# Train Document Classification Custom Skill
This tutorial shows how to train a document classification custom skill for Cognitive Search. We will use the 20newsgroups dataset provided by scikit-learn as our sample dataset.

For more information, please see the [AML](https://docs.microsoft.com/en-us/azure/machine-learning/service/) or [Cognitive Search](https://docs.microsoft.com/en-us/azure/search/cognitive-search-resources-documentation) documentation. This notebook is based off the MNIST Image Classification tutorial found [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml) as well as this [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-track-experiments) on tracking experiments locally. 

### 0.0 Important Variables you need to set for this tutorial

Enter your workspace, resource and subscription credentials below

In [None]:
# Machine Learning Service Workspace configuration
my_workspace_name = ''
my_azure_subscription_id = ''
my_resource_group = ''

### 1.0 Import packages
If this is your first time using AML, please see this quickstart to get your environment set up: https://docs.microsoft.com/azure/machine-learning/service/quickstart-create-workspace-with-python

In [None]:
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import azureml
from azureml.core import Workspace, Run

import numpy as np
import os

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

### 2.0 Connect to Workspace
Create a workspace object. If you already have a workspace and a config.json file you can use `ws = Workspace.from_config()` instead.

In [None]:
ws = Workspace.get(name = my_workspace_name, resource_group = my_resource_group, subscription_id = my_azure_subscription_id)
print(ws.name, ws.location, ws.resource_group, sep = '\t')

### 3.0 Create Experiment
Create an experiment to track the runs in your workspace. A workspace can have muliple experiments.

In [None]:
experiment_name = 'newsgroup-classification'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### 4.0 Import data for classification
The 20newsgroups dataset is available through scikit-learn and can be imported as shown below.

To use your own data instead, simply edit the cell below to populate your data into `X_train`, `y_train`, `X_text`, and `y_test`

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

X_train = newsgroups_train.data
y_train = [categories[x] for x in newsgroups_train.target]

X_test = newsgroups_test.data
y_test = [categories[x] for x in newsgroups_test.target]

### 5.0 Train and Score model locally
For this tutorial, the model will be trained locally and results will be logged to AML.

A scikit-learn pipeline is used to transform the data into a tfidf matrix and then classify the results using a [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html). 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Create a run object in the experiment
run = exp.start_logging()

# Creating pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words = 'english', ngram_range= (1, 2))),
    ('tfidf', TfidfTransformer()),
    ('bc', BaggingClassifier(n_estimators=20))
])

# Fitting pipeline
print("Started Training Model")
pipeline.fit(X_train, y_train)

print('Predicting against the test data')
pred = pipeline.predict(X_test)

print("")
print("Results:")
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

from sklearn.metrics import precision_recall_fscore_support
scores = precision_recall_fscore_support(y_test, pred, average='micro')

run.log('precision', scores[0])
run.log('recall', scores[1])
run.log('fscore', scores[2])

os.makedirs('outputs', exist_ok=True)
import joblib
joblib.dump(value=pipeline, filename='outputs/newsgroup_classifier.pkl')

# Complete the run
run.complete()

print()
print('The model has been exported to `outputs/newsgroup_classifier.pkl` for use in the next tutorial.')