## Text Classification

This notebook includes the process of training a binary classification model on textual data. Please note that the notebook runs on a small sample dataset, and the generated results are not usable.

#### Input:

ws_2_article_topic_XX.csv:
This dataset contains the clean text and the LDA results as features obtained from `ws2_2_topic_modelling` notebook where XX is the optimal number of topics. We create a binary label using 'topic_label' of this dataset.

#### Output:

'covid_classifier' model:
This notebook trains a binary classification model using SGDClassifier of sklearn and deploy it in the IBM Cloud Pak for Data deployment space.

tf_idf.csv:
The code will produce the tf_idf matrix as CSV for training the classification model with AutoAI.


#### Classification workflow includes:

- Import data
- Data split and upsampling
- Pipeline of TF-IDF tokenization method and Linear SVM classifier using SGDClassifier
- Hyper-parameter tuning with grid search
- Model evaluation
- Prepare data for AutoAI
- Save and deploy model with WML

In [None]:
# Importing Libraries

import sklearn
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics, ensemble

import warnings
warnings.filterwarnings('ignore')

### Configuration parameters

The data path of the data storage

In [None]:
# The path to the output folder where all the outputs will be saved
output_path = "/project_data/data_asset"

### Import data

Data contains clean articles and topic labels that are generated based on the result of LDA analysis with measuring the coherence metric. Here we import data and create a binary label where `topic_of_interest`is the topic we aim to detect with a classification model.

In [None]:
# Importing Article Dataset

df_clean = pd.read_csv(f"{output_path}/ws_2_article_topic_6.csv")
df_clean.head()

In [None]:
# Set the topic of interest for bianry classification
topic_of_interest = 'label_5'

# create a binary label where 1 is the topic of interest and 0 is the rest
df_clean['label'] = np.where(df_clean['topic_label'] == topic_of_interest, 1, 0)  
df_clean['label'].value_counts()

### Data split and upsampling

The dataset is inbalance in the label column, significantly more 0 class than 1 class. Upsampling will allow us increase the proportion of 1 class in data.

In [None]:
# Spliting the data into train and test sets
from sklearn import model_selection, preprocessing
from sklearn.model_selection import train_test_split

# cleaned body of documents
X = df_clean['article_clean']  
# Target variable
y = df_clean['label'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify=y, 
                                                    test_size=0.25,
                                                    random_state = 0)

In [None]:
# Upsampling
from sklearn.utils import resample

# concatenate our training data back together
train_concat = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes
majority_class = train_concat[train_concat.label==0]
minority_class = train_concat[train_concat.label==1]

# upsample the minority class
minority_upsampled = resample(minority_class,
                              replace=True, # sample with replacement
                              n_samples=len(majority_class), # match number in majority class
                              random_state=27) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([majority_class, minority_upsampled])

#shuffle 
upsampled = upsampled.sample(frac=1).reset_index(drop=True)

y_train = upsampled['label']
X_train = upsampled['article_clean']

### Pipeline

Creation of the step by step process that will lead to the classification

In [None]:
# Create the pipeline
pipeline = Pipeline([
    ('vect',TfidfVectorizer(max_df = 0.8, min_df=0.0001, norm = 'l2', use_idf = True)), 
    ('clf',SGDClassifier(random_state=0, alpha = 2e-05, penalty = 'l2', loss = 'hinge')) 
])

### Hyper-parameter tuning with grid search

Finding optimal parameters for the classifier

In [None]:
# Define parameters
parameters = {
    'vect__max_df': (0.6, 0.7, 0.8), # max threshold for document frequency
    'vect__min_df': (0.0001, 0.001),
    'vect__max_features': (None, 5000, 10000), # top max_features ordered by term frequency across the corpus
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'vect__use_idf': (True, False), # Enables inverse-document-frequency reweighting
    'vect__norm': (None, 'l1', 'l2'), # normalizes tf-idf in each row
    'clf__loss': ('log','modified_huber','hinge'), # log for logistic regression, modified_huber gives a smooth loss tolerant to outliers, hinge for linear SVM
    'clf__alpha': (0.00001,0.00002), # multiplies the regularization term
    'clf__penalty': ('none','l2','l1'), # regularization term
    'clf__max_iter': (10, 50), # passes over the training data (aka epochs)
}

In [None]:
# Grid search with cross validation
grid_search = GridSearchCV(pipeline, parameters, cv=5, scoring='roc_auc', n_jobs=10, verbose=1, iid=False) # 'roc_auc' or 'f1'

print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)

# model training
grid_search.fit(X_train, y_train)

In [None]:
### Result of grid search
print("Best score: %0.3f" % grid_search.best_score_)  
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

### Scoring the test set

Selecting best model from grid search and predictions

In [None]:
# best estimator found from grid search
model = grid_search.best_estimator_

In [None]:
# Model predictions

predictions = model.predict(X_test)
predictions

### Model evaluation



In [None]:
# Return of the Accuracy, Precision, Recall, F1 Score, AUC score

print("Accuracy:",metrics.accuracy_score(y_test, predictions))
print("Precision:",metrics.precision_score(y_test, predictions))
print("Recall:",metrics.recall_score(y_test, predictions))
print("f1_score:",metrics.f1_score(y_test, predictions))
print('AUC: ', metrics.roc_auc_score(y_test, predictions))

In [None]:
# confusion matrix of predictions

cnf_matrix = metrics.confusion_matrix(y_test, predictions)
cnf_matrix

## Prepare data for AutoAI

extracting TF-IDF matrix with 1000 most important words with label to input into AutoAI

In [None]:
# create tf-idf matrix
Tvectorizer = TfidfVectorizer(max_df = 0.6, min_df=0.0001, norm = 'l2', max_features=1000)
X_tfidf = Tvectorizer.fit_transform(df_clean['article_clean'])
# place tf-idf values in a pandas data frame
tfidf_df = pd.DataFrame(X_tfidf.todense(), columns=Tvectorizer.get_feature_names())

In [None]:
# add label column to dataframe
tfidf_df['label'] = df_clean['label']

In [None]:
# save data
tfidf_df.to_csv(f"{output_path}/tf_idf.csv")

## Save and deploy model with WML

We save and deploy the model by connecting to the ICP4D local Watson Machine Learning using CP4D credentials. Watson Machine Learning provides deployment spaces where the user can save, configure and deploy their models. We can save models, functions and data assets in this space. The steps involved in saving and deploying the model are detailed in the following cells. We will use the watson_machine_learning_client package to complete these steps.

* Connect to WML client
* Save the model in the deployment space repository
* Deploy the model ONLINE

### Connect to WML client

We will use the watson_machine_learning_client package to complete these steps. We establish a connection to the Watson Machine Learning API with the system credentials and set the default space and project.

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient
import os

token = os.environ['USER_ACCESS_TOKEN']

wml_credentials = {
   "token": token,
   "instance_id" : "openshift",
   "url": os.environ['RUNTIME_ENV_APSX_URL'],
   "version": "2.5.0"
}

client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
# create a deployment space (only run for the first time)
space_details = client.spaces.store(meta_props={client.spaces.ConfigurationMetaNames.NAME: "text_analysis_space"}) 

# get the space uid
space_uid = client.spaces.get_uid(space_details)

In [None]:
# set project ID
project_uid = os.environ['PROJECT_ID']
client.set.default_project(project_uid)
client.set.default_space(space_uid)

### Save the model in the deployment space repository

In [None]:
# give model name
model = model
model_name = 'covid_classifier'

In [None]:
# Store the model details
model_props = {client.repository.ModelMetaNames.NAME: model_name,
               client.repository.ModelMetaNames.RUNTIME_UID : "scikit-learn_0.20-py3.6",
               client.repository.ModelMetaNames.TYPE : "scikit-learn_0.20",
               }

# store model in the deployment space
stored_model_details = client.repository.store_model(model=model, meta_props=model_props)

### Deploy the model ONLINE

In [None]:
# deployment metadata of the model
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: model_name,
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

# deploy the model
model_uid = stored_model_details["metadata"]["guid"]
deployment_details = client.deployments.create( artifact_uid=model_uid, meta_props=meta_props)

### Score the deployed model

In [None]:
fields = ["article_clean"]
values = [['virus hero key worker full list classify niamh cavanagh mar update mar boris johnson announce today school shut friday notice child key worker vulnerable kid official list release tomorrow list medical professional education name johnson announcement earlier today credit afp list medical professional include doctor nurse midwife paramedic health worker employment medical health community safety police force fire brigade education teacher worker school pre school supermarket worker delivery driver broadcast johnson announce school whole close friday britain desperately contain coronavirus outbreak right time education secretary gavin williamson promise free school meal everyone likely form supermarket voucher boris nation even slow result draconian measure place week believe step already together announce today already slow spread disease pay story']]

scoring_payload = {
client.deployments.ScoringMetaNames.INPUT_DATA: [{
    "fields": fields, 
    "values": values
}]
}

In [None]:
dep_id = client.deployments.get_uid(deployment_details)

In [None]:
client.deployments.score(deployment_id=dep_id,meta_props=scoring_payload)

#### Authors
* **Mehrnoosh Vahdat** is Data Scientist with Data Science & AI Elite team where she specializes in Data Science, Analytics platforms, and Machine Learning solutions.
* **Anthony Ayanwale** is Data Scientist with CPAT team where he specializes in Data Science, Analytics platforms, and Machine Learning solutions.

Copyright © IBM Corp. 2020. Licensed under the Apache License, Version 2.0. Released as licensed Sample Materials.