# NLP classification - supervised learning

In this example, you will learn how you can use supervised learning algorithms for NLP classification. We will use documents from mtsamples again. The task is to classify a document into its clinical specialty, e.g. pediatrics or hematology.

We will use classification algorithms as implemented in sci-kit learn, and evaluate with cross-validation before testing on unseen test data.

Written by Sumithra Velupillai, March 2019. Updated August 2019

In [None]:
%matplotlib inline

import matplotlib

import pandas as pd

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

from nltk.tokenize import word_tokenize

import numpy as np

import matplotlib.pyplot as plt

import warnings; warnings.simplefilter('ignore')


# 1: Corpus
Read in the training data.

In [None]:
trainingdata = pd.read_pickle('training_data.pickle')

Take a look at the content. What are the labels we want to try to learn? How many instances do we have?

In [None]:
trainingdata['label'].value_counts()

What types of features do you think would be useful for the classification task? Where can we get them? Take a look at one or two of the documents. Can you guess which classification label these belong to?

In [None]:
trainingtxt_example = trainingdata['txt'].tolist()[0]
print(trainingtxt_example)

In [None]:
trainingtxt_example = trainingdata['txt'].tolist()[231]
print(trainingtxt_example)

The most common baseline feature representation for text classification tasks is to use the bag-of-words representation, in a document-term matrix. Let's build a simple one using raw counts and only keeping a maximum of 500 features. We can use the CountVectorizer function from sklearn, and tokenize using a function from nltk.

In [None]:
first_vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
                             tokenizer=word_tokenize, max_features=500)
first_vectorizer.fit(trainingdata['txt'].tolist())
first_fit_transformed_data = first_vectorizer.fit_transform(trainingdata['txt'])


We can now look at this transformed representation for an example document.

In [None]:
first_transformed_data = first_vectorizer.transform([trainingdata['txt'].tolist()[231]])
print (first_transformed_data)

What word is represented by the different indices? Have a look at a few examples.

In [None]:
print (first_vectorizer.get_feature_names()[30])

In [None]:
print(first_fit_transformed_data.shape)
print ('Amount of Non-Zero occurences: ', first_fit_transformed_data.nnz)

Let's build a classifier with this feature representation.

In [None]:
multinomialNB_classifier = MultinomialNB().fit(first_fit_transformed_data, trainingdata['label'])

We now have a trained multinomial Naive Bayes model. But how do we know how well it works? Let's evaluate it on the test data.

In [None]:
testdata = pd.read_pickle('test_data.pickle')
## We need to transform this data to the same representation
first_fit_transformed_testdata = first_vectorizer.transform(testdata['txt'])

In [None]:
first_fit_transformed_testdata
multinomialNB_predicted = multinomialNB_classifier.predict(first_fit_transformed_testdata)
multinomialNB_predicted

In [None]:
print(metrics.classification_report(testdata['label'], multinomialNB_predicted, target_names=set(testdata['label'].tolist())))

This didn't look too bad maybe? But there are probably ways of improving this, by changing the feature space or maybe trying a different classifier model. 
__There is one main problem though: we can't use this test data to try different configurations! Why?__

We can however try some different feature representations and classifier algorithms on the training data. Let's try finding a model we think will work well on unseen data by employing n-fold cross-validation.

In [None]:
multinomialNB_classifier = MultinomialNB().fit(first_fit_transformed_data, trainingdata['label'])
scoring = ['precision_macro', 'recall_macro','precision_micro','recall_micro', 'f1_micro', 'f1_macro']
scores = cross_validate(multinomialNB_classifier, first_fit_transformed_data, trainingdata['label'], scoring=scoring, cv=10, return_train_score=False)
scoresdf = pd.DataFrame(scores)
scoring = ['test_precision_macro', 'test_recall_macro','test_precision_micro','test_recall_micro', 'test_f1_micro', 'test_f1_macro']
bp = scoresdf.boxplot(column=scoring, grid=False, rot=45,)
[ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
fig = np.asarray(bp).reshape(-1)[0].get_figure()
plt.show()

What happens if we try another classifier? Let's try a random forest classifier.

In [None]:
rf_classifier = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0).fit(first_fit_transformed_data, trainingdata['label'])
scoring = ['precision_macro', 'recall_macro','precision_micro','recall_micro', 'f1_micro', 'f1_macro']
scores = cross_validate(rf_classifier, first_fit_transformed_data, trainingdata['label'], scoring=scoring, cv=10, return_train_score=False)
scoresdf = pd.DataFrame(scores)
scoring = ['test_precision_macro', 'test_recall_macro','test_precision_micro','test_recall_micro', 'test_f1_micro', 'test_f1_macro']
bp = scoresdf.boxplot(column=scoring, grid=False, rot=45,)
[ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
fig = np.asarray(bp).reshape(-1)[0].get_figure()
plt.show()

Was this better or worse? Are there any parameters worth changing?

We have used a very simple bag-of-words representation. What happens if we try something else? Let's try tf-idf. This is considered a strong baseline in many text classification tasks.

In [None]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
tfidf_vect = TfidfVectorizer(tokenizer=word_tokenize, stop_words=stopWords)
tfidf_vect.fit(trainingdata['txt'])
second_fit_transformed_data =  tfidf_vect.transform(trainingdata['txt'])
second_fit_transformed_data

What other parameters can you change in this representation? How does this look different from the CountVectorizer representation?

Let's now use this with the Multinomial Naive Bayes classifier.

In [None]:
multinomialNB_classifier = MultinomialNB().fit(second_fit_transformed_data, trainingdata['label'])
scoring = ['precision_macro', 'recall_macro','precision_micro','recall_micro', 'f1_micro', 'f1_macro']
scores = cross_validate(multinomialNB_classifier, second_fit_transformed_data, trainingdata['label'], scoring=scoring, cv=10, return_train_score=False)
scoresdf = pd.DataFrame(scores)
scoring = ['test_precision_macro', 'test_recall_macro','test_precision_micro','test_recall_micro', 'test_f1_micro', 'test_f1_macro']
bp = scoresdf.boxplot(column=scoring, grid=False, rot=45,)
[ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
fig = np.asarray(bp).reshape(-1)[0].get_figure()
plt.show()

This looks better, doesn't it? Let's try some different configurations all in one go.

In [None]:
## material in parts from https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

representations = {}

vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
                             tokenizer=word_tokenize, max_features=500)
xtrain_countvect = vectorizer.fit_transform(trainingdata['txt'])
representations['CountVectorizer'] = xtrain_countvect

tfidf_vect = TfidfVectorizer(tokenizer=word_tokenize, stop_words=stopWords)
tfidf_vect.fit(trainingdata['txt'])
xtrain_tfidf =  tfidf_vect.transform(trainingdata['txt'])
representations['TfidfVectorizer'] = xtrain_tfidf



for representation, transformed_vector in representations.items():
    classifier_models = [
        RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
        LinearSVC(multi_class='ovr', C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
      penalty='l2', random_state=0, tol=1e-05, verbose=0),
        MultinomialNB(),
        #LogisticRegression(random_state=0),
        SGDClassifier(),
    ]
    CV = 10
    cv_df = pd.DataFrame(index=range(CV * len(classifier_models)))
    score = 'f1_micro'
    entries = []
    for model in classifier_models:
      model_name = model.__class__.__name__
      accuracies = cross_val_score(model, transformed_vector, trainingdata['label'], scoring=score, cv=CV)
      for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
    cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', score])

    print('representation: ', representation)
    bp = cv_df.boxplot(by='model_name', column=[score], grid=False, rot=45,)
    [ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
    fig = np.asarray(bp).reshape(-1)[0].get_figure()
    fig.suptitle('Representation: '+representation)
    plt.show()

What conclusions do you draw from this? Which classifier and which representation would you choose as your final model? Why?

# Assignment: Your turn to build a classifier
**Choose one classifier and one representation format and test it on the test data. What results do you get?** 

***Bonus question: What other configurations could you try before deciding on a final model? Is it appropriate to experiment with this on the test data? Why or why not?*** 

In [None]:
## First step: Transform your training and test data to your chosen representation. 

## choose a representation: CountVectorizer or TfidfVectorizer. Do you want to add additional parameters to the vectorizer?

chosen_vectorizer =

## transform the training data 
transformed_training_data = chosen_vectorizer.fit_transform(trainingdata['txt'])

## transform the test data
transformed_test_data = chosen_vectorizer.transform(testdata['txt'])

## Second step: Create a classifier - for instance the one you think gave best results when experimenting with cross-validation

chosen_classifier = 

## train the classifier on the training data
chosen_classifier.fit(fit_transformed_training_data, trainingdata['label'])
## predict labels on the test data
predicted = chosen_classifier.predict(transformed_test_data)
## what results do you get?
print(metrics.classification_report(testdata['label'], predicted, target_names=set(testdata['label'].tolist())))

**What happens if you try to predict a label with a completely new text using your chosen trained classifier model? Does it seem to classify correctly?**

In [None]:
new_text = 'Patient with severe depression.'
testX = chosen_vectorizer.transform([new_text])
predicted = chosen_classifier.predict(testX)
print(predicted)

In [None]:
new_text = '5-year old girl with asthma.'
testX = chosen_vectorizer.transform([new_text])
predicted = chosen_classifier.predict(testX)
print(predicted)

In [None]:
new_text = 'Her pain is severe.'
testX = chosen_vectorizer.transform([new_text])
predicted = chosen_classifier.predict(testX)
print(predicted)

### Bonus assignment

**Write ten example sentences or paragraphs where you assign the correct label to each of them. Then pass them to the classifier and calculate precision, recall and f-score.**

In [None]:
new_test_data = ## choose a representation where you have a sentence/paragraph and its 'gold' label
testX = chosen_vectorizer.transform(#pass the new texts to the vectorizer)
predicted = chosen_classifier.predict(testX)
## compare the predicted labels with the gold labels