<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp-half-day-workshop/blob/main/practicals/classification-short.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP classification - supervised learning
## A short example

In this example, you will learn how you can use supervised learning algorithms for NLP classification. We will use documents from [MTSamples](http://www.mtsamples.com/). These are transcribed sample medical reports and examples from a variety of clinical disciplines, such as pediatrics, haematology, radiology, surgery, discharge summaries. Note that one document can belong to several categories.

The task is to classify a document into its clinical specialty, e.g. pediatrics or hematology.

We will use the simple K Nearest Neighbours classification algorithms as implemented in a popular Python machine learning library, [scikit-learn](https://scikit-learn.org/stable/), and evaluate with cross-validation before testing on unseen test data. We will use the [pandas](https://pandas.pydata.org/) library to store and handle our data.

We will experiment with a couple of different ways of representing the documents for the classifiers.

Material in parts from https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

Written by Sumithra Velupillai, March 2019 - adapted and updated February 2020, April 2024, January 2025 by Sumithra Velupillai and Angus Roberts

# 1: Packages
We will use a number of different packages for this exercise

In [None]:
# We will use matplotlib to graph our results
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

# And pandas to store our data
import pandas as pd

# numpy for number and vector handling
import numpy as np

import warnings; warnings.simplefilter('ignore')


In [None]:
# We'll use scikit-learn for the classification algorithms.
# https://scikit-learn.org/stable/

from sklearn.neighbors import KNeighborsClassifier



In [None]:
## sklearn also has some nice funtions for representations

# Bag-of-words implementation in sklearn
from sklearn.feature_extraction.text import CountVectorizer

# TfIdf implementation in sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

## and for evaluation
# from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score



In [None]:
## Since we're working with text, we might need to tokenize for some of these representations.
# We'll use nltk here, but there are other nlp packages available for this
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# 2: Corpus
Read in the training data.

In [None]:
# Copy files from github in to the local Colab filespace.
!git clone --quiet https://github.com/KCL-Health-NLP/nlp-half-day-workshop.git
print("Done copying files")

xlds_training = './nlp-half-day-workshop/practicals/classification_training_data.xlsx'
trainingdata = pd.read_excel(xlds_training)



Take a look at the content of the training data. What are we trying to classify? What are the labels we want to try to learn? How many instances do we have?

In [None]:
trainingdata['label'].value_counts()

What types of features do you think would be useful for the classification task? Where can we get them? Take a look at one or two of the documents. Can you guess which classification label these belong to?

In [None]:
trainingtxt_example = trainingdata['txt'].tolist()[0]
print(trainingtxt_example)

In [None]:
trainingtxt_example = trainingdata['txt'].tolist()[231]
print(trainingtxt_example)

# 3: Representation - BoW

The most common baseline feature representation for text classification tasks is to use the *bag-of-words* representation, in a document-term matrix. Let's build a simple one using raw counts and only keeping a maximum of 500 features. We can use the CountVectorizer function from sklearn, and tokenize using a function from nltk.

In [None]:
first_vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
                             tokenizer=word_tokenize, max_features=500)
first_vectorizer.fit(trainingdata['txt'].tolist())
first_fit_transformed_data = first_vectorizer.fit_transform(trainingdata['txt'])


We can now look at this transformed representation for an example document.

In [None]:
first_transformed_data = first_vectorizer.transform([trainingdata['txt'].tolist()[231]])
print (first_transformed_data)

What word is represented by the different indices? Have a look at a few examples.

In [None]:
print (first_vectorizer.get_feature_names_out()[32])

In [None]:
print(first_fit_transformed_data.shape)
print ('Amount of Non-Zero occurences: ', first_fit_transformed_data.nnz)

# 4: Classification
Let's build a classifier with this feature representation. In text classification, many classification algorithms have been shown to work well. Sci-kit learn has implementations for many different types of classification algorithms - have a look at their website!

Let's try a K nearest neighbour classifier. This builds a model that assigns classes to test examples based on the majority class of the nerest k training examples to that test example. By default, Sci-kit Learn's KNN classifier will look at the closest 5 neighbours.


In [None]:
kneighbour_classifier = KNeighborsClassifier().fit(first_fit_transformed_data, trainingdata['label'])

We now have a trained model. But how do we know how well it works? Let's evaluate it on the test data.

In [None]:

xlds_test = './nlp-half-day-workshop/practicals/classification_test_data.xlsx'
testdata = pd.read_excel(xlds_test)



## We need to transform this data to the same representation
first_fit_transformed_testdata = first_vectorizer.transform(testdata['txt'])

In [None]:
first_fit_transformed_testdata
kneighbour_predicted = kneighbour_classifier.predict(first_fit_transformed_testdata)
kneighbour_predicted

Let's make a list of all the labels in our dataset to evaluate, and then run some standard evaluation metrics

In [None]:
labels = list(set(testdata['label']))
print(metrics.classification_report(testdata['label'], kneighbour_predicted, target_names=labels))



What do you think about these results? There are probably ways of improving this, by changing the representation or maybe trying a different classifier model.
__There is one main problem though: we can't use this test data to try different configurations! Why?__

# 5: N-fold cross-validation

We can employ n-fold cross-validation on the training data to experiment with different representations, parameters, and classifiers.

There are also various metrics that can be used to evaluate classification results.


In [None]:
kneighbour_classifier = KNeighborsClassifier().fit(first_fit_transformed_data, trainingdata['label'])
scoring = ['precision_macro', 'recall_macro','precision_micro','recall_micro', 'f1_micro', 'f1_macro']
scores = cross_validate(kneighbour_classifier, first_fit_transformed_data, trainingdata['label'], scoring=scoring, cv=10, return_train_score=False)
scoresdf = pd.DataFrame(scores)
scoring = ['test_precision_macro', 'test_recall_macro','test_precision_micro','test_recall_micro', 'test_f1_micro', 'test_f1_macro']
bp = scoresdf.boxplot(column=scoring, grid=False, rot=45,)
[ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
fig = np.asarray(bp).reshape(-1)[0].get_figure()
fig.suptitle('K nearest neighbour, count vectorizer')
plt.show()

# 6: Another representation model: Tf-idf
We have used a very simple bag-of-words representation. What happens if we try something else? Let's try a representaiton called tf-idf - Term Frequency, Inverse Document Frequency. Tf-idf is a word frequency model like bag-of-words, but it adjusts frequencies to take in to account how rare words are. A rare word might be expected to help us distinguish between documents. So a very common word is given less weight than a rare word.  Tf-idf is considered a strong baseline in many text classification tasks.

In [None]:

stopWords = list(stopwords.words('english'))
tfidf_vect = TfidfVectorizer(tokenizer=word_tokenize, stop_words=stopWords)
tfidf_vect.fit(trainingdata['txt'])
second_fit_transformed_data =  tfidf_vect.transform(trainingdata['txt'])
second_fit_transformed_data

What other parameters can you change in this representation? How does this look different from the CountVectorizer representation?

Let's now use this with the KNN classifier.

In [None]:
kneighbour_classifier = KNeighborsClassifier().fit(second_fit_transformed_data, trainingdata['label'])
scoring = ['precision_macro', 'recall_macro','precision_micro','recall_micro', 'f1_micro', 'f1_macro']
scores = cross_validate(kneighbour_classifier, second_fit_transformed_data, trainingdata['label'], scoring=scoring, cv=10, return_train_score=False)
scoresdf = pd.DataFrame(scores)
scoring = ['test_precision_macro', 'test_recall_macro','test_precision_micro','test_recall_micro', 'test_f1_micro', 'test_f1_macro']
bp = scoresdf.boxplot(column=scoring, grid=False, rot=45,)
[ax_tmp.set_xlabel('') for ax_tmp in np.asarray(bp).reshape(-1)]
fig = np.asarray(bp).reshape(-1)[0].get_figure()
fig.suptitle('K nearest neighbour, tf-idf vectorizer')
plt.show()

This looks better, doesn't it? Why do you think this works better?