# Natural Language Processing and Machine Learning

This is a Jupyter Notebook running with Python, which contains a step-by-step example of some of the most basic and useful tools in Natural Language Processing (NLP). Explanations for each cell is given partially in text, but relies on the content of the workshop presentation.

You do not need to know about programming in Python in order to run this notebook. There are places where the code can be modified without such knowledge. This is noted in the text, and will be explained thoroughly in the workshop.

The notebook was initially developed by Johannes Bjerva (jbjerva@cs.aau.dk / https://bjerva.github.io) as part of the Digital Literacy programme. It was further modified by Ross Deans Kristensen-McLachlan (rdkm@cas.au.dk) for the CompUp workshop at AAU. If you have any questions, do not hesitate to contact me (Ross).


**Note: This Notebook uses data from the OffensEval 2020 shared task on hate speech detection (Zampieri et al, 2020). Be aware that examples of hatespeech from this dataset are used in the cells below, and is not in any way meant as an endorsement of such utterances or behaviour.**

## Running a Jupyter Notebook Cell

If you aren't familiar with Jupyter notebooks, the main thing you need to know is:

* Each "block" below is known as a **cell**
* In order to execute / run a cell, simply select the cell by clicking, and input **shift+Enter** on your keyboard
* The code in the cell will then execute (can be instant, or take up to 30 seconds depending on your computer), and the output will be displayed below the cell.

# Preliminaries

The first step is to make sure that we have access to all libraries and models that we need.
In particular, these are:
* Spacy (for NLP processing)
* Spacy's Danish model
* Scikit-learn (for machine learning classifiers)
* Matplotlib (for plotting)

In [None]:
# Importing the libraries we need
# python system tools
import os

# Loading the Danish Spacy models
import spacy
from spacy.lang.da.stop_words import STOP_WORDS
nlp = spacy.load("da_core_news_lg")

# data analysis tools
import random
import pandas as pd
import numpy as np
from collections import defaultdict, Counter

# some tools for 'Classical Machine Learning'
import sklearn
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

# For visualisations
from IPython.display import Image
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist
%matplotlib inline
random.seed(1)

Let's see which tools we have access to in this NLP pipeline

In [None]:
# Show the most informative featurs - again, don't worry too much about the details
def show_most_informative_features(vectorizer, classifier, n=20):
    """
    Return the most informative features from a classifier, i.e. the 'strongest' predictors.
    
    vectorizer:
        A vectorizer defined by the user, e.g. 'CountVectorizer'
    classifier:
        A classifier defined by the user, e.g. 'MultinomialNB'
    n:
        Number of features to display, defaults to 20
        
    """
    # Get feature names and coefficients
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(classifier.coef_[0], feature_names))
    # Get ordered labels
    labels = sorted(set(train_Y))
    # Select top n results, where n is function argument
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    # Pretty print columns showing most informative features
    print(f"{labels[0]}\t\t\t\t{labels[1]}\n")
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        for (coef_1, fn_1), (coef_2, fn_2) in top:
            print("%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

## Data

We're going to work with Danish data from the OffensEval 2020 shared task on hatespeech detection (Zampieri et al. 2020).
https://sites.google.com/site/offensevalsharedtask/home

If you want to play around with your own data, this is relatively straight forward if you can convert it to a .tsv file with the following format:

```<ID>\t<TEXT>\t<LABEL> ```

If you require something more complex, feel free to reach out to me after the workshop at: rdkm@cas.au.dk
   

In [None]:
# create variable with directory name
directory = "../dkhate/"

# create variables for train and test data files
train_file = os.path.join(directory, "offenseval-da-training-v1.tsv")
test_file = os.path.join(directory, "offenseval-da-test-v1.tsv")

Let's now read the training and test data from the given files:

In [None]:
# load data
train_data = pd.read_csv(train_file, sep="\t").dropna()
test_data = pd.read_csv(test_file, sep="\t").dropna()

# get labels
train_Y = train_data["subtask_a"]
test_Y = test_data["subtask_a"]

And let's see how many examples we have to work with:

In [None]:
print("\n")
print("Number of training instances:", len(train_data))
print("Number of test instances:", len(test_data))
print("\n")

As we can see, there are almost 3000 training instances and about 300 test instances.

The labels are either "NOT" for "Not Hatespeech" or "OFF" for hatespeech.

Let's now look at a single example:

In [None]:
example = train_data.iloc[52] # Choosing a nice example
print("\nTraining example:")
print("\tID:\t", example[0])
print("\tText:\t", example[1])
print("\tLabel:\t", example[2])

Let's investigate the label balance in the training data:

In [None]:
print(train_data["subtask_a"].value_counts())
train_data["subtask_a"].value_counts().plot(kind='bar')

As you can see, most tweets are not offensive.
We will return to this data after going through a standard NLP pipeline with an example.

# A Naïve Approach - Vectorizing and Feature Extraction

Vectorization. What is it and why are all the cool kids talking about it?

Essentially, vectorization is the process whereby textual or visual data is 'transformed' into some kind of numerical representation. One of the easiest ways to do this is to simple count how often individual features appear in a document.

Take the following text:

<center> <i>My father’s family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip.</i> </center><br>

We can convert this into the following vector

| and | be | being | both | called | came | christian | could | explicit | family | father | i | infant | longer | make | more | my | myself | name | names | nothing | of | or | philip | pip | pirrip | s | so | than | to | tongue|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |  --- |
| 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |

<br>
Our textual data is hence reduced to a jumbled-up 'vector' of numbers, known somewhat quaintly as a <i>bag-of-words</i>.
<br>
<br>
To do this in practice, we first need to create our vectorizer.

In [None]:
vectorizer = CountVectorizer(ngram_range = (1,2),    # unigrams and bigrams (1 word and 2 word units)
                             stop_words = STOP_WORDS, # why use stopwords?
                             lowercase = True,       # why use lowercase?
                             max_df = 0.95,          # remove very common words
                             min_df = 0.01,          # remove very rare words
                             max_features = 500)     # keep only top 500 features

This vectorizer is then used to turn all of our documents into a vector of numbers, instead of text.

In [None]:
# First we do it for our training data...
training_features = vectorizer.fit_transform(train_data["tweet"])
#... then we do it for our test data
test_features = vectorizer.transform(test_data["tweet"])
# We can also create a list of the feature names. 
feature_names = vectorizer.get_feature_names()

<br>
Q: What are the first 20 features that are picked out by the CountVectorizer?

In [None]:
feature_names[:20]

## Classifying and predicting

We now have to 'fit' the classifier to our data. This means that the classifier takes our data and finds correlations between features and labels.

These correlations are then the *model* that the classifier learns about our data. This model can then be used to predict the label for new, unseen data.

In [None]:
classifier = LogisticRegression().fit(training_features, train_Y)

predictions = classifier.predict(test_features)

In [None]:
classifier.predict(training_features)

Q: What are the predictions for the first 20 examples of the test data?

In [None]:
print(predictions[:20])

We can also inspect the model, in order to see which features are most informative when trying to predict a label. 

To do this, we can use the ```show_most_informative_features``` function that I defined earlier - how convenient!

In [None]:
show_most_informative_features(vectorizer, classifier, n=20)

## Evaluate

The computer has now learned a model of how our data behaves. But is it accurate?

In [None]:
Image("../img/confusionMatrix.jpg")

#### True Positive Rate => Recall  => Sensitivity => (TP / TP + FN)

Sensitivity tells us what proportion of the positive class got correctly classified. <br>
i.e number of sick people correctly identified.

#### True negative rate => Specificity => (TN / TN + FP) 
Specificity tells us what proportion of the negative class got correctly classified. <br>
i.e. the proportion of healthy people who were correctly identified.

#### False negative rate => (FN / TP + FN)
proportion of the positive class got incorrectly classified by the classifier

#### False positive rate = (FP / TN + FP) = 1 - Specificity
proportion of the negative class got incorrectly classified by the classifier.

#### Precision =>  (TP / TP + FP)
patients that we correctly identify having COVID out of all the patients actually having it <br>
ie. ratio of true positives to all positives

#### F1 => 2(PR / P + R)
Harmonic mean of precision and recall, useful where both precision and recall are important

#### Accuracy => (TP+TN)/(TP+FP+FN+TN)
Ratio of correct classifications, relative to total dataset

----
Thankfully, libraries like ```sklearn``` come with a range of tools that are useful for evaluating models.

One way to do this, is to use a confusion matrix, similar to what you see above.

In [None]:
metrics.plot_confusion_matrix(classifier, training_features, train_Y,
                              cmap=plt.cm.Blues, labels=["NOT", "OFF"])

We can also do some quick calculations, in order to assess just how well our model performs.

NB: Slightly different terminology but Recall is the same as Sensitivity in the confusion matrix above.

In [None]:
classifier_metrics = metrics.classification_report(test_Y, predictions)
print(classifier_metrics)

# A More Sophisticated Approach
We will now go through how we can go from a given input text (X) and a given output (Y), to a model which can take *any* input X and **predict** which Y value it has.

The first step is to obtain the NLP analysis we want, using the SpaCy NLP pipeline.

In [None]:
# Print the current pipeline
for tool in nlp.pipeline:
    print(tool[0])

In [None]:
# You can change this sentence to anything you want, and see what output you get in the steps below!
example = "Frøken Jensen bor i Aalborg, tæt på Limfjorden"

# With Spacy getting NLP tools to analyze a sentence is as easy as:
doc = nlp(example)

print(doc)

## Tokenization

In [None]:
# Simple white-space based tokenization goes wrong
for token in example.split():
    print(token)

In [None]:
# Spacy's tokenization gets it right
for token in doc:
    print(token.text)

## Lemmatization

Note that the automatic solution gives the wrong lemma for "tæt"

In [None]:
for token in doc:
    print(token.lemma_)

## Part of Speech Tagging

Interestingly, the POS for "tæt" is correct, even though the lemma is wrong

In [None]:
for token in doc:
    print('{0:10.10} {1}'.format(token.text, token.pos_))

## Parsing

In [None]:
for token in doc:
    print('{0:10.10} {1}'.format(token.text, token.dep_))

This corresponds to the dependency parse below:

In [None]:
spacy.displacy.render(doc, style='dep')

## Named Entity Recognition

Note that the NER tool is successfully able to recognise that the full name of the entity is "Dr. Andersen"

In [None]:
# Note that we loop over doc.ents rather than simply 'doc'
for ent in doc.ents:
    print('{0:15.15} {1}'.format(ent.text, ent.label_))

# A (very) simple Machine Learning pipeline using NLP

We will now look at combining these things into a full pipeline for the HateSpeech detection task.

In [None]:
def get_features(data, train=True):
    '''Extract required features for each sentence in the data set'''
    X = []
    for sentence in data:
        curr_X = []
        doc = nlp(sentence)
        for token in doc:
            if "tokens" in features:
                curr_X.append(token.text)
            if "lemmas" in features:
                curr_X.append(token.lemma_)
            if "parse" in features:
                curr_X.append(token.dep_)
            if "pos" in features:
                curr_X.append(token.pos_)
            if "token+pos" in features:
                curr_X.append(token.text + token.pos_)
                
        if "ner" in features:
            for ent in doc.ents:
                curr_X.append(ent.label_)
                
                
        X.append(" ".join(curr_X))
    if train:
        X_counts = count_vect.fit_transform(X)
    else:
        X_counts = count_vect.transform(X)
        
    return X_counts

Running the below cell might take some time, as running the SpaCy NLP tool on the entire training and test set is time consuming.

In [None]:
### Settings ###

# Ignore features with a lower frequency than this:
minimum_feature_frequency = 10 

# You can edit which features to use by commenting in or out these list items
# Add a '#' in front of a line (like with, e.g., #"lemmas",), in order to remove a feature
features = [
    #"pos",
    #"parse",
    "ner",
    "lemmas",
    #"tokens",
    #"token+pos",
] 

### End of Settings ###

# Note that feature extraction can take some time!
count_vect = CountVectorizer(min_df=minimum_feature_frequency)
train_X = get_features(train_data["tweet"])
test_X = get_features(test_data["tweet"], train=False)

print("The amount of features is:", train_X.shape[1])

We will now select and train/fit a classifier. The classifiers are from scikit-learn, and can be replaced with other options from: 

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

## Fitting a classifier

In [None]:
# Fit the classifier to the training data
classifier = LogisticRegression().fit(train_X, train_Y)
predictions = classifier.predict(test_X)

Let's see how well we do on data which we *have already observed*:

In [None]:
metrics.plot_confusion_matrix(classifier, train_X, train_Y,
                              cmap=plt.cm.Blues, labels=["NOT", "OFF"])

Let's check the full classification report

In [None]:
# Evaluate on test data
classifier_metrics = metrics.classification_report(test_Y, predictions)
print(classifier_metrics)

## Error analysis - opening the "Black Box"

Let's get a general impression of what mistakes the model makes on the test data

In [None]:
# Trying to check a single case

sentence = test_data.iloc[42]
print("\nTest example:")
print("\tID:\t", sentence[0])
print("\tText:\t", sentence[1])
print("\tLabel:\t", sentence[2])

In [None]:
# See what the model predicts:

prediction = predictions[42]
print("The model predicts:", prediction)

For a broader picture, let's investigate several cases:

In [None]:
# Let's get several error cases:

n_errors = 5
n = 0

print("\n")
for idx, (pred, gold) in enumerate(zip(predictions, test_Y)):
    if pred != gold:
        print(f"Model incorrectly predicts '{pred}' for the sentence: {test_data.iloc[idx]['tweet']}")
        n += 1
    if n >= n_errors:
        break

print("\n")


Some are "obviously" wrong:
* Model incorrectly predicted 'NOT' for the sentence: NED MED SVENSKEN!
* Model incorrectly predicted 'OFF' for the sentence: Ja tak. Og jeg kører selv mc.


Some are perhaps debatable and might be offensive in certain contexts:
* Model incorrectly predicted 'NOT' for the sentence: @USER ryger du hash. ???

## Breakout discussions

- Besides feature engineering and choice of classifier algorithm, what else might improve performance?
- What potential problems are there with the data, which might impact performance?
- Besides their practical use for classification tasks, for what else might you use the NLP techniques outlined here?

# Other use cases, beyond 'solving' a task

* Finding all Named Entities in a dataset

In [None]:
named_entities = []
for sentence in test_data["tweet"]:
    doc = nlp(sentence)
    for ent in doc.ents:
        named_entities.append(ent.text)

for entity in named_entities[:10]:
    print(entity)

Most NEs make sense, but there are some issues.

Let's count and visualise the most frequent Named Entities below:

In [None]:
min_frequency = 3 # The minimum occurrence of a named entity
data = Counter(named_entities)
names = list(i for i in data.keys() if data [i] >= min_frequency)
values = list(i for i in data.values() if i >= min_frequency)

matplotlib.pyplot.bar(names, values)
p = plt.xticks(rotation='vertical')