# Natural Language Processing

In this lab, we will preprocess and build models for textual data. We will learn how to clean, transform and classify texts and how to explain the predictions for particular cases.

#### Imports and definitions

In [None]:
! pip install lime

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt
import wordcloud

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WordPunctTokenizer, word_tokenize
from sklearn.model_selection import train_test_split 

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics 
from sklearn.model_selection import GridSearchCV


import random
from lime.lime_text import LimeTextExplainer
import urllib.request
import zipfile
%matplotlib inline


In [None]:
def plot_wordcloud(texts: list, title: str=''):
  wc = wordcloud.WordCloud(background_color="white").generate(' '.join(texts))
  plt.figure()
  plt.imshow(wc)
  plt.axis("off")
  plt.title(title)
  plt.show()


def preprocess_tokens(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.tokenize.regexp_tokenize(text, '[a-zA-Z]{3,}')
    return [lemmatizer.lemmatize(word).lower() for word in tokens]


def get_top_terms(tfidf, document, top_n=10):
    print(document[:100])
    features = tfidf.get_feature_names()
    terms_vec = tfidf.transform([document]).toarray()[0]
    return [features[i] for i in np.argsort(terms_vec)[::-1][:top_n]
            if terms_vec[i]>0]


def display_confusion_matrix(y_test, y_pred, class_names=None):
    confusion_matrix = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred))
    confusion_matrix.index.name = 'Actual'
    confusion_matrix.columns.name = 'Predicted'
    if class_names:
      confusion_matrix.columns = class_names
      confusion_matrix.index = class_names
    sns.heatmap(confusion_matrix, annot=True)

## News category classification

First, we will perform a text classification task for the 20 news groups dataset: http://qwone.com/~jason/20Newsgroups/

In [None]:
news_group_data = fetch_20newsgroups()
class_names = news_group_data.target_names
print(class_names)

In [None]:
news_group_pd = pd.DataFrame(news_group_data.data, columns=['text'])
news_group_pd['category'] = news_group_data.target
news_group_pd['category_name'] = news_group_pd['category'].apply(lambda x: news_group_data.target_names[x])
news_group_pd.head()

### Data exploration

To better visualize the dataset, let us first plot the distribution of texts in classes. 

In [None]:
sns.countplot(data=news_group_pd, y="category_name")

Next, we want to explore the content of texts in each of the categories. We will plot the word clouds which show the words that occur most often in each category.

In [None]:
for category in class_names:
  category_news_data = news_group_pd[news_group_pd['category_name']==category]
  plot_wordcloud(texts=category_news_data['text'].to_list(), title=category)

We can see that the texts contain much of the noise and unnecessary information (such as the email subject).

### Tokenization and vectorization

Next, we need to prepare our dataset for the modeling task. To use the texts a an input to a ML model, we need to encode it as numerical vectors. We tokenize the texts (split into words) and build a "Bag of Words" representation. Each text is encoded as a vector in which the positions tokens from the vocabulary and the values represent the number of occurencies in this text.

We will use the `CountVectorizer` class from sklearn https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(news_group_pd['text'])
print("Number of words in the vocabulary: ", len(vectorizer.vocabulary_))

We can check what words are in the vocabulary by the `.get_features_names` method.

In [None]:
vectorizer.get_feature_names()[:20]

Let us display the distribution of word frequencies (in how many document each word occurs).

In [None]:
pd.DataFrame(vectors.sum(0)).iloc[0].describe()

### Text cleaning

We can see that there are many tokens that occur in very few documents (possibly the noise) and some that occur in most of the texts. 



#### Stopwords removal
There are some words that are very common in the language but do not carry much information. Such words are called "stopwords" and we can list them for English language (based on the `nltk` library).

In [None]:
set(stopwords.words('english'))

To reduce the noise in the data, we will remove the stopwords and restrict the tokens to only alphabetic characters of at least 3 letters. Moreover, we will remove the words that occur in less than 10 documents or more than 90% (these are possibly the stopwords).

In [None]:
vectorizer = CountVectorizer(stop_words='english', token_pattern='[a-zA-Z]{3,}',
                             max_df=0.9, min_df=5)
vectorizer.fit(news_group_pd['text'])
print(vectorizer.get_feature_names()[:20])
print("Number of words in the vocabulary: ", len(vectorizer.vocabulary_))

#### Base form


We can also observe that some words in the dictionary have different grammatical forms. We can reduce the number of tokens by changing them to the base form - we can do it by stemming or lemmatization.

Stemming cuts the ending of a word according to the language rules (fast but less accurate).

Lemmatization finds the base form in a dictionary (more accurate but slower and requires external resources). 

We can see the difference between these approches on an example below:

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
word = 'studies'
stemma = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word)
print("Word: {}, stemma: {}, lemma: {}".format(word, stemma, lemma))

We will apply all the preprocessing operations with one help function `preprocess_tokens`.

In [None]:
vectorizer = CountVectorizer(stop_words='english', tokenizer=preprocess_tokens,
                             max_df=0.9, min_df=10)
vectorizer.fit(news_group_pd['text'])
print(vectorizer.get_feature_names()[:20])
print("Number of words in the vocabulary: ", len(vectorizer.vocabulary_))

### Text classification

Next, we will perform classification of the prepared texts. 



#### Text classification pipeline

We will use a random forest classifier. The vectorizer and random forest will be combined as stages in a pipeline - thi means that the output of one step will used as input to the next one: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

In [None]:
text_classification_pipeline = make_pipeline(vectorizer, RandomForestClassifier(max_depth=20))
print(text_classification_pipeline.named_steps)

We will split the data into train and test sets and perform the grid search for the hyperparameters of both steps in the pipeline. This time we will use the `RandomizedSearchCV` instead of standard grid search to sample only a subset of possible combinations.

In [None]:
train, test = train_test_split(news_group_pd, test_size=0.2)

In [None]:
param_grid = {
    'countvectorizer__max_df': [0.5, 0.8, 1.0],
    'randomforestclassifier__min_samples_leaf': [1, 10], 
}
search = GridSearchCV(text_classification_pipeline, param_grid, n_jobs=-1)
search.fit(train['text'], train['category_name'])
print(search.best_params_)

In [None]:
y_pred = search.predict(test['text'])
print("Accuracy: ", metrics.accuracy_score(test['category_name'], y_pred))
display_confusion_matrix(test['category_name'], y_pred, class_names)

### Model explanation

As the random forest classifier is difficult to intepret, we will use a separate approach to explain the model's decisions. The LIME method build a local approximation of a complex model to explain why an instance was classified to given category: https://github.com/marcotcr/lime

You can read more about LIME and and ML explanation methods: https://christophm.github.io/interpretable-ml-book/lime.html


We will use the text explainer to highlight the words with highest impact on the classification.

In [None]:
explainer = LimeTextExplainer(class_names=class_names)

First, we will display explanations for correctly classified examples:

In [None]:
correct_classes = test[y_pred == test['category_name']]
correct_examples = correct_classes.sample(3)

for i, example in correct_examples.iterrows():
    exp = explainer.explain_instance(example["text"], 
                                    search.predict_proba,
                                      top_labels=2)
    exp.show_in_notebook(text=True)

Next, we will display explanations for examples that were incorrectly assign to a class.

In [None]:
incorrect_classes = test[y_pred != test['category_name']]
incorrect_examples = incorrect_classes.sample(3)

for i, example in incorrect_examples.iterrows():
    print("Correct class: ", example["category_name"])
    exp = explainer.explain_instance(example["text"], 
                                    search.predict_proba,
                                      top_labels=2)
    exp.show_in_notebook(text=True)

## Text sentiment classification

Next, we will apply the same techniques to build a classifier for text sentiment (positive or negative). We will use the labelled dataset from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

In [None]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip"
urllib.request.urlretrieve(data_url, 'sentiment%20labelled%20sentences.zip')
data_file = zipfile.ZipFile('sentiment%20labelled%20sentences.zip')
movie_reviews = pd.read_csv(data_file.open('sentiment labelled sentences/imdb_labelled.txt'), delimiter = "\t", header=None)
movie_reviews.columns=["text", "sentiment"]
class_names = ["negative", "positive"]
movie_reviews.head()

#### Display the word clouds for each category

In [None]:
??

### Fit the vectorizer 
Compare the number of tokens in vocabulary without any preprocessing and with preprocessing (use `max_df=0.9`, `min_df=10` and `preprocess_tokens` function).

In [None]:
??

#### Split the dataset into train and test sets

In [None]:
??

#### Create the classification pipeline constisting of vectorizer and Random Forest Classifier steps

In [None]:
??

#### Configure the Grid Search 
Use the pipeline and parameters the same as for news group classification.

In [None]:
??

Display the accuracy and confusion matrix.

In [None]:
??

### Explain the classification for 3 correctly and 3 incorrectly classified examples.

In [None]:
??

In [None]:
??