# Natural Language Processing for Data Analytics
[The Analytics Store](http://www.theanalyticsstore.com)

# Predictive Modelling With Vector Representations

## Lots of Imports

To build predictive models in Python we use a set of libraries that are imported here. In particular **pandas** and **sklearn** are particularly important. Plus nltk for data preparation.

In [None]:
import os

import nltk
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import re

import pandas as pd
import numpy as np

import sklearn
import sklearn.naive_bayes
import sklearn.ensemble

Occasionally we need to install bits and pieces (e.g. corpora) from the NLTK. To do this uncommment and run the code below which will launch the interactive NLTK downloader. 

In [None]:
# Uncomment this in order to launch the NLTK downloader to access corpora, packages etc
#nltk.download()

## Load The Text Data

Load an nltk corpus, pre-process it and the us a scikit-learn vectoriser to convert it to a format appropriate for ML.

### Prepare nltk Corpus

Create a corpus object from a collection of text files

In [None]:
news_grps_20_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(root = './data/20news-bydate/20news-bydate-train/', \
                                                                   fileids = r'.+/\d+', \
                                                                   cat_pattern=r'((\w|\.)+)/*', \
                                                                   encoding = 'latin1')

Define a pre-processing function that perfoms text cleaning and normalisation operations

In [None]:
def preprocess(words, to_lowercase = True, remove_punctuation = True, remove_digits = True, remove_odd_chars = True, remove_stopwords=True, stem = True):
    if to_lowercase:
        words = [w.lower() for w in words]
    
    if remove_punctuation:
        words = [w for w in words if not (re.match(r'^\W+$', w) != None)]
    
    if remove_digits:
        words = [w for w in words if not w.replace('.','',1).isdigit()]

    if remove_odd_chars:
        words = [re.sub(r'[^a-zA-Z0-9_]','_', w) for w in words]
    
    if remove_stopwords:
        sw = set(nltk.corpus.stopwords.words("english"))
        words = [w for w in words if not w in sw]

    if stem:
        porter = nltk.PorterStemmer()
        words = [porter.stem(w) for w in words]
    
    return words

Iterate across the corpus applying pre-processing and storing text data in a list with file id and category meta-data

In [None]:
documents = [((fileid, category), preprocess(news_grps_20_corpus.words(fileid), to_lowercase = True, remove_punctuation = True, remove_digits = True, remove_odd_chars = True, remove_stopwords=True, stem = False)) \
#             for category in news_grps_20_corpus.categories() \
             for category in ['alt.atheism', 'comp.graphics'] \
             for fileid in news_grps_20_corpus.fileids(category)]

### Generate Vector Representation

Define a dummy function to stop the scikit-learn vecrtorisers performing tokenisation and other pre-processing

In [None]:
def dummy_fun(doc):
    return doc

Create a CountVectoriser object rady to perfrom the transformation - use the dummy function

In [None]:
bow_gen = sklearn.feature_extraction.text.CountVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None,
    ngram_range=(1, 2),
    min_df = 100,
    max_df = 0.9) 

Extract the texts into a bag of words represnetation and add fileid and category meta data

In [None]:
bow = bow_gen.fit_transform([doc[1] for doc in documents])
fileids = [doc[0][0] for doc in documents]
cats = [doc[0][1] for doc in documents]

Build a nice pandas data frame containing all data

In [None]:
bow_df = pd.DataFrame(bow.toarray(), columns = bow_gen.get_feature_names())
bow_df['fileids'] = fileids
bow_df['target'] = cats
bow_df

Do the same preparation but with a tf-idf vector representation

In [None]:
tfidf_gen = sklearn.feature_extraction.text.TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None) 
tfidf = tfidf_gen.fit_transform([doc[1] for doc in documents])
fileids = [doc[0][0] for doc in documents]
cats = [doc[0][1] for doc in documents]
tfidf_df = pd.DataFrame(tfidf.toarray(), columns = tfidf_gen.get_feature_names())
tfidf_df['fileids'] = fileids
tfidf_df['target'] = cats

Select either tf-idf or bow representation

In [None]:
dataset = bow_df

## Partition Data

Examine the distribution of the classification targets

In [None]:
dataset["target"].value_counts()

Extract features and target

In [None]:
X = dataset[dataset.columns.difference(['fileids', 'target'])]
y = dataset['target']

Split the data into a **training set**, a **validation set**, and a **test set**

In [None]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = sklearn.model_selection.train_test_split(X, y, random_state=0, \
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = sklearn.model_selection.train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, \
                                        random_state=0, \
                                        train_size = 0.5/0.7)

### A Simple Model

Train a Naive Bayes classification model

In [None]:
my_model = sklearn.naive_bayes.MultinomialNB()
my_model.fit(X_train,y_train)

### Evaluating Model Performance

Assess the performance of the decision tree on the training set

In [None]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train)

# Print performance details
accuracy = sklearn.metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(sklearn.metrics.classification_report(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Assess the performance of the tree on the validation dataset

In [None]:
# Make a set of predictions for the validation data
y_pred = my_model.predict(X_valid)

# Print performance details
accuracy = sklearn.metrics.accuracy_score(y_valid, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print nicer confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

## Choosing Parameters Using a Grid Search

Use a cross validation to perfrom an evaluation

In [None]:
my_model = sklearn.naive_bayes.MultinomialNB()
scores = sklearn.model_selection.cross_val_score(my_model, X_train_plus_valid, y_train_plus_valid, cv=5)
print(scores)
print(np.mean(scores), "+/-", np.std(scores))

We can use a grid search through a large set of possible parameters. Here we try depths between 3 and 20 and different limits on the minimum number of samples per split.

In [None]:
# Set up the parameter grid to seaerch
param_grid ={'fit_prior': [True, False], \
             'alpha': list(np.arange(0.1, 1.1, 0.1))}

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.naive_bayes.MultinomialNB(), \
                                param_grid, cv=2, verbose = 2)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
display(my_tuned_model.best_params_)
display(my_tuned_model.best_score_)

### Final Evaluation on Test Set

Evaluate the model on a stratified test set

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
print(sklearn.metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

### Other Models

We can easily use the same patterns to train other types of models.

#### Random Forests

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.RandomForestClassifier(n_estimators=300, \
                                           max_features = 3,\
                                           min_samples_split=200)
my_model.fit(X_train,y_train)

Assess the performance of the model on the **validation set**

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Choose parameters using a grid search

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(100, 501, 50)), 'max_features': list(range(1, 10, 2)), 'min_samples_split': list(range(20, 200, 50)) }
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.RandomForestClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
print(sklearn.metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### Bagging

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.BaggingClassifier(base_estimator = sklearn.tree.DecisionTreeClassifier(criterion="entropy", min_samples_leaf = 50), \
                                      n_estimators=10)
my_model.fit(X_train,y_train)

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(5, 25, 1))}
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.BaggingClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
print(sklearn.metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### AdaBoost

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.AdaBoostClassifier(base_estimator = sklearn.tree.DecisionTreeClassifier(criterion="entropy", min_samples_leaf = 50), \
                                       n_estimators=10)
my_model.fit(X_train,y_train)

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### Logistic Regression

In [None]:
# Do the same job with logistic regression
my_model = sklearn.linear_model.LogisticRegression()
my_model.fit(X_train,y_train)

Assess the performance of the model on the **validation set**

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### Nearest Neighbour

In [None]:
# Do the same job with random forests
my_model = sklearn.neighbors.KNeighborsClassifier()
my_model = my_model.fit(X_train,y_train)

Assess the performance of the decision tree on the **validation set**

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Choose parameters using a grid search

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_neighbors':[3,5,15, 25]}
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.neighbors.KNeighborsClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
print(sklearn.metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)