In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split as split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Loading the data

In [2]:
reviews = pd.read_csv("/Users/isaaclambert/Desktop/SST-2/train.tsv", delimiter='\t')
dev = pd.read_csv("/Users/isaaclambert/Desktop/SST-2/dev.tsv", delimiter='\t')

# Splitting the data
The data is split psudeo-randomly into training and test sets using the train_test_split import.

In [3]:
x = reviews["sentence"]
y = reviews["label"]

# Randomly split the data into training and test sets
train_x, test_x, train_y, test_y = split(x, y, test_size=0.25, random_state=21, shuffle=True)

# Text Vectorization
Various configurations of text vectorization are tested. The results and analysis of these tests can be seen below. The TF-IDF approach forms the basis and the variables in question are: the use of stop words; the minimum document frequency (mdf) and the use and range of N-grams.


In [4]:
# Create dictionary with various text vectorizers + their descriptions as keys
vectorizers = {'TF-IDF': TfidfVectorizer(), 
               'TF-IDF + no stop words': TfidfVectorizer(stop_words="english"),
               'TF-IDF + no stop words + mdf (5)': TfidfVectorizer(stop_words="english", min_df=5),
               'TF-IDF + no stop words + mdf (5) + N-grams (1-2)': TfidfVectorizer(stop_words="english", min_df=5, ngram_range=(1,2)),
               'TF-IDF + no stop words + mdf (2) + N-grams (1-2)': TfidfVectorizer(stop_words="english", min_df=2, ngram_range=(1,2)),
               'TF-IDF + no stop words + mdf (3) + N-grams (1-2)': TfidfVectorizer(stop_words="english", min_df=3, ngram_range=(1,2)),
               'TF-IDF + no stop words + mdf (2) + N-grams (1-3)': TfidfVectorizer(stop_words="english", min_df=2, ngram_range=(1,3)),
               'TF-IDF + mdf (2) + N-grams (1-2)': TfidfVectorizer(min_df=2, ngram_range=(1,2))
              }

# Create 2nd dictionary to display results  
dic_to_df = {'Vectorizer':[], 'Number of features':[], 'Training accuracy':[], 'Test accuracy':[]}

# Iterate through the 1st dictionary applying each type of vectorization to a Multinomial Naive Bayes model
for i, ii in vectorizers.items():
    
    # Create the vocabulary based on the training data
    ii.fit(train_x)
    
    # Based on the vocabulary
    train_v = ii.transform(train_x)
    test_v = ii.transform(test_x)
    
    # Train the model
    model = MultinomialNB(alpha=0.5).fit(X=train_v, y=train_y)
    
    # Append results to 2nd dictionary 
    dic_to_df['Vectorizer'].append(i)
    dic_to_df['Number of features'].append(len(ii.get_feature_names()))
    dic_to_df['Training accuracy'].append("{:.2%}".format(model.score(train_v, train_y)))
    dic_to_df['Test accuracy'].append("{:.2%}".format(model.score(test_v, test_y)))

In [5]:
text_vectorizer_results = pd.DataFrame.from_dict(dic_to_df)
display(text_vectorizer_results)

Unnamed: 0,Vectorizer,Number of features,Training accuracy,Test accuracy
0,TF-IDF,13587,91.29%,88.24%
1,TF-IDF + no stop words,13305,90.37%,87.45%
2,TF-IDF + no stop words + mdf (5),8758,88.46%,85.44%
3,TF-IDF + no stop words + mdf (5) + N-grams (1-2),23844,90.32%,86.52%
4,TF-IDF + no stop words + mdf (2) + N-grams (1-2),51521,93.65%,89.26%
5,TF-IDF + no stop words + mdf (3) + N-grams (1-2),41558,92.76%,88.52%
6,TF-IDF + no stop words + mdf (2) + N-grams (1-3),85157,94.08%,89.46%
7,TF-IDF + mdf (2) + N-grams (1-2),68412,94.62%,89.96%


## Anaylsis of vectorization alternatives
The first 4 rows are generic configurations which are tested with a Multinomial Naive Bayes model. In conjunction they form an ablation study. This allows us to identify the impact of each addition. 
* From row 0 to 1: removing stop words decreases the amount of features marginally yet it  also decreases the test accuracy by almost a percent. 
* From row 1 to 2: it is clear that while the introduction of a mdf of 5 lowers the number of features, it also dramatically lowers the test accuracy, by just over 2%.
* From row 2 to 3: the introduction of uni and bi-grams increases the number of features significantly but also increases the test accuracy by over a percent. 
* Since the mdf of 5 seems to have had a dramatic decrease in test accuracy, in row 4 it is decreased to 2. This leads to an increase of nearly 3% in test accuracy.
* In row 5, the mdf is increased to 3 to see whether the increase in test accurancy can be maintained while reducing the number of features however since the test accuracy dops again I decided to keep the mdf at 2 in the subseqent rows. I did not test an mdf of 1 as I thought this would leave too many rare and sentiment unrelated words within the features. 
* In row 6 tri-grams are introducted. While doing so did increase the test accuracy by almost a percent it more than doubled the number of features. As such, I decided to keep a range of uni to bi-grams.
* Finally, in row 7 the stop words which were removed are reintroduced. This does increase the number of features significantly, however, to less that the use of tri-grams did and it generates a test accuracy greater than the use of tri-grams - indeed the highest out of all tested permutations. I therefore used the configuration in line 7 to vectorize the data.

# Model Selection - Cross-Validation 
Different models are tested using the Cross-Validation method. The models tested are: the multinomial Naive Bayes model used above; Descision Trees; and K-nearest neighbours.

In [6]:
# Set up cross-validation and pass number of folds
cross_val = KFold(n_splits=10, random_state=21, shuffle=True)

# Create 1st dictionary with the models to be tested + their descirptions as keys
classifiers = {'MultinomialNB': MultinomialNB(),
               'Descision Tree': DecisionTreeClassifier(),
               'K-nearest neighbours': KNeighborsClassifier()
              }

# Create the vocabulary 
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1,2)).fit(train_x)

# Based on the vocabulary, encode the words in the training and test dataset
train_v = vectorizer.transform(train_x)
test_v =vectorizer.transform(test_x)

# Create 2nd dictionary to display results 
dic_to_df_2 = {'Model':[], 'Mean accuracy':[], 'Standard deviation':[]}

for i, ii in classifiers.items():
    # Cross-validate the models
    results = cross_val_score(estimator=ii, X=train_v, y=train_y, cv=cross_val)
    
    # Append results to 2nd dictionary 
    dic_to_df_2['Model'].append(i)
    dic_to_df_2['Mean accuracy'].append("{:.2%}".format(results.mean()))
    dic_to_df_2['Standard deviation'].append("{:.2%}".format(results.std()))

In [7]:
diff_model_results = pd.DataFrame.from_dict(dic_to_df_2)
display(diff_model_results)

Unnamed: 0,Model,Mean accuracy,Standard deviation
0,MultinomialNB,90.17%,0.32%
1,Descision Tree,83.50%,0.67%
2,K-nearest neighbours,83.83%,0.56%


## Analysis of model testing results
* The Multinomial Naive Bayes model has the highest test accuracy by far as well as the lowest standard deviation. As such, it is the clear winner.

# Hyper-parameter turning - Grid Search
In order to find the optimum values for the parameters and/or hyper-parameters, the Grid Seach method is used on the multinomial Naive Bayes model. In the first cell the paramaters and their default values for this modle are printed. In the second Grid Search is intialized with a range of values for alpha. Finally the results of the Grid Search are printed in the terminal of the third cell. 

In [8]:
# Show hyper-parameters for this type of model
print(MultinomialNB().get_params())

{'alpha': 1.0, 'class_prior': None, 'fit_prior': True}


In [9]:
# Create the model
best_model = MultinomialNB()

# Set the possible values for hyper-parameter 
alphas = np.array([1, 0.75, 0.55, 0.5, 0.45, 0.25, 0.1, 0.01, 0.001])

# Set up grid search and fit it to the data 
grid = GridSearchCV(estimator=best_model, param_grid=dict(alpha=alphas))
grid.fit(X=train_v, y=train_y);

In [10]:
print(f'Best value for {grid.best_params_}')
print(f'Best training accuracy: {grid.best_score_:.2%}')
print(f'Test accuracy: {grid.score(test_v, test_y):.2%}')

Best value for {'alpha': 0.5}
Best training accuracy: 90.04%
Test accuracy: 89.96%


## Analysis of Grid Search results
* Since the optimum value of alpha is 0.5 and the adjacent bounds are only 0.05 away, no further values need to be tested and 0.5 can be input into the multinomial naive bayes model. 

# Training the final model
The best type of model for this task is trained with the optimum value of alpha with text which has been vectorized in the best way.

In [11]:
best_model_tuned = MultinomialNB(alpha=0.5)
best_model_tuned_fitted = MultinomialNB(alpha=0.5).fit(X=train_v, y=train_y)

# Dev.tsv dataset Test 
Finally, the model is tested on the unseen dev.tsv dataset and the resutls are printed in the terminal below.

In [12]:
dev_x = dev["sentence"]
dev_y = dev["label"]

# Encode the words from the dev.tsv dataset
dev_v = vectorizer.transform(dev_x)

print(f'Models accuracy on dev.tsv dataset: {best_model_tuned_fitted.score(dev_v, dev_y):.2%}')     

Models accuracy on dev.tsv dataset: 81.77%


# Ethical Reflections 
Since machine learning models of this sort are never 100% accurate yet often used to make predictions or as the basis for imporant descisions, it is essential to be completely transparent about their limitations and the types of mistake they can make. To this end, it can be helpful to make a confusion matrix and see what types of text were misclassified and why. Likewise since the data to train such models can often be obtained in unscrupulous ways being transparent about what the source of one's data is ethically important. For this model the publically accessible Standford Sentiment Treebank dataset was used.  

# References
* Text Vectorization/Ablation Study, Text Analytics Notebook, Alexandros Koliousis, 22/04/21
* Cross-Validation, Machine Learning applications Notebook, Alexandros Koliousis, 22/04/21
* [Grid Search, Jason Brownlee, 23/04/21](https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/)