# Modelling Using Term Frequency - Inverse Document Frequency
To create an accurate predictive model which determines how well someone did on their virtual internship there are two potential general methods:
* Using the provided tabular data
* Use a numeric representation of the chat transcripts

Although interpreting written text is far more difficult than creating a tabular classifier, it has greater overall potential.
This is because the given tabular data does not provide enough information to make an informative decision on how well or badly someone faired.

In [1]:
import pandas as pd
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, loguniform, randint

from matplotlib import pyplot as plt
from xgboost import XGBClassifier

## Data Processing
Before we can begin creating models to predict peoples scores we have to ensure that the data is cleaned and interpretable.

The process begins with removing messages sent by the mentor.
These are preset with an average rating of 4 to avoid modifying any average statistics run on the dataset.
However, they do not add any value to the analysis and further skew the datasets mode towards the average mean result.

Secondly we oversample the minority classes which have less samples.
This mitigates our models seeing very few highly-rated and low-rated scores, whilst at the same time a very very large number of average ratings (class imbalance).
To finish off we ensure that this though does not happen to the test dataset, as we want to see how it fairs on the actual problem (having duplicates does help).
We utilise random state seeds to ensure that this all happens the exact same way each time.

In [2]:
df = pd.read_csv("../data/data.csv")
df = df[df["RoleName"] != "Mentor"]

In [3]:
ros = RandomOverSampler(random_state=0)
x_resampled, y_resampled = ros.fit_resample(df[["content"]], df["OutcomeScore"])
x_train, x_test, y_train, y_test = train_test_split(x_resampled["content"], y_resampled, train_size=0.8, random_state=0)
_, x_test, _, y_test = train_test_split(df["content"], df["OutcomeScore"], train_size=0.8, random_state=0) # test on imbalanced data

## Model Training
We will test a variety of models to see how different models fair.
Models include logistic regression (baseline),  naive bays, k-nearest neighbors, decision trees and ensemble models such as random forests and (normal/extreme) gradient boosting.
The selection is designed to accentuate which types of models are most likely to work well for classification of grades based on sparse text.


Upon first tests, all models go through a hyperparameter optimisation process.
Instead of tuning all hyperparameters, a small selection are chosen per model which either alter how the models function (e.g. optimisation routines) or how conservative they are (e.g. max depth).
Hyperparameter values are randomly selected within their specified range and in the end to combination which produces the highest weighted f1 score is chosen.
F1 scores are prefered over accuracy simple to avoid situations where either precission or recall is high whilst the other low.
Due to the large computational and time cost in hyperparameter selection, the process is only rerun for models where it resulted in sizable improvementns (based on further evaluation).
This is the case for the baseline logistic regression model.
Other optimisation routines are commented out and the basic pipeline is selected instead.


Note that although these models aren't trained using K-Fold cross validation, this happens later on in the evaluation section for the best and worst model.
This statistically ensures that the results are sound and not simply due to overfitting or a randomly easy/hard dataset (for example one with very few examples of high scoring messages will struggle on the test set).


We will prioritise testing two types of models:
* Random Forests - An Ensemble of Decision Trees
* Logistic Regression


To simplify the creation and usage of these models we will compose several pipelines.
Each of these will start with a TF-IDF vectoriser (to transform the text into a matrix of numbers) and after this proceed with a classifier (like logistic regression).
Term Frequency - Inverse Document Frequency (TF-IDF) models provide a standard way to go from text to a numeric vector representation of data.
This works by first calculating the number of times each word is used in total and in every separate document.
This can be divided and used as numeric data in future models like random forests and logistic regression.

In [None]:
# baseline_clf = make_pipeline(TfidfVectorizer(), LogisticRegression(random_state=0, max_iter=500))

baseline_clf = make_pipeline(
    TfidfVectorizer(),
    RandomizedSearchCV(
        LogisticRegression(random_state=0, max_iter=500),
        {
            "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
            "penalty": ["l1", "l2", "elasticnet", "none"],
            "tol": loguniform(1e-5, 1e-3),
            "C": uniform(loc=0, scale=4)
        },
        n_jobs=2, n_iter=100, cv=5,
        scoring="f1_weighted"
    ),
)

baseline_clf.fit(x_train, y_train);
baseline_clf["randomizedsearchcv"].best_params_

In [None]:
naive_bayes_clf = make_pipeline(TfidfVectorizer(), MultinomialNB())

# naive_bayes_clf = make_pipeline(
#     TfidfVectorizer(),
#     RandomizedSearchCV(
#         MultinomialNB(),
#         {
#             "alpha": uniform(0, 3),
#             "fit_prior": [True, False]
#         },
#         n_jobs=2,
#         scoring="f1_weighted"
#     )
# )

naive_bayes_clf.fit(x_train, y_train);

In [None]:
k_nearest_neighbors_clf = make_pipeline(TfidfVectorizer(), KNeighborsClassifier())

# k_nearest_neighbors_clf = make_pipeline(
#     TfidfVectorizer(),
#     RandomizedSearchCV(
#         KNeighborsClassifier(),
#         {
#             "n_neighbors": uniform(5, 10),
#             "weights": ["uniform", "distance"],
#             "metric": ["euclidean", "manhattan", "minkowski"]
#         },
#         n_jobs=2,
#         scoring="f1_weighted"
#     )
# )

k_nearest_neighbors_clf.fit(x_train, y_train);

In [None]:
decision_tree_clf = make_pipeline(TfidfVectorizer(), DecisionTreeClassifier(random_state=0))

# decision_tree_clf = make_pipeline(
#     TfidfVectorizer(),
#     RandomizedSearchCV(
#         RandomForestClassifier(random_state=0),
#         {
#             "criterion": ["gini", "entropy"]
#             "max_depth": [None, randint(5, 50)],
#             "min_samples_split": randint(2, 5),
#             "min_samples_leaf": randint(2, 5)
#         },
#         n_jobs=2,
#         scoring="f1_weighted"
#     )
# )

decision_tree_clf.fit(x_train, y_train);

In [None]:
random_forest_clf = make_pipeline(TfidfVectorizer(), RandomForestClassifier(random_state=0))

# random_forest_clf = make_pipeline(
#     TfidfVectorizer(),
#     RandomizedSearchCV(
#         RandomForestClassifier(random_state=0),
#         {
#             "n_estimators": randint(2, 100),
#             "max_depth": [None, randint(5, 50)],
#             "min_samples_split": randint(2, 5),
#             "min_samples_leaf": randint(2, 5)
#         },
#         n_jobs=2,
#         scoring="f1_weighted"
#     )
# )

random_forest_clf.fit(x_train, y_train);

In [None]:
gradient_boosted_clf = make_pipeline(TfidfVectorizer(), GradientBoostingClassifier(n_estimators=10, max_features=200, max_depth=200, random_state=0))
gradient_boosted_clf.fit(x_train, y_train)

In [None]:
# xg_boosted_clf = make_pipeline(TfidfVectorizer(), XGBClassifier(max_depth=500, eta=1, gamma=1, min_child_weight=1, use_label_encoder=False))

xg_boosted_clf = make_pipeline(
    TfidfVectorizer(),
    RandomizedSearchCV(
        XGBClassifier(random_state=0, use_label_encoder=False),
        {
            "max_depth": [None, randint(5, 100)],
            "gamma": randint(0, 5),
            "eta": uniform(0, 1),
            "min_child_weight": uniform(0, 2),
            "max_delta_step": randint(0, 5)
        },
        n_jobs=2,
        scoring="f1_weighted"
    )
)

xg_boosted_clf.fit(x_train, y_train)

## Evaluation
To evaluate how good our models are we will start by formulating a baseline estimate of how good a basic logistic regression model performs.
We will look at the F1 score (which weighs precision and recall) and plot the confusion matrix.
This will be repeated for each additional model.

We will evaluate all our models, but due to the added time required to run cross validation, it is only used for the logistic regression baseline and best random forest model.
This will output the accuracy over five seperate dataset folds to statistically ensure that the results are not an annomaly, nor cherry picked!

In [None]:
def evaluate_model(model_pipeline):
    predictions = model_pipeline.predict(x_test)
    f1 = f1_score(y_test, predictions, average="weighted")
    report = classification_report(y_test, predictions)
    
    plot_confusion_matrix(model_pipeline, x_test, y_test);
    
    return f1, report

## Logistic Regression Baseline

In [None]:
f1, report = evaluate_model(baseline_clf)

In [None]:
f1

In [None]:
print(report)

In [None]:
# cross_val_score(baseline_clf, x_resampled["content"], y_resampled, cv=5)

## Naive Bayes

In [None]:
f1, report = evaluate_model(naive_bayes_clf)

In [None]:
f1

In [None]:
print(report)

## K-Nearest Neighbors

In [None]:
f1, report = evaluate_model(k_nearest_neighbors_clf)

In [None]:
f1

In [None]:
print(report)

## Decision Trees

In [None]:
f1, report = evaluate_model(decision_tree_clf)

In [None]:
f1

In [None]:
print(report)

## Random Forests

In [None]:
f1, report = evaluate_model(random_forest_clf)

In [None]:
f1

In [None]:
print(report)

In [None]:
cross_val_score(random_forest_clf, x_resampled["content"], y_resampled, cv=5)

## Gradient Boosting

In [None]:
f1, report = evaluate_model(gradient_boosted_clf)

In [None]:
f1

In [None]:
print(report)

## Extreme Gradient Boosting

In [None]:
f1, report = evaluate_model(xg_boosted_clf)

In [None]:
f1

In [None]:
print(report)

## Performance Evaluation
Out of all the models we can clearly see that our models all have similar precision and recall scores.
This can be seen in the classification reports which show a variety of metrics (all usually with similar scores).
For robustness though the F1 score shall be used to decipher which models perform best.


We can empirically see that basic logistic regression models perform with around ~40 accuracy.
The confusion matrix has both a bright diagonal and horizontal line.
The horizontal line at four indicates that average scores are being predicted more than anything else, despite the fact that we are working with reballanced data.
This is likely because there is a maximum amount of over sampling which can happen.
Although it is not shown here, substituting the oversampled training dataset with the original unaltered one will result in this to an extreme extent where the enumber four is almost the only the number predicted.


From the confusion matrix it is obvious that both Naive Bays and K-Nearest Neighbours classifiers have the exact same problem.
Although the problem is slightly exagerated in Naive Bays, K-Nearest Neighbours has predicts top scorers far more accuratly.


On the other hand, decision trees and random forests completely avoid the problem of predicting average scores far more frequently than anything else.
These models still struggle to predict high-scoring responses.
This can be read from the numbers, however is not visilbe in the confusion matrix due to the lack off data at these extremes.
Note that this is a problem with the underlying dataset and not the models here.


The boosting methods (ADA and Gradient) here perform very poorly.
This can be further confirmed by rerunning the notebook with different numbers of models within their ensembles.
The results being far worse in every metric than the baseline logistic regression emphasises the fact that the underlying data is not complex enough to use these boosted methods.
Overfitting has likely occured.