# Report Description Classification

This notebook compares the performance of report classification with preprocessing by Spacy and NLTK, and Complement Naive Bayes and SVM models.

> This notebook is based off Bo's [ski_learn_with_spacy_finetune.ipynb](https://github.com/Code-the-Change-YYC/YW-NLP-Report-Classifier/blob/02ff7a9e7f49779c736cbb55edb4e8d2835beddd/notebooks/machine_learning/ski_learn_with_spacy_finetune.ipynb)

## Data Specification

This notebook was tested with data preprocessed by `ReportData` with 335 training examples.

> Lemmatization in preprocessing **is not used** as it is performed in the notebook.

Commit tested at: `a9ed0b8b4587410fd969bce6481057b205d9049e`

## Results Summary

The results of multiple preprocessing combinations are summarized here:

![image.png](./images/description_classification_results_no_weights.png)
> Placeholder used is `'someone'`. w/o placeholders for scrubadub uses `'{{}}'` entities, and for spacy it uses `'*'` entities

## Setup

If running this notebook in Google Colab, upload the requested files and allow the dependencies to be installed. The file paths should be treated as relative to the colab file, create any folders necessary to satisfy the paths.

Otherwise update the path to allow for necessary imports.

In [None]:
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    from google.colab import files

    required_files = [
        "requirements.txt",
        "data-processed.csv",
        "report_data.py",
        "report_data_d.py",
        "incident_types_d.py",
        "training/description_classification/utils.py",
        "training/description_classification/model_paths.py",
    ]
    for file in required_files:
        print(f"Upload {file}")
        files.upload()

    !pip install -r requirements.txt
else:
    from os import path

    root = path.abspath(path.join("..", ".."))
    sys.path.append(root)

    preprocess = path.join(root, "preprocess")
    sys.path.append(preprocess)

    incident_types = path.join(preprocess, "incident_types")
    sys.path.append(incident_types)

In [None]:
import pickle
from tempfile import mkdtemp

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    classification_report,
    plot_confusion_matrix,
)
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from sklearn.preprocessing import LabelEncoder


from incident_types_d import IncidentType
from preprocess.report_data import ReportData
from preprocess.report_data_d import ColName
from training.description_classification import model_paths, utils


set_config(display="diagram")

## Preprocessing

NLTK and Spacy versions of preprocessing to remove stop words and non-letter tokens, as well as perform lemmatization.

Load the data

In [None]:
if IN_COLAB:
    yw_df = ReportData(out_file_path="data-processed.csv").get_processed_data()
else:
    yw_df = ReportData().get_processed_data()[[ColName.DESC,ColName.INC_T1]]

print(yw_df.info())

Summarize the differences between Spacy and NLTK tokenization

In [None]:
yw_clean = yw_df[ColName.DESC]

print(
    "Spacy tokenization compared to NLTK tokenization on the same report description:\n"
)
spacy_tokens = utils.spacy_tokenizer(yw_clean[0])
nltk_tokens = utils.nltk_tokenizer(yw_clean[0])
print("Items in spacy_tokens but not in nltk_tokens:")
print([x for x in spacy_tokens if x not in nltk_tokens])
print()
print("Items in nltk_tokens but not in spacy_tokens:")
print([x for x in nltk_tokens if x not in spacy_tokens])

Use tf-idf with our Spacy tokenizer to vectorize the data. Note:

- We match single character alphanumeric words instead of the default minimum double character.
- We use both uni-grams and bi-grams, this gives more features and preserves some possibly important ordering. See [here](https://scikit-learn.org/stable/modules/feature_extraction.html?highlight=tfidf#common-vectorizer-usage) for an example.
- We set `min_df` to filter odd words that don't appear often. We only need to consider more common word patterns and filter out the odd words.

In [None]:
word_vec = TfidfVectorizer(
    tokenizer=utils.spacy_tokenizer,
    token_pattern=r"\b\w+\b",
    ngram_range=(1, 2),
    min_df=2,
)

Split data into training and test data. The `random_state` of `32` has been manually optimized for our data.

In [None]:
X = yw_clean
y = yw_df[ColName.INC_T1]
X_train_set, X_test_set, y_train_set, y_test_set = train_test_split(
    X, y, train_size=0.75, random_state=32, shuffle=True
)

Compute sample weights for each example, giving higher frequency examples more weight.

In [None]:
weight_all = compute_sample_weight(utils.count_weight(y), y)
weight_train = compute_sample_weight(utils.count_weight(y_train_set), y_train_set)
weight_test = compute_sample_weight(utils.count_weight(y_test_set), y_test_set)

## Training and Cross Validation Evaluation

Initialize and fit the models.
> NOTE: Using sample weights with CNB significantly decreases the accuracy, this could be related to the inner workings of the algorithm.

In [None]:
cnb_cache = mkdtemp()
cnb = make_pipeline(word_vec, CalibratedClassifierCV(ComplementNB(),method="sigmoid"), memory=cnb_cache)
cnb

In [None]:
svm_cache = mkdtemp()
svm = make_pipeline(word_vec, SVC(), memory=svm_cache)
svm

Fine tune the estimator hyperparameters.

In [None]:
# Save the encoded input as it takes too long to generate each grid search
X_train_enc = word_vec.fit_transform(X_train_set)

In [None]:
# To save time some of the best options from previous runs are selected here
# Once in a while this should be rerun with all options to ensure the best options haven't become outdated
svc_params_list = {
    "C": np.linspace(1, 0, num=5, endpoint=False),
    "coef0": np.logspace(-1, 1, num=5),
    "kernel": ["sigmoid"],  # ["linear", "poly", "sigmoid"],
    "gamma": ["scale"],  # ["scale", "auto"],
    "decision_function_shape": [
        "ovo"
    ],  # Multi-class is always handled with one-vs-one # ["ovo", "ovr"],
    "class_weight": ["balanced"],  # ["balanced", None],
}

In [None]:
svm_op = GridSearchCV(svm.named_steps["svc"], param_grid=svc_params_list)
svm_op.fit(X_train_enc, y_train_set, sample_weight=weight_train)

In [None]:
svm_op.best_params_

In [None]:
scoring = ["recall_weighted", "precision_weighted", "balanced_accuracy", "accuracy"]
fit_params = {"sample_weight": weight_train}

cv_s = cross_validate(
    svm_op.best_estimator_,  # change to svm variable to see differences from fine tuning
    X_train_enc,
    y_train_set,
    scoring=scoring,
    fit_params=fit_params,
)

In [None]:
cnb_op = GridSearchCV(
    cnb.named_steps["calibratedclassifiercv"].base_estimator,
    param_grid={"alpha": np.linspace(3, 0, num=50, endpoint=False)},
)
cnb_op.fit(X_train_enc, y_train_set)



In [None]:
cnb_op.best_params_

In [None]:
cv_b = cross_validate(cnb_op.best_estimator_, X_train_enc, y_train_set, scoring=scoring)

Cross validation results (training set only).

metrics = cv_b.keys()
names = ["Metric", "Model"]
index = pd.MultiIndex.from_product([metrics, ["CNB", "SVM"]], names=names)
pairs = zip(cv_b.values(), cv_s.values())
flattened = sum(pairs, ())
df = pd.DataFrame(flattened, index=index)
df.join(df.agg(func=["mean", "max"], axis=1), on=names)

In [None]:
# initial test out calibrated classifier
encoded_desc = word_vec.fit_transform(X)
encoder = LabelEncoder()
encodedY = encoder.fit_transform(y)
cnb_op.fit(encoded_desc,encodedY)


cal_clf_sig = CalibratedClassifierCV(cnb_op,cv="prefit",method="sigmoid")
cal_clf_iso = CalibratedClassifierCV(cnb_op,cv="prefit",method="isotonic")
cal_clf_sig.fit(encoded_desc,encodedY)
cal_clf_iso.fit(encoded_desc,encodedY)


prob_cnb_sig = cal_clf_sig.predict_proba(encoded_desc)
prob_cnb_iso = cal_clf_iso.predict_proba(encoded_desc)
prob_cnb_old =cnb_op.predict_proba(encoded_desc) 

cnb_score_sig =  log_loss(encodedY, prob_cnb_sig)
cnb_score_iso =  log_loss(encodedY, prob_cnb_iso)
cnb_score_old = log_loss(encodedY,prob_cnb_old)


print("With sigmoid calibration: %1.3f" % cnb_score_sig)
print("With isotonic calibration: %1.3f" % cnb_score_iso)
print("No calibration: %1.3f \n" % cnb_score_old)

# print("***************Showing caliberated effects for first 5 datasets***************")
for (i,desc) in enumerate(encodedY[:5]):
  print(f"encoded index : {desc} and class label : {encoder.inverse_transform([desc])[0]}")
  print(f"No calibration predict proba: {prob_cnb_old[i][desc]:.2f} at index {desc}")
  print(f"With sigmoid calibration predict proba: {prob_cnb_sig[i][desc]:.2f} at index {desc}")
  print(f"With isotonic calibration predict proba: {prob_cnb_iso[i][desc]:.2f} at index {desc}\n")
  


In [None]:
## Model Saving

Save the optimized models to pickle files after retraining on the entire dataset. If just training the models you can run only up to this cell.
> NOTE: In Colab the files will be saved to the current directory, download these to your local `model_output` folder.

In [None]:
def get_model_output_path(file_name: str, full_path: str) -> str:
    return full_path if not IN_COLAB else f"./{file_name}"


def save_model(model, file_name: str, full_path: str):
    with open(get_model_output_path(file_name, full_path), "wb") as f:
        pickle.dump(model, f)


# Ensure each of the pipelines' estimators are using the best params
for cv, pipe in [(svm_op, svm)]:
    # The last step, the estimator comes after the name
    estimator = pipe.steps[-1][1]
    estimator.set_params(**cv.best_params_)
for cv, pipe in [(cnb_op, cnb)]:
    # The last step, the estimator comes after the name
    estimator = pipe.steps[-1][1].base_estimator
    estimator.set_params(**cv.best_params_)


cnb.fit(X, y)


save_model(cnb, model_paths.cnb_cli_file_name, model_paths.cnb_cli)
#old path for cnb file.
# save_model(cnb, model_paths.cnb_file_name, model_paths.cnb)

# svm.fit(X, y, svc__sample_weight=weight_all)
# save_model(svm, model_paths.svm_file_name, model_paths.svm)


## Test Set Evaluation

Retrain the models on the entire training set. Click on the pipeline steps to view the chosen hyperparameters.

In [None]:
cnb.fit(X_train_set, y_train_set)

In [None]:
svm.fit(X_train_set, y_train_set, svc__sample_weight=weight_train)

> NOTE: Using `sample_weight=weight_test` in our metrics calculations makes samples which are more prevalent in our data contribute more to the overall score.

In [None]:
print("Complement NB train:\n")
utils.show_classification_report(
    cnb, X_train_set, y_train_set, sample_weight=weight_train
)

In [None]:
print("Complement NB test:\n")
utils.show_classification_report(cnb, X_test_set, y_test_set, sample_weight=weight_test)

In [None]:
print("SVM-C train:\n")
utils.show_classification_report(
    svm, X_train_set, y_train_set, sample_weight=weight_train
)

In [None]:
print("SVM-C test:\n")
utils.show_classification_report(svm, X_test_set, y_test_set, sample_weight=weight_test)