# Questions 5: Optimising pre-processing and feature extraction (30 marks)

I have made some adjustments to the original jupyter notebook so that it can handle different Feature extraction and pre-processing techniques using the same script. I have created a list of configurations, which are iterated through and the current configuration is passed to each function. Each configuration is a dictionary, with entries which map to the Description of the configuration, and the flags for each pre-processing technique. 

## Downloading and Importing Packages 

In [95]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [96]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [97]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [98]:
pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [99]:
import csv                               
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import ngrams
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd

In [100]:
import nltk

nltk.download("wordnet")
nltk.download("punkt")
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Feature Extraction function for (Presence of words) / (Bag of Words) model

In [101]:
def to_feature_vector(tokens, gfd, config):
    # Should return a dictionary containing features as keys, and weights as values
    # DESCRIBE YOUR METHOD IN WORDS
    def presence():
        feature_dict = {}
        ut = set(tokens)
        for token in tokens:
            feature_dict[token] = 1
        for token in ut:
            gfd[token] = 1
        return feature_dict
    def bag_of_words():
        feature_dict = {}
        ut = set(tokens)
        for token in tokens:
            feature_dict[token] = feature_dict.get(token, 0) + 1
        for token in ut:
            gfd[token] = gfd.get(token, 0) + 1
        return feature_dict
        
    if config.get("feature_extraction_method") == "presence":
        return presence()
    elif config.get("feature_extraction_method") == "BoW":
            return bag_of_words()
    else:
        return bag_of_words()

## Data Loading and Splitting

I have kept the `load_data()`function the same as in Q1-4, however I now have two different implementations of the `split_and_preprocess_data()` function such that it can properly support the different feature extraction methods.

**Split and preprocess data**:

A. `split_and_preprocess_data(percentage, config, traind, traind1, testd, testd1, gfd)`: This function is designed to be used with configurations which use a `Presence of word` or `Bag of words` feature extraction model. I have added a few extra parameters, however this function works in a very similar way to the function in the other notebook. Here is what each of the parameters does:
- *percentage*: Tells the function what ratio the `raw_data` should be divided into to create the `train_data` and `test_data` variables.
- *config*: Gives the flags for each of the different preprocessing and feature extraction methods that are to be used.
- *traind, traind1, testd, testd1*: These are empty lists which are filled in by this function. The parameters `traind` and `testd` just store the normal train-test splits with features and labels. `traind1` and `testd1` also store the original text sample alongside the features and the label which is convenient for certain operation which is explained further in later sections.
- *gfd*: This is an empty dictionary which represents the global feature dictionary. 


B. `split_and_preprocess_data_tfidf(percentage, config)`: 
- This split and preprocess function is defined specifically to be used if the configuration uses a `TF/IDF` feature extraction method.
- The parameters for this function works the same way as in the previous function.
- For this feature extraction method we don't call the `to_feature_vector` function but directly make use of the `sklearn` function `TfidfVectorizer` to generate the features.
- Here the data is split based on the percentage parameter and the tokens which are obtained from pre-processing the train and test data are joined together to create pre-processed text samples which are neccessary as `TfidfVectorizer` creates features from all documents at once.
- I had to limit the `max_features` parameter otherwise I ended up facing memory issues and the jupyter kernel crashing.
- The variables `tfidf_train` and `tfidf_test` contain the training and testing features created using the `TfidfVectorizer`. The `TfidfVectorizer` learns the vocabulary and then computes the associated TF/IDF values.
- The function returns `traind`, `testd` and `tfidf_vectorizer`, where `traind` and `testd` are lists of tuples with features and their labels.

In [102]:
def load_data(path):
    """Load data from a tab-separated file and append it to raw_data."""
    with open(path) as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            if line[0] == "Id":  # skip header
                continue
            (label, text) = parse_data_line(line)
            raw_data.append((text, label))

def split_and_preprocess_data(percentage, config, traind, traind1, testd, testd1, gfd):
    num_samples = len(raw_data)
    num_training_samples = int((percentage * num_samples))

    for (text, label) in raw_data[:num_training_samples]:
        traind.append((to_feature_vector(pre_process(text, config), gfd, config), label))
        traind1.append((text, to_feature_vector(pre_process(text, config), gfd, config), label))
    for (text, label) in raw_data[num_training_samples:]:
        testd.append((to_feature_vector(pre_process(text, config), gfd, config), label))
        testd1.append((text, to_feature_vector(pre_process(text, config), gfd, config), label))

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

def split_and_preprocess_data_tfidf(percentage, config):
    num_samples = len(raw_data)
    num_training_samples = int((percentage * num_samples))
    
    ## Creates a list of texts by joining together tokens obtained from preprocessing
    training_texts = [" ".join(pre_process(text, config)) for text, _ in raw_data[:num_training_samples]]
    test_texts = [" ".join(pre_process(text, config)) for text, _ in raw_data[num_training_samples:]]

    tfidf_vectorizer = TfidfVectorizer(max_features=50_000)
    tfidf_train = tfidf_vectorizer.fit_transform(training_texts)
    tfidf_test = tfidf_vectorizer.transform(test_texts)

    # Puts the Training and Testing features created from samples with their associated labels 
    traind = [(tfidf_train[i], label) for i, (_, label) in enumerate(raw_data[:num_training_samples])]
    testd = [(tfidf_test[i], label) for i, (_, label) in enumerate(raw_data[num_training_samples:])]

    return traind, testd, tfidf_vectorizer

In [104]:
def parse_data_line(data_line):
    # Should return a tuple of the label as just positive or negative and the statement
    # e.g. (label, statement)
    _, label, statement = data_line
    return (label, statement)

## Configurations for Feature Extraction and Pre-processing and populating `raw_data` 

I have listed all 26 configurations which I tested but have commented the ones which I didn't include in the report. You can enable or disable configurations by commenting or uncommenting lines from the configuration list.

In [105]:
raw_data = []
data_file_path = 'sentiment-dataset.tsv'

# 26 Configurations with different combination of preprocessing and feature extraction techniques
configurations = [
    {"description": "Presence/Absence of Word + Punctuation Separation"                                                                           , "feature_extraction_method": "presence" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    {"description": "Bag of words + Punctuation Separation"                                                                                       , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + URL Removal + Punctuation Separation"                                                                         , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Punctuation Removal"                                                                                          , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Punctuation Separation + Stemming"                                                                            , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":True , "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Punctuation Separation + Lemmatization"                                                                       , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + URL Removal + Punctuation Separation + Stemming"                                                              , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":True , "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + URL Removal + Punctuation Separation + Lemmatization"                                                         , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":False, "add_tg":False, "add_qg":False},
    {"description": "Bag of words + Bigrams + Punctuation Removal"                                                                                , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":True , "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Bigrams + Trigrams + URL Removal + Punctuation Separation"                                                    , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":True , "add_tg":True , "add_qg":False},
    # {"description": "Bag of words + Bigrams + Trigrams + URL Removal + Punctuation Separation + Lemmatization"                                    , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":True , "add_qg":False},
    {"description": "Bag of words + Trigrams + Punctuation Removal + Lemmatization"                                                               , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":False, "add_tg":True , "add_qg":False},
    # {"description": "Bag of words + Trigrams + Punctuation Separation + Lemmatization"                                                            , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":True , "add_qg":False},
    {"description": "Bag of words + Bigrams + Punctuation Separation + Lemmatization"                                                             , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Punctuation Separation + Stopword Removal"                                                                    , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Punctuation Separation + Lemmatization + Stopword Removal"                                                    , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":True , "add_bg":False, "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Bigrams + Punctuation Separation + Lemmatization + Stopword Removal"                                          , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
    {"description": "Bag of words + Quadgrams + Punctuation Separation"                                                                           , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":True },
    # {"description": "Bag of words + Bigrams + Trigrams + Quadgrams + Punctuation Separation"                                                      , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":True , "rmt": False, "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":True , "add_tg":True , "add_qg":True },
    {"description": "Bag of words + Bigrams + Punctuation Removal + Tag Removal + Lemmatization"                                                  , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": True , "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
    {"description": "Bag of words + Bigrams + Trigrams + Quadgrams + URL Removal + Punctuation Removal + Lemmatization + Stopword Removal"        , "feature_extraction_method": "BoW" , "rm_urls":True , "sep_pn":False, "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":True , "add_qg":True },
    {"description": "Bag of words + Bigrams + Punctuation Removal + Lemmatization + Stopword Removal"                                             , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
    # {"description": "Bag of words + Trigrams + Punctuation Removal + Lemmatization + Stopword Removal"                                            , "feature_extraction_method": "BoW" , "rm_urls":False, "sep_pn":False, "rmt": False, "rm_sw":True , "app_stem":False, "app_lem":True , "add_bg":False, "add_tg":True , "add_qg":False},
    # {"description": "TFIDF + Punctuation Separation"                                                                                              , "feature_extraction_method": "tfidf" , "rm_urls":False, "sep_pn":True, "rmt": False , "rm_sw":False, "app_stem":False, "app_lem":False, "add_bg":False, "add_tg":False, "add_qg":False},
    {"description": "TFIDF + Bigrams + Punctuation Removal + Tag Removal + Lemmatization"                                                         , "feature_extraction_method": "tfidf" , "rm_urls":False, "sep_pn":False, "rmt": True , "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
    # {"description": "TFIDF + Bigrams + Trigrams + Quadgrams + URL Removal + Punctuation Removal + Tag Removal + Lemmatization + Stopword Removal" , "feature_extraction_method": "tfidf" , "rm_urls":True,  "sep_pn":False, "rmt": True , "rm_sw":False, "app_stem":False, "app_lem":True , "add_bg":True , "add_tg":False, "add_qg":False},
]

load_data(data_file_path) 

print("rawData", len(raw_data))

rawData 33540


## Pre-processing Text Samples

These are the pre-processing techniques which I have employed and I have also included some feature extraction methods within the same function (n-grams implementation), as it made it more convenient to return the tokens for the text all in one go. These are the techniques which I used:
- URL Removal
- Punctuation Separation
- Punctuation Removal
- User tag and Hash tag removal
- Applying Porter-Stemmer
- Applying Lemmatization
- Stop word removal
- Generating N-grams (Bigrams, Trigrams and Quadgrams)

I check the specific configuration flag for each of the above techniques and apply them accordingly. Once all of the different preprocessing techniques have been applied, the pre-processed text is tokenized. This is followed by the generation of n-grams (if they are part of the configuration passed to the function) which are appended to the list of tokens (or unigrams). After this the final list of tokens is returned.

In [106]:
def pre_process(text, config):
    """ 
    Performs Different preprocessing operations based on the config parameter passed to the function.

    Parameters:
    text (string): passes a line of text (assume sentence segmentation has already been done)
    config (dictionary): passes a bunch of flags indicating which techniques are to be used

    Returns:
    List[string]: Should return a list of tokens.
    """
    # DESCRIBE YOUR METHOD IN WORDS

    def remove_urls(text):
        url_pattern = r'https?://\S+|www\.\S+'
        cleaned_text = re.sub(url_pattern, '', text)
        cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
        return cleaned_text

    def separate_punctuation(text):
        text = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", text) # separates punctuation at ends of strings and separates hash tags
        text = re.sub(r"([.,;:!?'\"“\(\)])(\w)", r"\1 \2", text) # separates punctuation at beginning of strings and separates hash tags
        return text

    def remove_punctuation(text):
        text = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1", text) # removes punctuation at ends of strings and removes hash tags
        text = re.sub(r"([.,;:!?'\"“\(\)])(\w)", r"\2", text) # removes punctuation at beginning of strings and removes hash tags
        return text
    
    def remove_tags(text):
        text = re.sub(r"(\w)([@#'\"”\)])", r"\1", text) # removes user tags and hash tags at ends of strings
        text = re.sub(r"([@#'\"“\(\)])(\w)", r"\2", text) # removes user tags  and hash tags at beginning of strings
        return text
        
    def tokenize_text(text):
        tokens = re.split(r"\s+",text) # separate into tokens by splitting on trailing spaces
        # normalization - only by lower casing for now
        tokens = [t.lower() for t in tokens]
        return tokens

    def apply_stemming(tokens):
        # Use porter stemmer
        stemmer = PorterStemmer()
        stemmed_tokens = [stemmer.stem(token) for token in tokens]
        return stemmed_tokens

    def apply_lemmatization(tokens):
        # Apply lemmatization
        lemmatizer = WordNetLemmatizer()
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
        return lemmatized_tokens

    def generate_ngrams_from_tokens(tokens, n):
        # make n-grams from tokens
        return list(ngrams(tokens, n))

    # Remove URLs

    if config.get("rm_urls"):
        text = remove_urls(text)

    # Separate Punctuation otherwise Remove it
    
    if config.get("sep_pn"):
        text = separate_punctuation(text)
    else:
        text = remove_punctuation(text)
    
    # Remove User and Hash tags

    if config.get("rmt"):
        text = remove_tags(text)
    
    tokens = tokenize_text(text)

    # Apply Lemmatization or Stemming

    if config.get("app_stem"):
        tokens = apply_stemming(tokens)

    if config.get("app_lem"):
        tokens = apply_lemmatization(tokens)


    # Generate bigrams, trigrams and quadgrams
    if config.get("add_bg"):
        bigrams = generate_ngrams_from_tokens(tokens, 2)
        bg = [i + " " + j for (i,j) in bigrams]
        tokens += bg

    if config.get("add_tg"):
        trigrams = generate_ngrams_from_tokens(tokens, 3)
        tg = [i + " " + j + " " + k for (i,j,k) in trigrams]
        tokens += tg

    if config.get("add_qg"):
        quadgrams = generate_ngrams_from_tokens(tokens, 4)
        qg = [i + " " + j + " " + k + " " + l for (i,j,k,l) in quadgrams]
        tokens += qg

    # Remove Stop words
    if config.get("rm_sw"):
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in tokens if w not in stop_words]

    return tokens

## Training Classifier based on Feature Extraction Method

- The method for training the classifier for `Bag of Words` and `Presence of Words` feature extraction model is the same as the code provided in the template for questions 1-4.
- However for the "TF/IDF" feature extraction method, I had to adapt the training function such that it works for the features generated by the `TfidfVectorizer` in the `split_and_preprocess_data_tfidf()` function.
    - I imported vstack from scipy.sparse so that a sparse matrix can be created for training the classifier. The features are taken from the data into a matrix where each row represents the documents and each column represents the terms. The sparse matrix has been used for efficient memory usage as matrices from TF/IDF features can be quite large.
    - The LinearSVC model is then trained on the sparse matrix and the extracted labels.

I had experimented with the `C` parameter and tried using `class_weight=balanced` for LinearSVC, however it didn't result in any gains and sometimes even made the model worse so I have chosen not to include it in the final implementation

In [107]:
from scipy.sparse import vstack

def train_classifier(data, config):
    if config.get("feature_extraction_method") == "BoW" or config.get("feature_extraction_method") == "presence":
        print("(BoW/presence)Training Classifier...")
        pipeline =  Pipeline([('svc', LinearSVC())])
        return SklearnClassifier(pipeline).train(data)
    else:
        print("(TF/IDF)Training Classifier...")
        # Stack sparse matrices to create a 2D sparse matrix for LinearSVC
        X_train = vstack([sample[0] for sample in data])  # vstack is used to keep X_train in sparse format
        y_train = [sample[1] for sample in data]  # Extract labels
        
        model = LinearSVC()
        model.fit(X_train, y_train)
        return model

In [108]:
def cross_validate(dataset, folds, config):
    results = []
    fold_size = int(len(dataset)/folds) + 1
    best_fold = None
    best_accuracy = 0
    for i in range(0,len(dataset),int(fold_size)):
        # insert code here that trains and tests on the 10 folds of data in the dataset
        # print("Fold start on items %d - %d" % (i, i+fold_size))
        # FILL IN THE METHOD HERE
        train_data, test_data  = dataset[:i]+dataset[i+fold_size:], dataset[i:i+fold_size]
        test_inputs = [data[0] for data in test_data]
        test_labels = [data[1] for data in test_data]
        classifier = train_classifier(train_data, config)
        predicted_labels = predict_labels(test_inputs, classifier, config)
        precision, recall, fscore, _ = precision_recall_fscore_support(test_labels, predicted_labels, average="weighted")
        accuracy = accuracy_score(test_labels, predicted_labels)
        cv_results = [precision, recall, fscore, accuracy]
        results.append(cv_results)
        print(accuracy)
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_fold = cv_results
    return results, best_fold, best_accuracy

## Predicting Labels for Different Feature Extraction Methods
- `classifier.classify_many(samples)` is used for the `Presence of Words` and `Bag of Words` implementation just like the implementation for questions 1-4.
- However for `TF/IDF` method, `classifier.predict(samples)` is used instead. This is mainly because the classifier is trained on a sparse matrix. `.predict()` function supports sparse matrix inputs and thus was a natural choice for this task.

In [109]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predict_labels(samples, classifier, config):
    """Assuming preprocessed samples, return their predicted labels from the classifier model."""
    if config.get("feature_extraction_method") == "presence" or config.get("feature_extraction_method") == "BoW":
        return classifier.classify_many(samples)
    elif config.get("feature_extraction_method") == "tfidf":
        return classifier.predict(samples)


# def predict_label_from_raw(sample, classifier):
#     """Assuming raw text, return its predicted label from the classifier model."""
#     return classifier.classify(to_feature_vector(pre_process(reviewSample)))

In [110]:
def confusion_matrix_heatmap(y_test, preds, labels):
    """Function to plot a confusion matrix"""
    cm = metrics.confusion_matrix(y_test, preds, labels=labels)
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):
            text = ax.text(j, i, cm[i, j],
                           ha="center", va="center", color="w")

    plt.xlabel('Predicted')
    plt.ylabel('True')

## Automation Script to run all scenarios and get all results at once!

Just to quickly remind what `traind`, `traind1`, `testd`, `testd1`, `gfd` are:
- `traind` and `testd`: Contains a list of tuples with the first element being a feature and the second being the associated label. `traind` contains all of the feature-label tuples used for training while `testd` contains all of the feature-label tuples used for testing. The train and test data was split in one of the `split_and_preprocess_data` functions.
- `traind1` and `testd1`: Contains tuples with the original text samples alongside the features and labels. These variables have the same features and labels which are also part of `traind` and `testd`.
- `gfd`: This is the Global Feature Dictionary.


1. Here the metrics can be computed for both the first fold as well as through the full cross-validation. The preferred method would be to use the first fold of the cross-validation to speed up the computation.
<!-- 1. Here I'm performing just the first fold of cross validation as it's the method I used for error analysis and it's also quicker than running the cross-validation function which would run for all 26 cases in one go which would take forever to execute. -->
2. If anyone wants to check the check the cross_validation results for each case, the code for it has been just commented right above so one can uncomment it and test that version as well.
    - In that case comment the lines `fold_size = ...` to `accuracy = accuracy_score(test_labels, predicted_labels)`, before you run the code for cross-validation.
3. I return the text description of each configuration alongside all of the performance metrics inside a dictionary. This way I can convert the results into a panda DataFrame later which makes it nicer to see all of the data for each of the cases in one place.
4. I also put the code for training on the entire training data right below the normal computation if one wants to see how the model performs in that case. Set `functions_complete` variable to True for the if and elif statements to see how it works for each of the 26 different cases.

In [111]:
def run_all_configurations(configurations):
    results = []
    for i, config in enumerate(configurations):
        print(f"Experiment {i} ", end="")
        traind, traind1, testd, testd1, gfd = [], [], [], [], {}

        if config.get("feature_extraction_method") == "BoW" or config.get("feature_extraction_method") == "presence":
            split_and_preprocess_data(0.8, config, traind, traind1, testd, testd1, gfd)

            ## If anyone wants to try how the different configurations perform if all folds of cross validation are run. 
            ## WARNING -> This can run for upto 20+ minutes if run with all configurations enabled. 
            ## However you can reduce the amount of time by commenting out some of the cases from the configuration list.
            
            # cv = cross_validate(traind, 10, config)
            # precision, recall, fscore, accuracy = cv[1]

            fold_size = int(len(traind)/10) + 1
            train_data_2, test_data_2  = traind1[fold_size:], traind1[:fold_size]
            test_text, test_inputs, test_labels = [data[0] for data in test_data_2], [data[1] for data in test_data_2], [data[2] for data in test_data_2]
            classifier = train_classifier(traind[fold_size:], config)
            predicted_labels = predict_labels(test_inputs, classifier, config)
            precision, recall, fscore, _ = precision_recall_fscore_support(test_labels, predicted_labels, average="weighted")
            accuracy = accuracy_score(test_labels, predicted_labels)

            results.append({
                "config": config["description"],
                "precision": precision,
                "recall": recall,
                "f1_score": fscore,
                "accuracy": accuracy,
            })

            
            # Finally, check the accuracy of your classifier by training on all the traning data
            # and testing on the test set
            functions_complete = False  # set to True once you're happy with your methods for cross val
            if functions_complete:
                print(testd[0])   # have a look at the first test data instance
                classifier = train_classifier(traind, config)  # train the classifier
                test_true = [t[1] for t in testd]   # get the ground-truth labels from the data
                test_pred = predict_labels([x[0] for x in testd], classifier, config)  # classify the test data to get predicted labels
                precision, recall, fscore, _ = precision_recall_fscore_support(test_true, test_pred, average='weighted') # evaluate
                accuracy = accuracy_score(test_true, test_pred)
                print("Done training!")
                print(f"Precision: {precision} | Recall: {recall} | F Score:{fscore} | Accuracy:{accuracy}")
        
        elif config.get("feature_extraction_method") == "tfidf":
            traind, testd, tfidf_vectorizer = split_and_preprocess_data_tfidf(0.8, config)

            fold_size = int(len(traind)/10) + 1
            train_data_2, test_data_2  = traind[fold_size:], traind[:fold_size]
            test_inputs, test_labels = vstack([data[0] for data in test_data_2]), [data[1] for data in test_data_2]
            classifier = train_classifier(traind[fold_size:], config)
            predicted_labels = predict_labels(test_inputs, classifier, config)
            precision, recall, fscore, _ = precision_recall_fscore_support(test_labels, predicted_labels, average="weighted")
            accuracy = accuracy_score(test_labels, predicted_labels)

            results.append({
                "config": config["description"],
                "precision": precision,
                "recall": recall,
                "f1_score": fscore,
                "accuracy": accuracy,
            })

            # Finally, check the accuracy of your classifier by training on all the traning data
            # and testing on the test set
            functions_complete = False  # set to True once you're happy with your methods for cross val
            if functions_complete:
                print(testd[0])   # have a look at the first test data instance
                test_inputs = vstack([data[0] for data in testd])
                test_true = [data[1] for data in testd]
                classifier = train_classifier(traind, config)
                test_pred = predict_labels(test_inputs, classifier, config)
                precision, recall, fscore, _ = precision_recall_fscore_support(test_true, test_pred, average='weighted')
                accuracy = accuracy_score(test_true, test_pred)
                print("Done training!")
                print(f"Precision: {precision} | Recall: {recall} | F Score:{fscore} | Accuracy:{accuracy}")
        print("--------------------------------------------------------------------------------------------")
    return results

In [112]:
# Run all configurations for this dataset
# WARNING: This function will take upto 5 minutes or more to execute if run with all configurations (with first fold of cross-validation)
results = run_all_configurations(configurations)
print("Done!")

Experiment 0 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 1 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 2 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 3 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 4 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 5 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 6 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------



--------------------------------------------------------------------------------------------
Experiment 8 (BoW/presence)Training Classifier...
--------------------------------------------------------------------------------------------
Experiment 9 (TF/IDF)Training Classifier...
--------------------------------------------------------------------------------------------
Done!


## The Final Results!

These results are discussed in the report, explaining which approaches performed better and which ones underperformed and potential reasons for why that might have happened. I have stored the results in the file "Final_Results.xlsx" for all of the cases I tested for so please check that to see the metrics for all results in one place.

In [113]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,config,precision,recall,f1_score,accuracy
0,Presence/Absence of Word + Punctuation Separation,0.853747,0.856185,0.854294,0.856185
1,Bag of words + Punctuation Separation,0.855645,0.858048,0.856153,0.858048
2,Bag of words + Bigrams + Punctuation Removal,0.870122,0.872206,0.869572,0.872206
3,Bag of words + Trigrams + Punctuation Removal ...,0.868971,0.871088,0.868345,0.871088
4,Bag of words + Bigrams + Punctuation Separatio...,0.866994,0.869225,0.866588,0.869225
5,Bag of words + Quadgrams + Punctuation Separation,0.868549,0.870343,0.866806,0.870343
6,Bag of words + Bigrams + Punctuation Removal +...,0.877065,0.878912,0.876735,0.878912
7,Bag of words + Bigrams + Trigrams + Quadgrams ...,0.843002,0.841282,0.831153,0.841282
8,Bag of words + Bigrams + Punctuation Removal +...,0.867455,0.869598,0.866647,0.869598
9,TFIDF + Bigrams + Punctuation Removal + Tag Re...,0.863522,0.865872,0.863604,0.865872
