# White Noise: Classic Supervised Machine Learning

## 1. Explaining the Problem

I must train a classifier on `labelled.csv`, a data set that contains 2200 bill summaries sampled from the 111th and 115th mandates of the US Congress - i.e., the House of Representatives, and the Senate - which are respectively the Obama presidency's first two years, in which the Democrats held both chambers (2008-10), and the Trump presidency's first two years, in which the Republicans held both chambers (2016-18). This classifier will aid me in automatically labelling the contents of the rest of the bill summaries (19620) retrieved from `api.congress.gov` as Economic / Non-Economic, or Socio-Cultural / Non-Socio-Cultural. I first wish to follow a "classic" Bag-Of-Words approach, and only subsequently turn to the state-of-the-art feature representation and modelling provided by BERT transformers, working on Google CoLab.

The textual data has been pre-processed by removing HTML tags and character escapes. Lowercasing, special characters elimination, and punctuation removal, have been left to the `nltk` tokenizers integrated into the `sklearn` vectorizers. The labels are already tailored for the classification problem at hand, which involves identifying economic or socio-cultural content within the bill summaries. Consequently, I can proceed directly with extracting the bill summaries and labels and splitting them into suitable sets for training, validation, and testing.

In [1]:
# General packages for data handling and wrangling
import pandas as pd
import numpy as np
import joblib

# Classic SML: Tokenization
import nltk
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.corpus import stopwords
nltk.download("stopwords")

# Classic SML: General
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline, Pipeline

# Classic SML: Preprocessing
from sklearn import preprocessing

# Classic SML: Train/test splits, cross validation, gridsearch
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
)

# Classic SML: Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# CLassic SML: Model evaluation
from sklearn import metrics

[nltk_data] Downloading package stopwords to C:\Users\Mattia aka
[nltk_data]     Mario\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Generating the Training, Validation, and Testing Dataset Splits

In [2]:
# I start by importing the "labelled.csv" data set as a DataFrame object within the Python environment.
# I crucially specify the "|" separator, because employing colons or semi-colons causes conflicts with the summaries' contents.

d = pd.read_csv("labelled.csv", sep = "|")

In [3]:
# I check the first few lines of the DataFrame object to assess if the "read_csv" command worked smoothly

d.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
0,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...,Non-Economic,Socio-Cultural
1,115,4105,hr,This bill extends funding through FY2022 for...,Economic,Non-Socio-Cultural
2,115,3691,s,Expanding Transparency of Information and S...,Non-Economic,Socio-Cultural
3,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...,Economic,Socio-Cultural
4,111,883,hr,"Amends the Internal Revenue Code to repeal, e...",Economic,Socio-Cultural


In [4]:
# I check the shape of the DataFrame object to assess if the "read_csv" command worked smoothly

d.shape

(2200, 6)

2200 classified documents, six columns - i.e., the original four columns I retrieved from `api.congress.gov`, plus the two columns that contain the categories I manually annotated. Everything seems perfect! I can now transform the columns where I respectively stored the textual data - i.e. `text` - the economic labels - i.e., `economic` - and the socio-cultural labels - i.e., `socio_cultural` - in three separate lists, which I subsequently split into suitable sets for training, validation, and testing.

The following cell's code is inspired by the official documentation for the `tolist()` `pandas` method, available at https://pandas.pydata.org/docs/reference/api/pandas.Series.tolist.html.

In [5]:
# I unpack the columns of interest into three separate lists with the .tolist() pandas method.

text = d["text"].tolist() # Textual data
economic = d["economic"].tolist() # Economic labels
socio_cultural = d["socio_cultural"].tolist() # Socio-cultural labels

In [6]:
# I check the first 5 elements and overall lengths of the three lists to assess whether this data wrangling step went smoothly.

text[:5]

['   Frank and Jeanne Moore Wild Steelhead Special Management Area Designation Act      This bill designates approximately 99,653 acres of Forest Service land in Oregon as the  Frank and Jeanne Moore Wild Steelhead Special Management Area.  ',
 '  This bill extends funding through FY2022 for the Department of Health and Human Services to award grants to states and certain other entities for demonstration projects that address health-professions workforce needs.  ',
 '   Expanding Transparency of Information and Safeguarding Toxics (EtO is Toxic) Act of 2018    This bill updates requirements for chemicals that pose an adverse public health risk. Specifically, the bill requires the Environmental Protection Agency (EPA) to publish an updated National Air Toxics Assessment once every two years. The assessment uses emissions data to estimate health risks from toxic air pollutants.    The bill also requires the EPA to use data from its Integrated Risk Information System when conducting rulem

In [7]:
print(f"The textual data list's total length is {len(text)}.")

The textual data list's total length is 2200.


In [8]:
economic[:5]

['Non-Economic', 'Economic', 'Non-Economic', 'Economic', 'Economic']

In [9]:
print(f"The economic label list's total length is {len(economic)}.")

The economic label list's total length is 2200.


In [10]:
socio_cultural[:5]

['Socio-Cultural',
 'Non-Socio-Cultural',
 'Socio-Cultural',
 'Socio-Cultural',
 'Socio-Cultural']

In [11]:
print(f"The socio-cultural label list's total length is {len(socio_cultural)}.")

The socio-cultural label list's total length is 2200.


All data was correctly dumped into separate lists. Now, I proceed to split them into suitable sets for training, validation, and testing with the `sklearn` `train_test_split` function.

In [12]:
# I set a given random seed to make my work reproducible. For the record, the 27th of August is my birthday.
my_seed = 27

# Running the train vs test split with the standard 80% vs 20% ratio.
text_train, text_test, econ_train, econ_test, sc_train, sc_test = train_test_split(
    text, economic, socio_cultural, test_size = 0.2, random_state = my_seed)

In [13]:
# Checking whether this splitting step went smoothly for the bill summaries...
print(f"The text sets have {len(text_train)} training instances and {len(text_test)} testing instances.")

The text sets have 1760 training instances and 440 testing instances.


In [14]:
# ...and for their labels.
print(f"The economic label sets have {len(econ_train)} training instances and {len(econ_test)} testing instances.")
print(f"The socio-cultural label sets have {len(sc_train)} training instances and {len(sc_test)} testing instances.")

The economic label sets have 1760 training instances and 440 testing instances.
The socio-cultural label sets have 1760 training instances and 440 testing instances.


I further split the remaining data into training and validation sets. This time, I apply a 75% versus 25% split ratio, saving one fourth of the bill summaries and relative labels for validation purposes, because I want the number of instances for validation and testing to be as close as possible.

In [15]:
# Running the train vs validate split with a 75% vs 25% ratio.
text_train, text_valid, econ_train, econ_valid, sc_train, sc_valid = train_test_split(
    text_train, econ_train, sc_train, test_size = 0.25, random_state = my_seed)

In [16]:
# Checking whether this splitting step went smoothly for the bill summaries...
print(f"The text sets have {len(text_train)} training instances and {len(text_valid)} validation instances.")

The text sets have 1320 training instances and 440 validation instances.


In [17]:
# ...and for their labels.
print(f"The economic label sets have {len(econ_train)} training instances and {len(econ_valid)} validation instances.")
print(f"The socio-cultural label sets have {len(sc_train)} training instances and {len(sc_valid)} validation instances.")

The economic label sets have 1320 training instances and 440 validation instances.
The socio-cultural label sets have 1320 training instances and 440 validation instances.


## 3. Finding the Best Classifiers

Codes and best practices for sections from `3.` to `5.` are inspired by Chapters 8, 10, and 11 of van Atteveldt, Trilling, and Arcila Calderón's *Computational Analysis of Communication* (2022).

Now, I want to adopt a classic SML approach and find the best vectorizer + classifier combinations by employing several pipelines, before eventually hyperparameter tuning the most optimal configurations. I already partly pre-processed the textual data, removing HTML boilerplates - i.e., character escapes and tags. The built-in default tokenizers for `CountVectorizer` and `TfIdfVectorizer` will handle the remaining special characters, remove punctuation, apply lowercasing, and eliminate whitespaces, to achieve higher consistency in the textual data, and prepare it for the Bag-Of-Words (BOW) transformation

In terms of vectorizers, I evaluate `CountVectorizers` against `TfIdfVectorizers`. The former represents unique features with raw word counts, while the latter applies a weighting formula that allows rarer features to acquire more importance in the classification task. Relatively rare terms such as "Hezbollah" - i.e., a Lebanese Shia Islamist political party and militant group that is deemed to be a terrorist group by the USA - which for instance appears in four documents only, are *really* strong indicators of a Socio-Cultural bill summary, because terrorism is a remarkably prominent socio-cultural issue in American politics, so I expect `TfIdfVectorizers` to achieve superior performance.

All vectorizers are set to prune words that appear in more than 75% of the summaries. This makes the default `nltk` English stopwords list redundant, as it follows the same logic in factoring out extremely frequent words. Moreover, I do not impose any lower limit in pruning, since some summaries are very short and contain unique, yet very informative words. For example, Bill number 187 of the 115th Senate: "Provides for the relief of Alemseghed Mussie Tesfamical". This piece of legislation establishes that Alemseghed Mussie Tesfamical, a former soldier from Eritrea, is eligible for the issuance of an immigrant visa or being admitted for permanent residence, to avoid his deportation back to his home country. By excluding the words "Alemseghed", "Mussie", and "Tesfamical", I would likely retain only two words: "Provides", and "relief". This could make the classification task quite challenging for the computer. Therefore, I set the `min_df` parameter to 0. These thresolds are effectively utilised as baselines, and they will be subjected to further modifications when I will fine-tune hyperparameters for the best classifiers, following a less theory- and more data-driven approach.

Turning to classifiers, I assess several models:
1. The Multinomial Naive-Bayes Classifier;
2. The Logistic Regression Classifier, with the `liblinear` solver;
3. The Support Vector Machine Classifier, with the `linear` kernel trick;
4. The Support Vector Machine Classifier, with the `rbf` kernel trick;
5. The Random Forest Classifier, with 100, 500, and 1000 estimators.

I include the Multinomial Naive-Bayes as a baseline model. The Logistic Regression with the linear solver is my potential golden standard, since it usually fares really well in analogous classification tasks. I also test SVMs with two different kernel tricks because minimizing hinge loss instead of logistic loss might lead to superior performances if the Logistic Regression yields unsatisfactory results. On a final note, I must explicit an important observation on Random Forest classifiers. I have no expectations regarding the form of the relationship among my BOW summary features and the Economic / Non-Economic, and Socio-Cultural / Non-Socio-Cultural labels. Therefore, I wish to assess whether non-linear decision trees yield better approximations of such associations. I test for the decision process' complexity by specifying an increasing number of estimators. For each test run, I choose the best model, which I will fine-tune later on.

In [18]:
# a. Economic / Non-economic

# 1a. Multinomial Naive-Bayes
# +
# 2a. Logistic Regression ("liblinear")

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ("NB with Count", CountVectorizer(min_df = 0, max_df = .75), MultinomialNB()),
    ("NB with TfIdf", TfidfVectorizer(min_df = 0, max_df = .75), MultinomialNB()),
    
    ("LogReg with Count", CountVectorizer(min_df = 0, max_df = .75),
     LogisticRegression(solver = "liblinear")),
    
    ("LogReg with TfIdf", TfidfVectorizer(min_df = 0, max_df =.75),
     LogisticRegression(solver = "liblinear")),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs: 
    print(name) # I print the given name of the classifier-vectorizer combination.
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines the given vectorizer and classifier.
    pipe.fit(text_train, econ_train) # I fit the training data on the pipeline.
    
    econ_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(econ_valid, econ_pred))
    
    # I add a new line for pretty printing.
    print("\n")

NB with Count
              precision    recall  f1-score   support

    Economic       0.79      0.85      0.82       250
Non-Economic       0.78      0.71      0.75       190

    accuracy                           0.79       440
   macro avg       0.79      0.78      0.78       440
weighted avg       0.79      0.79      0.79       440



NB with TfIdf
              precision    recall  f1-score   support

    Economic       0.67      0.98      0.80       250
Non-Economic       0.92      0.38      0.54       190

    accuracy                           0.72       440
   macro avg       0.80      0.68      0.67       440
weighted avg       0.78      0.72      0.69       440



LogReg with Count
              precision    recall  f1-score   support

    Economic       0.78      0.80      0.79       250
Non-Economic       0.73      0.71      0.72       190

    accuracy                           0.76       440
   macro avg       0.76      0.75      0.76       440
weighted avg       0.76 

The most promising model appears to be the **Naive-Bayes classifier with the `CountVectorizer`**. It does not achieve the highest accuracy of this test run (0.79), but it maintains consistently great figures for both categories, except the relatively low recall for the Non-Economic label (0.71). This seems to be the Achilles' Heel of all this test run's vectorizer + classifier combinations, as the very best model, the **Logistic Regression classifier with the `liblinear` solver and the `TfIdfVectorizer`**, which shows the highest overall accuracy (0.90), and great precision for the "Economic" class (0.84), suffers from the same shortcoming - i.e., an even more problematic score of only 0.67 for the Non-Economic category's recall metric.

In general, this could be a critical issue when automatically labelling the rest of the summaries, because it appears that all models tend to artificially inflate the number of bill summaries annotated as "Economic". Therefore, it is crucial that the fine-tuned classifiers exhibit a solid performance all across the board, and that is why I deem the **Naive-Bayes classifier with the `CountVectorizer`** to potentially have the brightest outlook.

In [19]:
# 3a. Support Vector Machine ("linear")
# +
# 4a. Support Vector Machine ("rbf")

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ("SVM with Count - linear kernel", CountVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "linear")),
    
    ("SVM with Count - rbf kernel", CountVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "rbf")),
    
    ("SVM with Tfidf - linear kernel", TfidfVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "linear")),
    
    ("SVM with Tfidf - rbf kernel", TfidfVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "rbf")),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs:
    
    # I print the given name of the classifier-vectorizer combination.
    print(name)
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines the given vectorizer and classifier.
    pipe.fit(text_train, econ_train) # I fit the training data on the pipeline.
    
    econ_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(econ_valid, econ_pred))
    
    # I add a new line for pretty printing.
    print("\n")

SVM with Count - linear kernel
              precision    recall  f1-score   support

    Economic       0.75      0.76      0.76       250
Non-Economic       0.68      0.67      0.68       190

    accuracy                           0.72       440
   macro avg       0.72      0.72      0.72       440
weighted avg       0.72      0.72      0.72       440



SVM with Count - rbf kernel
              precision    recall  f1-score   support

    Economic       0.73      0.84      0.78       250
Non-Economic       0.74      0.58      0.65       190

    accuracy                           0.73       440
   macro avg       0.73      0.71      0.72       440
weighted avg       0.73      0.73      0.72       440



SVM with Tfidf - linear kernel
              precision    recall  f1-score   support

    Economic       0.81      0.84      0.83       250
Non-Economic       0.78      0.74      0.76       190

    accuracy                           0.80       440
   macro avg       0.79      0.79 

When utilising a `TfIdfVectorizer`, it seems that substituting the logistic loss with the hinge loss minimisation problem does help the classifiers' performance. **SVM classifiers combined with the `TfIdfVectorizer`**, regardless of the kernel trick I employ, are the best solutions for this run. The **SVM classifier with the `rbf` kernel trick** exhibits the highest overall accuracy (0.81), but the **SVM classifier with the `linear` kernel trick** has a better recall for the Non-Economic category - i.e., 0.74, against the 0.72 value shown by the **SVM classifier with the `rbf` kernel trick**. This means that the former model has a lower likelihood of artificially inflating the general count of bill summaries classified as Economic.

In [20]:
# 5a. Random Forests

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ('RF with Count - 100 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 100)),
    
    ('RF with Tfidf - 100 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 100)),
    
    ('RF with Count - 500 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 500)),
    
    ('RF with Tfidf - 500 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 500)),
    
    ('RF with Count - 1000 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 1000)),
    
    ('RF with Tfidf - 1000 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 1000)),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs:
    
    # I print the given name of the classifier-vectorizer combination.
    print(name)
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines vectorizer and classifier.
    pipe.fit(text_train, econ_train) # I fit the training data on the pipeline.
    
    econ_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(econ_valid, econ_pred))
    
    # I add a new line for pretty printing.
    print("\n")

RF with Count - 100 estimators
              precision    recall  f1-score   support

    Economic       0.77      0.82      0.79       250
Non-Economic       0.74      0.67      0.70       190

    accuracy                           0.76       440
   macro avg       0.75      0.75      0.75       440
weighted avg       0.76      0.76      0.75       440



RF with Tfidf - 100 estimators
              precision    recall  f1-score   support

    Economic       0.75      0.86      0.80       250
Non-Economic       0.77      0.62      0.68       190

    accuracy                           0.75       440
   macro avg       0.76      0.74      0.74       440
weighted avg       0.76      0.75      0.75       440



RF with Count - 500 estimators
              precision    recall  f1-score   support

    Economic       0.77      0.84      0.80       250
Non-Economic       0.76      0.67      0.71       190

    accuracy                           0.76       440
   macro avg       0.76      0.

Random Forest Classifiers do not fare well relatively to their added computational complexity. The **Random Forest classifier with 500 estimators and the `TfIdfVectorizer`** shows the best performance of the test run, due to its overall accuracy (0.79) and acceptable recall (0.71) for the Non-Economic category. However, since its metrics are almost identical to the baseline model's - i.e., the **Naive-Bayes classifier with the `CountVectorizer`** - I determine that the non-existent improvement is not worth the remarkable time and effort required for hyperparameter tuning a Random Forest Classifier with 500 estimators.

In [21]:
# b. Socio-Cultural / Non-Socio-Cultural

# 1b. Multinomial Naive-Bayes
# +
# 2b. Logistic Regression ("liblinear")

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ("NB with Count", CountVectorizer(min_df = 0, max_df = .75), MultinomialNB()),
    ("NB with TfIdf", TfidfVectorizer(min_df = 0, max_df = .75), MultinomialNB()),
    
    ("LogReg with Count", CountVectorizer(min_df = 0, max_df = .75),
     LogisticRegression(solver = "liblinear")),
    
    ("LogReg with TfIdf", TfidfVectorizer(min_df = 0, max_df =.75),
     LogisticRegression(solver = "liblinear")),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs: 
    print(name) # I print the given name of the classifier-vectorizer combination.
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines the given vectorizer and classifier.
    pipe.fit(text_train, sc_train) # I fit the training data on the pipeline.
    
    sc_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(sc_valid, sc_pred))
    
    # I add a new line for pretty printing.
    print("\n")

NB with Count
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.80      0.72      0.76       178
    Socio-Cultural       0.82      0.88      0.85       262

          accuracy                           0.81       440
         macro avg       0.81      0.80      0.80       440
      weighted avg       0.81      0.81      0.81       440



NB with TfIdf
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.96      0.27      0.42       178
    Socio-Cultural       0.67      0.99      0.80       262

          accuracy                           0.70       440
         macro avg       0.81      0.63      0.61       440
      weighted avg       0.79      0.70      0.65       440



LogReg with Count
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.70      0.60      0.64       178
    Socio-Cultural       0.75      0.83      0.79       262

          accuracy                        

The best model of this test run is clearly the **Naive-Bayes classifier with the `CountVectorizer`**. Not only it is the one with the highest accuracy (0.81), but it is the only model to exhibit an acceptable recall for the Non-Socio-Cultural category. All other options show a marked tendency of artificially inflating the number of bill summaries presumed to express visions regarding socio-cultural issues.

In [22]:
# 3b. Support Vector Machine ("linear")
# +
# 4b. Support Vector Machine ("rbf")

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ("SVM with Count - linear kernel", CountVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "linear")),
    
    ("SVM with Count - rbf kernel", CountVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "rbf")),
    
    ("SVM with Tfidf - linear kernel", TfidfVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "linear")),
    
    ("SVM with Tfidf - rbf kernel", TfidfVectorizer(min_df = 0, max_df = .75),
     SVC(kernel = "rbf")),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs:
    
    # I print the given name of the classifier-vectorizer combination.
    print(name)
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines the given vectorizer and classifier.
    pipe.fit(text_train, sc_train) # I fit the training data on the pipeline.
    
    sc_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(sc_valid, sc_pred))
    
    # I add a new line for pretty printing.
    print("\n")

SVM with Count - linear kernel
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.70      0.63      0.66       178
    Socio-Cultural       0.76      0.82      0.79       262

          accuracy                           0.74       440
         macro avg       0.73      0.72      0.73       440
      weighted avg       0.74      0.74      0.74       440



SVM with Count - rbf kernel
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.82      0.42      0.55       178
    Socio-Cultural       0.70      0.94      0.80       262

          accuracy                           0.73       440
         macro avg       0.76      0.68      0.68       440
      weighted avg       0.75      0.73      0.70       440



SVM with Tfidf - linear kernel
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.80      0.62      0.70       178
    Socio-Cultural       0.77      0.89      0.83       262

Despite the Logistic Regression classifiers' performances being way poorer than the baseline, substituting the logistic loss with the hinge loss minimisation problem does not help much. Nevertheless, the **SVM classifier with the `liblinear` solver and the `TfIdfVectorizer`** appears to be the best solution within this this test run, as it is the only one to maintain an accuracy comparable to the baseline (0.78). Furthermore, despite its weakness in achieving a low recall (0.62) for the Non-Socio-Cultural category, indicating (again) a tendency to artificially inflate the number of bill summaries assumed to express political beliefs on socio-cultural issues, it demonstrates the highest F1-score (0.70) for that particular class.

In [23]:
# 5b. Random Forests

configs = [ # Saving all the vectorizer + classifier configurations of interest
    ('RF with Count - 100 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 100)),
    
    ('RF with Tfidf - 100 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 100)),
    
    ('RF with Count - 500 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 500)),
    
    ('RF with Tfidf - 500 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 500)),
    
    ('RF with Count - 1000 estimators', CountVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 1000)),
    
    ('RF with Tfidf - 1000 estimators', TfidfVectorizer(min_df = 0, max_df = .75),
     RandomForestClassifier(n_estimators = 1000)),
]

# Instead of fitting the vectorizer and classifier separately, I combine them in a pipeline!

# I loop over all the desired configurations in 'configs'.
for name, vectorizer, classifier in configs:
    
    # I print the given name of the classifier-vectorizer combination.
    print(name)
    
    pipe = make_pipeline(vectorizer, classifier) # I make a pipeline that combines vectorizer and classifier.
    pipe.fit(text_train, sc_train) # I fit the training data on the pipeline.
    
    sc_pred = pipe.predict(text_valid) # I predict the labels from the text database I set aside for validation.
    
    # I print a classification report for the predicted values against the true labels from the validation database.
    print(metrics.classification_report(sc_valid, sc_pred))
    
    # I add a new line for pretty printing.
    print("\n")

RF with Count - 100 estimators
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.81      0.47      0.60       178
    Socio-Cultural       0.72      0.92      0.81       262

          accuracy                           0.74       440
         macro avg       0.76      0.70      0.70       440
      weighted avg       0.76      0.74      0.72       440



RF with Tfidf - 100 estimators
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.81      0.44      0.57       178
    Socio-Cultural       0.71      0.93      0.81       262

          accuracy                           0.73       440
         macro avg       0.76      0.69      0.69       440
      weighted avg       0.75      0.73      0.71       440



RF with Count - 500 estimators
                    precision    recall  f1-score   support

Non-Socio-Cultural       0.80      0.43      0.56       178
    Socio-Cultural       0.70      0.93      0.80       

All Random Forest Classifiers exhibit dismal performances. Their recall metrics for the Non-Socio-Cultural category are all equal to or lower than 0.45, meaning that they would certainly skew my analysis by coding an overwhelming number of bill summaries as concerning socio-cultural issues. Thus, no model is selected from this test run.

## 4. Fine-Tuning the Best Hyperparameters

Focusing on the Economic / Non-Economic classification task, I am left with four models to fine-tune and compare:
1. The Naive-Bayes classifier with the `CountVectorizer`;
2. The Logistic Regression classifier with the `liblinear` solver and the `TfIdfVectorizer`;
3. The SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`;
4. The SVM classifier with the `rbf` kernel trick and the `TfIdfVectorizer`.

Turning to the Socio-Cultural / Non-Socio-Cultural classification task, I wish to fine-tune and compare only two models:
1. The Naive-Bayes classifier with the `CountVectorizer`;
2. The SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`.

I wish to optimize the models' macro-F1 score, which is an appropriate evaluation metric because class imbalance seems to be critical problem for both classification tasks. Specifically, all models show at least some tendency of inflating the number of positive labels. I will search for the best vectorizer parameters in terms of:
- Allowing for bi-grams or tri-grams. While remaining in a co-occurrence analysis framework, I believe these could bring a performance gain as the concepts that guided my manual annotation are latent and abstract - and thus context-dependent - concepts by constituting unique couples or triplets of words.
- Including  the default `nltk` English stopwords list or not, as I may have overlooked potential advantages of the stopwords removal approach.
- Setting lower or higher inferior and superior thresholds for pruning. This step is almost purely data-driven, but I do expect the optimal lower threshold to be zero since some summaries are very short and contain unique, yet very informative words.

On the other hand, I will search for the best classifier parameters in terms of:
- Regularisation parameters (`C`), for the Logistic Regression and SVM models.
- The additive smoothing parameter (`alpha`), for the Naive-Bayes classifier.

In [24]:
# I first save the default nltk English stopwords list in the "stop_word" object

stop_words = nltk.corpus.stopwords.words('english')

In [25]:
# a. Economic / Non-economic

# 1a. Naive-Bayes classifier with CountVectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", CountVectorizer()), # A CountVectorizer, without any previously specified parameters.
        ("classifier", MultinomialNB()), # A Multinomial Naive-Bayes, without any previously specified parameters.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove English stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__alpha": [0.01, 1, 10, 100], # The most optimal alpha for the Naive-Bayes
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [26]:
search.fit(text_train, econ_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [27]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__alpha': 0.01, 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 5, 'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': None}


In [28]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [29]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(econ_valid, pred))

              precision    recall  f1-score   support

    Economic       0.80      0.83      0.81       250
Non-Economic       0.76      0.73      0.75       190

    accuracy                           0.79       440
   macro avg       0.78      0.78      0.78       440
weighted avg       0.79      0.79      0.79       440



By fine-tuning the Naive-Bayes baseline, I achieve small, yet satisfactory trade-offs between precision and recall for both categories. Most importantly, recall for the Non-Economic class is raised from 0.71 to 0.73, at the cost of a decrease in precision, which goes from 0.78 to 0.76. The overall accuracy remains unchanged.

In [30]:
# 2a. Logistic Regression classifier with TfIdf Vectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", TfidfVectorizer()), # A TfidfVectorizer, without any previously specified parameters.
        ("classifier", LogisticRegression(solver = "liblinear")), # A Logistic Regression, with the "liblinear" solver.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__C": [0.01, 1, 10, 100], # The most optimal regularisation parameter for the Logistic Regression.
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [31]:
search.fit(text_train, econ_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [32]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__C': 10, 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}


In [33]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [34]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(econ_valid, pred))

              precision    recall  f1-score   support

    Economic       0.81      0.84      0.82       250
Non-Economic       0.77      0.74      0.75       190

    accuracy                           0.79       440
   macro avg       0.79      0.79      0.79       440
weighted avg       0.79      0.79      0.79       440



Thanks to hyper-parameter fine-tuning, the Logistic Regression model attains a level of performance similar to the baseline. The classifier's tendency of artificially inflating the number of positive labels is greatly reduced, with the Non-Economic category's recall sharply increasing from 0.67 to 0.74. The main trade-off concerns how well the model fares in handling the Economic class, with its associated F1 score shrinking from 0.84 to 0.82, but it is acceptable as it helps covering the classifier's greatest weakness. 

In [35]:
# 3a. Support Vector Machine classifier with linear kernel and TfIdf Vectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", TfidfVectorizer()), # A TfIdfVectorizer, without any previously specified parameters.
        ("classifier", SVC(kernel = "linear")), # A Support Vector Machine, with the "linear" kernel trick.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__C": [0.01, 1, 10, 100], # The most optimal regularisation parameter for the Support Vector Machine.
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [36]:
search.fit(text_train, econ_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [37]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__C': 1, 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}


In [38]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [39]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(econ_valid, pred))

              precision    recall  f1-score   support

    Economic       0.81      0.84      0.83       250
Non-Economic       0.78      0.74      0.76       190

    accuracy                           0.80       440
   macro avg       0.79      0.79      0.79       440
weighted avg       0.80      0.80      0.80       440



Never change a winning team they say. It seems there was no need to fine-tune the SVM classifier with the `linear` kernel trick,  which is the best model yet, for three reasons: its tendency to artificially inflate the number of positive labels is slightly lower than the rest of the classifiers'; its accuracy is the highest overall (0.80); its F1 scores for both categories are the highest in general - i.e., respectively, 0.83, and 0.76.

In [40]:
# 4a. Support Vector Machine classifier with rbf kernel and TfIdf Vectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", TfidfVectorizer()), # A TfIdfVectorizer, without any previously specified parameters.
        ("classifier", SVC(kernel = "rbf")), # A Support Vector Machine, with the "rbf" kernel trick.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__C": [0.01, 1, 10, 100], # The most optimal regularisation parameter for the Support Vector Machine.
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [41]:
search.fit(text_train, econ_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [42]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__C': 1, 'vectorizer__max_df': 0.5, 'vectorizer__min_df': 10, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}


In [43]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [44]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(econ_valid, pred))

              precision    recall  f1-score   support

    Economic       0.78      0.86      0.82       250
Non-Economic       0.79      0.68      0.73       190

    accuracy                           0.79       440
   macro avg       0.79      0.77      0.78       440
weighted avg       0.79      0.79      0.78       440



Again, never change a winning team. Trying to gear hyperparameters towards the maximum macro-F1 score actually yields worse fit on the validation set, which leads me to instantly discard the SVM classifier with the `rbf` kernel trick. In the end, the best choice for the Economic / Non-Economic classification task is the **SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`**.

In [45]:
# b. Socio-Cultural / Non-Socio-Cultural

# 1b. Naive-Bayes classifier with CountVectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", CountVectorizer()), # A CountVectorizer, without any previously specified parameters.
        ("classifier", MultinomialNB()), # A Multinomial Naive-Bayes, without any previously specified parameters.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove English stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__alpha": [0.01, 1, 10, 100], # The most optimal alpha for the Naive-Bayes
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [46]:
search.fit(text_train, sc_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [47]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__alpha': 1, 'vectorizer__max_df': 0.5, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'whe

In [48]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [49]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(sc_valid, pred))

                    precision    recall  f1-score   support

Non-Socio-Cultural       0.80      0.71      0.76       178
    Socio-Cultural       0.82      0.88      0.85       262

          accuracy                           0.81       440
         macro avg       0.81      0.80      0.80       440
      weighted avg       0.81      0.81      0.81       440



I will not repeat myself for a third time, as I would be overusing an objectively unfunny line. Again, there is no remarkable change from the original Naive-Bayes classifier with the `CountVectorizer`, so this remains the model that exhibits the lesser tendency of artificially inflating the number of positive labels, while retaining the highest overall accuracy (0.81).

In [50]:
# 2b. Support Vector Machine classifier with linear kernel and TfIdf Vectorizer

pipeline = Pipeline( # Constructing a pipeline comprised of two steps.
    steps = [
        ("vectorizer", TfidfVectorizer()), # A TfIdfVectorizer, without any previously specified parameters.
        ("classifier", SVC(kernel = "linear")), # A Support Vector Machine, with the "linear" kernel trick.
    ]
)

grid = { # In the gridsearch I want to search for...
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)], # Whether I want to allow for bigrams or tri-grams.
    "vectorizer__stop_words": [None, stop_words], # Whether I want to remove stopwords or not.
    "vectorizer__max_df": [0.5, 0.75], # Different cutoffs for the maximum... 
    "vectorizer__min_df": [0, 5, 10], # and minimum thresholds for pruning.
    "classifier__C": [0.01, 1, 10, 100], # The most optimal regularisation parameter for the Support Vector Machine.
}

# I run a gridsearch with the standard 5 k-folds, maximising the macro-F1 score since class imbalances are critical within
# this classification task.

search = GridSearchCV(
    estimator = pipeline, n_jobs = -1, param_grid = grid, scoring = "f1_macro", cv = 5, verbose = 10
)

In [51]:
search.fit(text_train, sc_train) # I run the gridsearch on the training data.

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [52]:
print(f"Best parameters: {search.best_params_}") # I print the best parameters found by the gridsearch.

Best parameters: {'classifier__C': 10, 'vectorizer__max_df': 0.5, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 3), 'vectorizer__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where'

In [53]:
pred = search.predict(text_valid) # I make my predictions with the best parameters found by the gridsearch.

In [54]:
# I print a classification report for the predicted values against the true labels from the validation set.
print(metrics.classification_report(sc_valid, pred))

                    precision    recall  f1-score   support

Non-Socio-Cultural       0.82      0.66      0.73       178
    Socio-Cultural       0.80      0.90      0.85       262

          accuracy                           0.80       440
         macro avg       0.81      0.78      0.79       440
      weighted avg       0.81      0.80      0.80       440



Fine-tuning hyperparameters of the SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer` leads to way more promising performance metrics all across the board. However, the overall accuracy is still worse than the fine-tuned baseline - i.e., 0.81 (NB) versus 0.80 (SVM) - and the same is true for the Non-Socio-Cultural category's recall - i.e., 0.71 (NB) versus 0.66 (SVM). This implies that this alternative model is actually more inclined to artificially inflate the number of positive labels, leading me to discard it. In the end, the best choice for the Socio-Cultural / Non-Socio-Cultural classification task is what should have been the baseline - i.e., the **Naive-Bayes classifier with the `CountVectorizer`**.

## 5. The Final Tests

To account for the two selected classifiers' potential overfitting of the validation data, I run two separate final evaluations on the unseen testing set I kept aside from the start. First, I check the **SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`**, geared towards the Economic / Non-Economic classification task. 

In [55]:
# a. Economic / Non-Economic

# I define the optimised TfIdfVectorizer with the fine-tuned hyperparameters found by the gridsearch...
vectorizer = TfidfVectorizer(ngram_range = (1, 1), min_df = 0, max_df = .75)

# ...and I do the same with the SVM classifier.
classifier = SVC(kernel = "linear", C = 10)

# I generate the BOW feature representation by fitting the vectorizer on my training data...
x_train = vectorizer.fit_transform(text_train)

# ...and testing data.
x_test = vectorizer.transform(text_test)

# I now train my optimised classifier.
classifier.fit(x_train, econ_train)

In [56]:
# I make my predictions on the yet unseen set for testing purposes...
pred = classifier.predict(x_test)

# ...and I print a classification report for the predicted values against the true labels from the test database.
print(metrics.classification_report(econ_test, pred))

              precision    recall  f1-score   support

    Economic       0.78      0.76      0.77       222
Non-Economic       0.76      0.78      0.77       218

    accuracy                           0.77       440
   macro avg       0.77      0.77      0.77       440
weighted avg       0.77      0.77      0.77       440



Unfortunately, it appears that randomisation has dealt me an unlucky hand. The negative category is non-negligibly overrepresented in the Economic / Non-Economic label test set, which means that this classification report should be interpreted with added caution. Its overall performance is slightly diminished, partially due to overfitting of the validation set, but the **SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`** is still a great model for solving this classification task. Although the figures should be taken with a grain of salt, in this particular instance the classifier shows solid metrics all across the board, which is very valuable since the most common issue throughout the pipeline has been artificial inflating of the positive labels. 

Before turning to the Socio-Cultural / Non-Socio-Cultural classification task, I save this vectorizer and classifier combination in two compressed files. This allows me to load the summary classifier I just trained at any given moment without needing to re-estimate it. Information on the performance optimisation provided by the `joblib` library (instead of the usual `pickle` library) was inspired by `ogrisel`'s top-rated response to the https://stackoverflow.com/questions/12615525/what-are-the-different-use-cases-of-joblib-versus-pickle thread. I must greatly thank this user for helping me to fine-tune my code, even though the performance gain is not too relevant for my application.

In [57]:
# I now save the vectorizer and the classifier as compressed files.
# I employ the joblib library, which is faster than pickle in saving or loading large NumPy arrays.

with open("econ_vectorizer.pkl", mode = "wb") as f:
    joblib.dump(vectorizer, f)
    
with open("econ_classifier.pkl", mode = "wb") as f:
    joblib.dump(classifier, f)

In [58]:
# b. Socio-Cultural / Non-Socio-Cultural

# I define the optimised CountVectorizer with the fine-tuned hyperparameters found by the gridsearch...
vectorizer = CountVectorizer(ngram_range = (1, 1), min_df = 0, max_df = .5, stop_words = stop_words)

# ...and I do the same with the NB classifier.
classifier = MultinomialNB(alpha = 1)

# I generate the BOW feature representation by fitting the vectorizer on my training data...
x_train = vectorizer.fit_transform(text_train)

# ...and testing data.
x_test = vectorizer.transform(text_test)

# I now train my optimised classifier.
classifier.fit(x_train, sc_train)

In [59]:
# I make my predictions on the yet unseen set for testing purposes...
pred = classifier.predict(x_test)

# ...and I print a classification report for the predicted values against the true labels from the test database.
print(metrics.classification_report(sc_test, pred))

                    precision    recall  f1-score   support

Non-Socio-Cultural       0.77      0.71      0.74       157
    Socio-Cultural       0.84      0.88      0.86       283

          accuracy                           0.82       440
         macro avg       0.81      0.80      0.80       440
      weighted avg       0.82      0.82      0.82       440



On the other hand, the **Naive-Bayes classifier with the `CountVectorizer`** retains most of its performance even when applied to the test set. There is a slight accuracy drop, from 0.82 to 0.81, mostly provoked by the precision loss concerning the Non-Socio-Cultural category, but the classifier is an acceptable answer to the Socio-Cultural / Non-Socio-Cultural classification task. However, there is a bitter note at the end, as the recall metrics still imply that there is an underlying bias towards the positive label. This recurring problem is likely caused by how conceptually broad and abstract the categories I employed for human annotation are, and it points to the need of transcending the Bag-Of-Words approach to take context into account.

I still decide to save this vectorizer and classifier combination in two compressed files, just in case the state-of-the-art BERT fine-tuning technique does not lead to better outcomes. This allows me to load the summary classifier I just trained at any given moment without needing to re-estimate it. Information on the performance optimisation provided by the `joblib` library (instead of the usual `pickle` library) was inspired by `ogrisel`'s top-rated response to the https://stackoverflow.com/questions/12615525/what-are-the-different-use-cases-of-joblib-versus-pickle thread. I must greatly thank this user for helping me to fine-tune my code, even though the performance gain is not too relevant for my application.

In [60]:
# I now save the vectorizer and the classifier as compressed files.
# I employ the joblib library, which is faster than pickle in saving or loading large NumPy arrays.

with open("sc_vectorizer.pkl", mode = "wb") as f:
    joblib.dump(vectorizer, f)
    
with open("sc_classifier.pkl", mode = "wb") as f:
    joblib.dump(classifier, f)

## 6. Wrapping Up

The vectorizer + classifier combinations I fine-tuned with grid-searches could be greatly improved. Even though the accuracy of the respective solutions hovers around the 80% mark, which is pretty satisfactory, there is a recurring and underlying bias towards the positive labels. The most flexible and powerful solution to transcend the Bag-Of-Words approach and try solving this issue is to fine-tune a BERT transformer specifically trained on legal text in English, to ensure that the pre-training phase is consistent with my domain of interest - i.e., US Congress bills. However, this necessitates a great deal of supplementary computational effort, to the extent that I am forced to load my script on Google CoLab, in order to employ Google's GPUs.

Thus, I will download the pre-trained `nlpaueb/legal-bert-base-uncased`, a BERT model from the `HuggingFace` library, created by the Athens University of Economics and Business's Natural Language Processing Group. The LEGAL-BERT model is pre-trained on a corpora of EU legislation, UK legislation, US contracts from the US Securities and Exchange Commission (SECOM), and cases from the European Court of Justice (ECJ), European Court of Human Rights (ECHR), and various courts across the USA. It is available at https://huggingface.co/nlpaueb/legal-bert-base-uncased. I expect that by fine-tuning this transformer for my specific downstream tasks will lead to superior performances in both classification tasks, ultimately yielding more nuanced predictions that take context and temporality into account.