## Part a

Read the hansard40000.csv dataset in the texts directory into a dataframe

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, classification_report
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from pathlib import Path
import time

In [140]:
root_directory = Path().cwd()
part_two_data_path = root_directory / "data/hansard40000.csv"

In [None]:

df = pd.read_csv(part_two_data_path)

# rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’,
df["party"] = df["party"].replace("Labour (Co-op)", "Labour")

# remove any rows where the value of the ‘party’ column is not one of the
# four most common party names, and remove the ‘Speaker’ value
party_name_count = df["party"].value_counts()
if "Speaker" in party_name_count:
    party_name_count = party_name_count.drop("Speaker")

most_common_party_name = party_name_count.nlargest(4).index
df = df[df["party"].isin(most_common_party_name)]

# remove any rows where the value in the ‘speech_class’ column is not
# ‘Speech’
df = df[df["speech_class"] == "Speech"]

# remove any rows where the text in the ‘speech’ column is less than 1000
# characters long.
df = df[df["speech"].str.len() >= 1000]

print(df.shape)
print(df.head())

(8084, 8)
                                                speech  \
63   It has been less than two weeks since the Gove...   
99   I am delighted to announce that last Friday we...   
100  I thank the Secretary of State for advance sig...   
101  After the right hon. Lady’s congratulations to...   
104  I congratulate the Secretary of State. I recog...   

                       party                  constituency        date  \
63              Conservative               Suffolk Coastal  2020-09-14   
99              Conservative            South West Norfolk  2020-09-14   
100                   Labour  Islington South and Finsbury  2020-09-14   
101             Conservative            South West Norfolk  2020-09-14   
104  Scottish National Party                   Dundee East  2020-09-14   

    speech_class               major_heading  year       speakername  
63        Speech           Work and Pensions  2020    Therese Coffey  
99        Speech  Japan Free Trade Agreement  2020   E

## Part b:

Vectorise the speeches using TfidfVectorizer from scikit-learn. Use the default 5
parameters, except for omitting English stopwords and setting max_features to
3000. Split the data into a train and test set, using stratified sampling, with a
random seed of 26.

In [45]:
# Vectorise the speeches using TfidfVectorizer from scikit-learn.
vectorizer = TfidfVectorizer(stop_words="english", max_features= 3000)
X = vectorizer.fit_transform(df["speech"])
y = df["party"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=26)


## Part c:

Train RandomForest (with n_estimators=300) and SVM with linear kernel classifiers on the training set, and print the scikit-learn macro-average f1 score and
classification report for each classifier on the test set. The label that you are
trying to predict is the ‘party’ value.

In [68]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=300, random_state=26)
random_forest.fit(X_train, y_train)
random_forest_prediction = random_forest.predict(X_test)
print("Macro F1 Score:" ,f1_score(y_test, random_forest_prediction, average="macro"))
print(classification_report(y_test, random_forest_prediction))


Macro F1 Score: 0.47296725014116325
                         precision    recall  f1-score   support

           Conservative       0.73      0.96      0.83       964
                 Labour       0.75      0.48      0.58       463
       Liberal Democrat       0.00      0.00      0.00        54
Scottish National Party       0.85      0.33      0.48       136

               accuracy                           0.74      1617
              macro avg       0.58      0.44      0.47      1617
           weighted avg       0.72      0.74      0.70      1617



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [47]:
# SVM with Linear Classifier
SVM_Linear = SVC(kernel = "linear", random_state=26)
SVM_Linear.fit(X_train, y_train)
SVM_Linear_prediction = SVM_Linear.predict(X_test)
print("Macro F1 Score:" ,f1_score(y_test, SVM_Linear_prediction, average="macro"))
print(classification_report(y_test, SVM_Linear_prediction))

Macro F1 Score: 0.5933446121140653
                         precision    recall  f1-score   support

           Conservative       0.83      0.92      0.87       964
                 Labour       0.74      0.71      0.72       463
       Liberal Democrat       1.00      0.07      0.14        54
Scottish National Party       0.78      0.54      0.64       136

               accuracy                           0.80      1617
              macro avg       0.84      0.56      0.59      1617
           weighted avg       0.81      0.80      0.79      1617



---

## Part d:




Adjust the parameters of the Tfidfvectorizer so that unigrams, bi-grams and 5
tri-grams will be considered as features, limiting the total number of features to
3000. Print the classification report as in 2(c) again using these parameters.

In [57]:
# Vectorise the speeches
vectorizer = TfidfVectorizer(stop_words="english", max_features= 3000, ngram_range=(1, 3))
X = vectorizer.fit_transform(df["speech"])
y = df["party"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=26)

In [58]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=300, random_state=26)
random_forest.fit(X_train, y_train)
random_forest_prediction = random_forest.predict(X_test)
print(classification_report(y_test, random_forest_prediction))

                         precision    recall  f1-score   support

           Conservative       0.73      0.96      0.83       964
                 Labour       0.75      0.48      0.58       463
       Liberal Democrat       0.00      0.00      0.00        54
Scottish National Party       0.85      0.33      0.48       136

               accuracy                           0.74      1617
              macro avg       0.58      0.44      0.47      1617
           weighted avg       0.72      0.74      0.70      1617



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [69]:
# SVM with Linear Classifier
SVM_Linear = SVC(kernel = "linear", random_state=26)
SVM_Linear.fit(X_train, y_train)
SVM_Linear_prediction = SVM_Linear.predict(X_test)
print(classification_report(y_test, SVM_Linear_prediction))

                         precision    recall  f1-score   support

           Conservative       0.84      0.92      0.88       964
                 Labour       0.75      0.73      0.74       463
       Liberal Democrat       1.00      0.04      0.07        54
Scottish National Party       0.78      0.56      0.65       136

               accuracy                           0.81      1617
              macro avg       0.84      0.56      0.59      1617
           weighted avg       0.81      0.81      0.79      1617



## Part e: 


Implement a new custom tokenizer and pass it to the tokenizer argument of
Tfidfvectorizer. You can use this function in any way you like to try to achieve
the best classification performance while keeping the number of features to no
more than 3000, and using the same three classifiers as above. Print the classification report for the best performing classifier using your tokenizer. Marks
will be awarded both for a high overall classification performance, and a good
trade-off between classification performance and efficiency (i.e., using fewer parameters).

In [76]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def new_custom_tokenizer(text):
    text = text.lower()
    word_tokens = word_tokenize(text)

    cleaned_word_tokens = []
    for token in word_tokens:
        if token.isalpha() and token.islower() and token not in stop_words:
            lemma_token = lemmatizer.lemmatize(token)
            if len(lemma_token) > 2:
                cleaned_word_tokens.append(lemma_token)

    return cleaned_word_tokens 


[nltk_data] Downloading package punkt to /Users/marcolecu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marcolecu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/marcolecu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [85]:
new_vectorizer = TfidfVectorizer(tokenizer = new_custom_tokenizer, max_features=3000)
X = new_vectorizer.fit_transform(df["speech"])
y = df["party"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=26)
        



In [None]:
def grid_search_evaluation(model_class, param_grid, label):
    """ Use GridSearch Cross Validation to find highest-performing model based on f1_macro
    
    Args:
        model_class (class): A scikit-learn compatible estimator class (not an instance).
        param_grid (dict): Dictionary with parameters names (str) as keys and lists of parameter settings to try.
        label (str): A label to identify the model in the results.

    Returns:
        dict: metadata related the model
    
    """
    classifier_grid = GridSearchCV(estimator= model_class(), param_grid= param_grid, cv=3, scoring ="f1_macro", n_jobs=-1)

    start_training_time = time.time()
    classifier_grid.fit(X_train, y_train)
    training_time = time.time() - start_training_time

    best_model = classifier_grid.best_estimator_

    start_prediciting_time = time.time()
    y_pred = best_model.predict(X_test)
    predicicting_time = time.time() - start_prediciting_time

    f1 = f1_score(y_test, y_pred, average = "macro")
    accuracy = accuracy_score(y_test, y_pred)
    return {
        "label": label,
        "parameters": classifier_grid.best_params_,
        "model": best_model,
        "f1": f1,
        "accuracy": accuracy,
        'training_time': training_time,
        "predicting_time": predicicting_time,
        "total_time": training_time + predicicting_time,
        "y_pred": y_pred
    }

In [None]:

random_forest_parameter_grid = {"n_estimators": [50, 100, 200, 300, 400, 500],
                                "max_depth": [None, 10, 20, 30]}

svc_parameter_grid = {"C": [0.01, 0.1, 1, 10, 100]}

random_forest_result = grid_search_evaluation(lambda: RandomForestClassifier(random_state=26), random_forest_parameter_grid, "Random Forest")
svc_linear_result = grid_search_evaluation(lambda: SVC(kernel="linear", random_state=26), svc_parameter_grid, "SVC Linear")

results = [random_forest_result, svc_linear_result]

best_performance = max(results, key=lambda x: x['f1'])
best_efficiency = min(results, key=lambda x: x["total_time"])

# Best by performance
print(f"\n Best performance model: {best_performance['label']}")
print(f"Parameters: {best_performance['parameters']}")
print(f"Total time: {best_performance['total_time']:.3f} seconds")
print("\nClassification Report:\n", classification_report(y_test, best_performance['y_pred']))
print("---")
# Best by effiency
if best_efficiency != best_performance:
    print(f"\n Best efficient model: {best_efficiency['label']}")
print(f"Parameters: {best_efficiency['parameters']}")
print(f"Total time: {best_efficiency['total_time']:.3f} seconds")
print("\nClassification Report:\n", classification_report(y_test, best_efficiency['y_pred']))



 Best performance model: SVC Linear
Parameters: {'C': 100}
Total time: 53.119 seconds

Classification Report:
                          precision    recall  f1-score   support

           Conservative       0.83      0.86      0.84       964
                 Labour       0.67      0.69      0.68       463
       Liberal Democrat       0.64      0.33      0.44        54
Scottish National Party       0.73      0.62      0.67       136

               accuracy                           0.77      1617
              macro avg       0.72      0.63      0.66      1617
           weighted avg       0.77      0.77      0.77      1617


 Best efficient model: Random Forest
Parameters: {'max_depth': None, 'n_estimators': 400}
Total time: 51.657 seconds

Classification Report:
                          precision    recall  f1-score   support

           Conservative       0.71      0.98      0.82       964
                 Labour       0.76      0.43      0.55       463
       Liberal Democrat   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Part f

# Explain your tokenizer function and discuss its performance.

In order to complete this task, I have used a custom tokenizer to clean and standardise the speech texts before the vectorization, in order to increase classification performance. In the beginning, the tokenizer will convert all of the text to lowercase and then use standard expressions to remove any characters that are not letters. The cleaned text is then tokenized using NLTK by filtering out English stopwords and removing any extreme short tokens. As a result, lemmatization is applied. In this context, lemmatization was used instead of stemming because it preserves the semantically and syntactically accurate base forms of words. On the other hand, stemming often generates simplified or non-standard forms, thus lemmatization maintains the grammatical nature of words, as this is a significant factor in formal contexts such as political speech. The impact of this tokenizer was seen in the best performance of the linear SVC model, which had the best F1 score of 0.66 and accuracy of 0.77 with parameters C=100. Hence, this classifier showed a balanced performance across political categories, suggesting that the lemmatised features have effectively captured class-specific linguistic patterns. In contrast, the Random Forest approach with 400 estimators had the best efficiency with a total runtime of 51.7 seconds, but severely underperformed in classification, resulting in an F1 score of 0.43 and accuracy of 0.72. Based on these findings, it seems that the custom tokenizer proved to be useful for SVM linear classifiers, which are highly dependent on features that both clean and contextually relevant in high-dimensional space.