Name: Sanad Masannat

ID: 24217734

Assignment 2 Machine Learning with Python

In [1]:
import pandas as pd
import string
import nltk
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


  from .autonotebook import tqdm as notebook_tqdm


In [2]:

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase and remove punctuation
    text = text.lower().translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return " ".join(words)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sanad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sanad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sanad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sanad\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Here we aim to get a list of all the common English stop_words, and preprocess the text by making it all lowercase, removing punctuation , tokenise the words, then removing all stop words and then lemmatize the words

In [3]:


# Load dataset
df = pd.read_csv("BBC_news.csv")

# Apply the preprocessing done above to the text
df["Processed_Text"] = df["Text"].apply(preprocess_text)

#Make sure all columns and preprocessing is done correctly
df.head(5)


Unnamed: 0,Text,Class,Processed_Text
0,Hariri killing hits Beirut shares Shares in S...,business,hariri killing hit beirut share share solidere...
1,Asian banks halt dollar's slide The dollar re...,business,asian bank halt dollar slide dollar regained l...
2,Housewives lift Channel 4 ratings The debut o...,entertainment,housewife lift channel 4 rating debut u televi...
3,Portable PlayStation ready to go Sony's PlayS...,tech,portable playstation ready go sonys playstatio...
4,Georgia plans hidden asset pardon Georgia is ...,business,georgia plan hidden asset pardon georgia offer...


Read in the csv file and apply the preprocessing we defined earlier to it, saving it to a new column. we then display a few records to make sure it is in fact done correctly

In [4]:
vectorizer = CountVectorizer()
model = LogisticRegression()
X = vectorizer.fit_transform(df["Processed_Text"])  # Transform text into numerical features
y = df["Class"]  # Target labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print results
print("Accuracy: ",accuracy)
print("Classification Report:\n", classification_rep)

Accuracy:  0.9520958083832335
Classification Report:
                precision    recall  f1-score   support

     business       0.93      0.97      0.95        38
entertainment       1.00      0.94      0.97        31
     politics       0.93      0.97      0.95        29
        sport       0.93      1.00      0.96        40
         tech       1.00      0.86      0.93        29

     accuracy                           0.95       167
    macro avg       0.96      0.95      0.95       167
 weighted avg       0.95      0.95      0.95       167



Here we use a standard Logistic Regression model on the data. We use a 70/30 split of test and training here then fit the model to the training data. We then use our test values to predict the Y values and then compare them and get an accuracy score

**Findings**

Our findings here is that we get an decent accuracy score of 95.2%. If we go through each possible category, we can see tech and entertainment have the highest precision but tech has the lowest recall value of .86 while entertainment has the second lowest one with .94. Using entertainment as a refernce, this means it misclassifies a few articles as non-entertainment when it was but its prediction rate is high. Conversely Sports has the lowest precision but the highest recall, meaning if it predicts all sports articles were correctly classified but the prescision indicts in classified non-sports articles as sports. The F1-scores range is from 0.93 to 0.97. This means it was not difficult for the model to classify each text accordingly to a class as F1 scores aim to balance both precision and recall so if one struggled with one metric but did better in the other, the f1 score adjusts accordingly. This model appeared t struggle the most with tech articles as despite the high precision rate, it has the lowest f1-score and a really low recall compared to the other classes

In [None]:


# Prepare data for Doc2Vec as each document will need an associated tag
tagged_data = [TaggedDocument(words=text.split(), tags=[str(i)]) for i, text in enumerate(df["Processed_Text"])]

# Define and train the Doc2Vec model
doc2vec_model = Doc2Vec(vector_size=100, window=5, min_count=2, workers=4, epochs=20)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Convert text data into embeddings
X_doc2vec = np.array([doc2vec_model.infer_vector(text.split()) for text in df["Processed_Text"]])

# Encode class labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(df["Class"])

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_doc2vec, y_encoded, test_size=0.3,random_state=42)

# Train Logistic Regression
model_doc2vec = LogisticRegression()
model_doc2vec.fit(X_train, y_train)

# Predict and evaluate
y_pred_doc2vec = model_doc2vec.predict(X_test)

# Print results
accuracy_doc2vec = accuracy_score(y_test, y_pred_doc2vec)
classification_rep_doc2vec = classification_report(y_test, y_pred_doc2vec, target_names=label_encoder.classes_)

print(f"Doc2Vec Accuracy: {accuracy_doc2vec:.4f}")
print("Doc2Vec Classification Report:\n", classification_rep_doc2vec)


Doc2Vec Accuracy: 0.9401
Doc2Vec Classification Report:
                precision    recall  f1-score   support

     business       0.90      0.97      0.94        38
entertainment       1.00      0.94      0.97        31
     politics       0.93      0.93      0.93        29
        sport       0.97      0.95      0.96        40
         tech       0.90      0.90      0.90        29

     accuracy                           0.94       167
    macro avg       0.94      0.94      0.94       167
 weighted avg       0.94      0.94      0.94       167



Here we use a  Logistic Regression model on the data but this time, we use gensim's doc2vec to produce embeddings for the model. After we prepare the data for the gensim model, we change the data so the model is able to accept them and then wee encode the data.After that, we train the model and then run We use a 70/30 split of test and training here then fit the model to the training data. We then use our test values to predict the Y values and then compare them and get an accuracy score

Findings

Our findings here is that we get a lower accuracy score of 94%. If we go through each possible category, we can see entertainment once again has the highest precision and the median recall value, meaning it misclassifies a few articles entertainment articles as non-entertainment but its prediction rate is high. Here buisness and tech have the lowest prediction values of 0.9 each/ This time, tech has the lowest recall of 0.9 whereas business has the highest recall of 0.97. The F1-scores of each class is a bit more spread out indicated by a wider range (0.9 to 0.97) The model appered to struggle with tech articles the most (indicated with the lowest recall and f1 score of 0.9)

In [6]:


# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert text into BERT embeddings
def get_bert_embedding(text):
    tokens = tokenizer(text, padding='max_length', truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():  # Disable gradient calculation for efficiency
        output = bert_model(**tokens)
    return output.last_hidden_state.mean(dim=1).squeeze().numpy()  # Mean pooling

# Convert dataset text into BERT embeddings
X_bert = np.array([get_bert_embedding(text) for text in df["Processed_Text"]])

# Encode class labels and set up the model
label_encoder = LabelEncoder()
model_bert = LogisticRegression()

y_encoded = label_encoder.fit_transform(df["Class"])

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_bert, y_encoded, test_size=0.3,random_state=42)

# Train the model and make predictions

model_bert.fit(X_train, y_train)
y_pred_bert = model_bert.predict(X_test)

# Print results
accuracy_bert = accuracy_score(y_test, y_pred_bert)
classification_rep_bert = classification_report(y_test, y_pred_bert, target_names=label_encoder.classes_)

print(f"BERT Accuracy: {accuracy_bert:.4f}")
print("BERT Classification Report:\n", classification_rep_bert)


BERT Accuracy: 0.9521
BERT Classification Report:
                precision    recall  f1-score   support

     business       0.95      0.92      0.93        38
entertainment       1.00      0.97      0.98        31
     politics       0.93      0.97      0.95        29
        sport       1.00      1.00      1.00        40
         tech       0.87      0.90      0.88        29

     accuracy                           0.95       167
    macro avg       0.95      0.95      0.95       167
 weighted avg       0.95      0.95      0.95       167



Here we use once again use a Logsitic Regression model on the data but this time, we use Bert to produce embeddings for the model. After we tokenise the data and convert the text data into embeddings, we train the model and then run We use a 70/30 split of test and training here then fit the model to the training data. We then use our test values to predict the Y values and then compare them and get an accuracy score

Findings

Our findings here is that we the highest accuracy score of 95.21%. If we go through each possible category, we can see that  Sport articles had a perfect score for precision, recall and f1-score. Entertainment had an equal precision but a lower recall. Once again, tech has the lowest precision, recall and f1-score at 0.87, 0.8 and 0.88 respectively each metric. Entertainment articles has a really high f1-score that is near perfect but due to a recall value of less than 1 led to a lower f1-score

**Final Thoughts**

Based of all the findings, we can see that the BERT model performed the best, followed by the baseline model which was then followed by the doc2vec  model. The BERT model and the baseline model were close to each other in accuracy with just a difference in 0.1%. The doc2vec model had a lower accuracy of about 1% in comparison to both. The baseline model seemed to struggle the most with tech articles but performed well with entertainment articles. The Doc2Vec model did well classifying entertainment articles but struggled with tech the most. Finally, BERT did really well with sport articles but did bad with tech articles once again.

It appeared all three models did bad with tech articles, this could be due to the fact that technology is used in different sectors so words from tech articles could be used in other different article types. Initially it could be due to the fact that tech articles are the lowest in count but the models performed considerable better in politics articles when it has the same number of articles.

To ensure we the models operate correctly, we first made sure we use the same data, the same test/train split function by keeping the random state the same and the same Logistic regression models. We also used the same metrics for measuring accuracy and classification scores to make sure nothing is different. The reason this occurs is if we dont set a random state, the ranking slightly changes and the accuracy will change as well, in preliminary testing, BERT had a accuracy at around 97 and the doc2vec model produced a better accuracy than the baseline but after setting an equal random state for all models, we achieved the current ranking of BERT>Baseline>Doc2Vec.

To ensure that all the models are working as intended: we did the following:

    1. We tokenised and converted the data correctly for the BERT model

    2. We made sure to use Logistic Regression on top the of the different embeddings to make sure comparisons are fair

    3. We tagged and generated embeddings prior to using our Logistic Regression Models

    4. We set the random state for all models to be equal

    5. We removed the maximum iterations ChatGPT gave us to let the models be slightly more accurate

    
    