<h1>Random Forest Model</h1>

Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import joblib

Load Preprocessed Dataset:

This dataset contains:
- `label`: the sentiment (positive or negative)
- `original_text`: the original text
- `preprocessed_text`: the cleaned and lemmatised version

In [2]:
df = pd.read_csv("Sentiment CSVs/jerbarnes_dataset_lowercased_lemmatized.csv",
                 header=None,
                 names=["label", "original_text", "preprocessed_text"])

Split into Training and Testing Sets (80% Training, 20% Testing):

In [3]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Check sizes
print(f"Training set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")

Training set size: 680
Test set size: 171


Convert Text to Features (TF-IDF):

TF-IDF vectorisation is used with n-grams (1 to 3 words at a time).

In [4]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3))
X_train = vectorizer.fit_transform(train_df["preprocessed_text"])
X_test = vectorizer.transform(test_df["preprocessed_text"])

y_train = train_df["label"]
y_test = test_df["label"]

Train the Random Forest Classifier:

A balanced Random Forest model is used with optimised parameters. 

In [5]:
rf_classifier = RandomForestClassifier(
    class_weight='balanced',
    n_estimators=200,
    max_depth=15,
    min_samples_split=4,
    min_samples_leaf=2,
    random_state=42
)
rf_classifier.fit(X_train, y_train)

Evaluate the Model:

The accuracy is checked, and a classification report is generated.

In [6]:
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, y_pred))

Accuracy: 0.75
              precision    recall  f1-score   support

           0       0.81      0.85      0.83       119
           1       0.61      0.54      0.57        52

    accuracy                           0.75       171
   macro avg       0.71      0.69      0.70       171
weighted avg       0.75      0.75      0.75       171



Add New Annotated Data to improve model:

In [8]:
annotated_data = pd.read_csv("Sentiment CSVs/crowdsourced_dataset_lowercased_lemmatized.csv",
                             header=None,
                             names=["label", "original_text", "preprocessed_text"])

Extend the Training Set:

In [9]:
train_df_extended = pd.concat([train_df, annotated_data], ignore_index=True)
print(f"Extended training set size: {len(train_df_extended)}")

Extended training set size: 2274


Retrain TF-IDF on Extended Data:

In [10]:
vectorizer_extended = TfidfVectorizer(ngram_range=(1, 3))
X_train_extended = vectorizer_extended.fit_transform(train_df_extended["preprocessed_text"])
X_test_extended = vectorizer_extended.transform(test_df["preprocessed_text"])

y_train_extended = train_df_extended["label"]
y_test_extended = test_df["label"]

Retrain the Random Forest Classifier:

A slightly larger model is used this time.

In [11]:
rf_classifier_extended = RandomForestClassifier(
    class_weight='balanced',
    n_estimators=300,
    max_depth=15,
    min_samples_split=4,
    min_samples_leaf=2,
    random_state=42
)
rf_classifier_extended.fit(X_train_extended, y_train_extended)

Evaluate the Updated Model:

In [12]:
y_pred_extended = rf_classifier_extended.predict(X_test_extended)
accuracy_extended = accuracy_score(y_test_extended, y_pred_extended)

print(f"Accuracy after adding new data: {accuracy_extended:.2f}")
print(classification_report(y_test_extended, y_pred_extended))

Accuracy after adding new data: 0.73
              precision    recall  f1-score   support

           0       0.82      0.79      0.80       119
           1       0.55      0.60      0.57        52

    accuracy                           0.73       171
   macro avg       0.69      0.69      0.69       171
weighted avg       0.74      0.73      0.73       171



Save the Updated Model and Vectorizer:

In [13]:
joblib.dump(rf_classifier_extended, "randomForestModel_Extended.pkl")
joblib.dump(vectorizer_extended, "vectorizer_Extended.pkl")

print("Updated model with additional data saved.")

Updated model with additional data saved.
