## Assignment 2: Sentiment Analysis

The goal of this assignment is to train a machine learning model that can predict the sentiment of short statements as either negative, positive or neutral. To train your model, split the attached training data into training, validation and test sets, and train a multi-class logistic regression classifier using the training data. Tune the hyperparameters of the model using the validation set, and finally once all the hyperparameters are tuned and you selected your best model, test it on the test data and report its performance using the different metrics used for evaluating multi-class classifiers.

*Note: Please include comments to your code so it can be easily followed and understood.*

### Loading the data


Load the data.csv file

In [None]:
# import libraries
import pandas as pd

In [None]:
# load the data.csv file on moodle
df = pd.read_csv('data.csv')

In [None]:
# display the first 5 rows of your dataset so you can explore its content
print("First 5 rows of the dataset:")
print(df.head())

### Cleaning the data

Remove duplicates and missing values from your dataset

In [None]:
# check for missing values
print("\nChecking for missing values:")
print(df.isnull().sum())

In [None]:
# drop missing values
df = df.dropna()

In [None]:
# check for duplicates
print("\nChecking for duplicates:")
print(df.duplicated().sum())

In [None]:
# drop duplicates
df = df.drop_duplicates()

### Splitting the data

Split your dataset into train (80%), validation (10%) and test (10%) sets

In [None]:
# import libraries
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

In [None]:
# split the data
total_samples = len(df)   

X = df['text']
y = df['sentiment']

# Split the data into train and temp sets (80% train, 20% temp)
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Split the temp set into validation and final test sets (50% each)
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=0.50, random_state=42)

# Display the shapes of the resulting sets
print("Train set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Test set shape:", X_test.shape, y_test.shape)

### Pre-processing the data

 Pre-process your data by converting all characters to lowercase, removing stop words and performing stemming

In [None]:
# import libraries
%pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [None]:
# pre-process the data
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove stop words
    words = [word for word in text.split() if word.lower() not in stop_words]
    
    # Perform stemming
    words = [stemmer.stem(word) for word in words]
    
    return ' '.join(words)

# Apply preprocessing to the 'text' column
df['text'] = df['text'].apply(preprocess_text)

### Representing the data using TF-IDF

Transform your pre-processed data into tf-idf vectors using the **TfidfVectorizer** of the **sklearn** library. *Note:* You should first transform the training data into TF-IDF vectors, and then when transforming the validation and test data, you should make sure that the IDF scores for test instances are deduced from the training data to avoid data leakage.

In [None]:
# import libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer 
tfidf_vectorizer = TfidfVectorizer()

In [None]:
# transform the text column of the training data into TF-IDF vectors
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

In [None]:
# transform the text column of the validation data into TF-IDF vectors
X_val_tfidf = tfidf_vectorizer.transform(X_val)

In [None]:
# transform the text column of the test data into TF-IDF vectors
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### Training a classifier

Train a multi-class logistic regression model to predict the sentiment of statements into either negative, positive or neutral.You should use the **"sentiment"** column as the target variable and all the remaining **TF-IDF features**' columns created above as independent variables.


In [None]:
# import libraries
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.metrics import accuracy_score

In [None]:
# train the logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_tfidf, y_train)

In [None]:
#y_train.head(5)

In [None]:
# report the accuracy of the model on your validation set
val_predictions = model.predict(X_val_tfidf)
accuracy = accuracy_score(y_val, val_predictions)
print("Accuracy on Validation Set:", accuracy)

In [None]:
# plot the learning curve of the model to examine bias and variance
train_sizes, train_scores, val_scores = learning_curve(model, X_train_tfidf, y_train, cv=5, scoring='accuracy', n_jobs=-1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training Accuracy')
plt.plot(train_sizes, np.mean(val_scores, axis=1), label='Validation Accuracy')
plt.title('Learning Curve')
plt.xlabel('Number of Training Samples')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer: The model seems okay as the training accuracy is in a reasonable range and seems to slowly start tending to 1 with increasing Number of training samples. Furthermore, The validation accuracy curve increases sharply in the beginning indicating that the model is learning but then plateau's at 0.5 and increases very very slowly with larger training samples.
Since we are not getting a lot of information by adding new samples, this may suggest that there may be a lack of generalization (possible overfitting).
However all in all, the model seems to suffer from neither.



### Hyperparameter tuning

Use the **GridSearchCV** module of the **sklearn** library to tune the hyperparameter of your logistic regression model on the validation set. You should try to tune as many hyperparameters as you can, such as regularization, learning rate, solver, number of iterations, etc. Your tuning should be guided by the observations you made from the learning curve of the untuned model.   

In [None]:
# import libraries
from sklearn.model_selection import GridSearchCV

In [None]:
# use gridsearch to find the best combination of hyperparameters
# Define the hyperparameter grid
param_grid = {
    'penalty': ['l2'],           # Regularization type
    'C': [0.001, 0.01, 0.1, 1, 10],    # Inverse of regularization strength
    'solver': ['lbfgs'],  # Optimization algorithm
    'max_iter': [100, 500, 1000],       # Maximum number of iterations
}

# Create GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train_tfidf, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)


In [None]:
#To test the highest score
print('Best score: ', grid_search.best_score_)

### Model Selection and Testing

Choose the best model with the best hyperparameters based on the validation set and test on the test data using accuracy, precision, recall, and F1-score. Display all these evaluation metrics as well as the confusion matrix of the best model.

In [None]:
# import libraries
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [None]:
# validate the accuracy of your best selected model


# Best hyperparameters from the previous part
best_hyperparameters = {'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

# Create the best model with the best hyperparameters
best_model = LogisticRegression(**best_hyperparameters)

# Train the best model on the entire training data
best_model.fit(X_train_tfidf, y_train)

# Make predictions on the test set
test_predictions_best = best_model.predict(X_test_tfidf)

# Evaluate the best model on the test set
test_accuracy_best = accuracy_score(y_test, test_predictions_best)
test_precision_best = precision_score(y_test, test_predictions_best, average='weighted')
test_recall_best = recall_score(y_test, test_predictions_best, average='weighted')
test_f1_score_best = f1_score(y_test, test_predictions_best, average='weighted')

# Display the evaluation metrics
print("Evaluation Metrics for the Best Model on Test Set:")
print("Accuracy:", test_accuracy_best)
print("Precision:", test_precision_best)
print("Recall:", test_recall_best)
print("F1-Score:", test_f1_score_best)
