## Section Two - Sentiment Analysis


### <div class="alert alert-info"></div>
* Split the data into training and testing part using the `train_test_split` function so that the training set size is 67% of the whole data (set argument `random_state=80493` to make the result deterministic, and make sure the data is split in a stratified fashion)

* Please use two feature extraction methods - __HashingVectorizer__ and __TfidfVectorizer__ from sklearn.feature_extraction.text

* Please use two classification algorithms - Perceptron should be included.

* Report and interpret the results (accuracy score) on test set. You should report four results

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.metrics import accuracy_score

# Loading the dataset
file_path = 'IMDB Dataset.csv'
data = pd.read_csv(file_path)

# Spliting the data
X = data['review']
y = data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=80493, stratify=y)

# Vectorizing the data
hashing_vectorizer = HashingVectorizer()
tfidf_vectorizer = TfidfVectorizer()

# Transforming the data
X_train_hash = hashing_vectorizer.fit_transform(X_train)
X_test_hash = hashing_vectorizer.transform(X_test)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Initializing classifiers
perceptron = Perceptron()
sgd_classifier = SGDClassifier()

# Train and evaluate with HashingVectorizer
perceptron.fit(X_train_hash, y_train)
y_pred_hash_perceptron = perceptron.predict(X_test_hash)
accuracy_hash_perceptron = accuracy_score(y_test, y_pred_hash_perceptron)

# Train and evaluate with TfidfVectorizer
perceptron.fit(X_train_tfidf, y_train)
y_pred_tfidf_perceptron = perceptron.predict(X_test_tfidf)
accuracy_tfidf_perceptron = accuracy_score(y_test, y_pred_tfidf_perceptron)

# Train and evaluate SGDClassifier with HashingVectorizer
sgd_classifier.fit(X_train_hash, y_train)
y_pred_hash_sgd = sgd_classifier.predict(X_test_hash)
accuracy_hash_sgd = accuracy_score(y_test, y_pred_hash_sgd)

# Train and evaluate SGDClassifier with TfidfVectorizer
sgd_classifier.fit(X_train_tfidf, y_train)
y_pred_tfidf_sgd = sgd_classifier.predict(X_test_tfidf)
accuracy_tfidf_sgd = accuracy_score(y_test, y_pred_tfidf_sgd)

# Print results
print(f'Accuracy of Perceptron with HashingVectorizer: {accuracy_hash_perceptron:.4f}')
print(f'Accuracy of Perceptron with TfidfVectorizer: {accuracy_tfidf_perceptron:.4f}')
print(f'Accuracy of SGDClassifier with HashingVectorizer: {accuracy_hash_sgd:.4f}')
print(f'Accuracy of SGDClassifier with TfidfVectorizer: {accuracy_tfidf_sgd:.4f}')


Accuracy of Perceptron with HashingVectorizer: 0.8742
Accuracy of Perceptron with TfidfVectorizer: 0.8776
Accuracy of SGDClassifier with HashingVectorizer: 0.8667
Accuracy of SGDClassifier with TfidfVectorizer: 0.8997


### <div class="alert alert-info"></div>
* Try to add cross-validation using the `RepeateKFold` function with 10 splits, 10 repeats, and 80493 as random state. 
* Report the result on training set with average and the standard deviation of the accuracy score
* Compare to the accuracy score on test set, and explain whether the model is overfitting or underfitting the training data
* You should report four sets of results

In [4]:
# Your answer here

from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
import numpy as np


# Defining cross-validation
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=80493)

# Cross-validation and evaluation
def evaluate_model(model, X_train, y_train, X_test, y_test):
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
    mean_cv_score = np.mean(cv_scores)
    std_cv_score = np.std(cv_scores)

    # Train and test accuracy
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)

    return mean_cv_score, std_cv_score, test_accuracy

# Evaluate with HashingVectorizer
mean_cv_hash_perceptron, std_cv_hash_perceptron, accuracy_hash_perceptron = evaluate_model(perceptron, X_train_hash, y_train, X_test_hash, y_test)

# Evaluate with TfidfVectorizer
mean_cv_tfidf_perceptron, std_cv_tfidf_perceptron, accuracy_tfidf_perceptron = evaluate_model(perceptron, X_train_tfidf, y_train, X_test_tfidf, y_test)

# Evaluate SGDClassifier with HashingVectorizer
mean_cv_hash_sgd, std_cv_hash_sgd, accuracy_hash_sgd = evaluate_model(sgd_classifier, X_train_hash, y_train, X_test_hash, y_test)

# Evaluate SGDClassifier with TfidfVectorizer
mean_cv_tfidf_sgd, std_cv_tfidf_sgd, accuracy_tfidf_sgd = evaluate_model(sgd_classifier, X_train_tfidf, y_train, X_test_tfidf, y_test)

# Print results
print(f'Perceptron with HashingVectorizer: CV Mean Accuracy = {mean_cv_hash_perceptron:.4f}, CV Std Dev = {std_cv_hash_perceptron:.4f}, Test Accuracy = {accuracy_hash_perceptron:.4f}')
print(f'Perceptron with TfidfVectorizer: CV Mean Accuracy = {mean_cv_tfidf_perceptron:.4f}, CV Std Dev = {std_cv_tfidf_perceptron:.4f}, Test Accuracy = {accuracy_tfidf_perceptron:.4f}')
print(f'SGDClassifier with HashingVectorizer: CV Mean Accuracy = {mean_cv_hash_sgd:.4f}, CV Std Dev = {std_cv_hash_sgd:.4f}, Test Accuracy = {accuracy_hash_sgd:.4f}')
print(f'SGDClassifier with TfidfVectorizer: CV Mean Accuracy = {mean_cv_tfidf_sgd:.4f}, CV Std Dev = {std_cv_tfidf_sgd:.4f}, Test Accuracy = {accuracy_tfidf_sgd:.4f}')


Perceptron with HashingVectorizer: CV Mean Accuracy = 0.8629, CV Std Dev = 0.0194, Test Accuracy = 0.8742
Perceptron with TfidfVectorizer: CV Mean Accuracy = 0.8732, CV Std Dev = 0.0064, Test Accuracy = 0.8776
SGDClassifier with HashingVectorizer: CV Mean Accuracy = 0.8620, CV Std Dev = 0.0066, Test Accuracy = 0.8663
SGDClassifier with TfidfVectorizer: CV Mean Accuracy = 0.8930, CV Std Dev = 0.0046, Test Accuracy = 0.8988
