## Sentiment Analysis


* Spliting the data into training and testing part using the `train_test_split` function so that the training set size is 75% of the whole data (set argument `random_state=2023` to make the result deterministic, and make sure the data is split in a stratified fashion)

* Reporting and interpreting the result (accuracy score) on test set

In [4]:


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load the dataset
dataset_path = "/Users/shivanivellanki/Downloads/IMDB Dataset.csv"
df = pd.read_csv(dataset_path)

# Split the data into training and testing sets
X = df['review']  # Input features
y = df['sentiment']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=2023)

# Vectorize the text data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train a logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score on the test set: {:.2f}".format(accuracy))


# Make predictions on new data
new_data = ["I didnt like the actors of this movie", "This movie showed great acting"]
new_data_features = vectorizer.transform(new_data)  # Vectorize the new data using the same vectorizer

# Predict the sentiment of the new data
new_data_predictions = model.predict(new_data_features)

# Print the predictions
for data, prediction in zip(new_data, new_data_predictions):
    print("Data: {}\nSentiment: {}".format(data, prediction))



Accuracy score on the test set: 0.90
Data: I didnt like the actors of this movie
Sentiment: negative
Data: This movie showed great acting
Sentiment: positive


In [None]:
# From the above result we can see the accuracy score is 0.9 which means the model can correctly identify if the review is 'positive'/'negative' 90% of the times.
# This is a high degree of success. 


* Trying to add cross-validation using the `RepeateKFold` function with 5 splits, 10 repeats, and 2023 as random state. 
* Reporting the result on both training and test set with average and the standard deviation of the accuracy score


In [5]:

from sklearn.model_selection import train_test_split, RepeatedKFold
from statistics import mean, stdev

# Perform cross-validation
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=2023)
train_scores = []
test_scores = []

for train_index, val_index in rkf.split(X_train):
    X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    model.fit(X_train_fold, y_train_fold)

    train_pred = model.predict(X_train_fold)
    train_score = accuracy_score(y_train_fold, train_pred)
    train_scores.append(train_score)

    val_pred = model.predict(X_val_fold)
    val_score = accuracy_score(y_val_fold, val_pred)
    test_scores.append(val_score)

# Calculate and print average and standard deviation of scores
train_avg_score = mean(train_scores)
train_std_dev = stdev(train_scores)
print("Average accuracy score on training set: {:.2f}".format(train_avg_score))
print("Standard deviation of accuracy score on training set: {:.2f}".format(train_std_dev))

test_avg_score = mean(test_scores)
test_std_dev = stdev(test_scores)
print("Average accuracy score on test set: {:.2f}".format(test_avg_score))
print("Standard deviation of accuracy score on test set: {:.2f}".format(test_std_dev))

# Determine if the model is overfitting or underfitting
if train_avg_score > test_avg_score:
    print("The model is potentially overfitting the training data.")
elif train_avg_score < test_avg_score:
    print("The model is potentially underfitting the training data.")
else:
    print("The model is performing consistently on both the training and test data.")


Average accuracy score on training set: 0.93
Standard deviation of accuracy score on training set: 0.00
Average accuracy score on test set: 0.89
Standard deviation of accuracy score on test set: 0.00
The model is potentially overfitting the training data.
