### Hypothesis 

1. Is there a significant difference in the average word count between positive and negative reviews?

2. Is there a significant association between sentiment (positive) and the presence of specific words like "amazing," "great," and "interesting" in movie reviews?

3. Is there a significant association between sentiment (negative) and the presence of specific words like "terrible," "bad," and "boring" in movie reviews?


### Questions 

1. What is the distribution of sentiment labels in the movie reviews dataset, and how are they distributed among different sentiment categories?

2. What does the distribution of word counts in movie reviews reveal about the length of reviews in the dataset?

3. How does the word count of movie reviews relate to their sentiment? Is there a noticeable difference in word count between positive and negative reviews?

## Installation of libraries

In [None]:
# !pip install spacy
# !pip install gradio
# !python -m spacy download en_core_web_sm
# !pip install wordcloud

## Importation of libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import string
import pickle
import os
import shutil
from collections import Counter
from wordcloud import WordCloud
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import scipy.stats as stats
from scipy.stats import ttest_ind
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.base import TransformerMixin 
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

## Data Collection & Loading

In [None]:
# !git clone https://github.com/Israel-Anaba/Movie_Review_Analysis.git

In [None]:
# %cd Movie_Review_Analysis

In [None]:
data=pd.read_csv('C:Assets\Train.csv')

In [None]:
data

In [None]:
data.shape

In [None]:
data.info

In [None]:
data.describe

In [None]:
data.isna().sum()

In [None]:
print(f'Columns Names: {list(data.columns)}')

In [None]:
data1=pd.read_csv('C:Assets\Test.csv')

In [None]:
data1

In [None]:
data.info

In [None]:
data.rename(columns={'content': 'review'}, inplace=True)

In [None]:
data.drop('review_file', axis=1, inplace=True)

In [None]:
data

## EDA

In [None]:
# Distribution of sentiment labels
sns.countplot(x='sentiment', data=data)
plt.title('Distribution of Sentiment Labels')
plt.show()

In [None]:
# Count the number of each sentiment label
sentiment_count = data['sentiment'].value_counts()

print(sentiment_count)

#### Observation - The distribution of both classes (Positive & Negative) are evenly matched. This confirms that the class is balance and no need for a balancing the data to mitigate a class imbalance.

In [None]:
hashtags = data['review'].str.findall(r'#\w+')
hashtags = [item for sublist in hashtags for item in sublist]
hashtag_freq = Counter(hashtags)

# Get the top 10 hashtags
top_10_hashtags = dict(hashtag_freq.most_common(10))

plt.figure(figsize=(10, 6))
plt.bar(top_10_hashtags.keys(), top_10_hashtags.values())
plt.xlabel('Hashtags')
plt.ylabel('Frequency')
plt.title('Top 10 Hashtags')
plt.xticks(rotation=45)
plt.show()

#### Observation - The hashtags featured in the dataset is numeric. They could likely be consumers numbering their reviews.

In [None]:
# Word cloud for positive reviews
positive_reviews = data[data['sentiment'] == 'positive']['review']
positive_text = ' '.join(positive_reviews)
positive_wordcloud = WordCloud(width=800, height=400).generate(positive_text)
plt.figure(figsize=(10, 5))
plt.imshow(positive_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Positive Reviews')
plt.axis('off')
plt.show()

#### Observation - This is a visual of the some of the most prevalent words in the Positive class. Words like : movie, watch director featured heavily. These terms are closely related to the domain of movies in relation to the Positive class.

In [None]:
# Word cloud for negative reviews
negative_reviews = data[data['sentiment'] == 'negative']['review']
negative_text = ' '.join(negative_reviews)
negative_wordcloud = WordCloud(width=800, height=400).generate(negative_text)
plt.figure(figsize=(10, 5))
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Negative Reviews')
plt.axis('off')
plt.show()

#### Observation - This is a visual of the some of the most prevalent words in the Negative class. Words like : movie, bad and actor featured heavily. These terms are closely related to the domain of movies in relation to the Negaitive class.

## HYPOTHESIS TESTING

### Question: Is a significant difference in the average word count between positive and negative reviews

### H(0) - There is no significant difference in the average word count between positive and negative reviews 

### H(1)- There is a significant difference in the average word count between positive and negative reviews 

In [None]:
# Separate positive and negative reviews
positive_reviews = data[data['sentiment'] == 'positive']
negative_reviews = data[data['sentiment'] == 'negative']

# Calculate word count for each review
positive_reviews['word_count'] = positive_reviews['review'].apply(lambda x: len(x.split()))
negative_reviews['word_count'] = negative_reviews['review'].apply(lambda x: len(x.split()))

# Perform a two-sample t-test
t_stat, p_value = ttest_ind(positive_reviews['word_count'], negative_reviews['word_count'])

# Define the significance level (alpha)
alpha = 0.05

# Check the p-value
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in word count between positive and negative reviews.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in word count between positive and negative reviews.")


#### Observation - The Hypothesis test infers that the average number of words counts in the classification of the review.

### Question: Is there a significant association between sentiment (positive) and the presence of specific words like "amazing," "great," and "interesting" in movie reviews?

### Hypotheses:

### H(0): There is no significant association between sentiment and the presence of these specific words in movie reviews.
### H(1): There is a significant association between sentiment and the presence of these specific words in movie reviews.

In [None]:
# List of words to test
words_to_test = ["amazing", "great", "interesting"]

for word in words_to_test:
    # Calculate the count of positive reviews containing the word
    count_positive_word = len(data[(data['sentiment'] == 'positive') & data['review'].str.contains(word)])

    # Calculate the count of positive reviews without the word
    count_positive_without_word = len(data[(data['sentiment'] == 'positive') & ~data['review'].str.contains(word)])

    # Perform a chi-squared test for each word
    contingency_table = [[count_positive_word, count_positive_without_word],
                         [data['sentiment'].value_counts()['positive'] - count_positive_word, data['sentiment'].value_counts()['positive'] - count_positive_without_word]]

    chi2, p, _, _ = stats.chi2_contingency(contingency_table)

    # Print the results for each word
    print(f"Word: '{word}'")
    print(f"Count of positive reviews containing '{word}':", count_positive_word)
    print(f"Count of positive reviews without '{word}':", count_positive_without_word)
    print("Chi-squared statistic:", chi2)
    print("p-value:", p)

    # Perform a hypothesis test for each word
    alpha = 0.05
    if p < alpha:
        print(f"Reject the null hypothesis: There's a significant association between sentiment and '{word}'.")
    else:
        print(f"Fail to reject the null hypothesis: No significant association between sentiment and '{word}'.")
    
    # Add a separator for readability
    print("\n" + "=" * 50 + "\n")

#### Observation - This Hypothesis Test infers that the prevalence of the words in a review text could translate in a Positive Sentiment.

### Question: Is there a significant association between sentiment (negative) and the presence of specific words like "terrible," "boring," and "bad" in movie reviews?

### Hypotheses:

### H(0): There is no significant association between sentiment and the presence of these specific words in movie reviews.
### H(1): There is a significant association between sentiment and the presence of these specific words in movie reviews.

In [None]:
# List of words to test
words_to_test = ["terrible", "bad", "boring"]

for word in words_to_test:
    # Calculate the count of negative reviews containing the word
    count_negative_with_word = len(data[(data['sentiment'] == 'negative') & data['review'].str.contains(word)])

    # Calculate the count of negative reviews without the word
    count_negative_without_word = len(data[(data['sentiment'] == 'negative') & ~data['review'].str.contains(word)])

    # Perform a chi-squared test for each word
    contingency_table = [[count_negative_with_word, count_negative_without_word],
                         [data['sentiment'].value_counts()['negative'] - count_negative_with_word, data['sentiment'].value_counts()['negative'] - count_negative_without_word]]

    chi2, p, _, _ = stats.chi2_contingency(contingency_table)

    # Print the results for each word
    print(f"Word: '{word}'")
    print("Count of negative reviews containing the word:", count_negative_with_word)
    print("Count of negative reviews without the word:", count_negative_without_word)
    print("Chi-squared statistic:", chi2)
    print("p-value:", p)

    # Perform a hypothesis test for each word
    alpha = 0.05
    if p < alpha:
        print(f"Reject the null hypothesis: There's a significant association between sentiment and '{word}'.")
    else:
        print(f"Fail to reject the null hypothesis: No significant association between sentiment and '{word}'.")
    # Add a separator for readability
    print("\n" + "=" * 50 + "\n")

#### Observation - This Hypothesis Test infers that the prevalence of the words in a review text could translate in a Negative Sentiment.

  ## ANSWERING QUESTIONS

1. What is the distribution of sentiment labels in the movie reviews dataset, and how are they distributed among different sentiment categories?

In [None]:
# Plot the distribution of sentiment labels
sentiment_counts = data['sentiment'].value_counts()
plt.bar(sentiment_counts.index, sentiment_counts.values)
plt.title('Distribution of Sentiment Labels')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

### Observation - The counts of the Sentiment class are leveled at about 12500

2. What does the distribution of word counts in movie reviews reveal about the length of reviews in the dataset?

In [None]:
# Calculate word count for each review
data['word_count'] = data['review'].apply(lambda x: len(x.split()))

# Plot the word count distribution
plt.hist(data['word_count'], bins=20)
plt.title('Word Count Distribution in Reviews')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()

### Observation - Review with a word count between 100 and 500 featured an average of about 8000 times

3. How does the word count of movie reviews relate to their sentiment? Is there a noticeable difference in word count between positive and negative reviews?

In [None]:
# Create a box plot to visualize the relationship between sentiment and word count
plt.figure(figsize=(8, 6))
sns.boxplot(x='sentiment', y='word_count', data=data)
plt.title('Relationship between Sentiment and Word Count')
plt.xlabel('Sentiment')
plt.ylabel('Word Count')
plt.show()

### Observation - The average word count of words within the Positive class was about 500 while the average count of words within the Negative class averaged about 400 words. We can infer that longer reviews are likely have Positive Sentiment.

## Feature Processing & Engineering

In [None]:
# Remove hashtags from the 'review' column
data['review'] = data['review'].str.replace(r'#\w+', '', regex=True)

#display dataframe
data

In [None]:
# Load spaCy
nlp = English()
stopwords = list(nlp.Defaults.stop_words)
punctuations = string.punctuation

In [None]:
# Preprocess the text 
data['review'] = data['review'].apply(lambda text: text.strip().lower())

In [None]:
# Tokenization, stopwords removal, and punctuation removal using spaCy
def process_text(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if token.text not in stopwords and token.text not in punctuations]
    return ' '.join(tokens)

In [None]:
data.drop('word_count', axis=1, inplace=True)

In [None]:
data

In [None]:
# Create a label encoder
label_encoder = LabelEncoder()

# Encode the 'sentiment' column in your DataFrame
data['sentiment'] = label_encoder.fit_transform(data['sentiment'])

data['sentiment'].unique


## Data Splitting

In [None]:
# Split the data into training and testing sets
X = data['review']
y = data['sentiment']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Feature Encoding

In [None]:
X_train = X_train.apply(process_text)
X_test = X_test.apply(process_text)

In [None]:
# Text vectorization using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

## TRAIN & EVALUATE MODEL

In [None]:
# Create an empty DataFrame to store the results
results_df = pd.DataFrame(columns=['Classifier', 'Accuracy'])

from sklearn.svm import SVC  

# List of classifiers
classifiers = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(),
    "RandomForest": RandomForestClassifier(),
    "SVC": SVC()  
}

# Train and evaluate each classifier
for clf_name, classifier in classifiers.items():
    classifier.fit(X_train_tfidf, y_train)
    y_pred = classifier.predict(X_test_tfidf)

    # Calculate and print the accuracy score in percentage format
    accuracy = accuracy_score(y_test, y_pred) * 100
    print(f"Accuracy Score for {clf_name}: {accuracy:.0f}%\n")
    

    # Append the results to the DataFrame
    results_df = results_df.append({'Classifier': clf_name, 'Accuracy': accuracy}, ignore_index=True)

    # Print the classification report
    class_report = classification_report(y_test, y_pred)
    print(f"Classification Report for {clf_name}:\n{class_report}\n")

results_df = results_df.sort_values(by='Accuracy', ascending=False)
results_df


## CONFUSION MATRIX

In [None]:
# Iterate through each classifier in the dictionary
for clf_name, classifier in classifiers.items():
    # Fit the classifier on the training data
    classifier.fit(X_train_tfidf, y_train)
    
    # Make predictions on the test data
    y_pred = classifier.predict(X_test_tfidf)
    
    # Create the confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Print the confusion matrix distribution
    print(f"Confusion Matrix for {clf_name} Distribution:")
    print(pd.DataFrame(cm, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive']))

    # Plot the confusion matrix as a heatmap
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
    plt.title(f'Confusion Matrix for {clf_name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

## HYPERPARAMETER TUNING

In [None]:
#Define the hyperparameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  
    'penalty': ['l1', 'l2'] 
}

# Create the logistic regression model with random state
logistic_classifier = LogisticRegression(solver='liblinear', max_iter=1000, random_state=42)

# Create the GridSearchCV object
grid_search = GridSearchCV(logistic_classifier, param_grid, cv=5, scoring='accuracy')

# Perform grid search
grid_search.fit(X_train_tfidf, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Train the final model with the best hyperparameters
best_classifier = LogisticRegression(solver='liblinear', max_iter=1000, **best_params)

best_classifier.fit(X_train_tfidf, y_train)

In [None]:
# Make predictions on the test set using the tuned model
y_pred = best_classifier.predict(X_test_tfidf)

# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score after Hyperparameter Tuning: {accuracy:.2f}")

In [None]:
from sklearn.metrics import roc_curve, auc

# Convert 'negative' to 0 and 'positive' to 1 in y_test
y_test_binary = y_test.map({'negative': 0, 'positive': 1})

# Get the predicted probabilities for the positive class
y_prob = best_classifier.predict_proba(X_test_tfidf)[:, 1]

# Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test_binary, y_prob)

# Calculate the AUC (Area Under the Curve)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

#### Observation - AUC ≥ 0.9: The model has outstanding discriminative ability. An ROC curve with an AUC of 0.94 means that the model is quite effective in classifying data points correctly, and it is considered a strong performer in this binary classification tasks.

In [None]:
# Create a dictionary to store the components and model
components_dict = {
    'cleaner': process_text,
    'cleaner_tfidf_vectorizer': tfidf_vectorizer,
    'cleaner_classifier': best_classifier
}

In [None]:
# Create a folder to store the exported data
folder_name = 'comp_folder'
os.makedirs(folder_name, exist_ok=True)

In [None]:
# Export the components and model
components_path = os.path.join(folder_name, 'sentiment_components.pkl')
with open(components_path, 'wb') as file:
    pickle.dump(components_dict, file)

In [None]:
# Generate the requirements.txt file
requirements_file = 'requirements.txt'
os.system(f'pip freeze > {os.path.join(folder_name, requirements_file)}')

In [None]:
# Zip the exported data folder
shutil.make_archive('exported_data', 'zip', 'exported_comp_folder')

print("Exported data has been zipped.")

### Test Data

In [None]:
# Load the exported components from the file
components_path = 'comp_folder/sentiment_components.pkl'
with open(components_path, 'rb') as file:
    components_dict = pickle.load(file)

In [None]:
# Extract the loaded components
loaded_cleaner = components_dict['cleaner']
loaded_vectorizer = components_dict['cleaner_tfidf_vectorizer']
loaded_classifier = components_dict['cleaner_classifier']

In [None]:
# Load the new data from a CSV file
data1

In [None]:
data1.rename(columns={'content': 'review'}, inplace=True)

In [None]:
files = data1['review_file'].copy()

In [None]:
data1.drop('review_file', axis=1, inplace=True)

In [None]:
data1['review'] = data1['review'].apply(lambda text: text.strip().lower())

In [None]:
# Apply text cleaning to the new data
data1['review'] = data1['review'].apply(loaded_cleaner)

# Vectorize the new data using the loaded vectorizer
X_new_tfidf = loaded_vectorizer.transform(data1['review'])

# Make predictions on the new data
new_predictions = loaded_classifier.predict(X_new_tfidf)

In [None]:
# Add the "review_file" column back to the data1 DataFrame
data1['review_file'] = files

# Add the predictions to the new_data DataFrame
data1['predicted_sentiment'] = new_predictions

In [None]:
# Create a new DataFrame with only "review_file" and "predicted_sentiment" columns
prediction = data1[['review_file', 'predicted_sentiment']]

In [None]:
data1

In [None]:
prediction

In [None]:
# Count the occurrences of "Positive" and "Negative" in the predictions
count = prediction['predicted_sentiment'].value_counts()

print(count)

In [None]:
# Save the Predictions
prediction.to_csv('predicted_sentiment_results.csv', index=False)

In [None]:
# !pip show spacy