# Sentiment Analysis for Customer Reviews Challenge

## Challenge:
Develop a robust Sentiment Analysis classifier for XYZ customer reviews, automating the categorization into positive, negative, or neutral sentiments. Utilize Natural Language Processing (NLP) techniques, exploring different sentiment analysis methods.

## Problem Statement:
XYZ organization, a global online retail giant, accumulates a vast number of customer reviews daily. Extracting sentiments from these reviews offers insights into customer satisfaction, product quality, and market trends. The challenge is to create an effective sentiment analysis model that accurately classifies XYZ customer reviews.

### Important Instructions:

1. Make sure this ipynb file that you have cloned is in the __Project__ folder on the Desktop. The Dataset is also available in the same folder.
2. Ensure that all the cells in the notebook can be executed without any errors.
3. Once the Challenge has been completed, save the SentimentAnalysis.ipynb notebook in the __*Project*__ Folder on the desktop. If the file is not present in that folder, autoevalution will fail.
4. Print the evaluation metrics of the model. 
5. Before you submit the challenge for evaluation, please make sure you have assigned the Accuracy score of the model that was created for evaluation.
6. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score*. The solution is to be written between the comments `# code starts here` and `# code ends here`
7. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

### --------------------------------------- CHALLENGE CODE STARTS HERE --------------------------------------------

In [16]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.neural_network import MLPClassifier
import nltk
# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/labuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/labuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/labuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
df = pd.read_csv('/home/labuser/Desktop/Project/Reviews.csv', nrows=6000)

In [18]:
# Assuming the column name containing sentiment labels is 'Score'
label_column = 'Score'
 
# Apply label encoding to the actual label column
label_encoder = LabelEncoder()
df['encoded_label'] = label_encoder.fit_transform(df[label_column])
 
# Text preprocessing function
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [19]:
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenization and lowercasing
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha()]  # Lemmatization
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

In [20]:
print(df.columns)  # Check column names
print(df.head())   # Display the first few rows of the DataFrame

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text',
       'encoded_label'],
      dtype='object')
   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0               

In [21]:
# Apply text preprocessing
df['processed_text'] = df['Text'].apply(preprocess_text)

In [22]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['processed_text'], df['encoded_label'], test_size=0.2, random_state=42)
 
# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [23]:
# Method 1: Support Vector Machine (SVM) with TF-IDF
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_tfidf, y_train)
y_pred_svm = svm_model.predict(X_test_tfidf)

In [24]:
# Method 2: Multinomial Naive Bayes with TF-IDF
nb_model = make_pipeline(TfidfVectorizer(), MultinomialNB())
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

In [25]:
# Method 3: Random Forest Classifier with TF-IDF
rf_model = make_pipeline(TfidfVectorizer(), RandomForestClassifier(n_estimators=100, random_state=42))
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
 

In [26]:
# Method 4: Logistic Regression with TF-IDF
lr_model = make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=1000))
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

In [27]:
# Method 5: Neural Network (MLP) with TF-IDF
mlp_model = make_pipeline(TfidfVectorizer(), MLPClassifier(hidden_layer_sizes=(100,), max_iter=500))
mlp_model.fit(X_train, y_train)
y_pred_mlp = mlp_model.predict(X_test)

In [28]:
# Display the accuracies for each method
print("Accuracy (SVM):", accuracy_score(y_test, y_pred_svm))
print("Accuracy (Naive Bayes):", accuracy_score(y_test, y_pred_nb))
print("Accuracy (Random Forest):", accuracy_score(y_test, y_pred_rf))
print("Accuracy (Logistic Regression):", accuracy_score(y_test, y_pred_lr))
print("Accuracy (Neural Network):", accuracy_score(y_test, y_pred_mlp))

Accuracy (SVM): 0.7033333333333334
Accuracy (Naive Bayes): 0.6516666666666666
Accuracy (Random Forest): 0.6666666666666666
Accuracy (Logistic Regression): 0.69
Accuracy (Neural Network): 0.665


In [29]:
# # Additional Metrics
print("\nAdditional Metrics (SVM):\n", classification_report(y_test, y_pred_svm))
print("\nAdditional Metrics (Naive Bayes):\n", classification_report(y_test, y_pred_nb))
print("\nAdditional Metrics (Random Forest):\n", classification_report(y_test, y_pred_rf))
print("\nAdditional Metrics (Logistic Regression):\n", classification_report(y_test, y_pred_lr))
print("\nAdditional Metrics (Neural Network):\n", classification_report(y_test, y_pred_mlp))


Additional Metrics (SVM):
               precision    recall  f1-score   support

           0       0.67      0.41      0.51       113
           1       0.60      0.05      0.10        55
           2       0.40      0.04      0.07       106
           3       0.45      0.09      0.15       144
           4       0.72      0.99      0.83       782

    accuracy                           0.70      1200
   macro avg       0.57      0.32      0.33      1200
weighted avg       0.65      0.70      0.62      1200


Additional Metrics (Naive Bayes):
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       113
           1       0.00      0.00      0.00        55
           2       0.00      0.00      0.00       106
           3       0.00      0.00      0.00       144
           4       0.65      1.00      0.79       782

    accuracy                           0.65      1200
   macro avg       0.13      0.20      0.16      1200
weighted avg 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### --------------------------------------- CHALLENGE CODE ENDS HERE --------------------------------------------

### NOTE:
1. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score* below. The solution is to be written between the comments `# code starts here` and `# code ends here`
2. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

In [1]:
def submit_accuracy_score()-> float:
    #accuracy should be in the range of 0.0 to 1.0
    accuracy = 0.0
    # code starts here
   
    # code ends here
    return accuracy