### Part 1: Naive Bayes

[Files](https://drive.google.com/drive/folders/1OUVrOMp2jSSBDJSqvEyXDFTrhiyZnqit?usp=sharing)

You will be performing Sentiment Analysis on a product review dataset with reviews from customers and star rating belonging to four classes (1,2,4,5). You can use sklearn for this question. Your tasks are as follows:

1.   Clean the text by removing punctations and preprocess them using techniques such as stop word removal, stemming etc. You can explore anything!
1.  Create BoW features using the word counts. You can choose the words that form the features such that the performance is optimised. Use the train-test split provided in `train_test_index.pickle` and report any interesting observations based on metrics such as accurarcy, precision, recall and f1 score (You can use Classification report in sklearn).
1. Repeat Task 2 with TfIdf features.

In [None]:
from google.colab import drive
import pickle
import pandas as pd
import numpy as np
import re
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
drive.mount('/content/drive/')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
with open('/content/drive/MyDrive/SMAI FOLDERS Assignment 2/Copy of train_test_index.pickle', 'rb') as handle:
    train_test_index_dict = pickle.load(handle)
print(len(train_test_index_dict['train_index']))
print(len(train_test_index_dict['test_index']))

17862
8799


In [None]:
data = pd.read_csv('/content/drive/MyDrive/SMAI FOLDERS Assignment 2/Copy of product_reviews.csv')
data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,Went in for a lunch. Steak sandwich was delici...,5.0,1
2,This place has gone down hill. Clearly they h...,1.0,0
3,"Walked in around 4 on a Friday afternoon, we s...",1.0,0
4,Michael from Red Carpet VIP is amazing ! I rea...,4.0,1


## Pre-processing

In [None]:
patterns = {
    'url': re.compile(r'https?://\S+|www\.\S+'),
    'hashtag': re.compile(r'(?<!\w)#\w+\b'),
    'mention': re.compile(r'(?<!\w)@\w+\b'),
    'number': re.compile(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?%?\b'),
    'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
    'date': re.compile(r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'),
    'time': re.compile(r'\b(?:[01]?\d|2[0-3]):[0-5]\d\b'),
    'phoneNo': re.compile(r'\b(?:\+\d{1,4}[-. ]?)?\(?\d{3}\)?[-. ]?\d{3}[-. ]?\d{4}\b'),
}

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
stop_words.remove("not")
def preprocess_text(text):
    text = text.lower()
    # for pattern in patterns.values():
    #     text = pattern.sub('', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return " ".join(lemmatized_tokens)
    return text

data['processed_text'] = data['text'].apply(preprocess_text)
data.head()

Unnamed: 0,text,stars,sentiment,processed_text
0,Total bill for this horrible service? Over $8G...,1.0,0,total bill horrible service 8gs crook actually...
1,Went in for a lunch. Steak sandwich was delici...,5.0,1,went lunch steak sandwich delicious caesar sal...
2,This place has gone down hill. Clearly they h...,1.0,0,place gone hill clearly cut back staff food qu...
3,"Walked in around 4 on a Friday afternoon, we s...",1.0,0,walked around 4 friday afternoon sat table bar...
4,Michael from Red Carpet VIP is amazing ! I rea...,4.0,1,michael red carpet vip amazing reached needed ...


## splitting the data

In [None]:
train_data = data.iloc[train_test_index_dict['train_index']]
test_data = data.iloc[train_test_index_dict['test_index']]
X_train, y_train = train_data['processed_text'], train_data['stars']
X_test, y_test = test_data['processed_text'], test_data['stars']

### BOW

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=5, max_df=0.8, max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

X_train_bow.shape

(17862, 10000)

In [None]:
    classifier_bow = MultinomialNB(alpha=5)

In [None]:
classifier_bow.fit(X_train_bow, y_train)

y_pred = classifier_bow.predict(X_test_bow)
print("BoW Features Performance:")
print(classification_report(y_test, y_pred))

BoW Features Performance:
              precision    recall  f1-score   support

         1.0       0.71      0.81      0.76      1149
         2.0       0.55      0.08      0.14       587
         4.0       0.50      0.52      0.51      1981
         5.0       0.80      0.84      0.82      5082

    accuracy                           0.72      8799
   macro avg       0.64      0.57      0.56      8799
weighted avg       0.71      0.72      0.70      8799



### Tfidf

In [None]:
classifier_tfidf = MultinomialNB(alpha=0.1)

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=5, max_df=0.8, max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
X_train_tfidf.shape

(17862, 10000)

In [None]:
classifier_tfidf.fit(X_train_tfidf, y_train)

y_pred_tfidf = classifier_tfidf.predict(X_test_tfidf)
print("TfIdf Features Performance:")
print(classification_report(y_test, y_pred_tfidf))

TfIdf Features Performance:
              precision    recall  f1-score   support

         1.0       0.77      0.74      0.76      1149
         2.0       0.59      0.25      0.35       587
         4.0       0.54      0.40      0.46      1981
         5.0       0.77      0.90      0.83      5082

    accuracy                           0.72      8799
   macro avg       0.67      0.57      0.60      8799
weighted avg       0.70      0.72      0.70      8799

