![sentiment%20analysis%20cover.JPG](attachment:sentiment%20analysis%20cover.JPG)

<h1><center>🥴Capstone Project: Sentiment Based Product Recommendation System🥴</center></h1>

<h2><center>Submitted by: Praveersinh Parmar</h2>


# Problem Statement 📒

- An e-commerce company named '`Ebuss`' has captured a huge market share in many fields, and it sells the products in various categories such as household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products.

- For Ebuss to grow quickly in the e-commerce market and become a major leader in the market, it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.

- As a senior **Machine Learning Engineer** in Ebuss, our goal here is to **build a model that will improve the recommendations given to the users given their past reviews and ratings**. 



- We will build a sentiment-based product recommendation system, which includes the following tasks.

<div class="alert alert-block alert-info">  
<p>1. Data sourcing and sentiment analysis
    
<p>2. Building a recommendation system  

<p>3. Improving the recommendations using the sentiment analysis model  

<p>4. Deploying the end-to-end project with a user interface  
</div>

# 1. Data sourcing and sentiment analysis

In this task, we will perform the following sub-tasks:

1. **Data cleaning and Preprocssing**
2. **Text preprocessing**
3. 
4. **Feature extraction**: In order to extract features from the text data, you may choose from any of the methods, including bag-of-words, TF-IDF vectorization or word embedding.
5. **Training a text classification model**: You need to build at least three ML models. You then need to analyse the performance of each of these models and choose the best model. At least three out of the following four models need to be built (Do not forget, if required, handle the class imbalance and perform hyperparameter tuning.). 
>1. Logistic regression
>2. Random forest
>3. XGBoost
>4. Naive Bayes  

# 📚 Importing the necessary libraries

In [57]:
# Libraries for data loading and manipulation
import pandas as pd
import numpy as np
import pickle
from collections import Counter


# Libraries for data visualization and EDA
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS


# Libraries for text preprocessing
import re, nltk, spacy, string
nlp = spacy.load("en_core_web_sm")
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize


# Libraries for machine learning models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier


# Libraries for evaluating models
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, f1_score, classification_report


In [2]:
# Suppress Warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
## Set limits for displaying rows and columns

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# `Task 1`: Data Cleaning 🧹 and Preprocessing 🛠️

In [4]:
# Load the csv file into a pandas dataframe 'df'
df = pd.read_csv("../input/sentiments/sample30.csv")

# View first five rows
df.head()

In [5]:
# View the dimensions of dataframe
print(f"Number of rows    = {df.shape[0]}")
print(f"Number of columns = {df.shape[1]}")

In [6]:
# View the info of all columns
df.info()

In [7]:
# View the description of data from another csv file
data_description = pd.read_csv("../input/sentiments/DataAttributeDescription.csv", encoding='unicode_escape')
data_description

### 📌 We observe that there are three columns that will be useful in building a sentiment classification model:
### 1. **reviews_text**: It contains the review given by the user to a particular product
### 2. **reviews_title**: It contains the title of the review given in previous column
### 3. **user_sentiment**: It contains the overall sentiment of the user for a particular product (Positive or Negative). We will use them as labels in our model.





In [8]:
# Check for the missing values 
df.isna().sum()

### 📌 There are 190 missing values in **reviews_title**, we replace them with blank.
### 📌 There is 1 missing value in **user_sentiment**, we simply drop it.

In [9]:
# Replace the missing values in 'reviews_title' with blank
df['reviews_title'].fillna("", inplace = True)

# Drop the missing value in 'user_sentiment'
df.dropna(subset=['user_sentiment'], inplace=True)

In [10]:
# Check for the missing values again
df.isna().sum()

### 📌 All missing values in our three useful columns have been treated.

### 📌 We now create a new dataframe containing only two columns
### 1. First column will be a concatenation of the two columns: **reviews_text** and **reviews_title**.
### 2. Second column will be the **user_sentiment** column and it will serve as our target column.

In [11]:
# Concatenate the two columns 'reviews_text' and 'reviews_title' into a single column: 'reviews'
df['reviews'] = df['reviews_text'] + df['reviews_title']
df.head(2)

In [12]:
# Create a new dataframe 'data' with only two columns from the dataframe 'df'
data = df[['reviews', 'user_sentiment']]

# View five random rows from the newly formed dataframe 'data'
data.sample(5)

In [13]:
# Check for data types
data.dtypes

### 📌 Both columns are of correct data type and need no conversion.

### 💠 Let us check how imbalanced our target column is.

In [14]:
# View value counts of the sentiments
data['user_sentiment'].value_counts()

In [15]:
# Visualize the perentage count of sentiments
plt.figure(figsize=(6,6))
plt.pie(
    data['user_sentiment'].value_counts(normalize=True), 
    explode=(0,0.1), labels=['Positive', 'Negative'], autopct='%1.2f%%', shadow=True
)
plt.show()

### 📌 As our target column is highly imbalanced, we will have to use techniques to handle imbalanced data during our model building process.

# `Task 2`: Text Preprocessing 📜

In [16]:
# View any 10 reviews to get an idea of how the reviews text is
# To get a more clear idea, you can run this cell more than one times
data['reviews'].sample(10)

### 💠 First we will clean the text of reviews column

In [17]:
# Helper function to clean the text and remove all the unnecessary elements.
def clean_text(text):
    '''This function 
        - makes the given text lowercase
        - removes punctuation 
        
    :param text: text to be cleaned
    :return: cleaned text
    '''
    
    # Make the text lowercase
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    
    
    return text

In [18]:
# Clean the review column using above helper function
data['reviews'] = data['reviews'].apply(lambda x: clean_text(x))

# View any 10 text-cleaned reviews
data['reviews'].head(10)

In [19]:
# Helper function to lemmatize the text
def lemmatizer(text):     
    """
    This function lemmatizes the given input text.
    :param text: given text
    :return: lemmatized text
    """
    
    # Initialize empty list to store lemmas
    sent = []
    
    # Extract lemmas of given text and add to the list 'sent'
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
        
    # return string converted form of the list of lemmas
    return " ".join(sent)

In [20]:
# Lemmatized the reviews text using above helper function
data["reviews"] = data["reviews"].apply(lambda x: lemmatizer(x))

# View the first five rows of text preprocessed dataframe
data.head()

### 💠 Let us try to visualize our data

In [23]:
# Create list of lengths of reviews
doc_lens = [len(d) for d in data['reviews']]
doc_lens[:5]

In [24]:
# Plot the data according to character length of complaints
plt.figure(figsize=(10,6))
plt.hist(doc_lens, edgecolor='black', bins = 50)
plt.title('Distribution of Review character length', fontsize=25)
plt.ylabel('Number of Reviews', fontsize=20)
plt.xlabel('Review character length', fontsize=20)
sns.despine()
plt.show()

In [25]:
# We create a word cloud showing the top 40 words by frequency among all the reviews after processing the text
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
                          background_color='GhostWhite',
                          stopwords=stopwords,
                          max_words=40,
                          max_font_size=40, 
                          random_state=42
                         ).generate(str(data['reviews']))

fig = plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# `Task 3`: Feature Extraction ⚙️

### 💠 Now, we will create features using TF-IDF method

In [27]:
# Get the TF-IDF Vector count
tfv = TfidfVectorizer()
X_train_tfidf = tfv.fit_transform(data['reviews'])

In [29]:
# View the sparse feature matrix
print(X_train_tfidf)

In [30]:
# View the dense feature matrix
print(X_train_tfidf.toarray())

In [31]:
# Save TF-IDF to a pickle file
pickle.dump(tfv, open("tfidf.pkl","wb"))

In [33]:
# Perform Train-Test split with 80% training set and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, data['user_sentiment'], 
                                                    test_size=0.20, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

### 💠 Selection of Evaluation Metric 

- As the target variable is highly imbalanced, we select **F1 Score** as our evaluation metric for comparing the performance of various models we will build.
- Moreover, we will use a **weighted average** method for evaluating `F1 Score` due to the imbalance of classes.

In [34]:
# Create a function to evaluate models
def eval_model(y_test, y_pred, model_name):
    """
    This function prints the classification report of a classifier 
    and plots the confusion martrix
    :param y_test: actual labels
    :param y_pred: predicted labels
    :param model_name: the name of the model being evaluated
    :return: None
    """
    
    # print classification report of classifier
    print(f"CLASSIFICATION REPORT for {model_name}\n")
    print(classification_report(y_test, y_pred, target_names=["Positive", "Negative"]))
    
    # plot confusion matrix of the classifier
    plt.figure(figsize=(6,6))
    plt.title(f"CONFUSION MATRIX for {model_name}\n")
    matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap(matrix, annot=True, cmap="coolwarm", fmt='d', xticklabels=["Positive", "Negative"], 
                yticklabels=["Positive", "Negative"])
    plt.show()
    
    return

## `Model #1`: Naive Bayes 😯

In [35]:
# Run the Multinomial Naive Bayes with default parameters
model_name = 'NAIVE BAYES'
clf_nb = MultinomialNB()
%time 
clf_nb.fit(X_train, y_train)
y_pred_nb = clf_nb.predict(X_test)

In [36]:
# Calculate F1 Score using weighted average method
f1_nb = f1_score(y_test, y_pred_nb, average="weighted")
f1_nb

In [None]:
# # Hyperparameter tuning to improve Naive Bayes performance
# param_grid_nb = {
#     'alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
#     'fit_prior':[True, False]
# }

# grid_nb = GridSearchCV(estimator=clf_nb, 
#                        param_grid=param_grid_nb,
#                        verbose=1,
#                        scoring='f1_weighted',
#                        n_jobs=-1,
#                        cv=10)
# grid_nb.fit(X_train, y_train)
# print(grid_nb.best_params_)

### 📌 Best Estimator parameters


{'alpha': 0.01, 'fit_prior': True}

In [37]:
# Run Multinomial Naive Bayes on tuned hyperparameters
clf_nb_tuned = MultinomialNB(alpha=0.01, fit_prior=True)
%time 
clf_nb_tuned.fit(X_train, y_train)
y_pred_nb_tuned = clf_nb_tuned.predict(X_test)

In [38]:
# Calculate F1 Score of tuned model using weighted average method
f1_nb_tuned = f1_score(y_test, y_pred_nb_tuned, average="weighted")
f1_nb_tuned

In [39]:
# Evaluate the tuned Naive Bayes classifier
eval_model(y_test, y_pred_nb_tuned, model_name)

### 📌 The tuned Naive Bayes model gives a decent F1 score of 0.88.

### 📌 However, this model performs a bit poor on classifying `Positive` class compared to `Negative`.

In [40]:
# Create a dataframe to store F1 Scores of all models we will build
summary = pd.DataFrame([{'Model': 'Naive Bayes','F1 Score (untuned)': round(f1_nb, 2), 
                         'F1 Score (tuned)': round(f1_nb_tuned, 2)}])
summary

## `Model #2`: Logistic Regression 📈

In [67]:
# Run the Logistic Regression model
model_name = 'LOGISTIC REGRESSION'
clf_lr = LogisticRegression()
%time 
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)

In [68]:
# Calculate F1 Score using weighted average method
f1_lr = f1_score(y_test, y_pred_lr, average="weighted")
f1_lr

In [64]:
# # Hyperparameter tuning to improve Logistic Regression performance
# param_grid_lr = {
#     'penalty': ['l1', 'l2','elasticnet', 'none'],
#     'C': [0.001,0.01,0.1,1,10,100],
# #     'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
# }

# grid_lr = GridSearchCV(estimator=clf_lr, 
#                        param_grid=param_grid_lr,
#                        verbose=1,
#                        scoring='f1_weighted',
#                        n_jobs=-1,
#                        cv=5)
# grid_lr.fit(X_train, y_train)
# print(grid_lr.best_params_)

### 📌 Best Estimator parameters


{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

In [69]:
# Run Logistic Regression on tuned hyperparameters
clf_lr_tuned = LogisticRegression(C=10, 
                                  penalty='l1', 
                                  solver='liblinear')
%time 
clf_lr_tuned.fit(X_train, y_train)
y_pred_lr_tuned = clf_lr_tuned.predict(X_test)

In [70]:
# Calculate F1 Score of tuned model using weighted average method
f1_lr_tuned = f1_score(y_test, y_pred_lr_tuned, average="weighted")
f1_lr_tuned

In [71]:
# Evaluate the tuned Logistic Regression classifier
eval_model(y_test, y_pred_lr_tuned, model_name)

### 📌 The tuned Logistic Regression model gives a pretty high F1 score of 0.93.

In [72]:
# Update the summary table
summary.loc[len(summary.index)] = ['Logistic Regression', round(f1_lr, 2), round(f1_lr_tuned, 2)]
summary

## `Model #3`: Decision Tree 🌴

In [73]:
# Run Decision Tree on default hyperparameters
model_name = 'DECISION TREE'
clf_dt = DecisionTreeClassifier()
%time 
clf_dt.fit(X_train, y_train)
y_pred_dt = clf_dt.predict(X_test)

In [74]:
# Calculate F1 Score using weighted average method
f1_dt = f1_score(y_test, y_pred_dt, average="weighted")
f1_dt

In [None]:
# # Hyperparameter tuning to improve Decision Tree performance
# param_grid_dt = {
#     'criterion': ['gini', 'entropy'],
#     'max_depth' : [40, 45, 50, 60, 70],
#     'min_samples_leaf':[10,15, 20, 25, 30, 35],
#     'max_features':['auto','log2','sqrt',None],
# }

# grid_dt = GridSearchCV(estimator=clf_dt, 
#                        param_grid=param_grid_dt,
#                        verbose=1,
#                        scoring='f1_weighted',
#                        n_jobs=-1,
#                        cv=5)
# grid_dt.fit(X_train, y_train)
# print(grid_dt.best_params_)

### 📌 Best Estimator parameters


{'criterion': 'gini', 'max_depth': 60, 'max_features': None, 'min_samples_leaf': 10}

In [75]:
# Run Decision Tree on tuned hyperparameters
clf_dt_tuned = DecisionTreeClassifier(criterion='gini', 
                                      max_depth=60, 
                                      min_samples_leaf=10, 
                                      max_features=None)
%time 
clf_dt_tuned.fit(X_train, y_train)
y_pred_dt_tuned = clf_dt_tuned.predict(X_test)

In [76]:
# Calculate F1 Score of tuned model using weighted average method
f1_dt_tuned = f1_score(y_test, y_pred_dt_tuned, average="weighted")
f1_dt_tuned

In [77]:
# Evaluate the tuned Decision Tree classifier
eval_model(y_test, y_pred_dt_tuned, model_name)

### 📌 The Decision Tree model gives a decent F1 score of 0.90.

In [78]:
# Update the summary table
summary.loc[len(summary.index)] = ['Decision Tree', round(f1_dt, 2), round(f1_dt_tuned, 2)]
summary

## `Model #4`: Random Forest 🌳🌳🌳

In [79]:
# Run the Random Forest model on default hyperparameters
model_name = 'RANDOM FOREST'
clf_rf = RandomForestClassifier()
%time 
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)

In [80]:
# Calculate F1 Score using weighted average method
f1_rf = f1_score(y_test, y_pred_rf, average="weighted")
f1_rf

In [None]:
# # Hyperparameter tuning to improve Random Forest performance
# param_grid_rf = {
#     'n_estimators': [100, 300, 500],
#     'criterion':['gini','entropy'],
#     'max_depth': [10, 30, 40],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 5, 10],
#     'max_features': ['log2', 'sqrt', None]    
# }

# grid_rf = RandomizedSearchCV(estimator=clf_rf, 
#                        param_distributions=param_grid_rf,
#                        scoring='f1_weighted',
#                        verbose=1,
#                        n_jobs=-1,
#                        cv=5)
# grid_rf.fit(X_train, y_train)
# print(grid_rf.best_params_)

### 📌 Best Estimator parameters
{'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None, 'max_depth': 40, 'criterion': 'gini'}

In [81]:
# Run Random Forest on tuned hyperparameters
clf_rf_tuned = RandomForestClassifier(n_estimators=300, 
                                      min_samples_split=2, 
                                      min_samples_leaf=1, 
                                      max_features=None, 
                                      max_depth=40, 
                                      criterion='gini'
)
%time 
clf_rf_tuned.fit(X_train, y_train)
y_pred_rf_tuned = clf_rf_tuned.predict(X_test)

In [82]:
# Calculate F1 Score of tuned model using weighted average method
f1_rf_tuned = f1_score(y_test, y_pred_rf_tuned, average="weighted")
f1_rf_tuned

In [83]:
# Evaluate the tuned Random Forest classifier
eval_model(y_test, y_pred_rf_tuned, model_name)

### 📌 The tuned Random Forest model gives a pretty high F1 score of 0.90.

In [84]:
# Update the summary table
summary.loc[len(summary.index)] = ['Random Forest', round(f1_rf, 2), round(f1_rf_tuned, 2)]
summary

## `Model #5`: Support Vector Machine 🏹

In [85]:
# Run the Support Vector Machine (SVM) model on default hyperparameters
model_name = 'SUPPORT VECTOR MACHINE'
clf_svm = SVC()
%time 
clf_svm.fit(X_train, y_train)
y_pred_svm = clf_svm.predict(X_test)

In [86]:
# Calculate F1 Score using weighted average method
f1_svm = f1_score(y_test, y_pred_svm, average="weighted")
f1_svm

In [87]:
# # Hyperparameter tuning to improve SVM performance
# param_grid_svm = {
#     'C': [10, 15],
#     'gamma': ['scale', 0.01],
#     'kernel': ['linear', 'rbf']
# }

# grid_svm = GridSearchCV(estimator=clf_svm, 
#                        param_grid=param_grid_svm,
#                        scoring='f1_weighted',
#                        verbose=1,
#                        n_jobs=-1,
#                        cv=2)
# grid_svm.fit(X_train, y_train)
# print(grid_svm.best_params_)

### 📌 Best Estimator parameters
{'C': 10, 'gamma': 'scale', 'kernel': 'linear'}

In [88]:
# Run SVM on tuned hyperparameters
clf_svm_tuned = SVC(C=10,
                    gamma='scale',
                    kernel='linear')
%time 
clf_svm_tuned.fit(X_train, y_train)
y_pred_svm_tuned = clf_svm_tuned.predict(X_test)

In [89]:
# Calculate F1 Score of tuned model using weighted average method
f1_svm_tuned = f1_score(y_test, y_pred_svm_tuned, average="weighted")
f1_svm_tuned

In [90]:
# Evaluate the SVM classifier
eval_model(y_test, y_pred_svm_tuned, model_name)

### 📌 The SVM model gives a pretty high F1 score of 0.92.

In [91]:
# Update the summary table
summary.loc[len(summary.index)] = ['Support Vector Machine', round(f1_svm, 2), round(f1_svm_tuned, 2)]
summary

## `Model #6`: XGBoost 💨

In [92]:
# Run XGBoost model on default hyperparameters 
################
# This uses GPU
################
model_name = 'XGBOOST'
clf_xgb = XGBClassifier(tree_method='gpu_hist', 
                        gpu_id=0, 
                        predictor="gpu_predictor")
%time
clf_xgb.fit(X_train, y_train)
y_pred_xgb = clf_xgb.predict(X_test)

In [93]:
# Calculate F1 Score using weighted average method
f1_xgb = f1_score(y_test, y_pred_xgb, average="weighted")
f1_xgb

In [94]:
from collections import Counter
counter = Counter(y_train)
print(counter)

In [95]:
scale_pos_weight = counter['Negative'] / counter['Positive']
scale_pos_weight

In [None]:
# Hyperparameter tuning to improve XGBoost performance
param_grid_xgb = {
    'learning_rate': [0.1, 0.2, 0.3],
    'max_depth': [2, 6, 10],
    'min_child_weight': [7, 11, 19],
    'colsample_bytree': [0.5, 0.8, 1],
    'scale_pos_weight': [0.1, 0.13, 0.5, 0.7],
    'n_estimators': [300, 500, 700]
}

grid_xgb = RandomizedSearchCV(estimator=clf_xgb, 
                              param_distributions=param_grid_xgb,
                              scoring='f1_weighted',
                              verbose=1,
                              n_jobs=-1,
                              cv=3)
grid_xgb.fit(X_train, y_train)
print(grid_xgb.best_params_)

### 📌 Best Estimator parameters


{'scale_pos_weight': 8, 'n_estimators': 700, 'min_child_weight': 7, 'max_depth': 6, 'learning_rate': 0.3}

In [None]:
# Run XGBoost on tuned hyperparameters
clf_xgb_tuned = XGBClassifier(scale_pos_weight=8, 
                              n_estimators=700, 
                              min_child_weight=7, 
                              max_depth=6, 
                              learning_rate=0.3, 
                              tree_method='gpu_hist', 
                              gpu_id=0, 
                              predictor="gpu_predictor"
)
%time 
clf_xgb_tuned.fit(X_train, y_train)
y_pred_xgb_tuned = clf_xgb_tuned.predict(X_test)

In [None]:
# Calculate F1 Score of tuned model using weighted average method
f1_xgb_tuned = f1_score(y_test, y_pred_xgb_tuned, average="weighted")
f1_xgb_tuned

In [None]:
# Evaluate the tuned XGBoost classifier
eval_model(y_test, y_pred_xgb_tuned, model_name)

### 📌 The tuned XGBoost model gives a pretty high F1 score of 0.92.
### 📌 This model performs quite well on all topics.

In [None]:
# Update the summary table
summary.loc[len(summary.index)] = ['XGBoost', round(f1_xgb, 2), round(f1_xgb_tuned, 2)]
summary