<a href="https://colab.research.google.com/github/NoofAlsafi-DS/NLP_Project/blob/main/Copy_of_NLP_Sentiment_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

Welcome to the "***Sentiment Analysis and Classification Project***" project, the first and only project of the ***Natural Language Processing (NLP)*** course.

This analysis will focus on using Natural Language techniques to find broad trends in the written thoughts of the customers.
The goal in this project is to predict whether customers recommend the product they purchased using the information in their review text.

One of the challenges in this project is to extract useful information from the *Review Text* variable using text mining techniques. The other challenge is that you need to convert text files into numeric feature vectors to run machine learning algorithms.

At the end of this project, you will learn how to build sentiment classification models using Machine Learning algorithms (***Logistic Regression, Naive Bayes, Support Vector Machine, Random Forest*** and ***Ada Boosting***), **Deep Learning algorithms** and **BERT algorithm**.

Before diving into the project, please take a look at the Determines and Tasks.

- ***NOTE:*** *This tutorial assumes that you already know the basics of coding in Python and are familiar with the theory behind the algorithms mentioned above as well as NLP techniques.*



---
---


# #Determines
The data is a collection of 22641 Rows and 10 column variables. Each row includes a written comment as well as additional customer information.
Also each row corresponds to a customer review, and includes the variables:


**Feature Information:**

**Clothing ID:** Integer Categorical variable that refers to the specific piece being reviewed.

**Age:** Positive Integer variable of the reviewers age.

**Title:** String variable for the title of the review.

**Review Text:** String variable for the review body.

**Rating:** Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.

**Recommended IND:** Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

**Positive Feedback Count:** Positive Integer documenting the number of other customers who found this review positive.

**Division Name:** Categorical name of the product high level division.

**Department Name:** Categorical name of the product department name.

**Class Name:** Categorical name of the product class name.

---

The basic goal in this project is to predict whether customers recommend the product they purchased using the information in their *Review Text*.
Especially, it should be noted that the expectation in this project is to use only the "Review Text" variable and neglect the other ones.
Of course, if you want, you can work on other variables individually.

Project Structure is separated in five tasks: ***EDA, Feature Selection and Data Cleaning , Text Mining, Word Cloud*** and ***Sentiment Classification with Machine Learning, Deep Learning and BERT model***.

Classically, you can start to know the data after doing the import and load operations.
You need to do missing value detection for Review Text, which is the only variable you need to care about. You can drop other variables.

You will need to apply ***noise removal*** and ***lexicon normalization*** processes by using the capabilities of the ***nltk*** library to the data set that is ready for text mining.

Afterwards, you will implement ***Word Cloud*** as a visual analysis of word repetition.

Finally, You will build models with five different algorithms and compare their performance. Thus, you will determine the algorithm that makes the most accurate emotion estimation by using the information obtained from the * Review Text * variable.






---
---


# #Tasks

#### 1. Exploratory Data Analysis

- Import Modules, Load Discover the Data

#### 2. Feature Selection and Data Cleaning

- Feature Selection and Rename Column Name
- Missing Value Detection

#### 3. Text Mining

- Tokenization
- Noise Removal
- Lexicon Normalization

#### 4. WordCloud - Repetition of Words

- Detect Reviews
- Collect Words
- Create Word Cloud


#### 5. Sentiment Classification with Machine Learning, Deep Learning and BERT Model

- Train - Test Split
- Vectorization
- TF-IDF
- Logistic Regression
- Naive Bayes
- Support Vector Machine
- Random Forest
- AdaBoost
- Deep Learning Model
- BERT Model
- Model Comparison

---
---


# Sentiment analysis of women's clothes reviews


In this project we used sentiment analysis to determined whether the product is recommended or not. We used different machine learning algorithms to get more accurate predictions. The following classification algorithms have been used: ML algorithms(Logistic Regression, Naive Bayes, Support Vector Machine (SVM), Random Forest and Ada Boosting), Deep learning algorithm and BERT algorithm. The dataset comes from Woman Clothing Review that can be find at (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews.


## 1. Exploratory Data Analysis

### Import Libraries, Load and Discover the Data

In [36]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
import missingno as msno
import scipy.stats as stats
#from matplotlib_dashboard import MatplotlibDashboard

import re
import random
import string
import requests
import tempfile

from PIL import Image
from tqdm.notebook import tqdm
from wordcloud import WordCloud

import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")

# memory management and garbage collection
import gc
gc.collect()

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import warnings
warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (7,4)
pd.set_option('display.max_columns', 50)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df =pd.read_csv("drive/My Drive/Womens Clothing E-Commerce Reviews.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [4]:
df_copy = df.copy()
print('Shape of DataFrame: ',df_copy.shape,'\n')

Shape of DataFrame:  (23486, 11) 



In [5]:
#!git clone https://github.com/SarahMoshababQ/G3-project/Womens Clothing E-Commerce Reviews.csv

### Data Wrangling

In [6]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [7]:
df_copy.isna().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [8]:
df_copy.drop(['Clothing ID'], axis=1, inplace=True)
df_copy.dropna(subset=['Review Text','Division Name',], inplace=True)

In [9]:
df_copy['Text']  = df_copy['Title'].fillna('-') + ' ' + df['Review Text']
df_copy.drop(['Title', 'Review Text', 'Division Name'],axis=1,inplace=True)

In [10]:
df_copy['Text_Length'] = df_copy['Text'].astype('str').apply(len)
df_copy = df_copy.reset_index(drop=True)

In [11]:
df_copy["Recommended"]=df_copy["Recommended IND"]

#### Check Proportion of Target Class Variable:

The target class variable is imbalanced, where "Recommended" values are more dominating then "Not Recommendation".

In [12]:
df_copy["Recommended"].value_counts(normalize=True)

1    0.818764
0    0.181236
Name: Recommended, dtype: float64

## 2. Feature Selection and Data Cleaning

From now on, the DataFrame you will work with should contain two columns: **"Review Text"** and **"Recommended IND"**. You can do the missing value detection operations from now on. You can also rename the column names if you want.



### Feature Selection and Rename Column Name

In [13]:
df_copy = pd.DataFrame(df_copy[["Text", "Recommended"]])
df_copy

Unnamed: 0,Text,Recommended
0,- Absolutely wonderful - silky and sexy and co...,1
1,- Love this dress! it's sooo pretty. i happe...,1
2,Some major design flaws I had such high hopes ...,0
3,"My favorite buy! I love, love, love this jumps...",1
4,Flattering shirt This shirt is very flattering...,1
...,...,...
22623,Great dress for many occasions I was very happ...,1
22624,Wish it was made of cotton It reminds me of ma...,1
22625,"Cute, but see through This fit well, but the t...",0
22626,"Very cute dress, perfect for summer parties an...",1


In [14]:
df_copy['not_recommended'] = df_copy['Recommended'].map({0 : 1, 1: 0})
df_copy

Unnamed: 0,Text,Recommended,not_recommended
0,- Absolutely wonderful - silky and sexy and co...,1,0
1,- Love this dress! it's sooo pretty. i happe...,1,0
2,Some major design flaws I had such high hopes ...,0,1
3,"My favorite buy! I love, love, love this jumps...",1,0
4,Flattering shirt This shirt is very flattering...,1,0
...,...,...,...
22623,Great dress for many occasions I was very happ...,1,0
22624,Wish it was made of cotton It reminds me of ma...,1,0
22625,"Cute, but see through This fit well, but the t...",0,1
22626,"Very cute dress, perfect for summer parties an...",1,0


---
---


### Missing Value Detection

In [15]:
df_copy.isna().sum()

Text               0
Recommended        0
not_recommended    0
dtype: int64

---
---


## 3. Text Mining

Text is the most unstructured form of all the available data, therefore various types of noise are present in it. This means that the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as **text preprocessing**.

The three key steps of text preprocessing:

- **Tokenization:**
This step is one of the top priorities when it comes to working on text mining. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

- **Noise Removal:**
Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, upper and lower case differentiation, punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.


- **Lexicon Normalization:**
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”. Though they mean different things, contextually they all are similar. This step converts all the disparities of a word into their normalized form (also known as lemma).
There are two methods of lexicon normalisation; **[Stemming or Lemmatization](https://www.guru99.com/stemming-lemmatization-python-nltk.html)**. Lemmatization is recommended for this case, because Lemmatization as this will return the root form of each word (rather than just stripping suffixes, which is stemming).

As the first step change text to tokens and convertion all of the words to lower case.  Next remove punctuation, bad characters, numbers and stop words. The second step is aimed to normalization them throught the Lemmatization method.


***Note:*** *Use the functions of the ***[nltk Library](https://www.guru99.com/nltk-tutorial.html)*** for all the above operations.*



### Tokenization, Noise Removal, Lexicon Normalization

In [41]:
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
import gzip
import json
nltk.download('stopwords')
nltk.download('wordnet', download_dir='/usr/share/nltk_data')
nltk.download('averaged_perceptron_tagger')

nltk.download('punkt')

# if 'wordnet' error
!unzip -oq /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [35]:
stop_words = set(stopwords.words('english'))

words_to_exclude = {"no", "not"}

stop_words.difference_update(words_to_exclude)

In [42]:
def preprocess_cleaning(data):

    import re

    #1. Removing URLS
    data = re.sub('http\S*', '', data).strip()
    data = re.sub('www\S*', '', data).strip()

    #2. Removing Tags
    data = re.sub('#\S*', '', data).strip()

    #3. Removing Mentions
    data = re.sub('@\S*', '', data).strip()

    #4. Removing upper brackets to keep negative auxiliary verbs in text
    data = data.replace("'", "")

    #5. Tokenize
    text_tokens = word_tokenize(data.lower())

    #6. Remove Puncs and numbers
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]

    #7. Removing Stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]

    #8. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]


    #9. joining
    return " ".join(text_cleaned)

In [43]:
df_copy['text_clean']= df_copy['Text'].apply(preprocess_cleaning)
df_copy

Unnamed: 0,Text,Recommended,not_recommended,text_clean
0,- Absolutely wonderful - silky and sexy and co...,1,0,absolutely wonderful silky sexy comfortable
1,- Love this dress! it's sooo pretty. i happe...,1,0,love dress sooo pretty happened find store im ...
2,Some major design flaws I had such high hopes ...,0,1,major design flaw high hope dress really wante...
3,"My favorite buy! I love, love, love this jumps...",1,0,favorite buy love love love jumpsuit fun flirt...
4,Flattering shirt This shirt is very flattering...,1,0,flattering shirt shirt flattering due adjustab...
...,...,...,...,...
22623,Great dress for many occasions I was very happ...,1,0,great dress many occasion happy snag dress gre...
22624,Wish it was made of cotton It reminds me of ma...,1,0,wish made cotton reminds maternity clothes sof...
22625,"Cute, but see through This fit well, but the t...",0,1,cute see fit well top see never would worked i...
22626,"Very cute dress, perfect for summer parties an...",1,0,cute dress perfect summer party bought dress w...


## 4. WordCloud - Repetition of Words

Now you'll create a Word Clouds for reviews, representing most common words in each target class.

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud.

You are expected to create separate word clouds for positive and negative reviews. You can qualify a review as positive or negative, by looking at its recommended status. You may need to use capabilities of matplotlib for visualizations.

You can follow the steps below:

- Detect Reviews
- Collect Words
- Create Word Cloud


### Detect Reviews (positive and negative separately)

In [44]:
import collections
df_copy['text_list'] = df_copy['text_clean'].apply(lambda x:str(x).split())
top = collections.Counter([item for sublist in df_copy['text_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(50))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Unnamed: 0,Common_words,count
0,dress,13005
1,not,10939
2,love,10852
3,fit,10807
4,size,9587
5,top,9460
6,great,7887
7,color,7371
8,like,7266
9,look,7092


In [45]:
df_copy

Unnamed: 0,Text,Recommended,not_recommended,text_clean,text_list
0,- Absolutely wonderful - silky and sexy and co...,1,0,absolutely wonderful silky sexy comfortable,"[absolutely, wonderful, silky, sexy, comfortable]"
1,- Love this dress! it's sooo pretty. i happe...,1,0,love dress sooo pretty happened find store im ...,"[love, dress, sooo, pretty, happened, find, st..."
2,Some major design flaws I had such high hopes ...,0,1,major design flaw high hope dress really wante...,"[major, design, flaw, high, hope, dress, reall..."
3,"My favorite buy! I love, love, love this jumps...",1,0,favorite buy love love love jumpsuit fun flirt...,"[favorite, buy, love, love, love, jumpsuit, fu..."
4,Flattering shirt This shirt is very flattering...,1,0,flattering shirt shirt flattering due adjustab...,"[flattering, shirt, shirt, flattering, due, ad..."
...,...,...,...,...,...
22623,Great dress for many occasions I was very happ...,1,0,great dress many occasion happy snag dress gre...,"[great, dress, many, occasion, happy, snag, dr..."
22624,Wish it was made of cotton It reminds me of ma...,1,0,wish made cotton reminds maternity clothes sof...,"[wish, made, cotton, reminds, maternity, cloth..."
22625,"Cute, but see through This fit well, but the t...",0,1,cute see fit well top see never would worked i...,"[cute, see, fit, well, top, see, never, would,..."
22626,"Very cute dress, perfect for summer parties an...",1,0,cute dress perfect summer party bought dress w...,"[cute, dress, perfect, summer, party, bought, ..."


In [46]:
Positive_sent = df_copy[df_copy['Recommended']==1]
Negative_sent = df_copy[df_copy['not_recommended']==1]

In [47]:
Positive_sent

Unnamed: 0,Text,Recommended,not_recommended,text_clean,text_list
0,- Absolutely wonderful - silky and sexy and co...,1,0,absolutely wonderful silky sexy comfortable,"[absolutely, wonderful, silky, sexy, comfortable]"
1,- Love this dress! it's sooo pretty. i happe...,1,0,love dress sooo pretty happened find store im ...,"[love, dress, sooo, pretty, happened, find, st..."
3,"My favorite buy! I love, love, love this jumps...",1,0,favorite buy love love love jumpsuit fun flirt...,"[favorite, buy, love, love, love, jumpsuit, fu..."
4,Flattering shirt This shirt is very flattering...,1,0,flattering shirt shirt flattering due adjustab...,"[flattering, shirt, shirt, flattering, due, ad..."
6,Cagrcoal shimmer fun I aded this in my basket ...,1,0,cagrcoal shimmer fun aded basket hte last mint...,"[cagrcoal, shimmer, fun, aded, basket, hte, la..."
...,...,...,...,...,...
22622,What a fun piece! So i wasn't sure about order...,1,0,fun piece wasnt sure ordering skirt couldnt se...,"[fun, piece, wasnt, sure, ordering, skirt, cou..."
22623,Great dress for many occasions I was very happ...,1,0,great dress many occasion happy snag dress gre...,"[great, dress, many, occasion, happy, snag, dr..."
22624,Wish it was made of cotton It reminds me of ma...,1,0,wish made cotton reminds maternity clothes sof...,"[wish, made, cotton, reminds, maternity, cloth..."
22626,"Very cute dress, perfect for summer parties an...",1,0,cute dress perfect summer party bought dress w...,"[cute, dress, perfect, summer, party, bought, ..."


### Collect Words (positive and negative separately)

In [51]:
#MosT common positive words
top = collections.Counter([item for sublist in Positive_sent['text_list'] for item in sublist])
temp_positive = pd.DataFrame(top.most_common(50))
temp_positive.columns = ['Common_words','count']
temp_positive.style.background_gradient(cmap='Greens')

Unnamed: 0,Common_words,count
0,dress,10775
1,love,9783
2,fit,9089
3,size,8217
4,not,7844
5,top,7652
6,great,7232
7,color,6276
8,wear,5835
9,look,5436


In [50]:
#MosT common negative words
top = collections.Counter([item for sublist in Negative_sent['text_list'] for item in sublist])
temp_negative = pd.DataFrame(top.most_common(50))
temp_negative = temp_negative.iloc[1:,:]
temp_negative.columns = ['Common_words','count']
temp_negative.style.background_gradient(cmap='Reds')

Unnamed: 0,Common_words,count
1,dress,2230
2,like,1840
3,top,1808
4,fit,1718
5,look,1656
6,fabric,1390
7,size,1370
8,would,1323
9,color,1095
10,love,1069


### Create Word Cloud (for most common words in recommended not recommended reviews separately)

## 5. Sentiment Classification with Machine Learning, Deep Learning and BERT model

Before moving on to modeling, as data preprocessing steps you will need to perform **[vectorization](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)** and **train-test split**. You have performed many times train test split process before.
But you will perform the vectorization for the first time.

Machine learning algorithms most often take numeric feature vectors as input. Thus, when working with text documents, you need a way to convert each document into a numeric vector. This process is known as text vectorization. Commonly used vectorization approach that you will use here is to represent each text as a vector of word counts.

At this moment, you have your review text column as a token (which has no punctuations and stopwords). You can use Scikit-learn’s CountVectorizer to convert the text collection into a matrix of token counts. You can imagine this resulting matrix as a 2-D matrix, where each row is a unique word, and each column is a review.

Train all models using TFIDF and Count vectorizer data.

**For Deep learning model, use embedding layer for all words.**

**For BERT model, use TF tensor**

After performing data preprocessing, build your models using following classification algorithms:

- Logistic Regression,
- Naive Bayes,
- Support Vector Machine,
- Random Forest,
- Ada Boosting
- Deep Learning Model
- BERT Model

### Train - Test Split

To run machine learning algorithms we need to convert text files into numerical feature vectors. We will use bag of words model for our analysis.

First we spliting the data into train and test sets:

In the next step we create a numerical feature vector for each document:

### Count Vectorization

### TF-IDF

### Eval Function

## Logistic Regression

### CountVectorizer

### TF-IDF

## Naive Bayes

### Countvectorizer

### TF-IDF

## Support Vector Machine (SVM)

### Countvectorizer

### TD-IDF

## Random Forest

### Countvectorizer

### TF-IDF

## Ada Boosting

### Countvectorizer

### TF-IDF

## DL modeling

### Tokenization

### Creating word index

### Converting tokens to numeric

### Maximum number of tokens for all documents¶

### Fixing token counts of all documents (pad_sequences)

### Train Set Split

### Modeling

### Model Evaluation

## BERT Modeling

### Read Data

### Train test split

### Tokenization

### Fixing token counts of all documents

### Transformation Vectors to Matrices

### Transformation Matrix to Tensorflow tensor

### Batch Size

### Creating optimization

### Creating Model with TPU

### Model Fiting

### Model evaluation

### Compare Models F1 Scores, Recall Scores and Average Precision Score

### Conclusion

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___