# [Getting Started with NLP](https://dphi.tech/bootcamps/getting-started-with-natural-language-processing?utm_source=header)
by [CSpanias](https://cspanias.github.io/aboutme/), 28/01 - 06/02/2022 <br>

Bootcamp organized by **[DPhi](https://dphi.tech/community/)**, lectures given by [**Dipanjan (DJ) Sarkar**](https://www.linkedin.com/in/dipanzan/) ([GitHub repo](https://github.com/dipanjanS/nlp_essentials)) <br>

## Fundamental Tutorials for NLP:
* [NLTK Book](https://www.nltk.org/book/)
* [spaCy Tutorials](https://course.spacy.io/en/chapter1)

# CONTENT
1. Text Wrangling
2. Text Representation with Feature Engineering - Statistical Models
3. Text Representation with Feature Engineering - Deep Learning Models
4. NLP Applications - Recommender Systems
5. [NLP Applications - Recommender Systems \#2](#NLPApp2)
    1. [Load and View Data](#Data)
    1. [Text Pre-Processing](#TextPre)
        1. [Merge  Text Attributes](#merge)
        2. [Convert Rating System](#convert)
        3. [Remove Useless Records](#useless)
        4. [Check Label's Distribution](#balance)
    1. [Build Train & Test Datasets](#split)
    1. [Experiment 1: Basic NLP Count-based Features](#experiment1)
        1. [Logistic Regression](#logreg)
        1. [Model Evaluation Metrics - Quick Refresher](#metrics)
        1. [Leveraging Text Sentiment](#sentiment)
    1. [Experiment 2: Features from Sentiment Analysis](#experiment2)
        1. [Buidling & Evaluating Model with Sentiment Analysis](#model2)
        1. [Text Pre-Processing \#2](#textpre2) 
        1. [Extracting Basic NLP Count-based Features](#previousfeatures)
    1. [Experiment 3: Adding Bag of Words based Features](#experiment3)
        1. [Model Training & Evaluation](#evaluation3)

<a name="NLPApp2"><a/>
# 5. NLP Applications - Recommender Systems \#2

The problem presented in this notebook is a classic NLP problem dealing with data from an e-commerce store focusing on women's clothing. 

The aim is to **predict a product's rating from customer reviews**.

**Each record** in the dataset is a **customer review** which consists of the review title, text description and a rating (ranging from 1 - 5) for a product amongst other features.

We convert this into a **binary classification problem** such that:
1. a customer recommends a product (label 1) if the **rating is > 3**
2. else they do not recommend the product (label 0).

The goal is to **leverage the review text attributes to predict the recommendation rating**.

**Note**: _The dataset is [available](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) on Kaggle._

<a name="Data"></a>
# 5.1 Load and View Data

In [1]:
# import required libraries
import numpy as np
import pandas as pd

from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# read data as pandas DataFrame
df = pd.read_csv('https://github.com/CSpanias/nlp_resources/blob/main/dphi_nlp_bootcamp/Womens%20Clothing%20E-Commerce%20Reviews.csv?raw=true', keep_default_na=False)

# check first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


<a name="TextPre"></a>
# 5.2 Text Pre-Processing

We want to achieve the following goals:
1. [Merge  Text Attributes](#merge)
2. [Convert Rating System](#convert)
3. [Remove Useless Records](#useless)
4. [Check Label's Distribution](#balance)

<a name="merge"></a>
### 5.2.1 Merge Text Attributes

We want to **merge all review text attributes** into a **single attribute**.

In [4]:
# merge 'title' and 'review text' columns into a single column
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())

# check first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Review
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Absolutely wonderful - silky and sexy and comf...
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Love this dress! it's sooo pretty. i happene...
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Some major design flaws I had such high hopes ...
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,"My favorite buy! I love, love, love this jumps..."
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Flattering shirt This shirt is very flattering...


<a name="convert"></a>
### 5.2.2 Convert Rating System

We need to **convert** the 5-star **rating system** into a **binary rating system** (1 or 0).

In [5]:
# convert 5-start rating system into a binary one
df['Rating'] = [1 if rating > 3 else 0 for rating in df['Rating']]

# create a new df with just the 2 columns
df = df[['Review', 'Rating']]

# check first 5 rows
df.head()

Unnamed: 0,Review,Rating
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,Some major design flaws I had such high hopes ...,0
3,"My favorite buy! I love, love, love this jumps...",1
4,Flattering shirt This shirt is very flattering...,1


<a name="useless"></a>
### 5.2.3 Remove Useless Records

We can remove all records that have **no review text** are they won't provide any information to our model.

In [7]:
# remove empty records from 'Review' column
df = df[df['Review'] != '']

# check basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22642 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  22642 non-null  object
 1   Rating  22642 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 530.7+ KB


<a name="balance"></a>
### 5.2.4 Check Label's Distribution

We can also check if our dataset is **balanced or not**.

In [8]:
# check target variable's distribution
df['Rating'].value_counts()

1    17449
0     5193
Name: Rating, dtype: int64

<a name="split"></a>
# 5.3 Build Train & Test Datasets

We must split our dataset into **training** and **test sets**. 

The simplest way to achieve that is by using sklearn's **`train_test_split`** method. 

**Note**: You can find _**`train_test_split`**'s documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [10]:
from sklearn.model_selection import train_test_split

# split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df[['Review']], df['Rating'], random_state=42)

# check shape
print("The training set has {} rows and {} column.\nThe test set has {} rows and {} column.".
      format(X_train.shape[0], X_train.shape[1], X_test.shape[0], X_test.shape[1]))

The training set has 16981 rows and 1 column.
The test set has 5661 rows and 1 column.


We can also check **label's distribution** seperately on each set.

In [27]:
from collections import Counter

# check label's distribution for each set
print("The label distribution is:\n\nTraining set: Label 1 = {} | Label 0 = {}.\n\nTest set: Label 1 = {} | Label 0 = {}".
      format(Counter(y_train)[1], Counter(y_train)[0], Counter(y_test)[1], Counter(y_test)[0]))

The label distribution is:

Training set: Label 1 = 13059 | Label 0 = 3922.

Test set: Label 1 = 4390 | Label 0 = 1271


<a name="experiment1"></a>
# 5.4 Experiment 1: Basic NLP Count-based Features

A number of basic text based features can also be created which sometimes are helpful for **improving text classification models**. 

Some examples are:

- __Word Count:__ total number of words in the documents
- __Character Count:__ total number of characters in the documents
- __Average Word Density:__ average length of the words used in the documents
- __Puncutation Count:__ total number of punctuation marks in the documents
- __Upper Case Count:__ total number of upper count words in the documents
- __Title Word Count:__ total number of proper case (title) words in the documents

**Note**: _The aforementioned information comes from [this](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/) article._

In [28]:
import string

# calculate total number of characters
X_train['char_count'] = X_train['Review'].apply(len)
# calculate total number of words
X_train['word_count'] = X_train['Review'].apply(lambda x: len(x.split()))
# calculate average word density
X_train['word_density'] = X_train['char_count'] / (X_train['word_count']+1)
# calculate total number of punctuaction marks
X_train['punctuation_count'] = X_train['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation)))
# calculate total number of title-cased words
X_train['title_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# calculate total number of upper-cased words
X_train['upper_case_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

# calculate total number of characters
X_test['char_count'] = X_test['Review'].apply(len)
# calculate total number of words
X_test['word_count'] = X_test['Review'].apply(lambda x: len(x.split()))
# calculate average word density
X_test['word_density'] = X_test['char_count'] / (X_test['word_count']+1)
# calculate total number of punctuaction marks
X_test['punctuation_count'] = X_test['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
# calculate total number of title-cased words
X_test['title_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
# calculate total number of upper-cased words
X_test['upper_case_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

**Note**: _Information about **lambda functions** can be found [here](https://www.w3schools.com/python/python_lambda.asp)._

In [29]:
# check first 5 rows
X_train.head()

Unnamed: 0,Review,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
12896,Soooo soft! This is a delightfully soft and fl...,268,52,5.056604,8,2,0
13183,"Had my eye on this, but dind't get I finally v...",399,84,4.694118,20,2,1
1496,I wanted to like this... I wanted to like this...,525,104,5.0,19,2,2
5205,Beautiful blouse Bought this for my daughter i...,203,35,5.638889,10,2,0
13366,"Boxy. large. Boxy, unflattering, and large.\n\...",295,51,5.673077,22,2,0


<a name="logreg"></a>
## 5.4.1 Logistic Regression 

A logistic regression model is **easy to train**, **interpret** and **works well** on a wide variety of NLP problems.

**Note**: _More info about Logistic Regression [here](https://www.youtube.com/watch?v=yIYKR4sgzI8)._

In [30]:
from sklearn.linear_model import LogisticRegression

# instantiate log reg
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')

<a name="metrics"></a>
## 5.4.2 Model Evaluation Metrics - Quick Refresher

Just **accuracy is never enough** in datasets with a **rare class problem**.

- __Precision:__ The positive predictive power of a model. Out of all the predictions made by a model for a class, how many are actually correct
- __Recall:__ The coverage or hit-rate of a model. Out of all the test data samples belonging to a class, how many was the model able to predict (hit or cover) correctly.
- __F1-score:__ The harmonic mean of the precision and recall.

**Note**: _More info about classification metrics [here](https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/)._

In [34]:
# train model
lr.fit(X_train.drop(['Review'], axis=1), y_train)

# predict with test data
predictions = lr.predict(X_test.drop(['Review'], axis=1))

# evaluate model
print(classification_report(y_test, predictions, zero_division=0))

# convert cr to pandas dataframe
pd.DataFrame(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1271
           1       0.78      1.00      0.87      4390

    accuracy                           0.78      5661
   macro avg       0.39      0.50      0.44      5661
weighted avg       0.60      0.78      0.68      5661



Unnamed: 0,0,1
0,0,1271
1,0,4390


Looks like our model was not able to predict a single product having a bad (no recommendation) rating, i.e. __Class 0__. 

This is as good as someone predicting a __1__ or __good__ for every product review. 

Can we do better?

<a name="sentiment"></a>
## 5.4.3 Leveraging Text Sentiment

Reviews are pretty **subjective**, **opinionated** and people often **express strong emotions**, **feelings**. 

This makes it a classic case where the text documents here are a good candidate for **extracting sentiment as a feature**.

The general expectation is that **highly rated and recommended products** (label 1) should have a **positive sentiment** and products which are **not recommended** (label 0) should have a **negative sentiment**.

**`TextBlob`** is an excellent open-source library for performing **sentiment analysis**. It has a **sentiment lexicon** (in the form of an XML file) which it leverages to give both **polarity and subjectivity scores**. 

- The **polarity score** is a float within the range [-1.0, 1.0]. 
- The **subjectivity** is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. 

**Note**: _The above information come from [this](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72) article._

In [48]:
from textblob import TextBlob

# perform sentiment analysis
sent = 'This is an AMAZING pair of Jeans!'
sent_analysis = TextBlob(sent).sentiment
print("Sentiment analysis of '{}': Polarity = {:.2f} | Subjectivity = {:.2f}\n".
      format(sent, sent_analysis[0], sent_analysis[1]))

sent_1 = 'I really hated this UGLY T-shirt!!'
sent_analysis_1 = TextBlob(sent_1).sentiment
print("Sentiment analysis of '{}': Polarity = {:.2f} | Subjectivity = {:.2f}".
      format(sent_1, sent_analysis_1[0], sent_analysis_1[1]))

Sentiment analysis of 'This is an AMAZING pair of Jeans!': Polarity = 0.75 | Subjectivity = 0.90

Sentiment analysis of 'I really hated this UGLY T-shirt!!': Polarity = -0.95 | Subjectivity = 0.85


Looks like this should help us get features which **can distinguish between good and bad products**.

<a name="experiment2"></a>
# 5.5 Experiment 2: Features from Sentiment Analysis 

Remember this is **unsupervised**, **lexicon-based sentiment analysis** where **we don't have any pre-labeled data** saying which review migth have a positive or negative sentiment. 

In [50]:
# calculate review's sentiment 
x_train_snt_obj = X_train['Review'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
# create a column for subjectivity scores
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

# calculate review's sentiment 
x_test_snt_obj = X_test['Review'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
# create a column for subjectivity scores
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

In [51]:
# check first 5 rows
X_train.head()

Unnamed: 0,Review,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity
12896,Soooo soft! This is a delightfully soft and fl...,268,52,5.056604,8,2,0,0.170455,0.490909
13183,"Had my eye on this, but dind't get I finally v...",399,84,4.694118,20,2,1,0.101944,0.719537
1496,I wanted to like this... I wanted to like this...,525,104,5.0,19,2,2,0.186538,0.458761
5205,Beautiful blouse Bought this for my daughter i...,203,35,5.638889,10,2,0,0.625,0.825
13366,"Boxy. large. Boxy, unflattering, and large.\n\...",295,51,5.673077,22,2,0,0.329613,0.510268


<a name="model2"></a>
## 5.5.1 Buidling & Evaluating Model with Sentiment Analysis

In [53]:
# train log reg
lr.fit(X_train.drop(['Review'], axis=1), y_train, )
# predict on test data
predictions = lr.predict(X_test.drop(['Review'], axis=1))

# print classification report
print(classification_report(y_test, predictions))

# convert report to pandas DataFrame
pd.DataFrame(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.69      0.26      0.38      1271
           1       0.82      0.97      0.89      4390

    accuracy                           0.81      5661
   macro avg       0.75      0.61      0.63      5661
weighted avg       0.79      0.81      0.77      5661



Unnamed: 0,0,1
0,336,935
1,152,4238


Interesting! Looks like we are now able to predict __26%__ of the total number of bad or negative rated products now, and **Precision** is quite good at __69%__.

__F1-Score__ for bad reviews is now __40%__ and for good reviews is __89%__.

This brings our overall __F1-Score__ to __77%__ which is quite good.

Can we still improve on our model since the recall of bad reviews is still pretty low?

<a name="textpre2"></a>
## 5.5.2 Text Pre-Processing \#2

We want to extract some specific features based on standard NLP feature engineering models like the classic **Bag of Words** model.

For this we need to **clean and pre-process our text data**. We will build a **simple text pre-processor** that will perform:

- Text Lowercasing
- Removal of contractions
- Removing unnecessary characters, numbers and symbols
- Stemming
- Stopword removal

**Note**: _The above information come from [this](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72) article._

In [55]:
import contractions

# expand contractions
contractions.fix('I didn\'t like this t-shirt')

'I did not like this t-shirt'

In [59]:
import re

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# remove some stopwords to capture negation in n-grams if possible
stop_words = stopwords.words('english')
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('but')

# load up a simple porter stemmer - nothing fancy
ps = PorterStemmer()

def simple_text_preprocessor(document):
    """Perform basic text pre-processing tasks."""
    
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

# vectorize function
stp = np.vectorize(simple_text_preprocessor)

In [60]:
# create a new column with cleaned text
X_train['Clean Review'] = stp(X_train['Review'].values)
X_test['Clean Review'] = stp(X_test['Review'].values)

# check first 5 rows
X_train.head()

Unnamed: 0,Review,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity,Clean Review
12896,Soooo soft! This is a delightfully soft and fl...,268,52,5.056604,8,2,0,0.170455,0.490909,soooo soft thi delight soft fluffi sweater mig...
13183,"Had my eye on this, but dind't get I finally v...",399,84,4.694118,20,2,1,0.101944,0.719537,eye thi but dind get final visit store petit t...
1496,I wanted to like this... I wanted to like this...,525,104,5.0,19,2,2,0.186538,0.458761,want like thi want like thi top badli badli fa...
5205,Beautiful blouse Bought this for my daughter i...,203,35,5.638889,10,2,0,0.625,0.825,beauti blous bought thi daughter law birthday ...
13366,"Boxy. large. Boxy, unflattering, and large.\n\...",295,51,5.673077,22,2,0,0.329613,0.510268,boxi larg boxi unflatt larg curvi pound thi to...


<a name="previousfeatures"></a>
## 5.5.3 Extracting Basic NLP Count-based Features

In [61]:
# remove the 2 columns
X_train_metadata = X_train.drop(['Review', 'Clean Review'], axis=1).reset_index(drop=True)
X_test_metadata = X_test.drop(['Review', 'Clean Review'], axis=1).reset_index(drop=True)

# check first 5 rows
X_train_metadata.head()

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity
0,268,52,5.056604,8,2,0,0.170455,0.490909
1,399,84,4.694118,20,2,1,0.101944,0.719537
2,525,104,5.0,19,2,2,0.186538,0.458761
3,203,35,5.638889,10,2,0,0.625,0.825
4,295,51,5.673077,22,2,0,0.329613,0.510268


<a name="experiment3"></a>
# 5.6 Experiment 3: Adding Bag of Words based Features

This is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to **represent unstructured text as numeric vectors**, such that each dimension of the vector is a specific feature\attribute. 


**Note**: _The above information come from [this](https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41) article._

In [65]:
from sklearn.feature_extraction.text import CountVectorizer

# instatiate vectorizer
cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))
# fit vectorizer to 'Clean Review' and convert it to numpy array
X_traincv = cv.fit_transform(X_train['Clean Review']).toarray()
# create a pandas DataFrame
X_traincv = pd.DataFrame(X_traincv, columns=cv.get_feature_names())

# use vectorizer to transform 'Clean Review' and convert it to numpy array
X_testcv = cv.transform(X_test['Clean Review']).toarray()
# create a pandas DataFrame
X_testcv = pd.DataFrame(X_testcv, columns=cv.get_feature_names())

# check first 5 rows
X_traincv.head()

Unnamed: 0,aa,aaaaandidon,aaaaannnnnnd,aaaah,aaaahmaz,aaah,ab,abbey,abbi,abck,...,zing,zip,zipper,zipperi,zippi,zone,zooland,zoom,zowi,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
# concatenate the 2 dataframes
X_train_comb = pd.concat([X_train_metadata, X_traincv], axis=1)
X_test_comb = pd.concat([X_test_metadata, X_testcv], axis=1)

# check first 5 rows
X_train_comb.head()

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity,aa,aaaaandidon,...,zing,zip,zipper,zipperi,zippi,zone,zooland,zoom,zowi,zuma
0,268,52,5.056604,8,2,0,0.170455,0.490909,0,0,...,0,0,0,0,0,0,0,0,0,0
1,399,84,4.694118,20,2,1,0.101944,0.719537,0,0,...,0,0,0,0,0,0,0,0,0,0
2,525,104,5.0,19,2,2,0.186538,0.458761,0,0,...,0,0,0,0,0,0,0,0,0,0
3,203,35,5.638889,10,2,0,0.625,0.825,0,0,...,0,0,0,0,0,0,0,0,0,0
4,295,51,5.673077,22,2,0,0.329613,0.510268,0,0,...,0,0,0,0,0,0,0,0,0,0


<a name="evaluation3"></a>
## 5.6.1 Model Training & Evaluation

In [67]:
# train logreg
lr.fit(X_train_comb, y_train)
# predict using test data
predictions = lr.predict(X_test_comb)

# print classification report
print(classification_report(y_test, predictions))

# convert cr to pandas DataFrame
pd.DataFrame(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.76      0.71      0.73      1271
           1       0.92      0.94      0.93      4390

    accuracy                           0.88      5661
   macro avg       0.84      0.82      0.83      5661
weighted avg       0.88      0.88      0.88      5661



Unnamed: 0,0,1
0,899,372
1,285,4105


Wow! This looks promising.

We are now able to predict __71%__ of the total number of bad or negative rated products now! **Precision** is quite good at __76%__

__F1-Score__ for bad reviews is now __73%__ and good reviews is __93%__.

This brings our overall __F1-Score__ to __88%__ which is quite good.