# Customer Satisfaction Prediction - Brazillian e-Commerce Public Dataset

## 1. Business Problem:-

### 1.1 Description

This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. A geolocation dataset that relates Brazilian zip codes to lat/lng coordinates has also been released.

This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. See more on the website: www.olist.com

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

CREDITS:- Kaggle

### 1.2 Problem Statement
Predict Customer satisfaction of the purhase from the olist e-commerce site.

## 2. Machine Learning Probelm
### 2.1 Data
#### 2.1.1 Data Overview

Source:- https://www.kaggle.com/olistbr/brazilian-ecommerce

The data is divided in multiple datasets for better understanding and organization. Please refer to the following data schema when working with it:
<img src="https://i.imgur.com/HRhd2Y0.png" />


#### 2.1.2 Data Description
The **olist_orders_dataset** have the order data for each purchase connected with other data using order_id and customer_id.
The **olist_order_reviews_dataset** have the labeled review data for each order in the order data table labelled as [1,2,3,4,5] where 5 being the highest and 1 being the lowest.
We will use reviews greater than 3 as positive and less than equal to 3 as negative review.
The table will be joined accordingly to get the data needed for the analysis, feature selection and model training.

### 2.2 Mapping the real world problem to an ML problem
#### 2.2.1 Type of Machine Leaning Problem
It is a binary classification problem, for a given purchase order we need to predict if it will get a positive or negative review from the customer.

#### 2.2.2 Performance Metric

Metric(s):
* f1-score : https://www.kaggle.com/wiki/LogarithmicLoss
* Binary Confusion Matrix

### 2.3 Train and Test Construction

We build train and test by stratified random split of the data in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 3. Exploratory Data Analysis

### 3.1 Importing libraries

In [None]:
!pip install --upgrade gensim



In [None]:
import warnings
warnings.filterwarnings("ignore")
import re
import nltk
nltk.download('stopwords')
nltk.download('rslp')
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from tqdm import tqdm
import shutil
import os
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib
matplotlib.use(u'nbAgg')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import random
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import hstack
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier,LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, VotingClassifier, AdaBoostClassifier
from sklearn.metrics import log_loss,accuracy_score, confusion_matrix, f1_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


### 3.2 Loading data and preprocessing

In [None]:
# loading the data tables
customer_data = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_customers_dataset.csv')
geolocation_data = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_geolocation_dataset.csv')
order_items_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_order_items_dataset.csv')
order_payments_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_order_payments_dataset.csv')
order_reviews_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_order_reviews_dataset.csv')
order_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_orders_dataset.csv')
order_products_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_products_dataset.csv')
order_sellers_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_sellers_dataset.csv')
product_translation_dataset = pd.read_csv('/content/drive/MyDrive/Predict Customer Satisfaction /Data/product_category_name_translation.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Predict Customer Satisfaction /Data/olist_customers_dataset.csv'

In [None]:
# checking customer data
customer_data.head()

In [None]:
# checking geo-location data
geolocation_data.head()

In [None]:
# checking ordered items data
order_items_dataset.head()

In [None]:
# checking payments data
order_payments_dataset.head()

In [None]:
# checking order reviews data
order_reviews_dataset.head()

In [None]:
# checking the order data
order_dataset.head()

In [None]:
# checking sellers data
order_sellers_dataset.head()

In [None]:
# checking products data
order_products_dataset.head()

In [None]:
# prdouct name translation data from Portugese to English
product_translation_dataset.head()

In [None]:
# checking info of reviews data
print(order_reviews_dataset.info())

Here, we can see that the review data have review score for each 100k data points but less than 50% of the orders have review comments for them. Also, we want to predict customer review based on the order fullfillment rather than classifying their reviews as positive or negative based on the review comments posted by them.

According to our objective here i.e to predict the customer satisfaction based on the order fullfillment rather than classifying their reviews as negative or positive, the review comments given by the customer should be removed from the data to avoid bias in the model.

In [None]:
# removing unuseful data from review data set
order_reviews_dataset = order_reviews_dataset[['order_id','review_score', 'review_comment_message']]
order_reviews_dataset.info()

In [None]:
# Merging order data with review data to get a review score on each order
order_review_data = order_reviews_dataset.merge(order_dataset,on='order_id')
order_review_data.head()

As seen above the product dataset containd the product categories in portugese language. so, let's translate the product categories to english for better understanding.

In [None]:
# changing product name to english in the ordered product dataset
order_products_dataset_english = pd.merge(order_products_dataset,product_translation_dataset,on='product_category_name'
                                          ,how='left')
order_products_dataset_english = order_products_dataset_english.drop(labels='product_category_name',axis=1)
order_products_dataset_english.head()

The above data set contains detailed description of each product item available for sale on website. So, let's merge this data with **order_items_dataset** which contains order details of each product item sold to get the product description of each item sold in the same data.

The dataset now contains the detailed description of each product ordered online such as price, dimensions, seller, number of photos available to customer, and product weight etc.

In [None]:
# merging item description to the products ordered data using product_id
order_product_item_dataset = pd.merge(order_items_dataset,order_products_dataset_english,on='product_id')
order_product_item_dataset.head()

In [None]:
# merging detailed product data with the order review data
ordered_product_reviews = pd.merge(order_product_item_dataset,order_review_data,on='order_id')
ordered_product_reviews_payments = pd.merge(ordered_product_reviews,order_payments_dataset,on='order_id')
ordered_product_reviews_payments.head()

#### 3.2.1 Final data

Now, we have our final data set for each order_id for we have products info, sellers info, items info, customer info, payment info and review score given by the customer.
Let us dive deep into our data set and see what the data tells. let us start with simple statistics on the data.

In [None]:
# merging detailed product data with the order review data
df_final = pd.merge(ordered_product_reviews_payments,customer_data,on='customer_id')
# df_final.to_csv('olist_final.csv',index=False)
df_final.head()

In [None]:
#info on the data set
df_final.info()

#### 3.2.2 Handling missing values
From above info table we can see that our data set have missing values for some of the features. let us see the statistics of missing values for each feature.

In [None]:
# checking the count of null values per column
df_final.isnull().sum()

The maximum missing values is seen in the order delivery date feature of the data set with around 2% of the total data. For the numerical features with null values we will use median impute technique( to avoid outliers) to handle missing value of these columns. For the date column order delivery date and order approve date we will fill the missing value from the corresponfiing estimated delivery date column and order purchase time column. The customer generally does not pay attention to the order_delivered_carrier_date of their order. so, we will drop this column. Also, the categorical product category feature have null values less than 1% of total data so, we will drop those rows having null values.

In [None]:
# Handling missing values
df_final['product_name_lenght'].fillna(df_final['product_name_lenght'].median(),inplace=True)
df_final['product_description_lenght'].fillna(df_final['product_description_lenght'].median(),inplace=True)
df_final['product_photos_qty'].fillna(df_final['product_photos_qty'].median(),inplace=True)
df_final['product_weight_g'].fillna(df_final['product_weight_g'].median(),inplace=True)
df_final['product_length_cm'].fillna(df_final['product_length_cm'].median(),inplace=True)
df_final['product_height_cm'].fillna(df_final['product_height_cm'].median(),inplace=True)
df_final['product_width_cm'].fillna(df_final['product_width_cm'].median(),inplace=True)

In [None]:
#Handling missing values
ids = (df_final[df_final['order_delivered_customer_date'].isnull() == True].index.values)
vals = df_final.iloc[ids]['order_estimated_delivery_date'].values
df_final.loc[ids,'order_delivered_customer_date'] = vals

ids = (df_final[df_final['order_approved_at'].isnull() == True].index.values)
df_final.loc[ids,'order_approved_at'] = df_final.iloc[ids]['order_purchase_timestamp'].values

#dropping order delivery carrier date
df_final.drop(labels='order_delivered_carrier_date',axis=1,inplace=True)

In [None]:
# filling nan value of review comments with no_review
df_final['review_comment_message'].fillna('no_review',inplace=True)

# dropping rows with product category name as null
df_final = df_final.dropna()

In [None]:
df_final.info()

##### Observation
We have observed different ways of handling missing features based on the features types and missing values. Missing value in the numerical features were handled using the imputation technique through median. While for the features with delivery dates we handled it differently. We used the data from other columns to fill missing values for these features like we assumed that any order with missing order customer delivery date should have been delivered by estimated delivery date and so we filled it in the same way. As for the features with less than 1% missing values we dropped the data points containing any null values.
Now, that we have handled all our missing values in the data and we can say that the preprocessing of the data is complete.so, let us go ahead and do some analysis on the data.

### 3.3 Data Analysis
Since preprocessig of the data is done and now we have our final data set with us. Let us analyse our data and find meaningful insights from the data.

Our objective here is to build a model which can predict the review score on the data or classify the data into 0 and 1 review score. So, let us try to find insights and analyze the data keeping our objective in mind.

In [None]:
# checking the review score
df_final.review_score.value_counts()

>According to our objective, we are going to solve this problem using binary classification technique. so, let us convert the review score into 0 and 1 labels and view the distribution.

In [None]:
# converting reviews into 0 and 1 to make it binary classification problem
df_final['review_score'] = df_final['review_score'].apply(lambda x:1 if x>3 else 0)

#let us see the distribution now.
plt.figure(figsize=(10,5))
ax=sns.countplot(x="review_score", data=df_final)
plt.title('Distribution of Review Score')
plt.show()

###### Observation
The above plot show the distribution of the class labels(review_score) in the data set. From the plot, we can see that more than 50% of the data points belong to the class label 1 i.e positive class and rest of them to the negative class suggesting that we have class imbalanced data set.

In [None]:
# statistics of numerical features in the data set
df_final.describe()

###### Observation
We observe from above table that we have 12 useful numerical features except zip code, our target variable review score and order_item_id. let us observe the statistics of the price and freight value of an order. The maximum price of an order is 6735 while max freight went to around 410 Brazilian real. The average price of an order is around 120 Brazilian real and frieght value is around 20 Brazilian real. The order with minimum price of 0.85 Brazilian real have been made. Let us look at distribution of these features and see how they help in classifying the class labels and find other insights.

#### 3.3.1 Univariate Analysis

In [None]:
# https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
# plotting distributions of price per class
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "price").add_legend();
plt.title('Distribution of product price per class')
plt.show()

###### Observation
The distribution plot above shows the distribution of price for both the postive and negative classes. The overlap of both the distribution for positive and negative class suggests that it is not possible to classify them based only on price feature.

In [None]:
# plotting distributions of freight_value per class
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "freight_value").add_legend();
plt.title('Distribution of freight_value per class')
plt.show()

##### Observation
From the plot above titled `Distribution plot of freight value`, we observe that freight value is somewhat normally distributed but it too is overlapping for both the classes and hence provide much info in the classification.
Let us look at some more distributions and see if we get something important from any of them.

In [None]:
# plotting distributions of product_height_cm per class
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "product_height_cm").add_legend();
plt.title('Distribution of product_height_cm per class')
plt.show()

In [None]:
# distriution plot of product_weight_g
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "product_weight_g").add_legend();
plt.title('Distribution of product_weight_g per class')
plt.show()

In [None]:
# distriution plot of payment_value
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "payment_value").add_legend();
plt.title('Distribution of payment_value per class')
plt.show()

##### Observation
From all the above univariate plots, we observed that almost all of them have overlapping distributions for the class labels. We can infer from that values of features lying in any range of their distribution have almost equal chance of gettig a postive or negative review. So, if that is the case how can we train the model to classify the positive and negative points if we cannot properly differentiate based on the features value(distribution).
So, let us go and do some bivariate analysis and see if we use more than one feature at time, can we come with something to classify these features.

#### 3.3.2 Bivariate Analysis

In [None]:
# Distribution of price vs freight_value per class
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.scatterplot(x='price',y='freight_value', data = df_final, hue="review_score")
plt.title('Distribution of price vs freight_value per class')
plt.show()

In [None]:
# Distribution of price vs freight_value per class
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.scatterplot(x='price',y='product_weight_g', data = df_final, hue="review_score")
plt.title('Distribution of price vs freight_value per class')
plt.show()

##### Observation
From the above two scatter plots titled `Distribution of price vs freight_value per class` and `Distribution of price vs freight_value per class` respectively, we tried to observe how are the features price and freight_value distributed for the class labels in the first plot. From the plots we observe that the points are mixed together for the both the classes suggesting us that algorithms like KNN might not be good in classifying these points. Therefore, we will observe the distribution of few more features with each other and see we find something more important.

In [None]:
# https://seaborn.pydata.org/generated/seaborn.pairplot.html
# pair plot
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(df_final[['price','freight_value','product_photos_qty','product_weight_g','product_length_cm',
                           'product_height_cm','product_width_cm', 'review_score']],hue='review_score')
# g.savefig("pairplot1.png")

##### Observation
In the pair plot above we see distribution of one feature with rest of them for each class labels. In the pair plot above, we can see some of the blue points from the orange points while in the univariate analysis we observed overlapping for amost all the cases. The distribution of freight value with all the other features like price, product length etc. shows that we might be able to classify positive and negative class using some non-linear techniques where the univariate analysis showed that it is nearly impossible to classify them based on some straight forward conditions or linear way. The results in the pair plot are better than univariate analysis but it is not promising to come any sure shot conclusion so, in next step we can engineer some new features on the data and try to see if we can find any relations td differentiate between positive and negative class.

#### 3.3.3 Analysis of Categorical Variable

In [None]:
# count plot of payment type
# https://stackoverflow.com/questions/34615854/seaborn-countplot-with-normalized-y-axis-per-group
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.countplot(x="review_score", hue="payment_type", data=df_final)
for p in ax.patches:
    ax.annotate('{:.1f}%'.format(100*p.get_height()/len(df_final)),(p.get_x()+0.05, p.get_height()+5))
ax.set_title('Review Score w.r.t payment method')
plt.show()

##### Observation
The plot above shows the distributio of the categorical variable payment type w.r.t the review score. From the plot we observe the around 55% of the positive review given by customers have used credit card for the payments. Similarly, for negative review around 18% of customers made payment using credit cared while in the second postion we have boleto -  digital currency provided by the eCommerce site for their regular and registered customers.

In [None]:
# count plot of order fullfillment
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.countplot(x="review_score", hue="order_status", data=df_final)
ax.set_title('Review Score by order fullfillment')
plt.show()

##### Observation
The plot above is a very simple plot which shows the distribution of review score given per order status of the order. From the plot we can observe that out of all the orders which got positive review 99% of them has been successfully delivered.

In [None]:
# Top 10 shopping states
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = df_final.customer_state.value_counts().sort_values()[-10:].plot(kind='bar')
ax.set_title("Top ten consumer states of Brazil")
ax.set_xlabel("States")
plt.xticks(rotation=35)
ax.set_ylabel("Frequency")
plt.show()

In [None]:
# top 10 products categories from which products have been sold
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = df_final.product_category_name_english.value_counts().sort_values()[-10:].plot(kind='bar')
ax.set_title("Top ten product categories sold")
ax.set_xlabel("Product category")
plt.xticks(rotation=35)
ax.set_ylabel("Frequency")
plt.show()

##### Observation
In the above two plots titled `top 10 consumer states of Brazil` and `top 10 products categories sold` respectively we tried to observe the top ten states of the Brazil which shopped mostly online and the top ten product catefories from which products have been sold. In plot 1, we observe that around 45% of the consumers who shopped online is from the state **SP** while top 2 state consitute only around 15% of the total consumer shoppings of the data.

From the second plot, we observe that most of the products sold is from category bed_bath_table. The top two products category health_beauty and bed_bath_table constitutes around 20% of the sells of the site.

### 3.4 Feature Engineering
let us create some features and analyse them.
1. **Sellers Count**:- Number of sellers for each product as a feature.
2. **Products count**:- Number of products ordered in each order as a feature.
3. **Estimated Delivery Time(in number of days)**:- Gets the days between order approval and estimated delivery date. A customer might be unsatisfied if he is told that the estimated time is big.
4. **Actual Delivery Time**:- Gets the days between order approval and delivered customer date. A customer might be more satisfied if he gets the product faster.
5. **Difference in delivery days**:- The difference between the actual and estimated date. If negative was delivered early, if positive was delivered late. A customer might be more satisfied if the order arrives sooner than expected, or unhappy if he receives after the deadline
6. **Is Late**:- Binary variable indicating if the order was delivered after the estimated date.
7. **Average Product Value**:- Cheaper products might have lower quality, leaving customers unhappy.
8. **Total Order Value**:- If a customer expends more, he might expect a better order fulfilment.
9. **Order Freight Ratio**:- If a customer pays more for freight, he might expect a better service.
10. **Purchase Day of Week**:- Day of week on which purchase was made.
11. **is_reviewed**:- If the review comment is given or not.

In [None]:
# Finding number of sellers for each product as a feature
product_id = order_product_item_dataset.groupby('product_id').count()['seller_id'].index
seller_count = order_product_item_dataset.groupby('product_id').count()['seller_id'].values
product_seller_count = pd.DataFrame({'product_id':product_id,'sellers_count':seller_count})
product_seller_count.head()

In [None]:
# Finding number of products ordered in each order as a feature
order_id = order_product_item_dataset.groupby('order_id').count()['product_id'].index
pd_count = order_product_item_dataset.groupby('order_id').count()['product_id'].values
order_items_count = pd.DataFrame({'order_id':order_id,'products_count':pd_count})
order_items_count.head()

In [None]:
# Adding the seller count and products count feature to the final data set
df_final = pd.merge(df_final,product_seller_count,on='product_id')
df_final = pd.merge(df_final,order_items_count,on='order_id')

In [None]:
# converting date to datetime and extracting dates from the datetime columns in the data set
datetime_cols = ['order_purchase_timestamp', 'order_approved_at', 'order_delivered_customer_date', 'order_estimated_delivery_date']

for col in datetime_cols:
    df_final[col] = pd.to_datetime(df_final[col])

In [None]:
# https://www.kaggle.com/andresionek/predicting-customer-satisfaction
# calculating estimated delivery time
df_final['estimated_delivery_time'] = (df_final['order_estimated_delivery_date'] - df_final['order_approved_at']).dt.days

# calculating actual delivery time
df_final['actual_delivery_time'] = (df_final['order_delivered_customer_date'] - df_final['order_approved_at']).dt.days

# calculating diff_in_delivery_time
df_final['diff_in_delivery_time'] = df_final['estimated_delivery_time'] - df_final['actual_delivery_time']

# finding if delivery was lare
df_final['on_time_delivery'] = df_final['order_delivered_customer_date'] < df_final['order_estimated_delivery_date']
df_final['on_time_delivery'] = df_final['on_time_delivery'].astype('int')

# calculating mean product value
df_final['avg_product_value'] = df_final['price']/df_final['products_count']

# finding total order cost
df_final['total_order_cost'] = df_final['price'] + df_final['freight_value']

# calculating order freight ratio
df_final['order_freight_ratio'] = df_final['freight_value']/df_final['price']

# finding the day of week on which order was made
df_final['purchase_dayofweek'] = pd.to_datetime(df_final['order_purchase_timestamp']).dt.dayofweek

# adding is_reviewed where 1 is if review comment is given otherwise 0.
df_final['is_reviewed'] = (df_final['review_comment_message'] != 'no_review').astype('int')

#### 3.4.1 Dropping date columns and id columns like seller_id, order_id etc.

In [None]:
df_final.drop(columns=['order_id', 'order_item_id', 'product_id', 'seller_id','shipping_limit_date','customer_id',
                       'order_purchase_timestamp', 'order_approved_at', 'order_delivered_customer_date', 'customer_state',
                       'order_estimated_delivery_date','customer_unique_id', 'customer_city','customer_zip_code_prefix'],
              axis=1,inplace=True)

In [None]:
# Final data set after feature creation and removing of irrelevant features
# df_final.to_csv('olist_final.csv',index=False)
df_final.head()

In [None]:
df_final.info()

#### 3.4.2 Analysis of the engineered features
#### 3.4.2.1 Understanding statistics of the engineered features

In [None]:
# describe the data set
df_final[['sellers_count', 'products_count','estimated_delivery_time', 'actual_delivery_time','diff_in_delivery_time',
          'avg_product_value','total_order_cost', 'order_freight_ratio']].describe()

##### Observation
Let us observe the statistics of the features we created.
1. The sellers count for total products ordered in a order have minimum number of sellers as 1 while maximum sellers of 527.
2. The numer of products ordered in a single order have maximum value as 21 and minimum as 1.
3. The maximum estimated delivery time is 153 days with mean value of 23 days.
4. The maximum actual delivery time is 208 days and with average delivery time of 12 days.
5. Average total order cost 140 brazilian real with minimum value of 6 real.

#### 3.4.3 Analysis

In [None]:
# distribution plot of actual delivery time
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "actual_delivery_time").add_legend();
plt.title('Distribution of payment_value per class')
plt.show()

In [None]:
# distribution plot of payment value
plt.figure()
sns.set_style("whitegrid")
ax = sns.FacetGrid(df_final, hue="review_score", height=5,aspect=2.0)
ax = ax.map(sns.distplot, "order_freight_ratio").add_legend();
plt.title('Distribution of order_freight_ratio per class')
plt.show()

In [None]:
# distribution review by on time delivery
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.countplot(x="review_score", hue="on_time_delivery", data=df_final)
ax.set_title('Review Score by timely delivery of orders')
plt.show()

In [None]:
# Distribution of price vs freight_value per class
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.scatterplot(x='estimated_delivery_time',y='actual_delivery_time', data = df_final, hue="review_score")
plt.title('Distribution of estimated_delivery_time vs actual_delivery_time per class')
plt.show()

In [None]:
# Distribution of price vs freight_value per class
plt.figure(figsize=(8,5))
sns.set_style("whitegrid")
ax = sns.scatterplot(x='sellers_count',y='actual_delivery_time', data = df_final, hue="review_score")
plt.title('Distribution of sellers_count vs actual_delivery_time per class')
plt.show()

###### Observation
From the univariate analysis, we have seen that actual delivery time's distibution is partially overlapping for both the class lables and we cannot derive any certain rule to classify them based on the actual delivery time. So, we plotted the above scatter plots to see if we can derive any relation with more than one features. From the plots above we observe that we can separate the blue points from the orange points with a linear line with some errors. Thus, now we can say that we can derive some linear relation to classify them. let us verify our observation from below pair plot.

In [None]:
# pair plot
feat = ['estimated_delivery_time','actual_delivery_time', 'diff_in_delivery_time','avg_product_value',
           'total_order_cost', 'order_freight_ratio','products_count','sellers_count','review_score']
sns.set(style="ticks", color_codes=True)
pp = sns.pairplot(df_final[feat],hue='review_score')
# pp.savefig("pairplot2.png")

###### Observation
The Pairplot above is the bivariate analysis of 8 newly engineered features like sellers count, product count, delivery time in days etc. From the above plot we observe that actual delivery time vs sellers count plot separates the positive and negative classed more visibly than others. We can say that the number of sellers available and the actual delivery time affects the review score either postively or negatively which we can see in details in the correlation analysis of the features.

## 4. Observation on EDA and FE
Let us gather what we have observed and learned so far from the data.
1. The class label is not balanced.
2. It is impossible to differentiate the classes based on any single feature.
3. The numerical feature like price and freight value have skewed distribution suggesting the presence of high boundary values.
4. The freight_value vs product photo qty plot shows good result for classifying the class labels.
5. The most used payment method is credit card.
6. We also found that aroudn 45% of the consumers belong to single state and most shopped product category among them is bed,bath,table,health and beauty.
7. As observed in the analysis, 10 different features were  created referencing some features in the data like:- products ccount, sellers count, total order cost, actual delivery time etc. and the previous features like delivery date, product_id etc. is dropped.
8. we learnt that the feature actual delivery time provide partial differentiation between positive and negative class.
9. The scatter plots plotted in the first part of thge analysis do not show any significant results in classifying the positive and negative points while the scatter plots in the second analysis after feature engineering shows significant difference in the positive and negative class points as evident from plot `Distribution of sellers_count vs actual_delivery_time per class`.
10. The pair plot shows that with these features can classify the postive class from negative class with some non linear transformation.
11. As evident from scatter plots and pair plots , it is clear that linear algorithms or KNN might not be good choice in classifying these points. But, atleast we learnt that with thes set of features we can classify them into two classes.

## 5. Data Preparation
### 5.1 Getting Numerical and categorical features

In [None]:
# selecting features
# numerical features
num_feat = ['price', 'freight_value', 'product_name_lenght','product_description_lenght', 'product_photos_qty',
           'product_weight_g','product_length_cm', 'product_height_cm', 'product_width_cm','sellers_count',
           'products_count', 'payment_sequential','payment_installments', 'payment_value','on_time_delivery',
           'estimated_delivery_time','actual_delivery_time', 'diff_in_delivery_time','avg_product_value', 'purchase_dayofweek',
           'total_order_cost', 'order_freight_ratio','is_reviewed']

# categorical features
cat_feat = ['review_comment_message','product_category_name_english','order_status', 'payment_type']

In [None]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='median')
si.fit(df_final[num_feat])
df_final[num_feat] = si.transform(df_final[num_feat])

In [None]:
# checking values of categorical features
print("order Status: ",df_final.order_status.value_counts())
print("----------------------------------------------------------------------------")
print("Payment type: ",df_final.payment_type.value_counts())

We see that the we have two types and 3 types of data in on_time_delivery and payment_type column in the data. so, we will go with label encoding for these data rather than one hot encoding. Let us look at the review comments feature.

In [None]:
df_final['review_comment_message'][:10]

In [None]:
# https://www.aclweb.org/anthology/W17-6615

def process_data(texts):

    processed_text = []

    portuguese_stopwords = stopwords.words('portuguese') # portugese language stopwords
    stemmer = RSLPStemmer() # portugese language stemmer

    links = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' # check for hyperlinks
    dates = '([0-2][0-9]|(3)[0-1])(\/|\.)(((0)[0-9])|((1)[0-2]))(\/|\.)\d{2,4}' # check for dates
    currency = '[R]{0,1}\$[ ]{0,}\d+(,|\.)\d+' # check for currency symbols

    for text in tqdm(texts):
        text = re.sub('[\n\r]', ' ', text) # remove new lines
        text = re.sub(links, ' URL ', text) # remove hyperlinks
        text = re.sub(dates, ' ', text) # remove dates
        text = re.sub(currency, ' dinheiro ', text) # remove currency symbols
        text = re.sub('[0-9]+', ' numero ', text) # remove digits
        text = re.sub('([nN][ãÃaA][oO]|[ñÑ]| [nN] )', ' negação ', text) # replace no with negative
        text = re.sub('\W', ' ', text) # remove extra whitespaces
        text = re.sub('\s+', ' ', text) # remove extra spaces
        text = re.sub('[ \t]+$', '', text) # remove tabs etc.
        text = ' '.join(e for e in text.split() if e.lower() not in portuguese_stopwords) # remove stopwords
#         text = ' '.join(stemmer.stem(e.lower()) for e in text.split()) # stemming the words
        processed_text.append(text.lower().strip())

    return processed_text

In [None]:
processed_text = process_data(df_final['review_comment_message'])

In [None]:
df_final['review_comment_message'] = processed_text
# nao_reveja = no_review in portugese
df_final['review_comment_message'] = df_final['review_comment_message'].replace({'no_review':'nao_reveja'})
# df_final.to_csv('olist_final.csv',index=False)

In [None]:
df_final['review_comment_message'].iloc[:10]

In [None]:
# Encoding categorical variable
df_final['payment_type'] = df_final['payment_type'].replace({'credit_card':1,'boleto':2,'voucher':3,'debit_card':4})

#### 4.2 Splitting data into test and train

In [None]:
# separating the target variable
y = df_final['review_score']
X = df_final.drop(labels='review_score',axis=1)

# train test 80:20 split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=25)
print("Train data: ",X_train.shape,y_train.shape)
print("Train data: ",X_test.shape,y_test.shape)

#### 4.3 Encoding categorical features
#### 4.3.1 Encoding categorical feature order_status

In [None]:
# encoding feature order status
vect = CountVectorizer()
vect.fit(X_train['order_status'])
training_os = vect.transform(X_train['order_status'])
test_os = vect.transform(X_test['order_status'])


print("training product category: ",training_os.shape)
print("test product category: ",test_os.shape)

#### 4.3.2 Encoding categorical feature product category

In [None]:
# encoding product category
cv = CountVectorizer()
cv.fit(X_train['product_category_name_english'])
training_pc = cv.transform(X_train['product_category_name_english'])
test_pc = cv.transform(X_test['product_category_name_english'])

print("training product category: ",training_pc.shape)
print("test product category: ",test_pc.shape)

#### 4.3.2 Encoding categorical feature review_comment_message

In [None]:
# # Word2vec encoding
# # https://radimrehurek.com/gensim/models/word2vec.html
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

path = get_tmpfile("word2vec.model")

texts = [x.split(' ') for x in df_final['review_comment_message']]

w2vmodel = Word2Vec(texts, vector_size=300, window=5, min_count=1, workers=4)
w2vmodel.save("word2vec.model")

w2vmodel = Word2Vec.load("word2vec.model")

initial_alpha = 0.01
min_alpha_value = 0.0001

texts = [x.split(' ') for x in df_final['review_comment_message']]
w2vmodel.train(texts, total_examples=len(texts), epochs=10)

w2vmodel.wv['nao_reveja'].shape  # numpy vector of a word

In [None]:
from sklearn.manifold import TSNE

# Retrieve the list of words from the Word2Vec model's vocabulary
words = list(w2vmodel.wv.index_to_key)
# Retrieve vectors for each word
vectors = [w2vmodel.wv[word] for word in words]

# Initialize TSNE
tsne = TSNE(n_components=2, random_state=0)

# Limit the number of words and vectors for visualization
limited_words = words[:100]
limited_vectors = np.array(vectors[:100])  # Convert list to numpy array

# Apply TSNE
Y = tsne.fit_transform(limited_vectors)

# Plot the results using matplotlib
plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(limited_words, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords="offset points")

plt.show()

In [None]:
# http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc#
# Reading glove vectors in python: https://stackoverflow.com/a/38230349/4084039
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    f = open(gloveFile,'r', encoding="utf8")
    model = {}
    for line in tqdm(f.readlines()[1:]):
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], "float32")#np.array([float(0) if val=='0,0' else float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model

embeddings = loadGloveModel('/content/drive/MyDrive/Predict Customer Satisfaction /Trang /glove_s300.txt')

In [None]:
def tfidfWord2Vector(text,glove_words,tfidf_words,tf_values):
    # average Word2Vec
    # compute average word2vec for each review.
    tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sentence in tqdm(text): # for each review/sentence
        vector = np.zeros(300) # as word vectors are of zero length
        tf_idf_weight =0; # num of words with a valid vector in the sentence/review
        for word in sentence.split(): # for each word in a review/sentence
            if (word in glove_words) and (word in tfidf_words):
                vec = w2vmodel.wv[word] # embeddings[word]
                # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
                tf_idf = tf_values[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
                vector += (vec * tf_idf) # calculating tfidf weighted w2v
                tf_idf_weight += tf_idf
        if tf_idf_weight != 0:
            vector /= tf_idf_weight
        tfidf_w2v_vectors.append(vector)
    tfidf_w2v_vectors = np.asarray(tfidf_w2v_vectors)

    return tfidf_w2v_vectors

In [None]:
# encoding review comment message using Tfidf weighted W2V
tfidf = TfidfVectorizer()
tfidf.fit(X_train['review_comment_message'])
# pickle.dump(tfidf,open('tfidf_review_comments.pkl','wb'))

# we are converting a dictionary with word as a key, and the idf as a value
tf_values = dict(zip(tfidf.get_feature_names_out(), list(tfidf.idf_)))
tfidf_words = set(tfidf.get_feature_names_out())
glove_words = list(w2vmodel.wv.index_to_key) # list(embeddings.keys())

tfidf_w2v_vectors_train = tfidfWord2Vector(X_train['review_comment_message'].values,glove_words,tfidf_words,tf_values)
tfidf_w2v_vectors_test = tfidfWord2Vector(X_test['review_comment_message'].values,glove_words,tfidf_words,tf_values)

In [None]:
tfidf_w2v_vectors_train.shape

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
vocab = list()
for x in df_final['review_comment_message']:
    vocab.extend(x.split())
vocab = set(vocab)
word_index = {word:i+1 for i,word in enumerate(vocab)}
# pickle.dump(word_index,open('word_index.pkl','wb'))
vocab_size = len(word_index)+1
# integer encode the documents
X_train_encoded_text = []
for x in X_train['review_comment_message']:
    X_train_encoded_text.append([word_index[w] for w in x.split()])

X_test_encoded_text = []
for y in X_test['review_comment_message']:
    X_test_encoded_text.append([word_index[w] for w in y.split()])


# pad documents to a max length of 122 words as 95 percentile is 122
max_length = 122
X_train_padded_text = pad_sequences(X_train_encoded_text, maxlen=max_length, padding='post')
X_test_padded_text = pad_sequences(X_test_encoded_text, maxlen=max_length, padding='post')


print(X_train_padded_text.shape,X_test_padded_text.shape)

In [None]:
# creating embedding matrix
embedding_matrix = np.zeros((vocab_size, 300))
for word,i in word_index.items():
    embedding_vector = w2vmodel.wv[word]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)
# pickle.dump(embedding_matrix,open('embedding_matrix.pkl','wb'))

#### 4.4 Encoding numerical features

In [None]:
normalizer = Normalizer()

X_train[num_feat] = normalizer.fit_transform(X_train[num_feat])
X_test[num_feat] = normalizer.transform(X_test[num_feat])

In [None]:
# dropping categorical features

X_train = X_train.drop(labels=['review_comment_message','product_category_name_english','order_status'],axis=1)
X_test = X_test.drop(labels=['review_comment_message','product_category_name_english','order_status'],axis=1)

print(X_train.shape,X_test.shape)

#### 4.5 Merging all the features

In [None]:
# merging our encoded categorical features with rest of the data
X_train_merge = hstack((X_train, training_pc, training_os, tfidf_w2v_vectors_train))
X_test_merge = hstack((X_test, test_pc, test_os, tfidf_w2v_vectors_test))

print("Train shape:",X_train_merge.shape)
print("Test shape:",X_test_merge.shape)

In [None]:
# merging our encoded categorical features with rest of the data
X_train_other = hstack((X_train, training_pc, training_os))
X_test_other = hstack((X_test, test_pc, test_os))

print("Train shape:",X_train_other.shape)
print("Test shape:",X_test_other.shape)

## 6. Model Selection
### 6.1 Logistic Regression

In [None]:
def confusion_matrices_plot(y_real, y_pred, y_test,y_test_pred):
    # representing confusion matric in heatmap format
    # https://seaborn.pydata.org/generated/seaborn.heatmap.html
    cmap=sns.light_palette("brown")
    C1 = confusion_matrix(y_real,y_pred)
    C2 = confusion_matrix(y_test,y_test_pred)

    fig,ax = plt.subplots(1, 2, figsize=(15,5))
    ax1 = sns.heatmap(C1, annot=True, cmap=cmap, fmt=".2f", ax = ax[0])
    ax1.set_xlabel('Predicted Class')
    ax1.set_ylabel('Original Class')
    ax1.set_title("Train Confusion matrix")

    ax2 = sns.heatmap(C2, annot=True, cmap=cmap, fmt=".2f", ax = ax[1])
    ax2.set_xlabel('Predicted Class')
    ax2.set_ylabel('Original Class')
    ax2.set_title("Test Confusion matrix")

    plt.show()

In [None]:
# Training Logistic regression model and chekcing f1 score metric
alpha = [0.001,0.01,0.1,1,10,100,1000]
train_scores = [] # store train scores
test_scores = [] # store test scores

for i in alpha:
    lr = SGDClassifier(loss='log', penalty='l2', alpha=i, n_jobs=-1, random_state=25)
    lr.fit(X_train_merge,y_train)
    train_sc = f1_score(y_train,lr.predict(X_train_merge))
    test_sc = f1_score(y_test,lr.predict(X_test_merge))
    test_scores.append(test_sc)
    train_scores.append(train_sc)
    print('Alpha = ',i,'Train Score',train_sc,'test Score',test_sc)

# plotting the scores vs parameters
plt.plot(np.log(alpha),train_scores,label='Train Score')
plt.plot(np.log(alpha),test_scores,label='Test Score')
plt.xlabel('Alpha')
plt.ylabel('Score')
plt.title('Alpha vs Score')

In [None]:
# Parameter tuning of Logistic regression using RandomisedSearch CV technique
sgd = SGDClassifier(loss='log', n_jobs=-1, random_state=25)

prams={ 'alpha': [0.001,0.01,0.1,1,10,100,1000] }

random_cfl1 = RandomizedSearchCV(sgd,param_distributions=prams,verbose=10,scoring='f1',n_jobs=-1,random_state=25,
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting LogisticRegression mpdel on best parameters
sgd = SGDClassifier(loss='log', alpha=0.001, n_jobs=-1, random_state=25)
sgd.fit(X_train_merge,y_train)
if not os.path.exists('models'):
    os.makedirs('models')

with open('models/logistic.pkl', 'wb') as file:
    pickle.dump(sgd, file)

y_train_pred = sgd.predict(X_train_merge)
y_test_pred = sgd.predict(X_test_merge)

# printing train and test scores
print('Train f1 score: ',f1_score(y_train,y_train_pred))
print('Test f1 score: ',f1_score(y_test,y_test_pred))

In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_auc_score

def print_evaluation_scores(y_true, y_pred, y_proba, set_name="Set"):
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)

    print(f"{set_name} Accuracy: {accuracy:.4f}")
    print(f"{set_name} F1 Score: {f1:.4f}")
    print(f"{set_name} Recall: {recall:.4f}")
    print(f"{set_name} Precision: {precision:.4f}")
    print(f"{set_name} AUC: {auc:.4f}\n")

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

### 6.2 Linear SVM

In [None]:
# Training Logistic regression model and chekcing f1 score metric
alpha = [0.001,0.01,0.1,1,10,100,1000]
train_scores = [] # store train scores
test_scores = [] # store test scores

for i in alpha:
    lr = SGDClassifier(loss='log', penalty='l2', alpha=i, n_jobs=-1, random_state=25)
    lr.fit(X_train_merge,y_train)
    train_sc = f1_score(y_train,lr.predict(X_train_merge))
    test_sc = f1_score(y_test,lr.predict(X_test_merge))
    test_scores.append(test_sc)
    train_scores.append(train_sc)
    print('Alpha = ',i,'Train Score',train_sc,'test Score',test_sc)

# plotting the scores vs parameters
plt.plot(np.log(alpha),train_scores,label='Train Score')
plt.plot(np.log(alpha),test_scores,label='Test Score')
plt.xlabel('Alpha')
plt.ylabel('Score')
plt.title('Alpha vs Score')

In [None]:
# Parameter tuning of Logistic regression using RandomisedSearch CV technique
sgd = SGDClassifier(loss='log', n_jobs=-1, random_state=25)

prams={ 'alpha': [0.001,0.01,0.1,1,10,100,1000] }

random_cfl1 = RandomizedSearchCV(sgd,param_distributions=prams,verbose=10,scoring='f1',n_jobs=-1,random_state=25,
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting LogisticRegression mpdel on best parameters
sgd = SGDClassifier(loss='log',alpha=0.001, n_jobs=-1, random_state=25)
sgd.fit(X_train_merge,y_train)
pickle.dump(sgd,open('models/svm.pkl','wb'))

y_train_pred = sgd.predict(X_train_merge)
y_test_pred = sgd.predict(X_test_merge)

# printing train and test scores
print('Train f1 score: ',f1_score(y_train,y_train_pred))
print('Test f1 score: ',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

### 6.3 Decision Tree

In [None]:
# Checking the variation of score with depth parameters of Decision Tree
depth = [3,10,50,100,250,500]
train_scores = []
test_scores = []
for i in depth:
    clf = DecisionTreeClassifier(max_depth=i,random_state=25)
    clf.fit(X_train_merge,y_train)
    train_sc = f1_score(y_train,clf.predict(X_train_merge))
    test_sc = f1_score(y_test,clf.predict(X_test_merge))
    test_scores.append(test_sc)
    train_scores.append(train_sc)
    print('Depth = ',i,'Train Score',train_sc,'test Score',test_sc)

# plotting the score vs depth
plt.plot(depth,train_scores,label='Train Score')
plt.plot(depth,test_scores,label='Test Score')
plt.xlabel('Depth')
plt.ylabel('Score')
plt.title('Depth vs Score')
plt.show()

In [None]:
# Parameter tuning of DecisionTreeClassifier using RandomisedSearch CV technique
# https://medium.com/@mohtedibf/indepth-parameter-tuning-for-decision-tree-6753118a03c3
dt = DecisionTreeClassifier(random_state=25)

params = { "max_depth": sp_randint(3,500), "min_samples_split": sp_randint(50,200), "min_samples_leaf": sp_randint(2,50)}

random_cfl1 = RandomizedSearchCV(dt, param_distributions=params,verbose=10,scoring='f1',n_jobs=-1,random_state=25,
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and scores
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting the model on best parameters
dt = DecisionTreeClassifier(max_depth = 320, min_samples_leaf = 25, min_samples_split = 186,random_state=25)
dt.fit(X_train_merge,y_train)
pickle.dump(dt,open('models/decision_tree.pkl','wb'))

y_train_pred = dt.predict(X_train_merge)
y_test_pred = dt.predict(X_test_merge)

# printing train test score
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

In [None]:
# checking some top features
features = list(X_train.columns.values) + cv.get_feature_names_out() + vect.get_feature_names_out() + tf.get_feature_names_out()
importances = dt.feature_importances_ # importance extracted from the model
indices = np.argsort(importances)[-21:] # top 20 features by importance value

# plotting top 20 features
plt.figure(figsize=(10,5))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

###### Observation
After not getting promising results with linear models, we go with non linear models i.e tree and used DecsionTree Classifier as classification model. Firstly we trained the model for various values of depth of the tree to get the range of the parameters while fine uning the parameters during randomised serach cross validation. Keeping the results in last step as base line, the score obtained by this model during training is 0.88 which is better than previous model.

Also, looking at the confusion matrices we see that we made some improvement in FalsePositives but while lost some in FalseNegatives suggesting that model is obviously learning differently than linear models. So, keeping these results in mind let us try ensemble techniques to further verify our hypothesis and try to achieve improved results.

We have also plotted feature importance map to look at some of the top features.

### 6.4 Random Forest

In [None]:
# Variation of score with estimators used in Random forest with other parameters set to constant value
estimators = [1,2,5,10,50,100,250,500]
train_scores = []
test_scores = []
for i in estimators:
    clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=52, min_samples_split=120,
            min_weight_fraction_leaf=0.0, n_estimators=i, n_jobs=-1,random_state=25,verbose=0,warm_start=False)
    clf.fit(X_train_merge,y_train)
    train_sc = f1_score(y_train,clf.predict(X_train_merge))
    test_sc = f1_score(y_test,clf.predict(X_test_merge))
    test_scores.append(test_sc)
    train_scores.append(train_sc)
    print('Estimators = ',i,'Train Score',train_sc,'test Score',test_sc)
plt.plot(estimators,train_scores,label='Train Score')
plt.plot(estimators,test_scores,label='Test Score')
plt.xlabel('Estimators')
plt.ylabel('Score')
plt.title('Estimators vs score at depth of 5')

In [None]:
# Parameter tuning of Random forest classifier using Randomised search CV
param_dist = {"n_estimators":sp_randint(1,500),
              "max_depth": sp_randint(3,20),
              "min_samples_split": sp_randint(50,200),
              "min_samples_leaf": sp_randint(2,50)}

clf = RandomForestClassifier(random_state=25,n_jobs=-1)

random_cfl1 = RandomizedSearchCV(clf,param_distributions=param_dist,scoring='f1',verbose=10,n_jobs=-1,random_state=25,
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting the model on best parameters
rf = RandomForestClassifier(max_depth = 19, min_samples_leaf = 40, min_samples_split = 166, n_estimators = 131,random_state=25,
                           n_jobs=-1)
rf.fit(X_train_merge,y_train)
pickle.dump(rf,open('models/random_forest.pkl','wb'))

y_train_pred = rf.predict(X_train_merge)
y_test_pred = rf.predict(X_test_merge)

# printing train and test scores
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

In [None]:
# plotting top 25 features
# features = list(X_train.columns.values) + cv.get_feature_names() + vect.get_feature_names() # features list
importances = rf.feature_importances_ # importance generated by model
indices = np.argsort(importances)[-25:]

plt.figure(figsize=(10,5))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

### 6.5 LightGBM

In [None]:
# Variation of score with estimators used in LGBM with other parameters set to default value
estimators = [1,3,5,10,50,100,250,500,1000]
train_scores = []
test_scores = []
for i in estimators:
    clf = LGBMClassifier(n_estimators=i, n_jobs=-1,random_state=25)
    clf.fit(X_train_merge,y_train)
    train_sc = f1_score(y_train,clf.predict(X_train_merge))
    test_sc = f1_score(y_test,clf.predict(X_test_merge))
    test_scores.append(test_sc)
    train_scores.append(train_sc)
    print('Estimators = ',i,'Train Score',train_sc,'test Score',test_sc)
plt.plot(estimators,train_scores,label='Train Score')
plt.plot(estimators,test_scores,label='Test Score')
plt.xlabel('Estimators')
plt.ylabel('Score')
plt.title('Estimators vs score at depth of 5')

In [None]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
# Prameter tuning of the LGBM parameters using RandonSearch CV
x_cfl=LGBMClassifier(random_state=25,n_jobs=-1)

prams={
    'learning_rate':[0.001,0.01,0.03,0.05,0.1,0.15,0.2],
     'n_estimators':[1,3,5,10,50,100,250,500,1000],
     'max_depth':[3,5,10,15,20,50],
    'colsample_bytree':[0.1,0.3,0.5,1],
    'subsample':[0.1,0.3,0.5,1]
}
random_cfl1=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,random_state=25,scoring='f1',
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting the model on best parameters
lgbm = LGBMClassifier(n_estimators=1000, max_depth=5,subsample=0.5,learning_rate=0.05,colsample_bytree=1,random_state=25,
                      n_jobs=-1)
lgbm.fit(X_train_merge,y_train)
pickle.dump(lgbm,open('models/lgbm.pkl','wb'))

y_train_pred = lgbm.predict(X_train_merge)
y_test_pred = lgbm.predict(X_test_merge)

# printing train and test scores
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

In [None]:
# ovserving top 25 features
importances = lgbm.feature_importances_
indices = np.argsort(importances)[-25:]

plt.figure(figsize=(10,5))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

### XGBoost

In [None]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
# Prameter tuning of the LGBM parameters using RandonSearch CV
x_cfl=XGBClassifier(random_state=25,n_jobs=-1)

prams={
    'learning_rate':[0.001,0.01,0.03,0.05,0.1,0.15,0.2],
     'n_estimators':[1,3,5,10,50,100,250,500,1000],
     'max_depth':[3,5,10,15,20,50],
    'colsample_bytree':[0.1,0.3,0.5,1],
    'subsample':[0.1,0.3,0.5,1]
}
random_cfl1=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,random_state=25,scoring='f1',
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting the model on best parameters
xgb = XGBClassifier(n_estimators=50, max_depth=15,subsample=0.5,learning_rate=0.2,colsample_bytree=0.3,random_state=25,
                      n_jobs=-1)
xgb.fit(X_train_merge,y_train)
pickle.dump(xgb,open('models/xgb.pkl','wb'))

y_train_pred = xgb.predict(X_train_merge)
y_test_pred = xgb.predict(X_test_merge)

# printing train and test scores
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)

### AdaBoost

In [None]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
# Prameter tuning of the LGBM parameters using RandonSearch CV
x_cfl=AdaBoostClassifier(random_state=25)

prams={
    'learning_rate':[0.001,0.01,0.03,0.05,0.1,0.15,0.2],
     'n_estimators':[1,3,5,10,50,100,250,500,1000]
}
random_cfl1=RandomizedSearchCV(x_cfl,param_distributions=prams,verbose=10,n_jobs=-1,random_state=25,scoring='f1',
                               return_train_score=True)
random_cfl1.fit(X_train_merge,y_train)

print('mean test scores',random_cfl1.cv_results_['mean_test_score'])
print('mean train scores',random_cfl1.cv_results_['mean_train_score'])

In [None]:
# printing best parameters and score
print("Best Parameters: ",random_cfl1.best_params_)
print("Best Score: ",random_cfl1.best_score_)

In [None]:
# Fitting the model on best parameters
ada = AdaBoostClassifier(n_estimators=500, learning_rate=0.05, random_state=25)
ada.fit(X_train_merge,y_train)
pickle.dump(ada,open('models/ada_boost.pkl','wb'))

y_train_pred = ada.predict(X_train_merge)
y_test_pred = ada.predict(X_test_merge)

# printing train and test scores
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))

In [None]:
y_test_proba = sgd.predict_proba(X_test_merge)[:, 1]
print_evaluation_scores(y_test, y_test_pred, y_test_proba, "Test")

In [None]:
confusion_matrices_plot(y_train,y_train_pred,y_test,y_test_pred)