# **PREDICTING SUCCESS RATE OF ADVERTISEMENTS - TERM PROJECT**

## **Table Of Contents**

1. [Dataset and Problem Statement](#section1)<br>
2. [Importing libraries](#section2)<br>
3. [Reading Data](#section3)<br>
4. [Pre-profiling of Data](#section4)<br>
5. [Profiling using Sweetviz package](#section5)<br>
6. [Analysis using Autoviz](#section6)<br>
7. [Statistical Analysis of Data](#section7)<br>
8. [Exploratory Analysis of Data](#section8)<br>
  - 8.1 [Advertisements counts based on relationship status](#section801)<br>
  - 8.2 [Advertisements counts for each industry](#section802)<br>
  - 8.3 [Advertisements counts for each genre](#section803)<br>
  - 8.4 [Advertisements counts gender-wise](#section804)<br>
  - 8.5 [Advertisements depending on time of air](#section805)<br>
  - 8.6 [Advertisements based on price class](#section806)<br>
  - 8.7 [Advertisements based on money back guarantee](#section807)<br>
  - 8.8 [Advertisements based on location](#section808)<br>
	- 8.8.1 [Based on genre and runtime](#section8081)<br>
	- 8.8.2 [Based on industry and runtime](#section8082)<br>
	- 8.8.3 [Based on genre and ratings](#section8083)<br>
	- 8.8.4 [Based on industry and ratings](#section8084)<br>
  - 8.9 [Advertisements count based on runtime and genre](#section809)<br>
  - 8.10 [Advertisements count based on runtime and industry](#section810)<br>
  - 8.11 [Advertisements count based on ratings and genre](#section811)<br>
  - 8.12 [Advertisements count based on ratings and indstry](#section812)<br>
  - 8.13 [Ratings distribution for each genre](#section813)<br>
  - 8.14 [Ratings distribution for each industry](#section814)<br>
  - 8.15 [Runtime distribution for each genre](#section815)<br>
  - 8.16 [Runtime distribution for each industry](#section816)<br>
  - 8.17 [Relationship between ratings and runtime](#section817)<br>
  - 8.18 [Analysis of columns with respect to target column - netgain](#section818)<br>
	- 8.18.1 [Profitable records analysis](#section8181)<br>
	- 8.18.2 [Loss records analysis](#section8182)<br>
9. [Checking data distribution of output variable](#section9)<br>
10. [Feature Engineering and Data Transformation](#section10)<br>
	- 10.1 [Combining feature values](#section101)<br>
	- 10.2 [Label Encoding and One Hot Encoding](#section102)<br>
	- 10.3 [Creating bins for the runtime values](#section103)<br>
	- 10.4 [Checking for outliers](#section104)<br>
		- 10.4.1 [Using Boxplot](#section10401)<br>
		- 10.4.2 [Analyzing using IQR](#section10402)<br>
		- 10.4.3 [Analyzing approaches on data imbalance issue](#section10403)<br>
		- 10.4.4 [LDA and PCA Transformation](#section10404)<br>
11. [Feature Selection](#section11)<br>
  - 11.1 [Using Model](#section1101)<br>
  - 11.2 [Using selectKBes](#section1102)<br>
12. [Pycaret - to analyze the best models on the dataset and features before actual modelling](#section12)<br>
13. [Checking the distribution of continuos columns to decide on normalization](#section13)<br>
14. [Machine learning - Analyzing baseline models](#section14)<br>
  - 14.1 [Modelling on basic dataset - adv_data with stratified - False while splitting the data for test/train](#section1401)<br>
  - 14.2 [Modelling on basic dataset - adv_data with stratified - True while splitting the data for test/train](#section1402)<br>
  - 14.3 [Modelling on adv_data_1 with stratified - False while splitting the data for test/train](#section1403)<br>
  - 14.4 [Modelling on adv_data_1 with stratified - True while splitting the data for test/train](#section1404)<br>
  - 14.5 [Modelling on adv_data_2 with stratified - False while splitting the data for test/train](#section1405)<br>
  - 14.6 [Modelling on adv_data_2 with stratified - True while splitting the data for test/train](#section1406)<br>
  - 14.7 [Summary on baseline model results](#section1407)<br>
15. [Machine learning - Analyzing tuned models (and Ensemble models)](#section15)<br>
  - 15.1 [Hyper-parameter tuning](#section1501)<br>
	- 15.1.1 [Hyper parameter tuning in Random Forest](#section15011)<br>
	- 15.1.2 [Hyper paramete tuning in Gradient Boosting Classifier](#section15012)<br>
	- 15.1.3 [Hyper parameter tuning for Xtreme Gradient Boosting Classifier](#section15013)<br>
	- 15.1.4 [Hyper parameter tuning for LightGBM Classifier](#section15014)<br>
  - 15.2 [Ensemble Techniques](#section1502)<br>
	- 15.2.1 [Voting Classifier](#section15021)<br>
	- 15.2.2 [Stacking](#section15022)<br>
16. [Deep Learning](#section16)<br>
  - 16.1 [Check the data distribution and if the data is linearly seperable based on output class](#section1601)<br>
  - 16.2 [Function to Normalize the data](#section1602)<br>
  - 16.3 [Create simple Neural Network first and evaluation with all datasets generated above](#section1603)<br>
	- 16.3.1 [Training the model with 1000 EPOCHS](#section16031)<br>
	- 16.3.2 [Training the model with 500 EPOCHS](#section16032)<br>
	- 16.3.3 [Training the model with 2000 EPOCHS](#section16032)<br>
	- 16.3.4 [Summary of highest accuracies obtained for all 3 epoch values](#section16034)<br>
	- 16.3.5 [Analyzing validation loss for various data generated above](#section16035)<br>
  - 16.4 [Creating deep neural networks with hyper parameter optimization](#section1604)<br>
  - 16.5 [Experiment with selected configuration from hyper parameter tuning](#section1605)<br>
  - 16.6 [Batch Normalization and Weight initializer on models selected above along with different optimizers](#section1606)<br>
	- 16.6.1 [Using Batch Normalization and early stopping](#section16061)<br>
	- 16.6.2 [Applying Kernel initializers](#section16062)<br>
  - 16.7 [Analyzing with various Optimizers](#section1607)<br>
  - 16.8 [Selecting the final parameters](#section1608)<br>
  - 16.9 [Final Model](#section1609)<br>
17. [Comparison of Machine Learning Model results and Deep Learning results](#section17)<br>
18. [Conclusion](#section18)<br>




<a id=section1></a>
## **1. Dataset and Problem Statement**

This dataset contains information about various advertisements.

It is a collection of approximately 26,000 different instances of advertisements of different products aired in different countries.

**Input fields**

Column|Description
---|---
id|Unique id for each row
relationship_status|The relationship status of the most responsive customers to the advertisement
industry|The industry to which the product belonged
genre|The type of advertisement
targeted_sex|Sex that was mainly targeted for the advertisement
averageruntime(minutesper_week)|Minutes per week the advertisement was aired
airtime|Time when the advertisement was aired
airlocation|Country of origin
ratings|Metric out of 1 which represents how much of the targeted demographic watched the advertisement
expensive|A general measure of how expensive the product or service is that the ad is discussing
moneybackguarantee|Whether or not the product offers a refund in the case of customer dissatisfaction


**Target column**

Column|Description
---|---
netgain|Whether the ad will incur a gain or not when sold




**Probelem Staetement -**

Using the above dataset, we need to classify whether an ad will be profitable or not




So, to accomplish above, will train the data using both Machiner Learning models and Deep learning models and compare which can be better in this case

<a id=section2></a>
## **2. Importing libraries**

In [None]:
!pip install --upgrade pandas
!pip install --upgrade numpy
!pip install --upgrade folium
!pip install --upgrade requests

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import RobustScaler

from sklearn.decomposition import PCA


from imblearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.svm import LinearSVC
from imblearn.pipeline import make_pipeline

from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier

from sklearn import metrics


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif

import pickle

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostClassifier, Pool

In [None]:
import plotly
from plotly.offline import iplot
import plotly.graph_objs as go

In [None]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [None]:
!pip install pandas-profiling

In [None]:
from pandas_profiling import ProfileReport

In [None]:
!pip install pycaret

In [None]:
!pip install shap

In [None]:
!pip install sweetviz

In [None]:
import sweetviz as sv     

In [None]:
!pip install autoviz

In [None]:
!pip install xlrd

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class

AV=AutoViz_Class()

In [None]:
!pip install git+https://github.com/tensorflow/docs

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import activations
from tensorflow.keras import regularizers
from keras.layers import Dropout
from keras.layers import BatchNormalization


import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

from keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.callbacks import EarlyStopping

<a id=section3></a>
## **3. Reading Data**

In [None]:
#adv_data=pd.read_csv('/content/advertisement_success.csv')
adv_data=pd.read_csv('advertisement_success.csv')
adv_data.head()

<a id=section4></a>
## **4. Pre-profiling of Data**

In [None]:
!pip install config-with-yaml

In [None]:
prof_adv=ProfileReport(adv_data)
prof_adv.to_file(output_file='Advertising_data_profiling.html')

 - No missing values.
 - realtionship_status is categorcial column has 7 distinct values with 'Married-civ-spouse' and 'Never-married' have majority of the records.
 - industry is categorical column has 6 distinct values with 'Pharma' and 'Auto' is the most common values.
 - genre is categorical column has 5 distinct values with 'Comedy' is the most common value and has majority of the data while other values have very few records.
 - targeted_sex is categorical column has 2 values - 'Male' and 'Female' where 'Male' has majority of the values.
 - runtime is numerical continuos values where bin range of 30-40 mins have most of the records and almost normal distribution.
 - airtime is categorical column and 'Primetime' is most common value.
 - airlocation is categorical column and 'United States' has majority of the records.
 - ratings is the numerical continuos column  - positively skewed, high value of kurotsis as well - means some values have extreme high count.
 - expensive is the categorical column with "low expensive" have majority of the records.
 - money_bak_guarantee has boolean values - Yes/No with almost equal percenatge of records for both.
 - netgain is boolean record - True/False where True has major records  - data imbalance issue which needs to be taken care.
 

<a id=section5></a>
## **5. Profiling using Sweetviz package**

In [None]:
advert_report = sv.analyze(adv_data)
advert_report.show_html('Advertising.html')

Sweetviz provided information on correlation among fields
 - relationship has high correlation with indsutry
 - targeted_sex has high correlation with industry
 - netgain  - the output variable is not highly correlated with any field but slightly dependent on netgain

<a id=section6></a>
## **6. Analysis using Autoviz**

In [None]:
aftrain=AV.AutoViz('advertisement_success.csv')

1. Data imbalance issue in the output dataset.
2. Most of the advertisements are low cost categories with high cost is distant second which is almost half of low cost.
3. Most common average rating is 40 mins/week followed by 50 mins and 60 mins but their count is almost negligible when compared whem compared to 40 mins/week.
4. Comedy is the most common genre and dominates the datatset, it is followed by Informercial which is very low in count as compared to comedy.
5. Pharma is the most common industry for Advertisements followed by Auto , Political and Entertainment, with Poitical and Entertainment almost half of Pharma.
6. Most of the advertisements are intended for Primetime followed by Morning time whihc is half og Primetime count, and Dayime has very less count.
7. The ratings below 0.2 are most common.
8. In case of targeted genders - Male category is almost double of Female.
9. In case of relationship status, 'Married-civ-spouse' is most common, followed by 'Never Married'.
10. Most of the records belong to United States.
11. Money back guarantee have equal distribution - for money back and not back category.
12. While people from various categroes majorly given average rating below 0.4, but people under "Married-civ-spouse" have given rating to Pharma above than 0.4.
13. Drama genre seems to have higher average ratings than other genres, it seems to have ratings above 0.5 while others have below than 0.4.
14. Daytime advertisements have better average rating than Primetime advertisements, even if Daytime count is least in the airtime category and Primetime has majority.
15. Asian countries like Japan, Thailand,India have given overall better average ratings than European countries and United States.
16. Average ratings given by Make are higher than Females, but if we consider the record count of Female and Male in target audience, the ratings seems to be given by Feamle seems better.
17. The average rating based on netgain, the profitable records to have rating aorund 0.07, while non-profitable are around 0.03.

<a id=section7></a>
## **7. Statistical Analysis of Data**

Taking the backup of original data and removed the "id" field from the data as it holds no significance in the analysis.

Also, the column name has been trimmed to "runtime" from original "average_runtime(minutes_per_week)"

In [None]:
adv_data_orig=adv_data.copy()
adv_data.drop('id', axis=1, inplace=True)
adv_data.rename(columns={"average_runtime(minutes_per_week)":"runtime"}, inplace=True)
adv_data.head()

In [None]:
adv_data.shape

The dataset is having 26048 rows and 11 columns

In [None]:
adv_data.info()

The dataset contains no null values.

Most of the columns are of type object which needs to be converted to numerical values before feeding the data to various ML/AI models

In [None]:
adv_data.describe(include='all')

Above is the summary of all the columns present in a dataset - whether numeric or object type.

 - No Null values
 - 2 numeric columns - airtime and ratings
 - Both the columns seems to have outliers 
 - runtime has almost value of 40 mins/week for around 50% of data and maximum is 99 minutes/week, so it means very few advertisements have runtime of 99 minutes, while most of them are limited to 40 mins.
 - ratings seems to be slighly positive skewed, with outlier value of rating 1, while majority of the advertisements are having ratings of around 0.027, but very few with 1.

**Checking the unique values for categorical columns**

In [None]:
print(adv_data.realtionship_status.unique())

'Married-spouse-absent' 'Married-civ-spouse' and 'Married-AF-spouse' can be merged later into single category - "Married"

Divorced', 'Separated' can be merged later into single category - "Divorced"

'Separated', 'Never-married' 'Widowed' can be merged later into single category - "Single"

In [None]:
print(adv_data.industry.unique())

In [None]:
print(adv_data.genre.unique())

In [None]:
print(adv_data.targeted_sex.unique())

In [None]:
print(adv_data.airtime.unique())

In [None]:
print(adv_data.airlocation.unique())

In [None]:
print(adv_data.expensive.unique())

In [None]:
print(adv_data.money_back_guarantee.unique())

In [None]:
print(adv_data.netgain.unique())

All the above values will be label encoded to numeric values later

<a id=section8></a>
## **8. Exploratory Analysis of Data**

Understanding the relation in the data by plotting various graphs

In [None]:
def plot_countplot(df, column, color, description):
  plt.figure(figsize=(15,8))
  sns.countplot(x=column,color=color, data=df)
  plt.title(description)
  plt.show()


In [None]:
def plot_barplot(df, col1, col2, col3, loc, description, w=20, h=8):
  plt.figure(figsize=(w,h))
  sns.barplot(x=col1, y=col2, hue=col3, data=df)
  plt.legend(loc=loc)
  plt.title(description)
  plt.show()

<a id=section801></a>
### **8.1 Advertisements counts based on relationship status**

In [None]:
plot_countplot(adv_data, 'realtionship_status', 'red', 'Advertisements count based on relationship status')

'Married-civ-spouse' and 'Never married' are the major followers of advertisements followed by 'Divorced'

<a id=section802></a>
### **8.2 Advertisements counts for each industry**

In [None]:
plot_countplot(adv_data, 'industry', 'orange', 'Advertisements count per industry')

Majority of the advertisements belonged to Pharma industry followed by Auto industry

<a id=section803></a>
### **8.3 Advertisements counts for each genre**

In [None]:
plot_countplot(adv_data, 'genre', 'green', 'Advertisements count per genre')

Comedy genre has the major count for Advertisements as comedy genre can attract the viewers/listeners easily, so this could be the reason, this genre preferred.

<a id=section804></a>
### **8.4 Advertisements counts gender-wise**

In [None]:
plot_countplot(adv_data, 'targeted_sex', 'violet', 'Advertisements count based on gender')

Most of the advertisements are targeted to Male, and about half are targeted for Female gender

<a id=section805></a>
### **8.5 Advertisements depending on time of air**

In [None]:
plot_countplot(adv_data, 'airtime', 'blue', 'Advertisements count based on time of air')

Most of the advertisements are favoured to run at prime time, followed by morning time.

Very few have day time slot as most of the people might be out for work, so less audience for the advertrisements hence low count preferred for this time

<a id=section806></a>
### **8.6 Advertisements based on price class**

In [None]:
plot_countplot(adv_data, 'expensive', 'yellow', 'Advertisements count based on price range')

Most of the advertisements belonged to low price range of products as most of the advertisements are targeted for higher range of population which could be middle class and would prefer for low cost products.

It is followed by High price range of products could be for huge spenders or could be products are popular among people irrespective of high price.

The Medium has the lowest count.

<a id=section807></a>
### **8.7 Advertisements based on money back guarantee**

In [None]:
plot_countplot(adv_data, 'money_back_guarantee', 'pink', 'Advertisements count based on money back guarantee')

The count is almost similar, as some big manufactureres can consider the customer satisfaction has a high priority, but many low/medium cost manufactureres producing in bulk may not be able to follow this process.

<a id=section808></a>
### **8.8 Advertisements based on location**

<a id=section80801></a>
#### **8.8.1 Based on genre and runtime**

In [None]:
df_location=pd.DataFrame(adv_data.groupby(['airlocation','genre','runtime'])['runtime'].count().nlargest(100))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'genre', 'upper right', 'Advertisements count based on location and genre', 30)

It seems like all genres are across the countries have generally 40 mins/week but Direct marginally more in United states.

<a id=section80802></a>
#### **8.8.2 Based on industry and runtime**

In [None]:
df_location=pd.DataFrame(adv_data.groupby(['airlocation','industry','runtime'])['runtime'].count().nlargest(100))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'industry', 'upper right', 'Advertisements count based on location and industry', 30)

All industries across countries are 40 mins/week, but political is less than others in runtime in United States

<a id=section80803></a>
#### **8.8.3 Based on genre and ratings**

In [None]:
df_location=pd.DataFrame(adv_data.groupby(['airlocation','genre','ratings'])['ratings'].count().nlargest(50))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'genre', 'upper right', 'Advertisements count based on location and genre', 30)

All the advertisements have rating below .05 on an average, but Comedy has some better rating in United states

<a id=section80804></a>
#### **8.8.4 Based on industry and ratings**

In [None]:
df_location=pd.DataFrame(adv_data.groupby(['airlocation','industry','ratings'])['ratings'].count().nlargest(100))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'industry', 'upper right', 'Advertisements count based on location and industry', 30)

United states have majority of Auto industry records, followed by Pharma.

In other countries, no particular industry has major share of records

<a id=section809></a>
### **8.9 Advertisements count based on runtime and genre**

In [None]:
df_genre_runtime=pd.DataFrame(adv_data.groupby(['runtime','genre'])['runtime'].count().nlargest(50))
df_genre_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_genre_runtime.reset_index(inplace=True)
df_genre_runtime=df_genre_runtime.sort_values(by='runtime', ascending=False)
df_genre_runtime.head()

In [None]:
plot_barplot(df_genre_runtime, 'runtime', 'count', 'genre', 'upper left', 'Runtime by genres', 20)

 - Advertisements based on comedy genre has 40 mins/week as the most common duration.
 - Informercial advertisements are also mostly 40 mins/week.
 - Comedy advertisements have more values like 45 mins, 50 mins, 60 mins with significant count of records.
 - Remaining genres have very few reocds on other duration apart from 40 mins/week.

So, it is assumed that 40 mins/week is the most preferred and cost effective duration for companies to showcase their advertisements.

Comedy genre has more different duration values as compared to other genres, could be the reason that they are more popular among people and profitable for companies.



<a id=section810></a>
### **8.10 Advertisements count based on runtime and industry**

In [None]:
df_industry_runtime=pd.DataFrame(adv_data.groupby(['runtime','industry'])['runtime'].count().nlargest(100))
df_industry_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_industry_runtime.reset_index(inplace=True)
df_industry_runtime=df_industry_runtime.sort_values(by='runtime', ascending=False)
df_industry_runtime.head()

In [None]:
plot_barplot(df_industry_runtime, 'runtime', 'count', 'industry', 'upper right', 'Runtime by industry', 20)

- Pharma industry has most varied duration values with majority are at 40 mins/week but it has other values as well with significant records.

- Auto is the second most commmon industry with the number of records and has vaarious different duration values.

- Political and Entertainment also has varied values apart from 40m mins/week, but at other durations, the records are quite low in count.

- Remaining industries are having major count of advertisements at 40 mins/week but negligible for other duration values.

So Pharma and Auto industries have more count of records as compared to other industries and also have adveritsements of different duration apart from most commonn duration of 40 mins/week.

So, 40 mins/week is the most preferred duration as observed in the previous graph as well.

<a id=section811></a>
### **8.11 Advertisements count based on ratings and genre**

In [None]:
df_genre_ratings=pd.DataFrame(adv_data.groupby(['ratings','genre'])['ratings'].count().nlargest(100))
df_genre_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_genre_ratings.reset_index(inplace=True)
df_genre_ratings=df_genre_ratings.sort_values(by='ratings', ascending=False)
df_genre_ratings['ratings']=df_genre_ratings['ratings'].map('{:,.2f}'.format)
df_genre_ratings.head()

In [None]:
  plot_barplot(df_genre_ratings, 'ratings', 'count', 'genre', 'upper right', 'Ratings by genres', 20)

- Since Comedy genre is most common, so in the ratings scale, it is most common and spread accross various values.

- The most common values for ratings for comedy and Informercial genre is around less than 0.03

- For drama and informercial and other category as well, it is around 0.03.

<a id=section812></a>
### **8.12 Advertisements count based on ratings and indstry**

In [None]:
df_industry_ratings=pd.DataFrame(adv_data.groupby(['ratings','industry'])['ratings'].count().nlargest(100))
df_industry_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_industry_ratings.reset_index(inplace=True)
df_industry_ratings=df_industry_ratings.sort_values(by='ratings', ascending=False)
df_industry_ratings['ratings']=df_industry_ratings['ratings'].map('{:,.2f}'.format)
df_industry_ratings.head()

In [None]:
plot_barplot(df_industry_ratings, 'ratings', 'count', 'industry', 'upper right', 'Ratings by industry', 20)

As observed before Pharma has the most count of advertisements with maxmum rating count of 0.17 followed by 0.10 and then 1.

Auto has most records with ratings of 0.11 followed by 0.16

<a id=section813></a>
### **8.13 Ratings distribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='ratings',  data=adv_data)
plt.title('Ratings distribution for each genre')
plt.show()

The ratings genre is not uniform, rather, no ratings provided from .5 to .9 range for any genre based advertisements.

Comedy and Informercial have some advertisements with ratings around .3 and .4 but other genres have ratings not more than .2 and then at 1

Most of the advertisements across all genres have most common rating between .1 and .2

<a id=section814></a>
### **8.14 Ratings distribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='ratings',  data=adv_data)
plt.title('Ratings dstribution for each industry')
plt.show()

Similar to genre, ratings based on industry has same distribution with Political and Pharma have few around .4.

No ratings between .5 to .9

Auto and Pharma have most of the counts in the graph and most common count like previous graph is around .1 to .2

<a id=section815></a>
### **8.15 Runtime distribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='runtime',  data=adv_data)
plt.title('Runtime dstribution for each genre')
plt.show()

Runtime seems to be almost well distributed accross all genres.

For Comedy, comparatively most of the advertisements are below 60 mins but dispersed data later on.

For other genres as well, count is less after 60 mins.

Comedy has most of the count at extreme values of more than 90 mins as compared to other genres.

<a id=section816></a>
### **8.16 Runtime distribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='runtime',  data=adv_data)
plt.title('Runtime distribution for each industry')
plt.show()

Pharma has most of the records and distributed across various run time values with most of them below 60 mins.

Pharama is followed by Auto and Political and Entertainment in the count, where most of the count is below 60 mins and few above.

Pharama has most count at extreme run time values of more than 90 mins.

<a id=section817></a>
### **8.17 Relationship between ratings and runtime**

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(x='ratings', y='runtime',  data=adv_data)
plt.title('Ratings based on runtime')
plt.show()

No relation between ratings and runtime.

**Summary**

- Most of the audience belong to 'Married-civ-spouse' category or 'Never Married' category while other categories have nominal count.
- Pharma is the most common industry than Auto.
- Comedy is the most common genre and other genres have negligible count comparatively.
- Most of the advertisements belong to Low cost categories.
- With respect to air location - Asian countries prefer Drama while European countries prefer Comedy and United States are into all genres.
- Most common duration among genres is 40 mins/Week.
- Majority of the advertisements have rating around or below 0.03 with Comedy is most common and other genres have negligible count, and in terms of Indiustry Pharma is most common followed by Auto and Political.

Since Comedy is the most common genre and Pharma and Auto are the most common industries and also other genres have  very low count comparatively, but in terms of Industries,apart from Pharma and Auto, others too have significant count.

So, most of the industries prefer creating advertisements in comedy genre as it is more popular.



<a id=section818></a>
### **8.18 Analysis of columns with respect to target column - netgain**

<a id=section8181></a>
#### **8.18.1 Profitable records analysis**

##### **Advertisements count per industry**

In [None]:
profit_data=adv_data[adv_data.netgain==True]
profit_data.head()

In [None]:
profit_data.shape

##### **Advertisements count based on relationship status**

In [None]:
plot_countplot(profit_data, 'realtionship_status', 'red', 'Advertisements count based on relationship status')

Since 'Married-civ-spouse' has maximum count in overall dataset, so it is exptec to be most common in profit dataset as well.

But, 'Never Married' count has decreased drastically from overall dataset to profit datatset, so it means most of the advertisements for Never Married category not profitable for the industry.

In [None]:
plot_countplot(profit_data, 'industry', 'orange', 'Advertisements count per industry')

Pharma has the majority of the profitable shares, but the count is reduced drastically from above 10000 to around 5000 are profitable.

But, for Auto, in the overall dataset, the industry has huge count, but very few records are profitable, overall count is more than 6000, but profitable count is around 800.

All the industries have low count of profitable advertisements.

##### **Advertisements count per genre**

In [None]:
plot_countplot(profit_data, 'genre', 'green', 'Advertisements count per genre')

Comedy genre as expected has highest count of profitable advertisements, but its overall count is around 20000 and profitable is only around 6000.

Other genre categories have very few count of profitable advertisements

##### **Advertisements count based on gender**

In [None]:
plot_countplot(profit_data, 'targeted_sex', 'violet', 'Advertisements count based on gender')

Male oriented advertisements have better count as profitable advertisements, that is also because in the original dataset male oriented advertisements have higher counts, though profitable count (5000) is quite less when compared to overall count(17500).

Female oriented advertisements also reduced from 8000 to 1000 when consider profitability.

##### **Advertisements count based on time of air**

In [None]:
plot_countplot(profit_data, 'airtime', 'blue', 'Advertisements count based on time of air')

Prime time advertisements are more profitable, but that is also because, it has most count in the overall dataset.

But count is decreased from overall 16000 to 5000.

Daytime has the lowest count in the overall dataset, but the profitable count is almost same as Morning data and no drastic decrease also (2000 overall to around 600 in profit)

##### **Advertisements count based on price range**

In [None]:
plot_countplot(profit_data, 'expensive', 'yellow', 'Advertisements count based on price range')

Low being the most common in overall dataset is expted to have most count here as well, but the count has sharp decline - from 16000 in overall dataset to around 4000 is profitable.

##### **Advertisements count based on money back guarantee**

In [None]:
plot_countplot(profit_data, 'money_back_guarantee', 'pink', 'Advertisements count based on money back guarantee')

Here also money back guarantee has almost same share as in overall dataset, so money back guarantee is not a major factor in making advertisements profitable.

##### **Ratings distribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='ratings',  data=profit_data)
plt.title('Ratings distribution for each genre')
plt.show()

Most of the profits earning advertisements for each genre have not very good ratings, infact very few with rating 1 is profitable.

Most of the advertisements with rating below or around .2 are most profitable.

##### **Ratings dstribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='ratings',  data=profit_data)
plt.title('Ratings dstribution for each industry')
plt.show()

Most of the profits earning advertisements for each industry have not very good ratings, infact very few with rating 1 is profitable.

Most of the advertisements with rating below or around .2 are most profitable.

Pharma and Auto have major share.

##### **Runtime dstribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='runtime',  data=profit_data)
plt.title('Runtime dstribution for each genre')
plt.show()

Runtime is spread across all values, so no particular range of runtime is more profitable than others, BUT the the advertisements between the range of runtime of 30 to 60 have major share.

##### **Runtime distribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='runtime',  data=profit_data)
plt.title('Runtime distribution for each industry')
plt.show()

Runtime is spread across all values, so no particular range of runtime is more profitable than others, BUT the the advertisements between the range of runtime of 30 to 60 have major share.

##### **Ratings based on runtime**

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(x='ratings', y='runtime',  data=profit_data)
plt.title('Ratings based on runtime')
plt.show()

As observed before advertisements around rating .2 or below have major share 

##### **Advertisements count based on industry and relationship status**

In [None]:
df_indsrel=pd.DataFrame(profit_data.groupby(['industry','realtionship_status'])['netgain'].count())
df_indsrel.rename(columns={"netgain":"count"}, inplace=True)
df_indsrel.reset_index(inplace=True)
df_indsrel=df_indsrel.sort_values(by='count', ascending=False)
df_indsrel.head()

In [None]:
plot_barplot(df_indsrel, 'industry', 'count', 'realtionship_status', 'upper right', 'Advertisements count based on industry and relationship status', 20)

 - Pharma is most profitable for 'Married-civ-spouse' category.
 - Auto insdustry is mainly targeted to 'Never married/Divorced' category.
 - Entertainment category is mostly for 'Divorced'
 - Political for 'Never Married'

##### **Advertisements count based on industry and gender**

In [None]:
df_indsrel=pd.DataFrame(profit_data.groupby(['industry','targeted_sex'])['netgain'].count())
df_indsrel.rename(columns={"netgain":"count"}, inplace=True)
df_indsrel.reset_index(inplace=True)
df_indsrel=df_indsrel.sort_values(by='count', ascending=False)
df_indsrel.head()

In [None]:
plot_barplot(df_indsrel, 'industry', 'count', 'targeted_sex', 'upper right', 'Advertisements count based on industry and gender', 20)

 - Pharma is quite popular with Male
 - Auto is more targeted towards Male but comparatively significant count for Female also.
 - Female preferred Other category with no Male count for that.
 - Entertainment also have similar count for Male and Female.

##### **Advertisements count based on genre and relationship status**

In [None]:
df_genrerel=pd.DataFrame(profit_data.groupby(['genre','realtionship_status'])['netgain'].count())
df_genrerel.rename(columns={"netgain":"count"}, inplace=True)
df_genrerel.reset_index(inplace=True)
df_genrerel=df_genrerel.sort_values(by='count', ascending=False)
df_genrerel.head()

In [None]:
plot_barplot(df_genrerel, 'genre', 'count', 'realtionship_status', 'upper right', 'Advertisements count based on genre and relationship status', 20)

 - Comedy is most popular with category - Married-civ-spouse, followed by Never Married and Divorced.
 - Informercial and Drama are also more popular with Married-civ-spouse

##### **Advertisements count based on genre and gender**

In [None]:
df_genrerel=pd.DataFrame(profit_data.groupby(['genre','targeted_sex'])['netgain'].count())
df_genrerel.rename(columns={"netgain":"count"}, inplace=True)
df_genrerel.reset_index(inplace=True)
df_genrerel=df_genrerel.sort_values(by='count', ascending=False)
df_genrerel.head()

In [None]:
plot_barplot(df_genrerel, 'genre', 'count', 'targeted_sex', 'upper right', 'Advertisements count based on genre and gender', 20)

In all the genre, Male count is more, and so because Male count is more in overall dataset.

##### **Advertisements count based on location,runtime and genre**

In [None]:
df_location=pd.DataFrame(profit_data.groupby(['airlocation','genre','runtime'])['runtime'].count().nlargest(100))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'genre', 'upper right', 'Advertisements count based on location, runtime and genre', 30)

United states dont have any particualr genre with major share of records in profitable data, but Comedy genre is most common across the multiple locations followed by Drama.

In European, Drama seems to be in majority while in Aisan countries, drama is most profitable.

##### **Advertisements count based on location,runtime and industry**

In [None]:
df_location=pd.DataFrame(profit_data.groupby(['airlocation','industry','runtime'])['runtime'].count().nlargest(100))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'industry', 'upper right', 'Advertisements count based on location,runtime and industry', 30)

United States have no particualr industry common but Pharma is most common across the countries overall.

And all of them have runtime approx 40 - 50 mins

##### **Advertisements count based on location,ratings and genre**

In [None]:
df_location=pd.DataFrame(profit_data.groupby(['airlocation','genre','ratings'])['ratings'].count().nlargest(50))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'genre', 'upper right', 'Advertisements count based on location,ratings and genre', 30)

Comedy is the most common genre, with ratings of around 0.02 except in United States where it is around 0.15

**Advertisements count based on location,ratings and industry**

In [None]:
df_location=pd.DataFrame(profit_data.groupby(['airlocation','industry','ratings'])['ratings'].count().nlargest(200))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'industry', 'upper right', 'Advertisements count based on location,ratings and industry', 40)

Pharma is the most common with rating around 0.1 with phillipines having rating around 0.5

In United states unlike other countries, Entertainment and Political industry is more common than Pharma.

##### **Runtime by genres**

In [None]:
df_genre_runtime=pd.DataFrame(profit_data.groupby(['runtime','genre'])['runtime'].count())
df_genre_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_genre_runtime.reset_index(inplace=True)
df_genre_runtime=df_genre_runtime.sort_values(by='runtime', ascending=False)
df_genre_runtime.head()

In [None]:
plot_barplot(df_genre_runtime, 'runtime', 'count', 'genre', 'upper left', 'Runtime by genres', 20)

Comedy genre with runtime around 40 mins has most of the records in profitable dataset, followed by 50 mins and 45 mins.

Other genres have comparatively very less entries.

##### **Runtime by industry**

In [None]:
df_industry_runtime=pd.DataFrame(profit_data.groupby(['runtime','industry'])['runtime'].count())
df_industry_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_industry_runtime.reset_index(inplace=True)
df_industry_runtime=df_industry_runtime.sort_values(by='runtime', ascending=False)
df_industry_runtime.head()

In [None]:
plot_barplot(df_industry_runtime, 'runtime', 'count', 'industry', 'upper right', 'Runtime by industry', 20)

Pharma has most of the share with runtime of 40 mins, followed by 50 mins and 45 mins.

##### **Ratings by genres**

In [None]:
df_genre_ratings=pd.DataFrame(profit_data.groupby(['ratings','genre'])['ratings'].count())
df_genre_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_genre_ratings.reset_index(inplace=True)
df_genre_ratings=df_genre_ratings.sort_values(by='ratings', ascending=False)
df_genre_ratings['ratings']=df_genre_ratings['ratings'].map('{:,.2f}'.format)
df_genre_ratings.head()

In [None]:
plot_barplot(df_genre_ratings, 'ratings', 'count', 'genre', 'upper right', 'Ratings by genres', 20)

Comedy is the most common with ratings around 0.3 followed by rating of around 0.17 and 1.

This means that good rating does not always profitable

##### **Ratings by industry**

In [None]:
df_industry_ratings=pd.DataFrame(profit_data.groupby(['ratings','industry'])['ratings'].count().nlargest(100))
df_industry_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_industry_ratings.reset_index(inplace=True)
df_industry_ratings=df_industry_ratings.sort_values(by='ratings', ascending=False)
df_industry_ratings['ratings']=df_industry_ratings['ratings'].map('{:,.2f}'.format)
df_industry_ratings.head()

In [None]:
plot_barplot(df_industry_ratings, 'ratings', 'count', 'industry', 'upper right', 'Ratings by industry', 20)

Pharma is the most common with around 0.03 rating followed by 0.17 and 0.10

Same as prevous graph, the advertisements with low ratings are more profitable with very few with high ratings.

But the advertisements with high ratings are overall very less in the dataset, so this is the reason, low count in the profitable data as well

##### **Analyzing Profit data using AutoViz**

In [None]:
profit_data.to_csv('Profit_data.csv')

In [None]:
aftrain=AV.AutoViz('Profit_data.csv')

 - Pharma is the most common record, followed by Auto and Other.
 - Comedy is the most common genre.
 - 0.00 to 0.02 is the most common rating.
 - Most of the records belong to United States
 - Male have majority 
 - Married-civ-spouse is the most common relationship status and have given better average ratings.
 - Primetime has most of the records
 - Auto industry based advertisements have got the best rating followed by Entertainment.
 - Drama has got the best rating.
 - Daytime based advertisements got the best rating.
 - Advertisements with run time of 51 mins and 66 mins have best ratings.
 - Ratings almost similar given by Male and Female.


##### **Summary**

 - Pharma is most common in profit dataset, which is similar as in overall dataset, but Auto haas very less records when compared to overall record, So Auto industry based advertisements are not much profitable.
 - 'Married-civ-spouse' has maximum count as in overall dataset, but 'Never married' has huge decrease count as comapred to overall dataset, so that means advertisements targeted to Never Married are not much profitable.
 - Comedy genre is most profitable and for 'Married-civ-spouse' and primetime is the most profitable.
 - Pharma is most profitable for 'Married-civ-souse' while Auto mostly profiatble with 'Never Married' or 'Divorced' and Entertainment is most profitable with 'Divorced' and 'Never married' audience.
 - Pharma is mostly profitable for Male audience while Auto and Entertainment for both Male and Female.
 - Similarly Comedy genre is most profitable with 'Married-civ-souse'.
 - Comedy and Drama with runtime of 40-50 mins is more profitable across the World, but for all categoeries have significant count.
 - Pharma is the most profitable insdustry across the World, but in US, Auto and Entertainment are profitable.
 - Average profitable rating -  0.02 to 0.03 across the World, but in US. Comedy and Informercial has rating above 0.10.
 - Most profitable run time for Comedy/Drama is 40 mins, followed by 50 mins, 45 mins and 60 mins.



 So, we can deduce that Pharma is quite profitable and could be the reason it is most common in overall dataset, but Auto is not much profitable but still good count in overall dataset.

Most of the profit comes from 'Marrid-civ-spouse' category audience and followed by Divorced and Never Married

<a id=section8182></a>
#### **8.18.2 Loss records analysis**

In [None]:
loss_data=adv_data[adv_data.netgain==False]
loss_data.head()

In [None]:
loss_data.shape

##### **Advertisements count per industry**

In [None]:
plot_countplot(loss_data, 'industry', 'orange', 'Advertisements count per industry')

Auto has most of the count in loss dataset, that means most of advertisements for Auto industry are not much profitable, followed by Pharam and Political.

Though Auto and Pharma has most of the records in dataset, so they have most of the count in profit and loss dataset as well.

But, Pharma is most common in profit dataset while Auto in loss dataset

##### **Advertisements count per genre**

In [None]:
plot_countplot(loss_data, 'genre', 'green', 'Advertisements count per genre')

Comeady has the most of the share in loss dataset, because it is most common in the overall dataset

##### **Advertisements count based on gender**

In [None]:
plot_countplot(loss_data, 'targeted_sex', 'violet', 'Advertisements count based on gender')

Male being most common in the total dataset, have high count in loss and profitable dataset but the female count is quite high in loss dataset as compared to profit dataset.

That means, many advertisements targeted towards feamle may not be much profitable.

##### **Advertisements count based on time of air**

In [None]:
plot_countplot(loss_data, 'airtime', 'blue', 'Advertisements count based on time of air')

Since most of the records in Prime time in overall dataset, so it has high count in loss datasset as well.

##### **Advertisements count based on price range**

In [None]:
plot_countplot(loss_data, 'expensive', 'yellow', 'Advertisements count based on price range')

Since most of the records to low cost producst, so they have most of the share in loss dataset also followed by High cost products

##### **Advertisements count based on money back guarantee**

In [None]:
plot_countplot(loss_data, 'money_back_guarantee', 'pink', 'Advertisements count based on money back guarantee')

Money back guarantee is almost simialr in both profit and loss dataset, so not much impacting the revenue

##### **Ratings distribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='ratings',  data=loss_data)
plt.title('Ratings distribution for each genre')
plt.show()

Vey least advertrisements with ratings 1 in loss dataset as compared to profit dataset and majority have ratings below 0.2

##### **Ratings dstribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='ratings',  data=loss_data)
plt.title('Ratings dstribution for each industry')
plt.show()

Same as previous graph, the advertisements with ratings 1 irrespective of industry have very less count in loss dataset, while most of the records have ratings around 0.2 or below

##### **Runtime dstribution for each genre**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='genre', y='runtime',  data=loss_data)
plt.title('Runtime dstribution for each genre')
plt.show()

Comedy genre has most of the records in loss datastet, it is evident as Comedy has most of the records in overall dataset.

##### **Runtime distribution for each industry**

In [None]:
plt.figure(figsize=(8,8))
sns.stripplot(x='industry', y='runtime',  data=loss_data)
plt.title('Runtime distribution for each industry')
plt.show()

Auto and Pharma with runtime 40-60 mins have most entires in loss dataset.

##### **Ratings based on runtime**

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(x='ratings', y='runtime',  data=loss_data)
plt.title('Ratings based on runtime')
plt.show()

As seen in earlier grapg, most of the advertisements with ratings 1 are profitable and very less have entries in loss dataset while most of  the advertisements with ratings less than 0.2 are non-profitable.

No specific runtime though, advertisements with all runtimes are present in profitable and loss dataset.

##### **Advertisements count based on location,runtime and genre**

In [None]:
df_location=pd.DataFrame(loss_data.groupby(['airlocation','genre','runtime'])['runtime'].count().nlargest(200))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'genre', 'upper right', 'Advertisements count based on location,runtime and genre', 50)

Comedy and Drama have most of the entries along the world, while United States and neighboring locations have Informercial,Direct and others as most common entries.

##### **Advertisements count based on location,runtime and industry**

In [None]:
df_location=pd.DataFrame(loss_data.groupby(['airlocation','industry','runtime'])['runtime'].count().nlargest(200))
df_location.rename(columns={"runtime":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='runtime', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'runtime', 'industry', 'upper right', 'Advertisements count based on location,runtime and industry', 30)

Auto and Pharma are the most common around the World, but in US and neighbouring locations, almost all industries have huge count of records.

This could be the due to as most of the records belong to United states and nearby locations as compared to other locations.

##### **Advertisements count based on location,ratings and genre**

In [None]:
df_location=pd.DataFrame(loss_data.groupby(['airlocation','genre','ratings'])['ratings'].count().nlargest(100))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'genre', 'upper right', 'Advertisements count based on location,ratings and genre', 45)

In US, the comedy genre even with somewhat higher rating has goood count, while other genre have average rating aorund just below 0.03

##### **Advertisements count based on location,ratings and industry**

In [None]:
df_location=pd.DataFrame(loss_data.groupby(['airlocation','industry','ratings'])['ratings'].count().nlargest(100))
df_location.rename(columns={"ratings":"count"}, inplace=True)
df_location.reset_index(inplace=True)
df_location=df_location.sort_values(by='ratings', ascending=False)
df_location.head()

In [None]:
plot_barplot(df_location, 'airlocation', 'ratings', 'industry', 'upper right', 'Advertisements count based on location,rating and industry', 30)

Though all industry have significant count of records but Pharma has majority across the various locations in the world, But in US ,,Pharma and Auto has most of the records.

##### **Runtime by genres**

In [None]:
df_genre_runtime=pd.DataFrame(loss_data.groupby(['runtime','genre'])['runtime'].count().nlargest(100))
df_genre_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_genre_runtime.reset_index(inplace=True)
df_genre_runtime=df_genre_runtime.sort_values(by='runtime', ascending=False)
df_genre_runtime.head()

In [None]:
plot_barplot(df_genre_runtime, 'runtime', 'count', 'genre', 'upper left', 'Runtime by genres', 20)

Comedy is most common, with runtime of 40 mins have majority of the records.

##### **Runtime by industry**

In [None]:
df_industry_runtime=pd.DataFrame(loss_data.groupby(['runtime','industry'])['runtime'].count().nlargest(100))
df_industry_runtime.rename(columns={"runtime":"count"}, inplace=True)
df_industry_runtime.reset_index(inplace=True)
df_industry_runtime=df_industry_runtime.sort_values(by='runtime', ascending=False)
df_industry_runtime.head()

In [None]:
plot_barplot(df_industry_runtime, 'runtime', 'count', 'industry', 'upper right', 'Runtime by industry', 20)

Most of the industries have maximum records at 40 mins followed by 45 and 50 mins.

##### **Ratings by genres**

In [None]:
df_genre_ratings=pd.DataFrame(loss_data.groupby(['ratings','genre'])['ratings'].count().nlargest(100))
df_genre_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_genre_ratings.reset_index(inplace=True)
df_genre_ratings=df_genre_ratings.sort_values(by='ratings', ascending=False)
df_genre_ratings['ratings']=df_genre_ratings['ratings'].map('{:,.2f}'.format)
df_genre_ratings.head()

In [None]:
plot_barplot(df_genre_ratings, 'ratings', 'count', 'genre', 'upper right', 'Ratings by genres', 20)

The genres have most non-profitable with ratings around 0.03

##### **Ratings by industry**

In [None]:
df_industry_ratings=pd.DataFrame(loss_data.groupby(['ratings','industry'])['ratings'].count().nlargest(100))
df_industry_ratings.rename(columns={"ratings":"count"}, inplace=True)
df_industry_ratings.reset_index(inplace=True)
df_industry_ratings=df_industry_ratings.sort_values(by='ratings', ascending=False)
df_industry_ratings['ratings']=df_industry_ratings['ratings'].map('{:,.2f}'.format)
df_industry_ratings.head()

In [None]:
plot_barplot(df_industry_ratings, 'ratings', 'count', 'industry', 'upper right', 'Ratings by industry', 20)

Most of the indstry have non-profitable advertisements at the rating of 0.03 with Pharma is the most common, followed by Auto.

##### **Advertisements count based on genre and relationship status**

In [None]:
df_genrerel=pd.DataFrame(loss_data.groupby(['genre','realtionship_status'])['netgain'].count())
df_genrerel.rename(columns={"netgain":"count"}, inplace=True)
df_genrerel.reset_index(inplace=True)
df_genrerel=df_genrerel.sort_values(by='count', ascending=False)
df_genrerel.head()

In [None]:
plot_barplot(df_genrerel, 'genre', 'count', 'realtionship_status', 'upper right', 'Advertisements count based on genre and relationship status', 20)

 - Never married have most common count for Comedy genre followed by Married-civ-spouse abd Divorced.
 - Other Genre too have similar seggregation

##### **Advertisements count based on genre and gender**

In [None]:
df_genrerel=pd.DataFrame(loss_data.groupby(['genre','targeted_sex'])['netgain'].count())
df_genrerel.rename(columns={"netgain":"count"}, inplace=True)
df_genrerel.reset_index(inplace=True)
df_genrerel=df_genrerel.sort_values(by='count', ascending=False)
df_genrerel.head()

In [None]:
plot_barplot(df_genrerel, 'genre', 'count', 'targeted_sex', 'upper right', 'Advertisements count based on genre and gender', 20)

- Most of the non-profitable Comedy genre advertisements were for Male gender but female have good count, that means most of the female oriented advertisements are not profitable
- Informercial and Drama too have almost similar count

##### **Advertisements count based on industry and relationship status**

In [None]:
df_indsrel=pd.DataFrame(loss_data.groupby(['industry','realtionship_status'])['netgain'].count())
df_indsrel.rename(columns={"netgain":"count"}, inplace=True)
df_indsrel.reset_index(inplace=True)
df_indsrel=df_indsrel.sort_values(by='count', ascending=False)
df_indsrel.head()

In [None]:
plot_barplot(df_indsrel, 'industry', 'count', 'realtionship_status', 'upper right', 'Advertisements count based on industry and relationship status', 20)

- Pharma has most count for Married-civ-spouse same as in profit data
- In other categories  - Never married and Divorced have more count

##### **Advertisements count based on industry and gender**

In [None]:
df_indsrel=pd.DataFrame(loss_data.groupby(['industry','targeted_sex'])['netgain'].count())
df_indsrel.rename(columns={"netgain":"count"}, inplace=True)
df_indsrel.reset_index(inplace=True)
df_indsrel=df_indsrel.sort_values(by='count', ascending=False)
df_indsrel.head()

In [None]:
plot_barplot(df_indsrel, 'industry', 'count', 'targeted_sex', 'upper right', 'Advertisements count based on industry and gender', 20)

 - Pharma has only Male record
 - Auto and political have almost similar records for females
 - Entertainment has more records for females than male.

##### **Analyzing Loss data using AutoViz**

In [None]:
loss_data.to_csv('Loss_data.csv')
aftrain=AV.AutoViz('Loss_data.csv')

##### **Summary**

- Auto has more loss data than Pharma and in overall data, Auto has less count than Pharma That means, Auto has huge count of non-profitable data.
- Comedy genre is most common in loss data as well
- Feamle count is less than Male count, but still high in count as comapared to profit data set, that means most of the advertisements targeted to Female are not much profitable.
- There are only couple of records with rating=1 in loss data set, that means most of the advertisements with high ratings are profitable.
- Runtime seems to have no impact on loss and profit as similar distribution.
- Most of the advertisements with loss are having ratings around 0.01.
- Genre distribution has more Auto records than Pharma.
- Profit dataset have more records for "Married-civ-spouse' category while loss dataset has for 'Never Married' with 'Married-civ-spouse' closely followed.
- Loss data has comparatively more count ration for females as compared to the ratio in overall dataset and profit dataset, that means most of the advertisements for feamle not earning much profit.

<a id=section9></a>
## **9. Checking data distribution of output variable**

In [None]:
plt.figure(figsize=(8,8))
sns.countplot(x='netgain', data=adv_data)
plt.show()

Data is imbalanced, so need to impute dummy values to minority calss to make it balanced before applying any ML/DL algorithm

<a id=section10></a>
## **10. Feature Engineering and Data Transformation**

<a id=section101></a>
### **10.1 Combining feature values**

In [None]:
adv_data.head()

In [None]:
adv_data.realtionship_status.unique()

In [None]:
adv_data.industry.unique()

In [None]:
adv_data.genre.unique()

In [None]:
adv_data.targeted_sex.unique()

In [None]:
adv_data.expensive.unique()

In [None]:
adv_data.airtime.unique()

In [None]:
adv_data.airlocation.unique()

In the EDA above, observed that realtionship_status['Never-married'] and realtionship_status['Divorced'] have similar values wrt other columns.

So, can combine both of these to new value  - 'single' and can check how ML works on this dataset.


Also, can combine 'Never-married','Divorced' ,'Widowed' and 'seperated' as well to anlyze further.

SO, now have below dataset for further experiments

 - Original one
 - realtionship_status['Never-married'] and realtionship_status['Divorced'] merged into realtionship_status['Single']
 - realtionship_status['Never-married'], realtionship_status['Seperated'],  realtionship_status['Widowed'] and realtionship_status['Divorced'] merged into realtionship_status['Single']

In [None]:
adv_data_1=adv_data.copy()
adv_data_2=adv_data.copy()

In [None]:
adv_data_1.shape

In [None]:
adv_data_2.shape

In [None]:
adv_data_1.head()

In [None]:
adv_data_1.realtionship_status.unique()

In [None]:
adv_data_1.realtionship_status=adv_data_1.realtionship_status.replace(['Never-married','Divorced'], 'Single')

In [None]:
adv_data_1.shape

In [None]:
adv_data_1.head()

In [None]:
adv_data_1.realtionship_status.unique()

In [None]:
adv_data_2.realtionship_status=adv_data_2.realtionship_status.replace(['Never-married','Divorced','Separated', 'Widowed'], 'Single')

In [None]:
adv_data_2.head()

In [None]:
adv_data_2.realtionship_status.unique()

Now we have 3 datasets
- adv_data
- adv_data_1
- adv_data_2

<a id=section102></a>
### **10.2 Label Encoding and One Hot Encoding**

Will do the OHE for
- relationship_status
- genre
- industry
- airtime
- targeted_sex

And will do label encoding for
- expensive
- net_gain
- money_back_guarantee

In [None]:
adv_data.head()

In [None]:
adv_data_1.head()

In [None]:
adv_data_2.head()

In [None]:
#Function to apply One hot encoding
def apply_ohe(df):
    df=pd.get_dummies(df, columns=['realtionship_status','industry','genre','targeted_sex','airtime','airlocation'], drop_first=True)
    
    return df

In [None]:
#Function to apply Label Encoding
def apply_le(df):
    le = LabelEncoder()
    df['expensive']=le.fit_transform(df['expensive'])
    df['netgain']=le.fit_transform(df['netgain'])
    df['money_back_guarantee']=le.fit_transform(df['money_back_guarantee'])
    
    return df

In [None]:
#Wrapper to call OHE and LE on dataframe
def apply_encoding(df):
    df=apply_ohe(df)
    df=apply_le(df)
    
    return df

In [None]:
adv_data=apply_encoding(adv_data)
adv_data.head()

In [None]:
adv_data_1=apply_encoding(adv_data_1)
adv_data_1.head()

In [None]:
adv_data_2=apply_encoding(adv_data_2)
adv_data_2.head()

In [None]:
adv_data.to_csv('Encoded_data.csv')

In [None]:
adv_data_1.to_csv('Encoded_data_1.csv')

In [None]:
adv_data_2.to_csv('Encoded_data_2.csv')

<a id=section103></a>
### **10.3 Creating bins for the runtime values**

In [None]:
adv_data_3=adv_data.copy()

In [None]:
plt.figure(figsize=(10,6))
adv_data_3['runtime'].plot(kind='hist')
plt.show()

From Histogram, it seems better to create bins of 10 intervals

In [None]:
adv_data_3['runtime'].describe()

Around half of the data is around 40 mins and 75% of data till 45 mins

In [None]:
adv_data_3['runtime_bins']=pd.cut(adv_data_3['runtime'], bins=10)

In [None]:
adv_data_3.drop('runtime', axis=1, inplace=True)
adv_data_3.head()

In [None]:
adv_data_3.to_csv('Bin_data_1.csv')

In [None]:
adv_data_4=adv_data_2.copy()
adv_data_4['runtime_bins']=pd.cut(adv_data_4['runtime'], bins=10)
adv_data_4.drop('runtime', axis=1, inplace=True)
print(adv_data_4.head())

In [None]:
adv_data_4.to_csv('Encoded_Bin_data_1.csv')

In [None]:
adv_data_5=adv_data_1.copy()
adv_data_5['runtime_bins']=pd.cut(adv_data_5['runtime'], bins=10)
adv_data_5.drop('runtime', axis=1, inplace=True)
print(adv_data_5.head())

In [None]:
adv_data_5.to_csv('Encoded_Bin_data_2.csv')

Now we have 5 datasets with various combinations to analyze
 - adv_data : original data with label encoding
 - adv_data_1 and adv_data_2 - data with relationship values combined to create new category
 - adv_data_3,adv_data_4 and adv_data_5 - label encoding with bins defined for runtime column

<a id=section104></a>
### **10.4 Checking for outliers**

There are 2 continuos columns  - runtime and rating and both seem to have high count of outliers, it is better to observe if outliers need to be deleted or retain

<a id=section10401></a>
#### **10.4.1 Using Boxplot**

In [None]:
def plot_outliers(df, col):
    trace=[]
    trace.append(go.Box(y=df[col],name=col))
    data=trace
    iplot({"data":data})

In [None]:
enable_plotly_in_cell()
plot_outliers(adv_data, 'runtime')

There are many outliers in the column runtime, and the reason is most of the advertrisements are around 40 mins and few are with different runtimes, but cant remove those values as runtime is also important in determining netgain, so would be better to retain it

In [None]:
enable_plotly_in_cell()
plot_outliers(adv_data, 'ratings')

Similar to runtime, ratings column also has most of the values around 0.02, so other values are extreme but they can't be ignored

<a id=section10402></a>
#### **10.4.2 Analyzing using IQR**

In [None]:
def get_outlier_info(df):
    Q1=df.quantile(0.25)
    Q3=df.quantile(0.75)
    IQR=Q3-Q1
    #print (IQR)
    return df[((df<(Q1-IQR*1.5)) | (df>(Q3+IQR*1.5))).any(axis=1)]

In [None]:
tmp=get_outlier_info(adv_data)
print (tmp.shape)
tmp

In [None]:
def get_outlier_colinfo(df, col):
    Q1=df[col].quantile(0.25)
    Q3=df[col].quantile(0.75)
    IQR=Q3-Q1
    #print (IQR)
    print (df[((df[col]<(Q1-IQR*1.5)) | (df[col]>(Q3+IQR*1.5)))][col])

In [None]:
get_outlier_colinfo(adv_data, 'runtime')

In [None]:
get_outlier_colinfo(adv_data, 'ratings')

Number of outliers record if combine for runtime and ratings are around 8600, which is huge in terms of dataset with 26k records, so dropping those would cause losing information, so better to retain those

But can try to use RobustScaler while modelling

<a id=section10403></a>
### **10.4.3 Analyzing approaches on data imbalance issue**

We can analuze various ways to reduce the data imbalance issue by oversampling the minority class

 - RandomOverSampler
 - SMOTE 
 - ADASYN

We can analyze all the techniques on basic SVC classifier and measure the performance

In [None]:
#Function to split data into test and train
def split_data(data, test_size):
    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=42)
    
    print("X_train shape - ", X_train.shape)
    print("y_train shape - ", y_train.shape)
    print("X_test shape - ", X_test.shape)
    print("y_test shape - ", y_test.shape)
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test=split_data(adv_data, 0.2)

In [None]:
std=StandardScaler()
X_train_std=std.fit_transform(X_train)

**With imbalance data**

In [None]:
clf=SVC()
clf.fit(X_train_std, y_train)
clf.score(X_train_std, y_train)

**Using Random OverSampler**

In [None]:
pipe = make_pipeline(RandomOverSampler(random_state=0), SVC())
pipe.fit(X_train_std, y_train)
pipe.score(X_train_std, y_train)

**Using SMOTE**

In [None]:
pipe = make_pipeline(SMOTE(random_state=0), SVC())
pipe.fit(X_train_std, y_train)
pipe.score(X_train_std, y_train)

**Using ADASYN**

In [None]:
pipe = make_pipeline(ADASYN(random_state=0), SVC())
pipe.fit(X_train_std, y_train)
pipe.score(X_train_std, y_train)

The accuracy got impacted from various sampling techniques, but accuracy can never be measure of model and especially in highly imbalanced dataset like this, so can include the sampling methods in ML pipeline to experiment with various approaches.

<a id=section10404></a>
### **10.4.4 LDA and PCA Transformation**

#### **Using LDA**

In [None]:
def select_n_components(var_ratio, goal_var):
    total_variance=0
    n_comp=0

    for explained_variance in var_ratio:
        total_variance  += explained_variance

        n_comp += 1

        if total_variance>=goal_var:
            break

    return n_comp

In [None]:
def get_lda_comp(X, y):
    lda=LDA(n_components=None)

    X_lda=lda.fit(X, y)

    lda_var_ratios=lda.explained_variance_ratio_
    
    print(select_n_components(lda_var_ratios, 0.99))

In [None]:
X=adv_data.drop('netgain', axis=1)
y=adv_data['netgain']
std=StandardScaler()
X_std=std.fit_transform(X)
get_lda_comp(X_std, y)

It seems like only 1 component wll explain 99% of the variance

#### **Using PCA**

In [None]:
X_std.shape[1]

In [None]:
pca = PCA(n_components=0.99)
X_pca = pca.fit_transform(X_std)
X_pca.shape[1]

It seems 58 components explain 99% of the variance using PCA

So, in the pipeline, it would be better to expriment with both

<a id=section11></a>
## **11. Feature Selection**

<a id=section1101></a>
### **11.1 Using Model**

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier()
model.fit(X,y)

plt.figure(figsize=(6,6))
feat_importances=pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.title('Important Features')
plt.show()

<a id=section1102></a>
### **11.2 Using selectKBest**

In [None]:
bestfeatures=SelectKBest(score_func=chi2, k=20)
fit=bestfeatures.fit(X,y)
dfscores=pd.DataFrame(fit.scores_)
dfcolumns=pd.DataFrame(X.columns)

plt.figure(figsize=(6,6))
featureScores=pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns=['Features', 'Score']
sns.barplot(x='Score', y='Features', data=featureScores.sort_values(by='Score', ascending=False).head(20))
plt.show()

In [None]:
bestfeatures=SelectKBest(score_func=f_classif, k=20)
fit=bestfeatures.fit(X,y)
dfscores=pd.DataFrame(fit.scores_)
dfcolumns=pd.DataFrame(X.columns)

plt.figure(figsize=(6,6))
featureScores=pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns=['Features', 'Score']
sns.barplot(x='Score', y='Features', data=featureScores.sort_values(by='Score', ascending=False).head(20))
plt.show()

From the above analysis, it is observed that below features are most important, so during experiment, can try to model on the data with only these features (using selectKBest in pipeline with same k value)

 - ratings
 - runtime
 - relationship_status_Married-civ-spouse
 - industry_Pharma
 - relationship_status_Never-married
 - airtime_Morning
 - airtime_Primetime
 - expensive
 - targeted_sex_Male
 - industry_Political
 - industry_Entertainment
 - industry_Other

<a id=section12></a>
## **12. Pycaret - to analyze the best models on the dataset and features before actual modelling**

**Will run and analyze Pycaret for adv_data to analyze the best models and important features**

**Running for adv_data with feature_selection and fix_imbalance**

In [None]:
from pycaret.classification import *

In [None]:
clf1 = setup(adv_data, target = 'netgain', session_id=123, log_experiment=False, experiment_name='adv1',  feature_selection=True, fix_imbalance=True)

In [None]:
 base_models = compare_models(exclude=['nb','qda'])

In [None]:
catb=create_model('catboost')

In [None]:
gbc=create_model('gbc')

In [None]:
lgbm=create_model('lightgbm')

In [None]:
xgb=create_model('xgboost')

In [None]:
ada=create_model('ada')

In [None]:
blender_specific_soft= blend_models(estimator_list = [catb,gbc,xgb,lgbm,ada], method = 'soft')

In [None]:
blender_specific_hard= blend_models(estimator_list = [catb,gbc,xgb,lgbm,ada], method = 'hard')

In [None]:
stacker = stack_models(estimator_list = [catb,gbc,lgbm,ada], meta_model = xgb)

In [None]:
interpret_model(catb)

In [None]:
interpret_model(gbc)

In [None]:
interpret_model(lgbm)

**Running for adv_data with feature_selection and fix_imbalance and normalize**

In [None]:
clf2 = setup(adv_data, target = 'netgain', session_id=124, log_experiment=False, experiment_name='adv1',  feature_selection=True, fix_imbalance=True, normalize=True)

In [None]:
 base_models = compare_models(exclude=['nb','qda','lda'])

In [None]:
catb=create_model('catboost')

In [None]:
gbc=create_model('gbc')

In [None]:
lgbm=create_model('lightgbm')

In [None]:
xgb=create_model('xgboost')

In [None]:
ada=create_model('ada')

In [None]:
blender_specific_soft= blend_models(estimator_list = [catb,gbc,xgb,lgbm,ada], method = 'soft')

In [None]:
blender_specific_hard= blend_models(estimator_list = [catb,gbc,xgb,lgbm,ada], method = 'hard')

In [None]:
stacker = stack_models(estimator_list = [catb,gbc,lgbm,ada], meta_model = xgb)

In [None]:
interpret_model(catb)

In [None]:
interpret_model(gbc)

In [None]:
interpret_model(lgbm)

So, while modelling the data, can use the above features to check the performace on the best features.

**Summary**

Below models seems to be best performing

- Logistic Regression
- Random Forest Classifier
- Random Forest Classifier
- Ada Boost Classifier
- XGBoost Classifier
- Extreme Gradient Boosting Classifier
- Light Gradient Boosting Classifier
- Catboost


Also below features seems to be most important as seen before

 - ratings
 - runtime
 - relationship_status_Married-civ-spouse
 - industry_Pharma
 - relationship_status_Never-married
 - airtime_Morning
 - airtime_Primetime
 - expensive
 - targeted_sex_Male
 - industry_Political
 - industry_Entertainment
 - industry_Other

<a id=section13></a>
## **13. Checking the distribution of continuos columns to decide on normalization**

Need to analyze if runtime and ratings are normally distributed using various methods

**Basic methods using describe()**

In [None]:
adv_data.runtime.describe()

In [None]:
adv_data.ratings.describe()

**Analyzing using KDE plot**

In [None]:
plt.figure(figsize=(8,6))
sns.kdeplot(data=adv_data.runtime)
plt.title('Runtime Distribution')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.kdeplot(data=adv_data.ratings)
plt.title('Ratings Distribution')
plt.show()

**QQ Plot**

In [None]:
import statsmodels.api as sm
from scipy.stats import norm
import pylab

In [None]:
sm.qqplot(adv_data.runtime, line='45')
pylab.show()

In [None]:
sm.qqplot(adv_data.ratings, line='45')
pylab.show()

It seems like both the values need to normalized/scaled, so will experiment with that in ML sklearn pipelines

<a id=section14></a>
## **14. Machine learning - Analyzing baseline models**

From PyCaret we observed that following models provided the satisfactory results,

- Logistic Regression
- Random Forest Classifier
- Random Forest Classifier
- Ada Boost Classifier
- XGBoost Classifier
- Extreme Gradient Boosting Classifier
- Light Gradient Boosting Classifier
- Catboost

Also since get the idea of best features in the dataset, so can ignore including PCA and LDA in experimentation as LDA returning only 1 compoenent, so can lose information and PCA returning  58 features which is close to actual number of features - 63

In [None]:
#Fucntion to create pipeline of various modesl with various scaling and sampling techniques and return the best model parameters
#Will be called for different dataset

MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Recall', 'MLA Precision', 'MLA F1 Score']
MLA_compare = pd.DataFrame(columns = MLA_columns)

def train_predict_default_ml(data, stratified, modelfilename):
    """
    data - Input dataframe
    stratified - boolean - whether to set stratify flag ON or OFF while splitting test and train data for model
    modelfilename - pickle filename to dump the best model so that can be load and used later
    """

    row_index=0
    best_accuracy=0
    best_recall=0
    best_precision=0
    best_f1=0

    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    if (True==stratified):
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0, stratify=y, shuffle=True)
    else:
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)
              
    #preprocess_step=FeatureUnion([('kbest1', SelectKBest(score_func=chi2, k=15)), ('kbest2', SelectKBest(score_func=f_classif, k=15))])
    preprocess_step=FeatureUnion([('kbest', SelectKBest(score_func=f_classif, k=15))])

    #Logistic Regression CV pipeline with Data imbalance and Scaling fucntions
    pipe_lrv_1=Pipeline([('lrvscaling1', StandardScaler()),
                        ('lrvimbalance1', RandomOverSampler(random_state=0)),
                        ('lrv_preprocess1', preprocess_step),
                        ('lrvclassifer1', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_2=Pipeline([('lrvscaling2', StandardScaler()),
                        ('lrvimbalance2', SMOTE(random_state=0)),
                        ('lrv_preprocess2', preprocess_step),
                        ('lrvclassifer2', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_3=Pipeline([('lrvscaling3', StandardScaler()),
                        ('lrvimbalance3', ADASYN(random_state=0)),
                        ('lrv_preprocess3', preprocess_step),
                        ('lrvclassifer3', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_4=Pipeline([('lrvscaling4', Normalizer()),
                        ('lrvimbalance4', RandomOverSampler(random_state=0)),
                        ('lrv_preprocess4', preprocess_step),
                        ('lrvclassifer4', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_5=Pipeline([('lrvscaling5', Normalizer()),
                        ('lrvimbalance5', SMOTE(random_state=0)),
                        ('lrv_preprocess5', preprocess_step),
                        ('lrvclassifer5', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_6=Pipeline([('lrvscaling6', Normalizer()),
                        ('lrvimbalance6', ADASYN(random_state=0)),
                        ('lrv_preprocess6', preprocess_step),
                        ('lrvclassifer6', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_7=Pipeline([('lrvscaling7', MinMaxScaler()),
                        ('lrvimbalance7', RandomOverSampler(random_state=0)),
                        ('lrv_preprocess7', preprocess_step),
                        ('lrvclassifer7', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_8=Pipeline([('lrvscaling8', MinMaxScaler()),
                        ('lrvimbalance8', SMOTE(random_state=0)),
                        ('lrv_preprocess8', preprocess_step),
                        ('lrvclassifer8', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_9=Pipeline([('lrvscaling9', MinMaxScaler()),
                        ('lrvimbalance9', ADASYN(random_state=0)),
                        ('lrv_preprocess9', preprocess_step),
                        ('lrvclassifer9', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_10=Pipeline([('lrvscaling10', RobustScaler()),
                        ('lrvimbalance10', RandomOverSampler(random_state=0)),
                        ('lrv_preprocess10', preprocess_step),
                        ('lrvclassifer10', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_11=Pipeline([('lrvscaling11', RobustScaler()),
                        ('lrvimbalance11', SMOTE(random_state=0)),
                        ('lrv_preprocess11', preprocess_step),
                        ('lrvclassifer11', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    pipe_lrv_12=Pipeline([('lrvscaling11', RobustScaler()),
                        ('lrvimbalance12', ADASYN(random_state=0)),
                        ('lrv_preprocess12', preprocess_step),
                        ('lrvclassifer12', LogisticRegressionCV(cv=5, max_iter=500, random_state=0))])
    
    
    #Logistic Regression pipeline with Data imbalance and Scaling fucntions
    pipe_lr_1=Pipeline([('lrscaling1', StandardScaler()),
                        ('lrimbalance1', RandomOverSampler(random_state=0)),
                        ('lr_preprocess1', preprocess_step),
                        ('lrclassifer1', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_2=Pipeline([('lrscaling2', StandardScaler()),
                        ('lrimbalance2', SMOTE(random_state=0)),
                        ('lr_preprocess2', preprocess_step),
                        ('lrclassifer2', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_3=Pipeline([('lrscaling3', StandardScaler()),
                        ('lrimbalance3', ADASYN(random_state=0)),
                        ('lr_preprocess3', preprocess_step),
                        ('lrclassifer3', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_4=Pipeline([('lrscaling4', Normalizer()),
                        ('lrimbalance4', RandomOverSampler(random_state=0)),
                        ('lr_preprocess4', preprocess_step),
                        ('lrclassifer4', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_5=Pipeline([('lrscaling5', Normalizer()),
                        ('lrimbalance5', SMOTE(random_state=0)),
                        ('lr_preprocess5', preprocess_step),
                        ('lrclassifer5', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_6=Pipeline([('lrscaling6', Normalizer()),
                        ('lrimbalance6', ADASYN(random_state=0)),
                        ('lr_preprocess6', preprocess_step),
                        ('lrclassifer6', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_7=Pipeline([('lrvscaling7', MinMaxScaler()),
                        ('lrimbalance7', RandomOverSampler(random_state=0)),
                        ('lr_preprocess7', preprocess_step),
                        ('lrclassifer7', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_8=Pipeline([('lrscaling8', MinMaxScaler()),
                        ('lrvimbalance8', SMOTE(random_state=0)),
                        ('lr_preprocess8', preprocess_step),
                        ('lrclassifer8', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_9=Pipeline([('lrscaling9', MinMaxScaler()),
                        ('lrimbalance9', ADASYN(random_state=0)),
                        ('lr_preprocess9', preprocess_step),
                        ('lrclassifer9', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_10=Pipeline([('lrvscaling10', RobustScaler()),
                        ('lrimbalance10', RandomOverSampler(random_state=0)),
                        ('lr_preprocess10', preprocess_step),
                        ('lrclassifer10', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_11=Pipeline([('lrscaling11', RobustScaler()),
                        ('lrvimbalance11', SMOTE(random_state=0)),
                        ('lr_preprocess11', preprocess_step),
                        ('lrclassifer11', LogisticRegression(max_iter=500, random_state=0))])
    pipe_lr_12=Pipeline([('lrscaling12', RobustScaler()),
                        ('lrimbalance12', ADASYN(random_state=0)),
                        ('lr_preprocess12', preprocess_step),
                        ('lrclassifer12', LogisticRegression(max_iter=500, random_state=0))])
     
    
    #Random Forest pipeline with Data imbalance and Scaling fucntions
    pipe_rf_1=Pipeline([('rfscaling1', StandardScaler()),
                        ('rfimbalance1', RandomOverSampler(random_state=0)),
                        ('rf_preprocess1', preprocess_step),
                        ('rfclassifer1', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_2=Pipeline([('rfscaling2', StandardScaler()),
                        ('rfimbalance2', SMOTE(random_state=0)),
                        ('rf_preprocess2', preprocess_step),
                        ('rfclassifer2', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_3=Pipeline([('rfscaling3', StandardScaler()),
                        ('rfimbalance3', ADASYN(random_state=0)),
                        ('rf_preprocess3', preprocess_step),
                        ('rfclassifer3', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_4=Pipeline([('rfscaling4', Normalizer()),
                        ('rfimbalance4', RandomOverSampler(random_state=0)),
                        ('rf_preprocess4', preprocess_step),
                        ('rfclassifer4', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_5=Pipeline([('rfscaling5', Normalizer()),
                        ('rfimbalance5', SMOTE(random_state=0)),
                        ('rf_preprocess5', preprocess_step),
                        ('rfclassifer5', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_6=Pipeline([('rfscaling6', Normalizer()),
                        ('rfimbalance6', ADASYN(random_state=0)),
                        ('rf_preprocess6', preprocess_step),
                        ('rfclassifer6', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_7=Pipeline([('rfscaling7', MinMaxScaler()),
                        ('rfimbalance7', RandomOverSampler(random_state=0)),
                        ('rf_preprocess7', preprocess_step),
                        ('rfclassifer7', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_8=Pipeline([('rfscaling8', MinMaxScaler()),
                        ('rfimbalance8', SMOTE(random_state=0)),
                        ('rf_preprocess8', preprocess_step),
                        ('rfclassifer8', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_9=Pipeline([('rfscaling9', MinMaxScaler()),
                        ('rfimbalance9', ADASYN(random_state=0)),
                        ('rf_preprocess9', preprocess_step),
                        ('rfclassifer9', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_10=Pipeline([('rfscaling10', RobustScaler()),
                        ('rfimbalance10', RandomOverSampler(random_state=0)),
                        ('rf_preprocess10', preprocess_step),
                        ('rfclassifer10', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_11=Pipeline([('rfscaling11', RobustScaler()),
                        ('rfimbalance11', SMOTE(random_state=0)),
                        ('rf_preprocess11', preprocess_step),
                        ('rfclassifer11', RandomForestClassifier(random_state=0, criterion='gini'))])
    pipe_rf_12=Pipeline([('rfscaling12', RobustScaler()),
                        ('rfimbalance12', ADASYN(random_state=0)),
                        ('rf_preprocess12', preprocess_step),
                        ('rfclassifer12', RandomForestClassifier(random_state=0, criterion='gini'))])

    #Gradient Boosting Classifier pipeline with Data imbalance and Scaling fucntions
    pipe_gbc_1=Pipeline([('gbcscaling1', StandardScaler()),
                        ('gbcimbalance1', RandomOverSampler(random_state=0)),
                        ('gbc_preprocess1', preprocess_step),
                        ('gbcclassifer1', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_2=Pipeline([('gbcscaling2', StandardScaler()),
                        ('gbcimbalance2', SMOTE(random_state=0)),
                        ('gbc_preprocess2', preprocess_step),
                        ('gbcclassifer2', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_3=Pipeline([('gbcscaling3', StandardScaler()),
                        ('gbcimbalance3', ADASYN(random_state=0)),
                        ('gbc_preprocess3', preprocess_step),
                        ('gbcclassifer3', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_4=Pipeline([('gbcscaling4', Normalizer()),
                        ('gbcimbalance4', RandomOverSampler(random_state=0)),
                        ('gbc_preprocess4', preprocess_step),
                        ('gbcclassifer4', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_5=Pipeline([('gbcscaling5', Normalizer()),
                        ('gbcimbalance5', SMOTE(random_state=0)),
                        ('gbc_preprocess5', preprocess_step),
                        ('gbcclassifer5', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_6=Pipeline([('gbcscaling6', Normalizer()),
                        ('gbcimbalance6', ADASYN(random_state=0)),
                        ('gbc_preprocess6', preprocess_step),
                        ('gbcclassifer6', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_7=Pipeline([('gbcscaling7', MinMaxScaler()),
                        ('gbcimbalance7', RandomOverSampler(random_state=0)),
                        ('gbc_preprocess7', preprocess_step),
                        ('gbcclassifer7', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_8=Pipeline([('gbcscaling8', MinMaxScaler()),
                        ('gbcimbalance8', SMOTE(random_state=0)),
                        ('gbc_preprocess8', preprocess_step),
                        ('gbcclassifer8', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_9=Pipeline([('gbcscaling9', MinMaxScaler()),
                        ('gbcimbalance9', ADASYN(random_state=0)),
                        ('gbc_preprocess9', preprocess_step),
                        ('gbcclassifer9', GradientBoostingClassifier(random_state=0))]) 
    pipe_gbc_10=Pipeline([('gbcscaling10', RobustScaler()),
                        ('gbcimbalance10', RandomOverSampler(random_state=0)),
                        ('gbc_preprocess10', preprocess_step),
                        ('gbcclassifer10', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_11=Pipeline([('gbcscaling11', RobustScaler()),
                        ('gbcimbalance11', SMOTE(random_state=0)),
                        ('gbc_preprocess11', preprocess_step),
                        ('gbcclassifer11', GradientBoostingClassifier(random_state=0))])
    pipe_gbc_12=Pipeline([('gbcscaling12', RobustScaler()),
                        ('gbcimbalance12', ADASYN(random_state=0)),
                        ('gbc_preprocess12', preprocess_step),
                        ('gbcclassifer12', GradientBoostingClassifier(random_state=0))])  

    #Light Gradient Boosting Classifier pipeline with Data imbalance and Scaling fucntions
    pipe_lgbm_1=Pipeline([('lgbmscaling1', StandardScaler()),
                        ('lgbmimbalance1', RandomOverSampler(random_state=0)),
                        ('lgbm_preprocess1', preprocess_step),
                        ('lgbmclassifer1', LGBMClassifier(random_state=0))])
    pipe_lgbm_2=Pipeline([('lgbmscaling2', StandardScaler()),
                        ('lgbmimbalance2', SMOTE(random_state=0)),
                        ('lgbm_preprocess2', preprocess_step),
                        ('lgbmclassifer2', LGBMClassifier(random_state=0))])
    pipe_lgbm_3=Pipeline([('lgbmscaling3', StandardScaler()),
                        ('lgbmimbalance3', ADASYN(random_state=0)),
                        ('lgbm_preprocess3', preprocess_step),
                        ('lgbmclassifer3', LGBMClassifier(random_state=0))])
    pipe_lgbm_4=Pipeline([('lgbmscaling4', Normalizer()),
                        ('lgbmimbalance4', RandomOverSampler(random_state=0)),
                        ('lgbm_preprocess4', preprocess_step),
                        ('lgbmclassifer4', LGBMClassifier(random_state=0))])
    pipe_lgbm_5=Pipeline([('lgbmscaling5', Normalizer()),
                        ('lgbmimbalance5', SMOTE(random_state=0)),
                        ('lgbm_preprocess5', preprocess_step),
                        ('lgbmclassifer5', LGBMClassifier(random_state=0))])
    pipe_lgbm_6=Pipeline([('lgbmscaling6', Normalizer()),
                        ('lgbmimbalance6', ADASYN(random_state=0)),
                        ('lgbm_preprocess6', preprocess_step),
                        ('lgbmclassifer6', LGBMClassifier(random_state=0))])
    pipe_lgbm_7=Pipeline([('lgbmscaling7', MinMaxScaler()),
                        ('lgbmimbalance7', RandomOverSampler(random_state=0)),
                        ('lgbm_preprocess7', preprocess_step),
                        ('lgbmclassifer7', LGBMClassifier(random_state=0))])
    pipe_lgbm_8=Pipeline([('lgbmscaling8', MinMaxScaler()),
                        ('lgbmimbalance8', SMOTE(random_state=0)),
                        ('lgbm_preprocess8', preprocess_step),
                        ('lgbmclassifer8', LGBMClassifier(random_state=0))])
    pipe_lgbm_9=Pipeline([('lgbmscaling9', MinMaxScaler()),
                        ('lgbmimbalance9', ADASYN(random_state=0)),
                        ('lgbm_preprocess9', preprocess_step),
                        ('lgbmclassifer9', LGBMClassifier(random_state=0))])   
    pipe_lgbm_10=Pipeline([('lgbmscaling10', RobustScaler()),
                        ('lgbmimbalance10', RandomOverSampler(random_state=0)),
                        ('lgbm_preprocess10', preprocess_step),
                        ('lgbmclassifer10', LGBMClassifier(random_state=0))])
    pipe_lgbm_11=Pipeline([('lgbmscaling11', RobustScaler()),
                        ('lgbmimbalance11', SMOTE(random_state=0)),
                        ('lgbm_preprocess11', preprocess_step),
                        ('lgbmclassifer11', LGBMClassifier(random_state=0))])
    pipe_lgbm_12=Pipeline([('lgbmscaling12', RobustScaler()),
                        ('lgbmimbalance12', ADASYN(random_state=0)),
                        ('lgbm_preprocess12', preprocess_step),
                        ('lgbmclassifer12', LGBMClassifier(random_state=0))])   
    
    
  
    #Xtreme Gradient Boosting Classifier pipeline with Data imbalance and Scaling fucntions
    pipe_xgb_1=Pipeline([('xgbscaling1', StandardScaler()),
                        ('xgbimbalance1', RandomOverSampler(random_state=0)),
                        ('xgb_preprocess1', preprocess_step),
                        ('xgbclassifer1', XGBClassifier(random_state=0))])
    pipe_xgb_2=Pipeline([('xgbscaling2', StandardScaler()),
                        ('xgbimbalance2', SMOTE(random_state=0)),
                        ('xgb_preprocess2', preprocess_step),
                        ('xgbclassifer2', XGBClassifier(random_state=0))])
    pipe_xgb_3=Pipeline([('xgbscaling3', StandardScaler()),
                        ('xgbimbalance3', ADASYN(random_state=0)),
                        ('xgb_preprocess3', preprocess_step),
                        ('xgbclassifer3', XGBClassifier(random_state=0))])
    pipe_xgb_4=Pipeline([('xgbscaling4', Normalizer()),
                        ('xgbimbalance4', RandomOverSampler(random_state=0)),
                        ('xgb_preprocess4', preprocess_step),
                        ('xgbclassifer4', XGBClassifier(random_state=0))])
    pipe_xgb_5=Pipeline([('xgbscaling5', Normalizer()),
                        ('xgbimbalance5', SMOTE(random_state=0)),
                        ('xgb_preprocess5', preprocess_step),
                        ('xgbclassifer5', XGBClassifier(random_state=0))])
    pipe_xgb_6=Pipeline([('xgbscaling6', Normalizer()),
                        ('xgbimbalance6', ADASYN(random_state=0)),
                        ('xgb_preprocess6', preprocess_step),
                        ('xgbclassifer6', XGBClassifier(random_state=0))])
    pipe_xgb_7=Pipeline([('xgbscaling7', MinMaxScaler()),
                        ('xgbimbalance7', RandomOverSampler(random_state=0)),
                        ('xgb_preprocess7', preprocess_step),
                        ('xgbclassifer7', XGBClassifier(random_state=0))])
    pipe_xgb_8=Pipeline([('xgbscaling8', MinMaxScaler()),
                        ('xgbimbalance8', SMOTE(random_state=0)),
                        ('xgb_preprocess8', preprocess_step),
                        ('xgbclassifer8', XGBClassifier(random_state=0))])
    pipe_xgb_9=Pipeline([('xgbscaling9', MinMaxScaler()),
                        ('xgbimbalance9', ADASYN(random_state=0)),
                        ('xgb_preprocess9', preprocess_step),
                        ('xgbclassifer9', XGBClassifier(random_state=0))])   
    pipe_xgb_10=Pipeline([('xgbscaling10', RobustScaler()),
                        ('xgbimbalance10', RandomOverSampler(random_state=0)),
                        ('xgb_preprocess10', preprocess_step),
                        ('xgbclassifer10', XGBClassifier(random_state=0))])
    pipe_xgb_11=Pipeline([('xgbscaling11', RobustScaler()),
                        ('xgbimbalance11', SMOTE(random_state=0)),
                        ('xgb_preprocess11', preprocess_step),
                        ('xgbclassifer11', XGBClassifier(random_state=0))])
    pipe_xgb_12=Pipeline([('xgbscaling12', RobustScaler()),
                        ('xgbimbalance12', ADASYN(random_state=0)),
                        ('xgb_preprocess12', preprocess_step),
                        ('xgbclassifer12', XGBClassifier(random_state=0))])   
 
    #Adaptive Boosting Classifier pipeline with Data imbalance and Scaling fucntions
    pipe_ada_1=Pipeline([('adascaling1', StandardScaler()),
                        ('adaimbalance1', RandomOverSampler(random_state=0)),
                        ('ada_preprocess1', preprocess_step),
                        ('adaclassifer1', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_2=Pipeline([('adascaling2', StandardScaler()),
                        ('adaimbalance2', SMOTE(random_state=0)),
                        ('ada_preprocess2', preprocess_step),
                        ('adaclassifer2', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_3=Pipeline([('adascaling3', StandardScaler()),
                        ('adaimbalance3', ADASYN(random_state=0)),
                        ('ada_preprocess3', preprocess_step),
                        ('adaclassifer3', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_4=Pipeline([('adascaling4', Normalizer()),
                        ('adaimbalance4', RandomOverSampler(random_state=0)),
                        ('ada_preprocess4', preprocess_step),
                        ('adaclassifer4', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_5=Pipeline([('adascaling5', Normalizer()),
                        ('adaimbalance5', SMOTE(random_state=0)),
                        ('ada_preprocess5', preprocess_step),
                        ('adaclassifer5', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_6=Pipeline([('adascaling6', Normalizer()),
                        ('adaimbalance6', ADASYN(random_state=0)),
                        ('ada_preprocess6', preprocess_step),
                        ('adaclassifer6', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_7=Pipeline([('adascaling7', MinMaxScaler()),
                        ('adaimbalance7', RandomOverSampler(random_state=0)),
                        ('ada_preprocess7', preprocess_step),
                        ('adaclassifer7', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_8=Pipeline([('adascaling8', MinMaxScaler()),
                        ('adaimbalance8', SMOTE(random_state=0)),
                        ('ada_preprocess8', preprocess_step),
                        ('adaclassifer8', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_9=Pipeline([('adascaling9', MinMaxScaler()),
                        ('adaimbalance9', ADASYN(random_state=0)),
                        ('ada_preprocess9', preprocess_step),
                        ('adaclassifer9', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])   
    pipe_ada_10=Pipeline([('adascaling10', RobustScaler()),
                        ('adaimbalance10', RandomOverSampler(random_state=0)),
                        ('ada_preprocess10', preprocess_step),
                        ('adaclassifer10', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_11=Pipeline([('adascaling11', RobustScaler()),
                        ('adaimbalance11', SMOTE(random_state=0)),
                        ('ada_preprocess11', preprocess_step),
                        ('adaclassifer11', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])
    pipe_ada_12=Pipeline([('adascaling12', RobustScaler()),
                        ('adaimbalance12', ADASYN(random_state=0)),
                        ('ada_preprocess12', preprocess_step),
                        ('adaclassifer12', AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=0))])    


    #Catboost Classifier pipeline with Data imbalance and Scaling fucntions
    pipe_catb_1=Pipeline([('adascaling1', StandardScaler()),
                        ('adaimbalance1', RandomOverSampler(random_state=0)),
                        ('ada_preprocess1', preprocess_step),
                        ('adaclassifer1', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_2=Pipeline([('adascaling2', StandardScaler()),
                        ('adaimbalance2', SMOTE(random_state=0)),
                        ('ada_preprocess2', preprocess_step),
                        ('adaclassifer2', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_3=Pipeline([('adascaling3', StandardScaler()),
                        ('adaimbalance3', ADASYN(random_state=0)),
                        ('ada_preprocess3', preprocess_step),
                        ('adaclassifer3', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_4=Pipeline([('adascaling4', Normalizer()),
                        ('adaimbalance4', RandomOverSampler(random_state=0)),
                        ('ada_preprocess4', preprocess_step),
                        ('adaclassifer4', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_5=Pipeline([('adascaling5', Normalizer()),
                        ('adaimbalance5', SMOTE(random_state=0)),
                        ('ada_preprocess5', preprocess_step),
                        ('adaclassifer5', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_6=Pipeline([('adascaling6', Normalizer()),
                        ('adaimbalance6', ADASYN(random_state=0)),
                        ('ada_preprocess6', preprocess_step),
                        ('adaclassifer6', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_7=Pipeline([('adascaling7', MinMaxScaler()),
                        ('adaimbalance7', RandomOverSampler(random_state=0)),
                        ('ada_preprocess7', preprocess_step),
                        ('adaclassifer7', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_8=Pipeline([('adascaling8', MinMaxScaler()),
                        ('adaimbalance8', SMOTE(random_state=0)),
                        ('ada_preprocess8', preprocess_step),
                        ('adaclassifer8', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_9=Pipeline([('adascaling9', MinMaxScaler()),
                        ('adaimbalance9', ADASYN(random_state=0)),
                        ('ada_preprocess9', preprocess_step),
                        ('adaclassifer9', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_10=Pipeline([('adascaling10', RobustScaler()),
                        ('adaimbalance10', RandomOverSampler(random_state=0)),
                        ('ada_preprocess10', preprocess_step),
                        ('adaclassifer10', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_11=Pipeline([('adascaling11', RobustScaler()),
                        ('adaimbalance11', SMOTE(random_state=0)),
                        ('ada_preprocess11', preprocess_step),
                        ('adaclassifer11', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    pipe_catb_12=Pipeline([('adascaling12', RobustScaler()),
                        ('adaimbalance12', ADASYN(random_state=0)),
                        ('ada_preprocess12', preprocess_step),
                        ('adaclassifer12', CatBoostClassifier(iterations=50, learning_rate=0.1, random_state=0, verbose=0))])
    
    pipelines=[pipe_lrv_1,pipe_lrv_2,pipe_lrv_3,pipe_lrv_4,pipe_lrv_5,pipe_lrv_6,pipe_lrv_7,pipe_lrv_8,pipe_lrv_9,pipe_lrv_10,pipe_lrv_11,pipe_lrv_12,
            pipe_lr_1,pipe_lr_2,pipe_lr_3,pipe_lr_4,pipe_lr_5,pipe_lr_6,pipe_lr_7,pipe_lr_8,pipe_lr_9,pipe_lr_10,pipe_lr_11,pipe_lr_12,
            pipe_rf_1,pipe_rf_2,pipe_rf_3,pipe_rf_4,pipe_rf_5,pipe_rf_6,pipe_rf_7,pipe_rf_8,pipe_rf_9,pipe_rf_10,pipe_rf_11,pipe_rf_12,
            pipe_gbc_1,pipe_gbc_2,pipe_gbc_3,pipe_gbc_4,pipe_gbc_5,pipe_gbc_6,pipe_gbc_7,pipe_gbc_8,pipe_gbc_9,pipe_gbc_10,pipe_gbc_11,pipe_gbc_12,
            pipe_lgbm_1,pipe_lgbm_2,pipe_lgbm_3,pipe_lgbm_4,pipe_lgbm_5,pipe_lgbm_6,pipe_lgbm_7,pipe_lgbm_8,pipe_lgbm_9,pipe_lgbm_10,pipe_lgbm_11,pipe_lgbm_12,
            pipe_xgb_1,pipe_xgb_2,pipe_xgb_3,pipe_xgb_4,pipe_xgb_5,pipe_xgb_6,pipe_xgb_7,pipe_xgb_8,pipe_xgb_9,pipe_xgb_10,pipe_xgb_11,pipe_xgb_12,
            pipe_ada_1,pipe_ada_2,pipe_ada_3,pipe_ada_4,pipe_ada_5,pipe_ada_6,pipe_ada_7,pipe_ada_8,pipe_ada_9,pipe_ada_10,pipe_ada_11,pipe_ada_12,
            pipe_catb_1,pipe_catb_2,pipe_catb_3,pipe_catb_4,pipe_catb_5,pipe_catb_6,pipe_catb_7,pipe_catb_8,pipe_catb_9,pipe_catb_10,pipe_catb_11,pipe_catb_12]

    for pipe in pipelines:
        pipe.fit(X_tr, y_tr)
  
    for i,model in enumerate(pipelines):
        if (i>=0 and i<=11):
            MLA_compare.loc[row_index, 'MLA Name']='LogRegCV'
        elif (i>=12 and i<=23):
            MLA_compare.loc[row_index, 'MLA Name']='LogReg'
        elif (i>=24 and i<=35):
            MLA_compare.loc[row_index, 'MLA Name']='RF'
        elif (i>=36 and i<=47):
            MLA_compare.loc[row_index, 'MLA Name']='GB'
        elif (i>=48 and i<=59):
            MLA_compare.loc[row_index, 'MLA Name']='LightGBM'
        elif (i>=60 and i<=71):
            MLA_compare.loc[row_index, 'MLA Name']='XGB'
        elif (i>=72 and i<=83):
            MLA_compare.loc[row_index, 'MLA Name']='ADA'
        elif (i>=84 and i<=95):
            MLA_compare.loc[row_index, 'MLA Name']='CATB'


        MLA_compare.loc[row_index, 'MLA Parameters']=str(model.get_params())
        MLA_compare.loc[row_index, 'MLA Train Accuracy Mean']=model.score(X_tr,y_tr)
        MLA_compare.loc[row_index, 'MLA Test Accuracy Mean']=model.score(X_ts,y_ts)
        y_pred = model.predict(X_ts)
        MLA_compare.loc[row_index, 'MLA Recall']=metrics.recall_score(y_ts, y_pred, average='weighted')
        MLA_compare.loc[row_index, 'MLA Precision']=metrics.precision_score(y_ts, y_pred,average='weighted')
        MLA_compare.loc[row_index, 'MLA F1 Score']=metrics.f1_score(y_ts, y_pred, average='weighted')

        row_index=row_index+1
        

        if (model.score(X_ts,y_ts)>best_accuracy):
            best_accuracy=model.score(X_ts,y_ts,)
            best_accpipeline=model

        if (metrics.recall_score(y_ts, y_pred, average='weighted')>best_recall):
            best_recall=metrics.recall_score(y_ts, y_pred, average='weighted')
            bestrecpipeline=model

        if (metrics.precision_score(y_ts, y_pred, average='weighted')>best_precision):
            best_precision=metrics.precision_score(y_ts, y_pred, average='weighted')
            bestprcpipeline=model

        if (metrics.f1_score(y_ts, y_pred, average='weighted')>best_f1):
            best_f1=metrics.f1_score(y_ts, y_pred, average='weighted')
            bestf1pipeline=model

        MLA_compare['Difference']= (MLA_compare['MLA Test Accuracy Mean']-MLA_compare['MLA Train Accuracy Mean'])*100

    #Storing the model with best f1 score   
    with open(modelfilename, 'wb') as file:  
        pickle.dump(bestf1pipeline, file)

    return MLA_compare

In [None]:
def plot_test_accuracy(df):
    tmp_df=df.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False)
    tmp_df.reset_index(drop=True, inplace=True)
    plt.figure(figsize=(10,10))
    sns.barplot(x='MLA Name', y='MLA Test Accuracy Mean', data=tmp_df.head(10))
    plt.title("Mean Test Accuracy of Models", fontdict={'fontweight':'bold'})
    plt.show()

In [None]:
def plot_train_accuracy(df):
    tmp_df=df.sort_values(by = ['MLA Train Accuracy Mean'], ascending = False)
    tmp_df.reset_index(drop=True, inplace=True)
    plt.figure(figsize=(10,10))
    sns.barplot(x='MLA Name', y='MLA Train Accuracy Mean', data=tmp_df.head(10))
    plt.title("Mean Train Accuracy of Models", fontdict={'fontweight':'bold'})
    plt.show()

In [None]:
def plot_recall(df):
    tmp_df=df.sort_values(by = ['MLA Recall'], ascending = False)
    tmp_df.reset_index(drop=True, inplace=True)
    plt.figure(figsize=(10,10))
    sns.barplot(x='MLA Name', y='MLA Recall', data=tmp_df.head(10))
    plt.title("Weighted Recall of Models", fontdict={'fontweight':'bold'})
    plt.show()

In [None]:
def plot_precision(df):
    tmp_df=df.sort_values(by = ['MLA Precision'], ascending = False)
    tmp_df.reset_index(drop=True, inplace=True)
    plt.figure(figsize=(10,10))
    sns.barplot(x='MLA Name', y='MLA Precision', data=tmp_df.head(10))
    plt.title("Weighted Precision of Models", fontdict={'fontweight':'bold'})
    plt.show()

In [None]:
def plot_f1score(df):
    tmp_df=df.sort_values(by = ['MLA F1 Score'], ascending = False)
    tmp_df.reset_index(drop=True, inplace=True)
    plt.figure(figsize=(10,10))
    sns.barplot(x='MLA Name', y='MLA F1 Score', data=tmp_df.head(10))
    plt.title("Weighted F1 Score of Models", fontdict={'fontweight':'bold'})
    plt.show()

<a id=section1401></a>
### **14.1 Modelling on basic dataset - adv_data with stratified - False while splitting the data for test/train**

In [None]:
df=train_predict_default_ml(adv_data, 0, 'model_adv_data_no_stratify').copy()

In [None]:
plot_test_accuracy(df)

In [None]:
df.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

In terms of Test accuracy, Xtreme Gradient Boosting, Light Gradient Boosting and Random Forest have perfornmed well than others.
But if observed the top 5 results, XBG is the best model

In [None]:
plot_train_accuracy(df)

In [None]:
df.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

In terms of Train accuracy, Random Forest,XGB are best models similar to test accuracy scenarios.
But, here in tersms of Top 5 results, Random Forest is best model, that is expected, as Random Forest teds to overfit, so it may perform better on training data than testing data

In [None]:
plot_recall(df)

In [None]:
df.sort_values(by='MLA Recall', ascending=False).head(10)

In terms of Recall, the result is similar, RF,XGB and LightGBM are best models with XGB has performed better of all

In [None]:
plot_precision(df)

In [None]:
df.sort_values(by='MLA Precision', ascending=False).head(10)

In terms of Precision, Catboost Classifier has performed marginally better than other models, but LighGBM and XBG have also performed better

In [None]:
plot_f1score(df)

In [None]:
df.sort_values(by='MLA F1 Score', ascending=False).head(10)

In terms of F1 score, XGB,LightGBM and Random Forest have performed better with XGB is better of all

<a id=section1402></a>
### **14.2 Modelling on basic dataset - adv_data with stratified - True while splitting the data for test/train**

In [None]:
df1=train_predict_default_ml(adv_data_1, 1, 'model_adv_data_stratify').copy()

In [None]:
plot_test_accuracy(df1)

In [None]:
df1.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

Similar models as in when stratified=False, but the test accuracy has been increrased with TRUE value for the flag.
GB,XGB,Catb and LighGBM are better performed models

In [None]:
plot_train_accuracy(df1)

In [None]:
df1.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

Same models as in test accuracy scenario, here also when stratified flag is TRUE, train accuracy is increased

In [None]:
plot_recall(df1)

In [None]:
df1.sort_values(by='MLA Recall', ascending=False).head(10)

Similar models  - XGB,GB,LightGBM and RF as in earlier cases, the recall is better with stratified=TRUE

In [None]:
plot_precision(df1)

In [None]:
df1.sort_values(by='MLA Precision', ascending=False).head(10)

When stratified is TRUE, AdaBoost has performed better in case of Precsion, but other values are comparatively low for it.
Other models like LighGBM, LogisticRegressionCV and XGB have precision marginally low, but better values for other parameters

In [None]:
plot_f1score(df1)

In [None]:
df1.sort_values(by='MLA F1 Score', ascending=False).head(10)

In terms of F1 score, Gradient Boosting has performed well along with XGB,LightGBM and Random Forest, But F1 score is reduced marginally when compared to stratified=FALSE

**Summary for adv_data with baseline models**

XGB,LightGBM,Gradient Boosting and Random Forest are the best performing models either with stratified=TRUE or FALSE with CatBoost and Adaboost performed well when stratified=TRUE.
All the parameter values have increased slighly with stratified=TRUE for top 5 models, but marginally low value for F1 score.

<a id=section1403></a>
### **14.3 Modelling on adv_data_1 with stratified - False while splitting the data for test/train**

In [None]:
df2=train_predict_default_ml(adv_data_1, 0, 'model_adv_data_1_no_stratify').copy()

In [None]:
plot_test_accuracy(df2)

In [None]:
df2.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

Similar models as in case of adv_data, and test accuracy for top few results is better than what adv_data modelling had with stratified=FALSE, but values are lower when adv_data with stratified=FALSE

In [None]:
plot_train_accuracy(df2)

In [None]:
df2.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

Same models as in case of adv_data with stratified=FALSE and train accuracy values are almost similar in both the cases, but lower when adv_data trained with stratified=FALSE

In [None]:
plot_recall(df2)

In [None]:
df2.sort_values(by='MLA Recall', ascending=False).head(10)

Recall has been decreased as compred to the results obtained from adv_data with or without stratified set

In [None]:
plot_precision(df2)

In [None]:
df2.sort_values(by='MLA Precision', ascending=False).head(10)

Overall precision of the models decreased as compared to adv_data results

In [None]:
plot_f1score(df2)

In [None]:
df2.sort_values(by='MLA F1 Score', ascending=False).head(10)

F1 score is almost similar as compared to adv_data results, though marginally high for few top results but overall not that difference.
But, other values like precisoon,Recall and accuracy seems to have satisfactory results with better F1 in this case

**Summary**

Recall and Precison values are marginally low as compared to adv_data results, but accuracy score and F1 score increrased marginally.

<a id=section1404></a>
### **14.4 Modelling on adv_data_1 with stratified - True while splitting the data for test/train**

In [None]:
df3=train_predict_default_ml(adv_data_1, 1, 'model_adv_data_1_stratify').copy()

In [None]:
plot_test_accuracy(df3)

In [None]:
df3.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

Test accuracy almost similar what we have in case of adv_data with stratified=TRUE with same models performing better - Gradient Boosting, XGB,Catboost and LightGBM

In [None]:
plot_train_accuracy(df3)

In [None]:
df3.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

Train accuracy is same a in adv_data

In [None]:
plot_recall(df3)

In [None]:
df3.sort_values(by='MLA Recall', ascending=False).head(10)

Similar value for Recall as well as compared to adv_data with sane models

In [None]:
plot_precision(df3)

In [None]:
df3.sort_values(by='MLA Precision', ascending=False).head(10)

Same value for precision as compared to adv_data results

In [None]:
plot_f1score(df3)

In [None]:
df3.sort_values(by='MLA F1 Score', ascending=False).head(10)

F1 score is also same as in adv_data

<a id=section1405></a>
### **14.5 Modelling on adv_data_2 with stratified - False while splitting the data for test/train**

In [None]:
df4=train_predict_default_ml(adv_data_2, 0, 'model_adv_data_2_no_stratify').copy()

In [None]:
plot_test_accuracy(df4)

In [None]:
df4.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

Test accuracies have been icnreased as compared to adv_data and adv_data_1 with stratidied=FLASE with same models - XBG,GB nd LighGBM

In [None]:
plot_train_accuracy(df4)

In [None]:
df4.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

Train accuracies have also increased with same models

In [None]:
plot_recall(df4)

In [None]:
df4.sort_values(by='MLA Recall', ascending=False).head(10)

Recall too have increased with same Models

In [None]:
plot_precision(df4)

In [None]:
df4.sort_values(by='MLA Precision', ascending=False).head(10)

Precison also increased as compared to adv_data and adv_data_1

In [None]:
plot_f1score(df4)

In [None]:
df4.sort_values(by='MLA F1 Score', ascending=False).head(10)

Precision also increased with same models as compared to adv_data and adv_data_1

**Summary**

With adv_data_2, all the metrics values have increased, so can be considered for further experimentation over other 2 datasets.

<a id=section1406></a>
### **14.6 Modelling on adv_data_2 with stratified - True while splitting the data for test/train**

In [None]:
df5=train_predict_default_ml(adv_data_2, 1, 'model_adv_data_2_stratify').copy()

In [None]:
plot_test_accuracy(df5)

In [None]:
df5.sort_values(by='MLA Test Accuracy Mean', ascending=False).head(10)

Test accuracies have decreased as compared to adv_data and adv_data_1 with stratified=TRUE

Also, adv_data_2 with stratified=FALSE has better accuracy

In [None]:
plot_train_accuracy(df5)

In [None]:
df5.sort_values(by='MLA Train Accuracy Mean', ascending=False).head(10)

Train accuracies have increased and marginally higher than other datasets with adv_data_2 with stratified=FALSE

In [None]:
plot_recall(df5)

In [None]:
df5.sort_values(by='MLA Recall', ascending=False).head(10)

Recall is decreased as compared to other datasets

In [None]:
plot_precision(df5)

In [None]:
df5.sort_values(by='MLA Precision', ascending=False).head(10)

Precision too decrerased as comapred to other datasets and adv_data_2 with stratified=FALSE

In [None]:
plot_f1score(df5)

In [None]:
df5.sort_values(by='MLA F1 Score', ascending=False).head(10)

F1 score also decreased as compared to previous results

<a id=section1407></a>
### **14.7 Summary on baseline model results**

 - adv_data_2 with stratified=FALSE has given the best results
 - XGB,GB,LightGBM and RF are the best performing models 
 
 
 **Below are the top 5 performing model configurations (considering F1 score as final evaluation parameter) - all are XGB or LightGBM**

In [None]:
df4.sort_values(by='MLA F1 Score', ascending=False).head()

Also noticed that, StandardScaler and RobustScaler have performed better  - could be due to below reasons-
 - StandardScaler - ratings and runtime values were in different scale (0-1 and 10-100 respectively), so moved all to same scale
 - RobustScaler took care of large number of outliers in both the columns

<a id=section15></a>
## **15. Machine learning - Analyzing tuned models (and Ensemble models)**

<a id=section1501></a>
### **15.1 Hyper-parameter tuning**

We will tune the parameters for Xtreme Gradient Boosting, LightGBM, Gradient Boosting, Random Forest along with scaling using - StandardScaler, RobustScaler and MinMaxScaler along with data imbalance resolving technique like  - RandomOverSampler, SMOTE and ADASYN

Also, useing dataset - adv_data_2 with stratified=FALSE as this dataset provided the best results with baseline

<a id=section15011></a>
#### **15.1.1 Hyper parameter tuning in Random Forest**

In [None]:
#Function to train Random Forest using hyper parameter tuning
def process_rf(data,stratified):
    """
    data - input data
    stratified flag - wether to set the stratified flag to TRUE or FALSE (always FALSE in this case)
    """
    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    if (True==stratified):
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0, stratify=y, shuffle=True)
    else:
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)
        
    preprocess_step=FeatureUnion([('kbest', SelectKBest(score_func=f_classif, k=15))])
    
    pipe_1=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_2=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_3=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_4=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_5=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_6=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_7=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_8=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    pipe_9=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', RandomForestClassifier(random_state=0))])
    
    
    param_grid= {
                  'classifier__n_estimators': [None, 10, 100, 500],
                  'classifier__criterion':['gini'],
                  'classifier__max_depth':range(0,50,2),
                  'classifier__min_samples_split':range(0,500, 10),
                  'classifier__min_samples_leaf':range(0,20, 1),
                  'classifier__max_features':['auto','sqrt','log2']
               }
    
    gs_1=RandomizedSearchCV(pipe_1, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_2=RandomizedSearchCV(pipe_2, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_3=RandomizedSearchCV(pipe_3, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_4=RandomizedSearchCV(pipe_4, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_5=RandomizedSearchCV(pipe_5, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_6=RandomizedSearchCV(pipe_6, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_7=RandomizedSearchCV(pipe_7, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_8=RandomizedSearchCV(pipe_8, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_9=RandomizedSearchCV(pipe_9, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    
    grids=[gs_1,gs_2,gs_3,gs_4,gs_5,gs_6,gs_7,gs_8,gs_9]
    
    print('Performing model optimizations...')
   
    best_f1=0
    
    for gs in grids:
        gs.fit(X_tr, y_tr)
        print("Best parameters - : ", gs.best_params_)
        print("Best training accuracy - :", gs.best_score_)
        y_pred = gs.predict(X_ts)
        print("Test accuracy -  : ", gs.score(X_ts, y_ts))
        print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
        print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
        print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
        
        if metrics.f1_score(y_ts, y_pred, average='weighted')>best_f1:
            best_f1=metrics.f1_score(y_ts, y_pred, average='weighted')
            best_gs=gs
     
    print("===========================================================")
    print ("Best Model values :")
    print("Best parameters - : ", best_gs.best_params_)
    print("Best training accuracy - :", best_gs.best_score_)
    y_pred = best_gs.predict(X_ts)
    print("Test accuracy -  : ", best_gs.score(X_ts, y_ts))
    print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
    print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
    print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
    
    y_pred_proba = best_gs.predict_proba(X_ts)
    preds = y_pred_proba[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_ts, preds)
    roc_auc = metrics.auc(fpr, tpr)
    print(roc_auc)

    plt.figure()
    plt.plot(fpr, tpr, label='AUC ROC (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()
    
    
    print("===========================================================")
    
    return best_gs

In [None]:
best_rf_model=process_rf(adv_data_2, 0)

In [None]:
with open('rf_param_model', 'wb') as file:  
    pickle.dump(best_rf_model, file)

**Best Random Forest Model information**

**Best Model :** best_rf_model

**Best parameters - :**  {'classifier__n_estimators': 500, 'classifier__min_samples_split': 20, 'classifier__min_samples_leaf': 2, 'classifier__max_features': 'log2', 'classifier__max_depth': 48, 'classifier__criterion': 'gini'}

**Best training accuracy - :** 0.6171399533822614

**Test accuracy -  :**  0.6307458143074581

**Recall Score - :**  0.7671785028790787

**Precison Score - :**  0.8329584077668938

**F1 Score - :**  0.7824949126262221

**AUC-ROC Score - :** 0.8650290285244923

<a id=section15012></a>
#### **15.1.2 Hyper paramete tuning in Gradient Boosting Classifier**

In [None]:
#Function to process Gradient Boosting model usimhg hyper parameter tuning
def process_gb(data,stratified):
    
    """
    data - input data
    stratified flag - wether to set the stratified flag to TRUE or FALSE (always FALSE in this case)
    """

    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    if (True==stratified):
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0, stratify=y, shuffle=True)
    else:
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)
        
    preprocess_step=FeatureUnion([('kbest', SelectKBest(score_func=f_classif, k=15))])
    
    pipe_1=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_2=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_3=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_4=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_5=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_6=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_7=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_8=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    pipe_9=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', GradientBoostingClassifier(random_state=0))])
    
    
    param_grid= {
                'classifier__n_estimators': [10, 100,250,500,750,1000,1250,1500,1750],
                'classifier__learning_rate': [0.15,0.1,0.05,0.01,0.005,0.001],
                'classifier__max_depth':range(2,50,2),
                'classifier__min_samples_split':range(2,500, 10),
                'classifier__min_samples_leaf':range(1,20, 1),
                'classifier__max_features':['auto','sqrt','log2'],
                'classifier__subsample': [0.5, 0.6, 0.7,0.75,0.8,0.85,0.9,0.95,1]
               }

    gs_1=RandomizedSearchCV(pipe_1, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_2=RandomizedSearchCV(pipe_2, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_3=RandomizedSearchCV(pipe_3, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_4=RandomizedSearchCV(pipe_4, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_5=RandomizedSearchCV(pipe_5, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_6=RandomizedSearchCV(pipe_6, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_7=RandomizedSearchCV(pipe_7, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_8=RandomizedSearchCV(pipe_8, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_9=RandomizedSearchCV(pipe_9, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    
    grids=[gs_1,gs_2,gs_3,gs_4,gs_5,gs_6,gs_7,gs_8,gs_9]
    
    print('Performing model optimizations...')
   
    best_f1=0
    
    for gs in grids:
        gs.fit(X_tr, y_tr)
        print("Best parameters - : ", gs.best_params_)
        print("Best training accuracy - :", gs.best_score_)
        y_pred = gs.predict(X_ts)
        print("Test accuracy -  : ", gs.score(X_ts, y_ts))
        print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
        print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
        print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
        
        if metrics.f1_score(y_ts, y_pred, average='weighted')>best_f1:
            best_f1=metrics.f1_score(y_ts, y_pred, average='weighted')
            best_gs=gs
     
    print("===========================================================")
    print ("Best Model values :")
    print("Best parameters - : ", best_gs.best_params_)
    print("Best training accuracy - :", best_gs.best_score_)
    y_pred = best_gs.predict(X_ts)
    print("Test accuracy -  : ", best_gs.score(X_ts, y_ts))
    print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
    print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
    print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
    
    y_pred_proba = best_gs.predict_proba(X_ts)
    preds = y_pred_proba[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_ts, preds)
    roc_auc = metrics.auc(fpr, tpr)
    print(roc_auc)

    plt.figure()
    plt.plot(fpr, tpr, label='AUC ROC (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()
    
    
    print("===========================================================")
    
    return best_gs

In [None]:
best_gb_model=process_gb(adv_data_2, 0)

In [None]:
with open('gb_param_model', 'wb') as file: 
    pickle.dump(best_gb_model, file)

**Best Gradient Boosting Model information**

**Best Model :** best_gb_model

**Best parameters - :**  {'classifier__subsample': 1, 'classifier__n_estimators': 1250, 'classifier__min_samples_split': 412, 'classifier__min_samples_leaf': 9, 'classifier__max_features': 'auto', 'classifier__max_depth': 38, 'classifier__learning_rate': 0.01}

**Best training accuracy - :** 0.6238598604165437

**Test accuracy -  :**  0.635900700190961

**Recall Score - :**  0.7804222648752399

**Precison Score - :**  0.8310411097706167

**F1 Score - :**  0.7934883805062174

**AUC-ROC Score - :** 0.8660632686289025


All the parameters value have been increased as compared to Random Forest

<a id=section15013></a>
#### **15.1.3 Hyper parameter tuning for Xtreme Gradient Boosting Classifier**

In [None]:
#Function to process Xtreme Gradient Boosting model using hyper parameter tuning
def process_xgb(data,stratified):
    
    """
    data - input data
    stratified flag - wether to set the stratified flag to TRUE or FALSE (always FALSE in this case)
    """

    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    if (True==stratified):
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0, stratify=y, shuffle=True)
    else:
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)
        
    preprocess_step=FeatureUnion([('kbest', SelectKBest(score_func=f_classif, k=15))])
    
    pipe_1=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_2=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_3=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_4=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_5=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_6=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_7=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_8=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    pipe_9=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', XGBClassifier(random_state=0))])
    
    
    param_grid= {
                'classifier__eta': [0.3, 0.2, 0.1 , 0.01, 0.001],
                'classifier__max_depth':range(2,20,2),
                'classifier__gamma':[i/10.0 for i in range(0,5)],
                'classifier__min_child_weight': range(2,20,2),
                'classifier__subsample':[0.5, 0.6, 0.7,0.75,0.8,0.85,0.9,0.95,1],
                'classifier__reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
               }

    gs_1=RandomizedSearchCV(pipe_1, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_2=RandomizedSearchCV(pipe_2, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_3=RandomizedSearchCV(pipe_3, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_4=RandomizedSearchCV(pipe_4, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_5=RandomizedSearchCV(pipe_5, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_6=RandomizedSearchCV(pipe_6, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_7=RandomizedSearchCV(pipe_7, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_8=RandomizedSearchCV(pipe_8, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_9=RandomizedSearchCV(pipe_9, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    
    grids=[gs_1,gs_2,gs_3,gs_4,gs_5,gs_6,gs_7,gs_8,gs_9]
    
    print('Performing model optimizations...')
   
    best_f1=0
    
    for gs in grids:
        gs.fit(X_tr, y_tr)
        print("Best parameters - : ", gs.best_params_)
        print("Best training accuracy - :", gs.best_score_)
        y_pred = gs.predict(X_ts)
        print("Test accuracy -  : ", gs.score(X_ts, y_ts))
        print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
        print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
        print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
        
        if metrics.f1_score(y_ts, y_pred, average='weighted')>best_f1:
            best_f1=metrics.f1_score(y_ts, y_pred, average='weighted')
            best_gs=gs
     
    print("===========================================================")
    print ("Best Model values :")
    print("Best parameters - : ", best_gs.best_params_)
    print("Best training accuracy - :", best_gs.best_score_)
    y_pred = best_gs.predict(X_ts)
    print("Test accuracy -  : ", best_gs.score(X_ts, y_ts))
    print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
    print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
    print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
    
    y_pred_proba = best_gs.predict_proba(X_ts)
    preds = y_pred_proba[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_ts, preds)
    roc_auc = metrics.auc(fpr, tpr)
    print(roc_auc)

    plt.figure()
    plt.plot(fpr, tpr, label='AUC ROC (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()
    
    
    print("===========================================================")
    
    return best_gs

In [None]:
best_xgb_model=process_xgb(adv_data_2, 0)

In [None]:
with open('xgb_param_model', 'wb') as file:  
    pickle.dump(best_xgb_model, file)

**Best Xtreme Gradient Boosting Model information**

**Best Model :** best_xgb_model

**Best parameters - :**  {'classifier__subsample': 0.8, 'classifier__reg_alpha': 1e-05, 'classifier__min_child_weight': 4, 'classifier__max_depth': 4, 'classifier__gamma': 0.3, 'classifier__eta': 0.3}
**Best training accuracy - :** 0.6227975490043983

**Test accuracy -  :**  0.6409364125276812

**Recall Score - :**  0.7821497120921305

**Precison Score - :**  0.8344563671761712

**F1 Score - :**  0.7953197635875338

**AUC-ROC Score - :** 0.871461222793621


All the parameters value have been increased as compared to Gradient Boosting

<a id=section15014></a>
#### **15.1.4 Hyper parameter tuning for LightGBM Classifier**

In [None]:
#Function to process Light GBM using hyper parameter tuning
def process_lgbm(data,stratified):

    """
    data - input data
    stratified flag - wether to set the stratified flag to TRUE or FALSE (always FALSE in this case)
    """
        
    X=data.drop('netgain', axis=1)
    y=data['netgain']
    
    if (True==stratified):
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0, stratify=y, shuffle=True)
    else:
        X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)
        
    preprocess_step=FeatureUnion([('kbest', SelectKBest(score_func=f_classif, k=15))])
    
    params = {'boosting_type': 'gbdt',
          'max_depth' : -1,
          'objective': 'binary',
          'nthread': 3, # Updated from nthread
          'num_leaves': 64,
          'learning_rate': 0.05,
          'max_bin': 512,
          'subsample_for_bin': 200,
          'subsample': 1,
          'subsample_freq': 1,
          'colsample_bytree': 0.8,
          'reg_alpha': 5,
          'reg_lambda': 10,
          'min_split_gain': 0.5,
          'min_child_weight': 1,
          'min_child_samples': 5,
          'scale_pos_weight': 1,
          'num_class' : 1,
          'metric' : 'binary_error'}

    lgbmdl = LGBMClassifier(boosting_type= 'gbdt',
          objective = 'binary',
          n_jobs = -1, 
          silent = True,
          max_depth = params['max_depth'],
          max_bin = params['max_bin'],
          subsample_for_bin = params['subsample_for_bin'],
          subsample = params['subsample'],
          subsample_freq = params['subsample_freq'],
          min_split_gain = params['min_split_gain'],
          min_child_weight = params['min_child_weight'],
          min_child_samples = params['min_child_samples'],
          scale_pos_weight = params['scale_pos_weight'],
        random_state=0)

    pipe_1=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_2=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_3=Pipeline([('scaling', StandardScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_4=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_5=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_6=Pipeline([('scaling', RobustScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_7=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', RandomOverSampler(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_8=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', SMOTE(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])
    pipe_9=Pipeline([('scaling', MinMaxScaler()),
                    ('imbalance', ADASYN(random_state=0)),
                    ('preprocess', preprocess_step),
                   ('classifier', lgbmdl)])

    param_grid = {
        'classifier__learning_rate': [0.1, 0.01, 0.005, 0.001],
        'classifier__n_estimators': [10, 100,250,500],
        'classifier__num_leaves': range(0,20, 1),
        'classifier__boosting_type' : ['gbdt'],
        'classifier__objective' : ['binary'],
        'classifier__colsample_bytree' : [0.65, 0.66],
        'classifier__subsample' : [0.5, 0.6, 0.7,0.75,0.8,0.85,0.9,0.95,1],
        'classifier__eg_alpha' : [1,1.2],
        'classifier__reg_lambda' : [1,1.2,1.4]
    }


    gs_1=RandomizedSearchCV(pipe_1, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_2=RandomizedSearchCV(pipe_2, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_3=RandomizedSearchCV(pipe_3, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_4=RandomizedSearchCV(pipe_4, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_5=RandomizedSearchCV(pipe_5, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_6=RandomizedSearchCV(pipe_6, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_7=RandomizedSearchCV(pipe_7, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_8=RandomizedSearchCV(pipe_8, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    gs_9=RandomizedSearchCV(pipe_9, param_distributions=param_grid, n_iter=10, cv =10, verbose=True, n_jobs=-1, return_train_score=True, refit='f1_score', scoring='f1')
    
    grids=[gs_1,gs_2,gs_3,gs_4,gs_5,gs_6,gs_7,gs_8,gs_9]
    
    print('Performing model optimizations...')
   
    best_f1=0
    
    for gs in grids:
        gs.fit(X_tr, y_tr)
        print("Best parameters - : ", gs.best_params_)
        print("Best training accuracy - :", gs.best_score_)
        y_pred = gs.predict(X_ts)
        print("Test accuracy -  : ", gs.score(X_ts, y_ts))
        print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
        print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
        print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
        
        if metrics.f1_score(y_ts, y_pred, average='weighted')>best_f1:
            best_f1=metrics.f1_score(y_ts, y_pred, average='weighted')
            best_gs=gs
     
    print("===========================================================")
    print ("Best Model values :")
    print("Best parameters - : ", best_gs.best_params_)
    print("Best training accuracy - :", best_gs.best_score_)
    y_pred = best_gs.predict(X_ts)
    print("Test accuracy -  : ", best_gs.score(X_ts, y_ts))
    print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
    print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
    print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
    
    y_pred_proba = best_gs.predict_proba(X_ts)
    preds = y_pred_proba[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_ts, preds)
    roc_auc = metrics.auc(fpr, tpr)
    print(roc_auc)

    plt.figure()
    plt.plot(fpr, tpr, label='AUC ROC (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()
    
    
    print("===========================================================")
    
    return best_gs

In [None]:
best_lgbm_model=process_lgbm(adv_data_2, 0)

In [None]:
with open('lgbm_param_model', 'wb') as file:  
    pickle.dump(best_lgbm_model, file)

**Best Light Gradient Boosting Model information**

**Best Model :** best_lgbm_model

**Best parameters - :**  {'classifier__subsample': 0.8, 'classifier__reg_lambda': 1, 'classifier__objective': 'binary', 'classifier__num_leaves': 11, 'classifier__n_estimators': 100, 'classifier__learning_rate': 0.1, 'classifier__eg_alpha': 1.2, 'classifier__colsample_bytree': 0.65, 'classifier__boosting_type': 'gbdt'}

**Best training accuracy - :** 0.6205174445915582

**Test accuracy -  :**  0.6340731853629273

**Recall Score - :**  0.7658349328214972

**Precison Score - :**  0.8368709448909321

**F1 Score - :**  0.7816403922259263

**AUC-ROC Score - :** 0.8649997078073866


All the parameters value have been decreasedvas compared to Xtreme Gradient Boosting and Gradient Boosting

**Summary on hyper parameter tuning of various models**

Xtreme Gradient Boosting performed better as compared to other models (compared on the basis of F1 score) with following values

**Parameters - :**  {'classifier__subsample': 0.8, 'classifier__reg_alpha': 1e-05, 'classifier__min_child_weight': 4, 'classifier__max_depth': 4, 'classifier__gamma': 0.3, 'classifier__eta': 0.3}
**Best training accuracy - :** 0.6227975490043983

**Test accuracy -  :**  0.6409364125276812

**Recall Score - :**  0.7821497120921305

**Precison Score - :**  0.8344563671761712

**F1 Score - :**  0.7953197635875338

**AUC-ROC Score - :** 0.871461222793621


Further ensembling techniques can be applied on all hyper parameter tuned models and best selected baseline model to check if performance enahnces

<a id=section1502></a>
### **15.2 Ensemble Techniques**

First let's check all the models to be used in ensemble

In [None]:
#Xtreme Gradient Boosting Model
with open('xgb_param_model', 'rb') as file:  
    best_xgb_model = pickle.load(file)
    
best_xgb_model

In [None]:
#Gradient Boosting model
with open('gb_param_model', 'rb') as file:  
    best_gb_model = pickle.load(file)

best_gb_model

In [None]:
#Light Gradient Boosting
with open('lgbm_param_model', 'rb') as file:  
    best_lgbm_model = pickle.load(file)
    
best_lgbm_model

In [None]:
#Random Forest Model
with open('rf_param_model', 'rb') as file:  
    best_rf_model = pickle.load(file)

best_rf_model

In [None]:
#Best performing model from baseline models
with open('model_adv_data_2_no_stratify', 'rb') as file:  
    baseline_model = pickle.load(file)
    
baseline_model

<a id=section15021></a>
#### **15.2.1 Voting Classifier**

In [None]:
X=adv_data_2.drop('netgain', axis=1)
y=adv_data_2['netgain']

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)

votingclf=VotingClassifier(estimators=[
            ('model1', baseline_model),
            ('model2', best_rf_model),
            ('model3', best_gb_model),
            ('model4', best_xgb_model),
            ('model5', best_lgbm_model)], voting='hard', n_jobs=-1)

votingclf=votingclf.fit(X_tr,y_tr)

print("===========================================================")
print ("Voting Classifier Model values :")
print("Training accuracy - :", votingclf.score(X_tr, y_tr))
y_pred = votingclf.predict(X_ts)
print("Test accuracy -  : ", votingclf.score(X_ts, y_ts))
print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
print("===========================================================")

In [None]:
with open('voting_model', 'wb') as file:  
    pickle.dump(votingclf, file)

With Voting Classifier Along with the F1 score,Recall and Precsion, Test and Train accuracy has been improved.

So, the Voting Classifier can be considered final model from ML perspective, if Stacking Classifier gives lower values

<a id=section15022></a>
#### **15.2.2 Stacking**

In [None]:
X=adv_data_2.drop('netgain', axis=1)
y=adv_data_2['netgain']

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = .20, random_state = 0)

stackclf=StackingClassifier(estimators=[
            ('model1', baseline_model),
            ('model2', best_rf_model),
            ('model3', best_gb_model),
            ('model4', best_xgb_model),
            ('model5', best_lgbm_model)], final_estimator=LogisticRegressionCV(cv=5, random_state=0), n_jobs=-1)

stackclf=stackclf.fit(X_tr,y_tr)

print("===========================================================")
print ("Stacking Classifier Model values :")
print("Training accuracy - :", stackclf.score(X_tr, y_tr))
y_pred = stackclf.predict(X_ts)
print("Test accuracy -  : ", stackclf.score(X_ts, y_ts))
print ("Recall Score - : ", metrics.recall_score(y_ts, y_pred, average='weighted'))
print ("Precison Score - : ", metrics.precision_score(y_ts, y_pred, average='weighted'))
print ("F1 Score - : ", metrics.f1_score(y_ts, y_pred, average='weighted'))
   
print("===========================================================")

In [None]:
with open('stacking_modelfile', 'wb') as file:  
    pickle.dump(stackclf, file)

Stacking further increased the F1 score and accuracy values, so this model, we can consider as final one from ML perspective

**Summary on final model from ML**

**Stacking Classifier build on Random Forest, Gradient Boosting Classifier, Xtreme Gradient Boosting classifier and LightGBM**

**Training accuracy** - : 0.8229196659948171

**Test accuracy** -  :  0.8166986564299424

**Recall Score** - :  0.8166986564299424

**Precison Score** - :  0.8042796289150841

**F1 Score** - :  0.8059686365556438

<a id=section16></a>
## **16. Deep Learning**

Now we have the final model parameter values from Machine Learning, so main objective is to create DNN model and analyze if it performs better than ML models on the given dataset

<a id=section1601></a>
### **16.1  - Check the data distribution and if the data is linearly seperable based on output class**

**Plotting pairplot to study the data using original dataset(before any feature engoneering or encoding on it)**

In [None]:
adv_data_orig.head()

In [None]:
plt.figure(figsize=(8,8))
sns.pairplot(data=adv_data_orig, hue='netgain')
plt.show()

From the above graph, we analyze that data is not linearly seperable

Further can plot the encoded data we got earlier after feature engineering and other operations - adv_data and using only important features we analyzed earlier

In [None]:
tmp=adv_data[['ratings','runtime','realtionship_status_Married-civ-spouse','industry_Pharma','realtionship_status_Never-married','airtime_Morning','airtime_Primetime','expensive','targeted_sex_Male','industry_Political','industry_Entertainment','industry_Other','netgain']]
tmp.head()

In [None]:
plt.figure(figsize=(10,10))
sns.pairplot(data=tmp, hue='netgain')
plt.show()

We can analyze that the data is not linearly seperable based on output class, so need activation fucntions in the layers to do the processing

<a id=section1602></a>
### **16.2  - Function to Normalize the data**

Will create and analyze the neural network on adv_data, adv_data_1 and adv_data_2

Will try to normalize from all the 4 common techniques to analyze which works best with neural networks

In [None]:
#Getting basic X_train,y_train,X_test and y_test before normaliztion for further use if required

train_dataset = adv_data.sample(frac=0.8,random_state=0)   
test_dataset = adv_data.drop(train_dataset.index)
    
X_train=train_dataset.drop('netgain', axis=1)
y_train=train_dataset['netgain']

X_test=test_dataset.drop('netgain', axis=1)
y_test=test_dataset['netgain']
    

In [None]:
def get_normalized_dataset(data, scaling):
        
    train_dataset = adv_data.sample(frac=0.8,random_state=0)   
    test_dataset = adv_data.drop(train_dataset.index)

    print(train_dataset.shape)
    print(test_dataset.shape)
    
    X_train=train_dataset.drop('netgain', axis=1)
    y_train=train_dataset['netgain']

    X_test=test_dataset.drop('netgain', axis=1)
    y_test=test_dataset['netgain']
    
    print(X_train.shape)
    print(y_train.shape)
    print(X_test.shape)
    print(y_test.shape)


    if (1==scaling):
        norm_clf=StandardScaler()
    elif (2==scaling):
        norm_clf=MinMaxScaler()
    elif (3==scaling):
        norm_clf=Normalizer()
    elif (4==scaling):
        norm_clf=RobustScaler()
        
    X_train_norm=norm_clf.fit_transform(X_train)
    X_test_norm=norm_clf.transform(X_test)
    
    print ("\nNormalized Train data :\n")
    print (X_train_norm[0])
    print ("\nNormalized Test data :\n")
    print (X_test_norm[0])
    
    return X_train_norm, X_test_norm

#### **Scaling adv_data using various methods like StandardScaler, MinMaxScaler, RobustScaler and Normalizer**

In [None]:
#Scaling using StandardScaler
X_train_norm11, X_test_norm11=get_normalized_dataset(adv_data,1)

In [None]:
#Scaling using MinMaxScaler
X_train_norm12, X_test_norm12=get_normalized_dataset(adv_data,2)

In [None]:
#Normalizer
X_train_norm13, X_test_norm13=get_normalized_dataset(adv_data,3)

In [None]:
#Robust Scaler
X_train_norm14, X_test_norm14=get_normalized_dataset(adv_data,4)

#### **Scaling adv_data_1 using various methods like StandardScaler, MinMaxScaler, RobustScaler and Normalizer**

In [None]:
#Scaling using StandardScaler
X_train_norm21, X_test_norm21=get_normalized_dataset(adv_data_1,1)

In [None]:
#Scaling using MinMaxScaler
X_train_norm22, X_test_norm22=get_normalized_dataset(adv_data_1,2)

In [None]:
#Normalizer
X_train_norm23, X_test_norm23=get_normalized_dataset(adv_data_1,3)

In [None]:
#Robust Scaler
X_train_norm24, X_test_norm24=get_normalized_dataset(adv_data_1,4)

#### **Scaling adv_data_2 using various methods like StandardScaler, MinMaxScaler, RobustScaler and Normalizer**

In [None]:
#Scaling using StandardScaler
X_train_norm31, X_test_norm31=get_normalized_dataset(adv_data_2,1)

In [None]:
#Scaling using MinMaxScaler
X_train_norm32, X_test_norm32=get_normalized_dataset(adv_data_2,2)

In [None]:
#Normalizer
X_train_norm33, X_test_norm33=get_normalized_dataset(adv_data_2,3)

In [None]:
#Robust Scaler
X_train_norm34, X_test_norm34=get_normalized_dataset(adv_data_2,4)

Now, we have below normalized/scaled data set for further experiment

 - adv_data
* X_train_norm11, X_test_norm11
* X_train_norm12, X_test_norm12
* X_train_norm13, X_test_norm13
* X_train_norm14, X_test_norm14


 - adv_data_1
* X_train_norm21, X_test_norm21
* X_train_norm22, X_test_norm22
* X_train_norm23, X_test_norm23
* X_train_norm24, X_test_norm24


 - adv_data_2
* X_train_norm31, X_test_norm31
* X_train_norm32, X_test_norm32
* X_train_norm33, X_test_norm33
* X_train_norm34, X_test_norm34

<a id=section1603></a>
### **16.3 Create simple Neural Network first and evaluation with all datasets generated above**

#### **Building Simple Neural Network with only 1 hidden layer**

In [None]:
num_input_nodes=len(X_train.columns)
num_input_nodes

In [None]:
num_classes=len(np.unique(y_train))
num_classes

In [None]:
def simple_nn(num_hidden_neurons):
    # initialize model
    model = keras.Sequential()
    # add an input layer and a hidden layer
    model.add(Dense(num_hidden_neurons, input_dim = num_input_nodes))
    # add activation layer to add non-linearity
    model.add(Activation('relu'))
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    return model

In [None]:
def build_model(num_hidden_neurons):
    model=simple_nn(num_hidden_neurons)
    
    # Defining the optimizer with a specific learning rate of 0.001
    optimizer = tf.keras.optimizers.RMSprop(0.001)
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

In [None]:
model=build_model(63)
model.summary()

In [None]:
example_batch = X_train_norm11[:10]
example_result = model.predict(example_batch)
example_result

In [None]:
def train_model(model, X_train, y_train, num_epochs):

    model_info = model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                        verbose=0, callbacks=[tfdocs.modeling.EpochDots()])
    model_hist=pd.DataFrame(model_info.history)
    model_hist['epochs']=model_info.epoch
    model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
    model_hist.reset_index(drop=True, inplace=True)
    
    return model_info, model_hist

Now for intial experiment with simple neural network, we can try with different epochs - 1000, 500, 2000 for all the datasets

<a id=section16031></a>
#### **16.3.1 Training the model with 1000 EPOCHS**

In [None]:
info,df=train_model(model,X_train_norm11, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm12, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm13, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm14, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm21, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm22, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm23, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm24, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm31, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm32, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm33, y_train, 1000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm34, y_train, 1000)
df.head()

With the epochs of 1000, training data and validation data accuracy seems above .80 in most of the cases, now can try with lower epoch values, if can achieve similar results with lower iterations as well

<a id=section16032></a>
#### **16.3.2 Training the model with 500 EPOCHS**

In [None]:
info,df=train_model(model, X_train_norm11, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm12, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm13, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm14, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm21, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm22, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm23, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm24, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm31, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm32, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm33, y_train, 500)
df.head()

In [None]:
info,df=train_model(model, X_train_norm34, y_train, 500)
df.head()

Though the experiment with 500 epoch cycles have similar results, but the experiment with 1000 epochs provided marginally higher results, so can try experimenting with 2000 epoch values, if can increasing the epochs increases the acccuracies as well

<a id=section16033></a>
#### **16.3.3 Training the model with 2000 EPOCHS**

In [None]:
info,df=train_model(model, X_train_norm11, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm12, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm13, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm14, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm21, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm22, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm23, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm24, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm31, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm32, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm33, y_train, 2000)
df.head()

In [None]:
info,df=train_model(model, X_train_norm34, y_train, 2000)
df.head()

Observed that increasing epochs to 2000 have not increased the accuracy values and experiment with 1000 epochs provided the overall better result

<a id=section16034></a>
#### **16.3.4 Summary of highest accuracies obtained for all 3 epoch values**



<table>
    <tr>
        <th>Dataset</th>
        <th>Accuracy(500)</th>
        <th>Accuracy(1000)</th>
        <th>Accuracy(2000)</th>
    </tr>
    <tr>
        <td>adv_data+StandardScaler</td>
        <td>.8131</td>
        <td>.8186</td>
        <td>.8138</td>
    </tr>   
    <tr>
        <td>adv_data+MinMaxScaler</td>
        <td>.8080</td>
        <td>.8090</td>
        <td>.8114</td>
    </tr> 
    <tr>
        <td>adv_data+Normalizer</td>
        <td>.7909</td>
        <td>.8030</td>
        <td>.7675</td>
    </tr> 
    <tr>
        <td>adv_data+RobustScaler</td>
        <td>.8147</td>
        <td>.8131</td>
        <td>.8159</td>
    </tr> 
</table>

<table>
    <tr>
        <th>Dataset</th>
        <th>Accuracy(500)</th>
        <th>Accuracy(1000)</th>
        <th>Accuracy(2000)</th>
    </tr>
    <tr>
        <td>adv_data_1+StandardScaler</td>
        <td>.8133</td>
        <td>.8169</td>
        <td>.8138</td>
    </tr>   
    <tr>
        <td>adv_data_1+MinMaxScaler</td>
        <td>.8083</td>
        <td>.8157</td>
        <td>.8083</td>
    </tr> 
    <tr>
        <td>adv_data_1+Normalizer</td>
        <td>.7927</td>
        <td>.8068</td>
        <td>.800</td>
    </tr> 
    <tr>
        <td>adv_data_1+RobustScaler</td>
        <td>.8102</td>
        <td>.8073</td>
        <td>.8063</td>
    </tr> 
</table>

<table>
    <tr>
        <th>Dataset</th>
        <th>Accuracy(500)</th>
        <th>Accuracy(1000)</th>
        <th>Accuracy(2000)</th>
    </tr>
    <tr>
        <td>adv_data_2+StandardScaler</td>
        <td>.8126</td>
        <td>.8159</td>
        <td>.809</td>
    </tr>   
    <tr>
        <td>adv_data_2+MinMaxScaler</td>
        <td>.8080</td>
        <td>.8114</td>
        <td>.806</td>
    </tr> 
    <tr>
        <td>adv_data_2+Normalizer</td>
        <td>.7927</td>
        <td>.7893</td>
        <td>.7972</td>
    </tr> 
    <tr>
        <td>adv_data_2+RobustScaler</td>
        <td>.814</td>
        <td>.8152</td>
        <td>.803</td>
    </tr> 
</table>



So, from above tables, can deduce the following

- StandardScaler and RobustScaler have given the better values as compared to other techniques
- Epoch value of 2000 have not increased the accuracies significantly, so if compare the tradeoff between the training time and accuracy values, we can skip epoch values of 2000 from further analysis
- The difference in accuracies between epochs of 500 and 1000 are not very huge, so can experiment with deep neural networks and other optimization techniques with 500 epochs

Can also check the test data validation results for following model using simple neural networks for better understanding
 - adv_data+RobustScaler with epoch  - 1000 and 500
 - adv_data+StandardScaler with epoch  - 1000 and 500
 - adv_data_1+RobustScaler with epoch  - 1000 and 500
 - adv_data_1+StandardScaler with epoch  - 1000 and 500
 - adv_data_2+RobustScaler with epoch  - 1000 and 500
 - adv_data_2+StandardScaler with epoch  - 1000 and 500

<a id=section16035></a>
#### **16.3.5 Analyzing validation loss for various data generated above**

In [None]:
def plot_accuracies(X_train, y_train, X_test, y_test, num_epochs):
    model=build_model(63)
    info, df=train_model(model, X_train, y_train, num_epochs)
    _, train_acc = model.evaluate(X_train, y_train, verbose=1)
    _, test_acc = model.evaluate(X_test, y_test, verbose=1)
    print ("\nTrain accuracy - ", train_acc)
    print ("\nTest accuracy - ", test_acc)
    plt.plot(info.history['loss'], label='train')
    plt.plot(info.history['val_loss'], label='test')
    plt.legend()
    plt.title("Accuracies")
    plt.show()

In [None]:
#adv_data + StandardScaler for 500 epochs
plot_accuracies(X_train_norm11,y_train, X_test_norm11, y_test, 500)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data + RobustScaler for 500 epochs
plot_accuracies(X_train_norm14,y_train, X_test_norm14, y_test, 500)

Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs

In [None]:
#adv_data + StandardScaler for 1000 epochs
plot_accuracies(X_train_norm11,y_train, X_test_norm11, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data + RobustScaler for 1000 epochs
plot_accuracies(X_train_norm14,y_train, X_test_norm14, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs             

In [None]:
#adv_data_1 + StandardScaler for 500 epochs
plot_accuracies(X_train_norm21,y_train, X_test_norm21, y_test, 500)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data_1 + RobustScaler for 500 epochs
plot_accuracies(X_train_norm24,y_train, X_test_norm24, y_test, 500)


Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs  

In [None]:
#adv_data_1 + StandardScaler for 1000 epochs
plot_accuracies(X_train_norm21,y_train, X_test_norm21, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data_1 + RobustScaler for 1000 epochs
plot_accuracies(X_train_norm24,y_train, X_test_norm24, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs

In [None]:
#adv_data_2 + StandardScaler for 500 epochs
plot_accuracies(X_train_norm31,y_train, X_test_norm31, y_test, 500)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data_2 + RobustScaler for 500 epochs
plot_accuracies(X_train_norm34,y_train, X_test_norm34, y_test, 500)

Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs   

In [None]:
#adv_data_2 + StandardScaler for 1000 epochs
plot_accuracies(X_train_norm31,y_train, X_test_norm31, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and also the validation loss seems to have increased with number of epochs

In [None]:
#adv_data_2 + RobustScaler for 1000 epochs
plot_accuracies(X_train_norm34,y_train, X_test_norm34, y_test, 1000)

Validation loss for test data seems to be much higher than that of training data and the difference is huge, could be overfitting.

Also the loss in training is not changing much but for test data, it increased with the number of epochs             

**Summary**

- Not much difference in train and test accuracy values in 500 epochs or 1000 epochs, so can continue with epochs=500.
- In some cases, the accuracy has stablized and not increased significantly, so can use EarlyStop to halt the processing at appropriate time.
- Also, the grah is not smooth, could be due to default batch size of 32, so in further experiment can try with different batch sizes.
- Also can further experiment with only adv_data, as not much difference/improvement among the different dataset

<a id=section1604></a>
### **16.4 Creating deep neural networks with hyper parameter optimization**

We will try to train the model with different parameters like layers, activation functions and dropout probability values to determine which combination gives the better result.

For basic analysis, we will run the test on adv_data + StandardScaler data to analyze the behvaiour


We are using KerasClassifier and RandomizedSearchCV to check the multiple parameter values and get the best model information

In [None]:
def create_deep_nn(layers, activations, dropouts):
    model=Sequential()
    
    for i,nodes in enumerate(layers):
        if i==0:
            model.add(Dense(nodes, input_dim=len(X_train.columns)))
        else:
            model.add(Dense(nodes))
        model.add(Activation(activations))
        model.add(Dropout(dropouts))
    
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
model=KerasClassifier(build_fn=create_deep_nn, verbose=0)

In [None]:
layers=[(63,),(128,),(256,),(128,64),(256,128,64)]
activations=['relu','selu','elu','tanh']
dropouts=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
param_grid=dict(layers=layers, activations=activations, dropouts=dropouts, 
                batch_size=[32,64,80,128,256], epochs=[50,100,200,300,500])
#grid=GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid=RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs=-1, cv=3, verbose=True)

In [None]:
grid_result=grid.fit(X_train_norm11, y_train)

In [None]:
grid_result.best_params_

In [None]:
grid_result.best_score_

Can try with different values in param_grid for experiment

In [None]:
model=KerasClassifier(build_fn=create_deep_nn, verbose=0)

In [None]:
layers=[(32,),(45,),(64,), (80,),(128,), (40,20),(60,40,20)]
activations=['relu','selu','elu','tanh']
dropouts=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
param_grid=dict(layers=layers, activations=activations, dropouts=dropouts, 
                batch_size=[64, 80, 100, 128, 180, 256, 300], epochs=[50,100,200,300,500, 1000])
#grid=GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid=RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs=-1, cv=10, verbose=True)

In [None]:
grid_result=grid.fit(X_train_norm11, y_train)

In [None]:
grid_result.best_params_

In [None]:
grid_result.best_score_

Got better score by changing the parameter grid, so can try with some more values with similar epochs

In [None]:
model=KerasClassifier(build_fn=create_deep_nn, verbose=0)

In [None]:
layers=[(32,), (40,), (45,),(64,), (80,), (128,), (40,20), (60,40,20), (20, 40, 40, 20)]
activations=['relu','selu','elu','tanh']
dropouts=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
param_grid=dict(layers=layers, activations=activations, dropouts=dropouts, 
                batch_size=[64, 80, 100, 128, 180, 256, 300], epochs=[100,200,300,400, 500, 600, 800, 1000])
#grid=GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid=RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs=-1, cv=10, verbose=True)

In [None]:
grid_result=grid.fit(X_train_norm11, y_train)

In [None]:
grid_result.best_params_

In [None]:
grid_result.best_score_

From the Grid search on number of layers, number of neurons, activation function and dropout probabilities, we get the few configurations like below, so can try to create neural networks and analyze their score on train and test data and later can optimize them using different optimizers and kernel regularization techniques

{'layers': (63,),
 'epochs': 300,
 'dropouts': 0.4,
 'batch_size': 32,
 'activations': 'relu'}

{'layers': (60, 40, 20),
 'epochs': 1000,
 'dropouts': 0.5,
 'batch_size': 64,
 'activations': 'relu'}

{'layers': (80,),
 'epochs': 1000,
 'dropouts': 0.6,
 'batch_size': 256,
 'activations': 'tanh'}

Based on the above information, we can create few architectures and observe their performance (without batch normalization, changing optimizers or regularization technique)

<a id=section1605></a>
### **16.5 Experiment with selected configuration from hyper parameter tuning**

Will try below hidden layers combination with dropout values of (0.2,0.4,0.5,0.6) and activation functions of 'relu','elu' and 'tanh' and optimizer='adam'

 - 63
 - 60+40+20
 - 80
 - 45+15+45
 - 40+50+40

In [None]:
len(X_train.columns)

In [None]:
#create model for 1 hidden layer of 63 neurons
def build_deep_nn1(dropouts, actfn):

    # initialize model
    model = keras.Sequential()
    # add an input layer and a hidden layer
    model.add(Dense(63, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 1 hidden layer of 80 neurons
def build_deep_nn2(dropouts, actfn):
    
    # initialize model
    model = keras.Sequential()
    # add an input layer and a hidden layer
    model.add(Dense(80, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
    return model

In [None]:
#create model for 3 hidden layers of 45,15 and 45 neurons each
def build_deep_nn3(dropouts, actfn):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(15, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 60,40 and 20 neurons each
def build_deep_nn4(dropouts, actfn):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 40 50 amd 40 neurons each
def build_deep_nn5(dropouts, actfn):
 
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(50, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
def build_train_model(X_train, y_train, X_test, y_test, num_epochs):

    data_list=[]

    dropout_list=[0.2,0.4,0.5,0.6]
    act_fnlist=['relu','selu','tanh']
    batch_size=[32,64,128,256, 512]
    data_record=[]
    fucntion_list=[build_deep_nn1,build_deep_nn2,build_deep_nn3,build_deep_nn4,build_deep_nn5]
    function_name=['deep_nn1','deep_nn2','deep_nn3','deep_nn4','deep_nn5']
    
    for name, fn in zip(function_name,fucntion_list):
        for actfn in act_fnlist:
            for dropout in dropout_list:
                for batch in batch_size:
                    data_record=[]
                    deep_model=fn(dropout, actfn)
                    model_info = deep_model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                                        verbose=0, callbacks=[tfdocs.modeling.EpochDots()], batch_size=batch)
                    model_hist=pd.DataFrame(model_info.history)
                    model_hist['epochs']=model_info.epoch
                    model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
                    model_hist.reset_index(drop=True, inplace=True)
                    _, train_acc = deep_model.evaluate(X_train, y_train, verbose=1)
                    _, test_acc = deep_model.evaluate(X_test, y_test, verbose=1)

                    print("\nActivation - ", actfn)
                    print("\nDropout - ", dropout)
                    print("\nBatch size - ", batch)
                    print("\nModel summary - \n", deep_model.summary())
                    print(model_hist.head())
                    print ("\nTrain accuracy - ", train_acc)
                    print ("\nTest accuracy - ", test_acc)
                    plt.plot(model_info.history['loss'], label='train')
                    plt.plot(model_info.history['val_loss'], label='test')
                    plt.legend()
                    plt.title("Accuracies")
                    plt.show()
                    print("=========================================")
                    data_record.append(name)
                    data_record.append(actfn)
                    data_record.append(dropout)
                    data_record.append(batch)
                    data_record.append(model_hist.loc[0,'accuracy'])
                    data_record.append(model_hist.loc[0,'val_accuracy'])
                    data_record.append(model_hist.loc[0,'val_loss'])
                    data_record.append(train_acc)
                    data_record.append(test_acc)
                    data_record.append(deep_model.summary())
                    data_record.append(model_hist.loc[0,'epochs'])
                    data_record.append(num_epochs)
                    data_list.append(tuple(data_record))
                    print("*************************************************************************************************************************")
                    print("*************************************************************************************************************************")
                    
                    print("\n")

    deep_model_data=pd.DataFrame(data_list, columns=['model','activation', 'dropout', 'batchsize', 'trainaccuracy', 'testaccuracy', 
                                          'loss', 'meantrainaccuracy', 'meantestaccuracy', 'summary', 'epoch', 'epochs'])
    return deep_model_data

In [None]:
df_500_1=build_train_model(X_train_norm11, y_train, X_test_norm11, y_test, 500)

Multiple configurations of layers and neurons tested with different values of batch size and dropouts with multiple activation functions for epoch of 500 and below is the observation

- relu activation fucntions with the batch sizes of 32,64,128,256 with any value of dropout seems to have high variation in loss for training and testing data, could be overfitting
- Though relu with batch size of 512 seems more stable and less difference in loss for test and train data.
- But the activation fucntions os tanh and selu with dropout of 0.5 and 0.6 and batch size of 256 and 512 seems most stable.

Also, in many scenarios, the loss has become constant after certain number of epochs, so early stop callback needs to be included to halt at right time

In [None]:
df_500_2=build_train_model(X_train_norm14, y_train, X_test_norm14, y_test, 500)

Multiple configurations of layers and neurons tested with different values of batch size and dropouts with multiple activation functions for epoch of 500 and below is the observation

- relu activation fucntions with the batch sizes of 32,64,128,256 with any value of dropout seems to have high variation in loss for training and testing data, could be overfitting
- Though relu with batch size of 512 seems more stable and less difference in loss for test and train data.
- But the activation fucntions os tanh and selu with dropout of 0.5 and 0.6 and batch size of 256 and 512 seems most stable.

Also, in many scenarios, the loss has become constant after certain number of epochs, so early stop callback needs to be included to halt at right time

Also, the models with multiple layers seems to be more stable than models with single layer

In [None]:
df_1000_1=build_train_model(X_train_norm11, y_train, X_test_norm11, y_test, 1000)

Multiple configurations of layers and neurons tested with different values of batch size and dropouts with multiple activation functions for epoch of 500 and below is the observation

- relu activation fucntions with the batch sizes of 32,64,128,256 with any value of dropout seems to have high variation in loss for training and testing data, could be overfitting
- Though relu with batch size of 512 seems more stable and less difference in loss for test and train data.
- But the activation fucntions os tanh and selu with dropout of 0.5 and 0.6 and batch size of 256 and 512 seems most stable.

Also, in many scenarios, the loss has become constant after certain number of epochs, so early stop callback needs to be included to halt at right time

Also, the models with multiple layers seems to be more stable than models with single layer

In [None]:
df_1000_2=build_train_model(X_train_norm14, y_train, X_test_norm14, y_test, 1000)

Multiple configurations of layers and neurons tested with different values of batch size and dropouts with multiple activation functions for epoch of 500 and below is the observation

- relu activation fucntions with the batch sizes of 32,64,128,256 with any value of dropout seems to have high variation in loss for training and testing data, could be overfitting
- Though relu with batch size of 512 seems more stable and less difference in loss for test and train data.
- But the activation fucntions os tanh and selu with dropout of 0.5 and 0.6 and batch size of 256 and 512 seems most stable.

Also, in many scenarios, the loss has become constant after certain number of epochs, so early stop callback needs to be included to halt at right time

Also, the models with multiple layers seems to be more stable than models with single layer

In [None]:
df_500_1.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_2.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_1000_1.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_1000_2.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_1.to_csv('Dataframe_500_1.csv')

In [None]:
df_500_2.to_csv('Dataframe_500_2.csv')

In [None]:
df_1000_1.to_csv('Dataframe_1000_1.csv')

In [None]:
df_1000_2.to_csv('Dataframe_1000_2.csv')

From the above data, we analyzed that deep__n5, deep_nn4 and deep_nn3 model performed better and specifically with batch size of 128,256 and 512 and with activation fucntions - relu and selu and epoch=1000 and dropout of 0.4, 0.5 and .6

So, we will apply batch normalization, regularization tehnique, callbacks to halt early when accuracy reaches the specificed value and starts decreasing thereafter and various custom metrics like AUC/ROC

<a id=section1606></a>
### **16.6 Batch Normalization and Weight initializer on models selected above along with different optimizers**

<a id=section16061></a>
#### **16.6.1 Using Batch Normalization and early stopping**

In [None]:
#create model for 3 hidden layers of 60,40 and 20 neurons each
def build_deep_nn1_withbn(dropouts, actfn):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 45,15 and 45 neurons each
def build_deep_nn2_withbn(dropouts, actfn):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(15, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
def build_deep_nn3_withbn(dropouts, actfn):
 
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(50, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
def build_train_model_bn(X_train, y_train, X_test, y_test, num_epochs=1000):

    data_list=[]

    dropout_list=[0.4,0.5,0.6]
    act_fnlist=['selu','tanh']
    batch_size=[256, 512]
    data_record=[]
    fucntion_list=[build_deep_nn1_withbn,build_deep_nn2_withbn, build_deep_nn3_withbn]
    function_name=['deep_nn1bn','deep_nn2bn','deep_nn3bn']
    
    for name, fn in zip(function_name,fucntion_list):
        for actfn in act_fnlist:
            for dropout in dropout_list:
                for batch in batch_size:
                    data_record=[]
                    deep_model=fn(dropout, actfn)
                    earlystop = EarlyStopping(monitor='val_accuracy', min_delta=0.001, patience=5, verbose=1, mode='auto')
                    callbacks_list = [tfdocs.modeling.EpochDots(), earlystop]
                    model_info = deep_model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                                        verbose=0, callbacks=callbacks_list, batch_size=batch)
                    model_hist=pd.DataFrame(model_info.history)
                    model_hist['epochs']=model_info.epoch
                    model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
                    model_hist.reset_index(drop=True, inplace=True)
                    _, train_acc = deep_model.evaluate(X_train, y_train, verbose=1)
                    _, test_acc = deep_model.evaluate(X_test, y_test, verbose=1)

                    print("\nActivation - ", actfn)
                    print("\nDropout - ", dropout)
                    print("\nBatch size - ", batch)
                    print("\nModel summary - \n", deep_model.summary())
                    print(model_hist.head())
                    print ("\nTrain accuracy - ", train_acc)
                    print ("\nTest accuracy - ", test_acc)
                    plt.plot(model_info.history['loss'], label='train')
                    plt.plot(model_info.history['val_loss'], label='test')
                    plt.legend()
                    plt.title("Validation Loss")
                    plt.show()
                    print("=========================================")
                    data_record.append(name)
                    data_record.append(actfn)
                    data_record.append(dropout)
                    data_record.append(batch)
                    data_record.append(model_hist.loc[0,'accuracy'])
                    data_record.append(model_hist.loc[0,'val_accuracy'])
                    data_record.append(model_hist.loc[0,'val_loss'])
                    data_record.append(train_acc)
                    data_record.append(test_acc)
                    data_record.append(deep_model.summary())
                    data_record.append(model_hist.loc[0,'epochs'])
                    data_record.append(num_epochs)
                    data_list.append(tuple(data_record))
                    print("*************************************************************************************************************************")
                    print("*************************************************************************************************************************")
                    
                    print("\n")

    deep_model_data=pd.DataFrame(data_list, columns=['model','activation', 'dropout', 'batchsize', 'trainaccuracy', 'testaccuracy', 
                                          'loss', 'meantrainaccuracy', 'meantestaccuracy', 'summary', 'epoch', 'epochs'])
    return deep_model_data

In [None]:
df_500_bn_1=build_train_model_bn(X_train_norm11, y_train, X_test_norm11, y_test, 500)

In [None]:
df_500_bn_1.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_bn_1.to_csv('Dataframe_500_bn_1.csv')

In [None]:
df_500_bn_2=build_train_model_bn(X_train_norm14, y_train, X_test_norm14, y_test, 500)

In [None]:
df_500_bn_2.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_bn_2.to_csv('Dataframe_500_bn_2.csv')

In [None]:
df_1000_bn_1=build_train_model_bn(X_train_norm11, y_train, X_test_norm11, y_test, 1000)

In [None]:
df_1000_bn_1.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_1000_bn_1.to_csv('Dataframe_1000_bn_1.csv')

In [None]:
df_1000_bn_2=build_train_model_bn(X_train_norm14, y_train, X_test_norm14, y_test, 1000)

In [None]:
df_1000_bn_2.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_1000_bn_2.to_csv('Dataframe_1000_bn_2.csv')

After including BatchNormalization and EarlyStopping on validation loss, it improved the performnace a bit in terms of accuracy values in some cases but decreases the  time to train the model.
So, below are onfigurations that performed well with EarlyStopping and BatchNormalization

- Robust Scaled data
- Epoch=500
- Droput - 0.4 and 0.5
- Batch size = 256 and 512
- activation fn=  tanh
- Model - 3 hidden layers of 40 50 amd 40 neurons each

<a id=section16062></a>
#### **16.6.2 Applying Kernel initializers**

We wiil try to add different values for kernel_initializer and bias_initializer and analyze if its enahnce the performance

In [None]:
#create model for 3 hidden layers of 60,40 and 20 neurons each
def build_deep_nn1_withbninit(dropouts, actfn, kernel_initializer, bias_initializer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 45,15 and 45 neurons each
def build_deep_nn2_withbninit(dropouts, actfn, kernel_initializer, bias_initializer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(45, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(15, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(45, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 40 50 amd 40 neurons each
def build_deep_nn3_withbninit(dropouts, actfn, kernel_initializer, bias_initializer):
 
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(50, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer, bias_initializer=bias_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
def build_train_model_bninit(X_train, y_train, X_test, y_test, num_epochs=1000):

    data_list=[]

    dropout_list=[0.4,0.5,0.6]
    act_fnlist=['selu','tanh']
    batch_size=[256, 512]
    data_record=[]
    kernel_initializer=[tf.keras.initializers.RandomNormal(mean=0., stddev=1.), tf.keras.initializers.RandomUniform(minval=0., maxval=1.), tf.keras.initializers.TruncatedNormal(mean=0., stddev=1.)]
    kname_list=['normal','uniform', 'tnormal']

    bias_initializer=[tf.keras.initializers.Zeros(), tf.keras.initializers.Ones(), tf.keras.initializers.RandomNormal(mean=0., stddev=1.), tf.keras.initializers.RandomUniform(minval=0., maxval=1.), tf.keras.initializers.TruncatedNormal(mean=0., stddev=1.)]
    bname_list=['zero', 'one', 'normal','uniform', 'tnormal']

    fucntion_list=[build_deep_nn1_withbninit,build_deep_nn2_withbninit, build_deep_nn3_withbninit]
    function_name=['deep_nn1bn','deep_nn2bn','deep_nn3bn']
    
    for name, fn in zip(function_name,fucntion_list):
        for actfn in act_fnlist:
            for dropout in dropout_list:
                for batch in batch_size:
                    for kname, kinit in zip(kname_list,kernel_initializer):
                        for bname, binit in zip(bname_list, bias_initializer):
                            data_record=[]
                            deep_model=fn(dropout, actfn, kinit, binit)
                            earlystop = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=5, verbose=1, mode='auto')
                            model_info = deep_model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                                                verbose=0, callbacks=[tfdocs.modeling.EpochDots(), earlystop],  batch_size=batch)
                            model_hist=pd.DataFrame(model_info.history)
                            model_hist['epochs']=model_info.epoch
                            model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
                            model_hist.reset_index(drop=True, inplace=True)
                            _, train_acc = deep_model.evaluate(X_train, y_train, verbose=1)
                            _, test_acc = deep_model.evaluate(X_test, y_test, verbose=1)

                            print("\nActivation - ", actfn)
                            print("\nDropout - ", dropout)
                            print("\nBatch size - ", batch)
                            print ("\nKernel init method - ", kname)
                            print("\Bias init method - ", bname)
                            print("\nModel summary - \n", deep_model.summary())
                            print(model_hist.head())
                            print ("\nTrain accuracy - ", train_acc)
                            print ("\nTest accuracy - ", test_acc)
                            plt.plot(model_info.history['loss'], label='train')
                            plt.plot(model_info.history['val_loss'], label='test')
                            plt.legend()
                            plt.title("Validation Loss")
                            plt.show()
                            print("=========================================")
                            data_record.append(name)
                            data_record.append(actfn)
                            data_record.append(dropout)
                            data_record.append(batch)
                            data_record.append(kname)
                            data_record.append(bname)
                            data_record.append(model_hist.loc[0,'accuracy'])
                            data_record.append(model_hist.loc[0,'val_accuracy'])
                            data_record.append(model_hist.loc[0,'val_loss'])
                            data_record.append(train_acc)
                            data_record.append(test_acc)
                            data_record.append(deep_model.summary())
                            data_record.append(model_hist.loc[0,'epochs'])
                            data_record.append(num_epochs)
                            data_list.append(tuple(data_record))
                            print("*************************************************************************************************************************")
                            print("*************************************************************************************************************************")

                            print("\n")

    deep_model_data=pd.DataFrame(data_list, columns=['model','activation', 'dropout', 'Kernelinitmethod', 'biasinitimethod', 'batchsize', 'trainaccuracy', 'testaccuracy', 
                                          'loss', 'meantrainaccuracy', 'meantestaccuracy', 'summary', 'epoch', 'epochs'])
    return deep_model_data

In [None]:
df_500_bninit_1=build_train_model_bninit(X_train_norm11, y_train, X_test_norm11, y_test, 500)

In [None]:
df_500_bninit_1.sort_values(by='testaccuracy', ascending=False).head()

No significant improvement in accuracy by including the kernel initializers, however the below configurations has perfomed better than all-

- selu/tanh + batch size=256 and kernel initializer = uniform with bias initializer - uniform

In [None]:
df_500_bninit_1.to_csv('Dataframe_500_bninit_1.csv')

In [None]:
df_500_bninit_2=build_train_model_bninit(X_train_norm14, y_train, X_test_norm14, y_test, 500)

In [None]:
df_500_bninit_2.sort_values(by='testaccuracy', ascending=False).head()

Simialr bheaviour as above , no significant improvement by addding kernel and bias initializer but selu/tanh with uniform values of both performed better than rest

In [None]:
df_500_bninit_2.to_csv('Dataframe_500_bninit_2.csv')

In [None]:
df_1000_bninit_1=build_train_model_bninit(X_train_norm11, y_train, X_test_norm11, y_test, 1000)

In [None]:
df_1000_bninit_1.sort_values(by='testaccuracy', ascending=False).head()

Simialr bheaviour as above , no significant improvement by addding kernel and bias initializer but tanh with uniform and normal values of both performed better than rest

In [None]:
df_1000_bninit_1.to_csv('Dataframe_1000_bninit_1.csv')

In [None]:
df_1000_bninit_2=build_train_model_bninit(X_train_norm14, y_train, X_test_norm14, y_test, 1000)

In [None]:
df_1000_bninit_2.sort_values(by='testaccuracy', ascending=False).head()

No significant improvement by including kernel and bias initializer

In [None]:
df_1000_bninit_2.to_csv('Dataframe_1000_bninit_2.csv')

The kernel initializer not performed well in terms of accuracies improvement but still can experiment further only with kernel initializer once

In [None]:
#create model for 3 hidden layers of 60,40 and 20 neurons each
def build_deep_nn1_withbninit2(dropouts, actfn, kernel_initializer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, kernel_initializer=kernel_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, kernel_initializer=kernel_initializer,  input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 45,15 and 45 neurons each
def build_deep_nn2_withbninit2(dropouts, actfn, kernel_initializer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(45, kernel_initializer=kernel_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(15, kernel_initializer=kernel_initializer,  input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(45, kernel_initializer=kernel_initializer,  input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
#create model for 3 hidden layers of 40 50 amd 40 neurons each
def build_deep_nn3_withbninit2(dropouts, actfn, kernel_initializer):
 
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer,  input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(50, kernel_initializer=kernel_initializer, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, kernel_initializer=kernel_initializer,  input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
def build_train_model_bninit2(X_train, y_train, X_test, y_test, num_epochs=1000):

    data_list=[]

    dropout_list=[0.4,0.5,0.6]
    act_fnlist=['selu','tanh']
    batch_size=[256, 512]
    data_record=[]
    kernel_initializer=[tf.keras.initializers.RandomNormal(mean=0., stddev=1.), tf.keras.initializers.RandomUniform(minval=0., maxval=1.), tf.keras.initializers.TruncatedNormal(mean=0., stddev=1.)]
    kname_list=['normal','uniform', 'tnormal']

    fucntion_list=[build_deep_nn1_withbninit2,build_deep_nn2_withbninit2, build_deep_nn3_withbninit2]
    function_name=['deep_nn1bn','deep_nn2bn','deep_nn3bn']
    
    for name, fn in zip(function_name,fucntion_list):
        for actfn in act_fnlist:
            for dropout in dropout_list:
                for batch in batch_size:
                    for kname, kinit in zip(kname_list,kernel_initializer):
                        data_record=[]
                        deep_model=fn(dropout, actfn, kinit)
                        earlystop = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=5, verbose=1, mode='auto')
                        model_info = deep_model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                                            verbose=0, callbacks=[tfdocs.modeling.EpochDots(), earlystop],  batch_size=batch)
                        model_hist=pd.DataFrame(model_info.history)
                        model_hist['epochs']=model_info.epoch
                        model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
                        model_hist.reset_index(drop=True, inplace=True)
                        _, train_acc = deep_model.evaluate(X_train, y_train, verbose=1)
                        _, test_acc = deep_model.evaluate(X_test, y_test, verbose=1)

                        print("\nActivation - ", actfn)
                        print("\nDropout - ", dropout)
                        print("\nBatch size - ", batch)
                        print ("\nKernel init method - ", kname)
                        print("\nModel summary - \n", deep_model.summary())
                        print(model_hist.head())
                        print ("\nTrain accuracy - ", train_acc)
                        print ("\nTest accuracy - ", test_acc)
                        plt.plot(model_info.history['loss'], label='train')
                        plt.plot(model_info.history['val_loss'], label='test')
                        plt.legend()
                        plt.title("Validation Loss")
                        plt.show()
                        print("=========================================")
                        data_record.append(name)
                        data_record.append(actfn)
                        data_record.append(dropout)
                        data_record.append(kname)
                        data_record.append(batch)
                        data_record.append(model_hist.loc[0,'accuracy'])
                        data_record.append(model_hist.loc[0,'val_accuracy'])
                        data_record.append(model_hist.loc[0,'val_loss'])
                        data_record.append(train_acc)
                        data_record.append(test_acc)
                        data_record.append(deep_model.summary())
                        data_record.append(model_hist.loc[0,'epochs'])
                        data_record.append(num_epochs)
                        data_list.append(tuple(data_record))
                        print("*************************************************************************************************************************")
                        print("*************************************************************************************************************************")

                        print("\n")

    deep_model_data=pd.DataFrame(data_list, columns=['model','activation', 'dropout', 'Kernelinitmethod', 'batchsize', 'trainaccuracy', 'testaccuracy', 
                                          'loss', 'meantrainaccuracy', 'meantestaccuracy', 'summary', 'epoch', 'epochs'])
    return deep_model_data

In [None]:
df_500_bninit_k=build_train_model_bninit2(X_train_norm14, y_train, X_test_norm14, y_test, 500)

In [None]:
df_500_bninit_k.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_bninit_k.to_csv('Dataframe_500_bninit_k.csv')

Kernel initializer (without bias initializer) has not enahnced the performance in terms of accuracies

So further experiment can be done with below configurations of layers and neurons with batch Normalization and EarlyStopping
 - 3 hidden layers of 40 50 amd 40 neurons each
 - 3 hidden layers of 60,40 and 20 neurons each
 - 3 hidden layers of 45,15 and 45 neurons each
 
 Epoch  = 500 and Data is RobustScaler and Standrd Scaler with dropouts of .4,.5 and .6

<a id=section1607></a>
### **16.7 Analyzing with various Optimizers**

In [None]:
def build_deep_nn1_fn(dropouts, actfn, optimizer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

In [None]:
def build_deep_nn2_fn(dropouts, actfn, optimizer):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(15, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(45, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    # Compiling the model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

In [None]:
def build_deep_nn3_fn(dropouts, actfn, optimizer):
 
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(50, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation(actfn))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

In [None]:
def build_train_model_fn(X_train, y_train, X_test, y_test, num_epochs=1000):

    data_list=[]

    dropout_list=[0.4,0.5,0.6]
    act_fnlist=['selu','tanh']
    batch_size=[256, 512]
    data_record=[]

    fucntion_list=[build_deep_nn1_fn,build_deep_nn2_fn, build_deep_nn3_fn]
    function_name=['deep_nn1bn','deep_nn2bn','deep_nn3bn']

    optimizer_list=['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
    
    for name, fn in zip(function_name,fucntion_list):
        for actfn in act_fnlist:
            for dropout in dropout_list:
                for batch in batch_size:
                    for opt in optimizer_list:
                        data_record=[]
                        deep_model=fn(dropout, actfn, opt)
                        earlystop = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=10, verbose=1, mode='auto')
                        model_info = deep_model.fit(X_train, y_train, epochs=num_epochs, validation_split=0.2,
                                            verbose=0, callbacks=[tfdocs.modeling.EpochDots(), earlystop],  batch_size=batch)
                        model_hist=pd.DataFrame(model_info.history)
                        model_hist['epochs']=model_info.epoch
                        model_hist=model_hist.sort_values(by='val_accuracy', ascending=False)
                        model_hist.reset_index(drop=True, inplace=True)
                        _, train_acc = deep_model.evaluate(X_train, y_train, verbose=1)
                        _, test_acc = deep_model.evaluate(X_test, y_test, verbose=1)

                        print("\nActivation - ", actfn)
                        print("\nDropout - ", dropout)
                        print("\nBatch size - ", batch)
                        print ("\nOptimizer - ", opt)
                        print("\nModel summary - \n", deep_model.summary())
                        print(model_hist.head())
                        print ("\nTrain accuracy - ", train_acc)
                        print ("\nTest accuracy - ", test_acc)
                        plt.plot(model_info.history['loss'], label='train')
                        plt.plot(model_info.history['val_loss'], label='test')
                        plt.legend()
                        plt.title("Validation Loss")
                        plt.show()
                        print("=========================================")
                        data_record.append(name)
                        data_record.append(actfn)
                        data_record.append(dropout)
                        data_record.append(batch)
                        data_record.append(opt)
                        data_record.append(model_hist.loc[0,'accuracy'])
                        data_record.append(model_hist.loc[0,'val_accuracy'])
                        data_record.append(model_hist.loc[0,'val_loss'])
                        data_record.append(train_acc)
                        data_record.append(test_acc)
                        data_record.append(deep_model.summary())
                        data_record.append(model_hist.loc[0,'epochs'])
                        data_record.append(num_epochs)
                        data_list.append(tuple(data_record))
                        print("*************************************************************************************************************************")
                        print("*************************************************************************************************************************")

                        print("\n")

    deep_model_data=pd.DataFrame(data_list, columns=['model','activation', 'dropout', 'batchsize', 'Optimizer', 'trainaccuracy', 'testaccuracy', 
                                          'loss', 'meantrainaccuracy', 'meantestaccuracy', 'summary', 'epoch', 'epochs'])
    return deep_model_data

In [None]:
df_500_opt1=build_train_model_fn(X_train_norm11, y_train, X_test_norm11, y_test, 500) 

In [None]:
df_500_opt1.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_opt1.to_csv('Dataframe_500_opt1.csv')

In [None]:
df_500_opt2=build_train_model_fn(X_train_norm14, y_train, X_test_norm14, y_test, 500) 

In [None]:
df_500_opt2.sort_values(by='testaccuracy', ascending=False).head()

In [None]:
df_500_opt2.to_csv('Dataframe_500_opt2.csv')

So. after analyzing, the best model is :
 - 3 hidden layers of 60,40 and 20 neurons each
 - activation - tanh
 - dropout - 0.4 or 0.6
 - optimizer - Adam
 - Batch size - 256 or 512
 - Epoch - 500

<a id=section1608></a>
### **16.8 Selecting the final parameters**

In [None]:
def build_final_model(dropouts):
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    # add a hidden layer
    model.add(Dense(20, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(dropouts))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
final_model=KerasClassifier(build_fn=build_final_model, verbose=1, epochs=500)
batch_size=[256,512]
dropouts=[0.4,0.6]
param_grid=dict(batch_size=batch_size, dropouts=dropouts)
grid=RandomizedSearchCV(estimator=final_model, param_distributions=param_grid, n_jobs=-1, cv=10, verbose=True)

In [None]:
grid_result=grid.fit(X_train_norm11, y_train)

In [None]:
grid_result.best_score_

In [None]:
grid_result.best_params_

In [None]:
grid_result1=grid.fit(X_train_norm14, y_train)

In [None]:
grid_result1.best_score_

In [None]:
grid_result1.best_params_

So, final model would be -
 - Data scaled using StandardScaler
 - 3 hidden layers of 60,40 and 20 neurons each
 - activation - tanh
 - dropout - 0.4
 - optimizer - Adam
 - Batch size - 512
 - Epoch - 500

<a id=section1609></a>
### **16.9 Final Model**

 - Data scaled using StandardScaler
 - 3 hidden layers of 60,40 and 20 neurons each
 - activation - tanh
 - dropout - 0.4
 - optimizer - Adam
 - Batch size - 512
 - Epoch - 500

In [None]:
def build_finalized_model():
    
    # initialize model
    model = keras.Sequential()
    
    # add a hidden layer
    model.add(Dense(60, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(0.4))
    
    # add a hidden layer
    model.add(Dense(40, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(0.4))
    
    # add a hidden layer
    model.add(Dense(20, input_dim = len(X_train.columns)))
    model.add(BatchNormalization())
    # add activation layer to add non-linearity
    model.add(Activation('tanh'))
    #Add Dropout layer
    model.add(Dropout(0.4))
    
    
    # add output layer
    model.add(Dense(1))
    # add softmax layer 
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',tf.keras.metrics.AUC()])
    
    return model

In [None]:
finalized_model=build_finalized_model()

In [None]:
finalized_model.summary()

In [None]:
earlystop = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=10, verbose=1, mode='auto')
final_model_info = finalized_model.fit(X_train_norm11, y_train, epochs=500, validation_split=0.2,verbose=0, callbacks=[tfdocs.modeling.EpochDots(), earlystop], batch_size=512)

In [None]:
final_df = pd.DataFrame(final_model_info.history)
final_df['epoch'] = final_model_info.epoch
final_df.tail()

In [None]:
plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)

In [None]:
plotter.plot({'Basic': final_model_info}, metric="accuracy")

In [None]:
plotter.plot({'Basic': final_model_info}, metric="loss")

In [None]:
results=finalized_model.evaluate(X_test_norm11, y_test, verbose=2)
results

In [None]:
test_predictions = finalized_model.predict_classes(X_test_norm11)

In [None]:
test_predictions

In [None]:
pd.DataFrame(zip(y_test, test_predictions.flatten()), columns=['Actual','Predict'])

**Summary on final model from DL**

 - 3 hidden layers of 60,40 and 20 neurons each
 - activation - tanh
 - dropout - 0.4
 - optimizer - Adam
 - Batch size - 512
 - Epoch - 500

**Training accuracy** - : 0.803119

**Test accuracy** - : 0.8027

**AUC Score** - : 0.8513

<a id=section17></a>
## **17. Comparison of Machine Learning Model results and Deep Learning results**

**From Machine Learning models got below values**

- Model - Stacking Classifier build on Random Forest, Gradient Boosting Classifier, Xtreme Gradient Boosting classifier and LightGBM

- Training accuracy - : 0.8229196659948171

- Test accuracy - : 0.8166986564299424

- Recall Score - : 0.8166986564299424

- Precison Score - : 0.8042796289150841

- F1 Score - : 0.8059686365556438

**From Deep Learning models we got below values**

 - 3 hidden layers of 60,40 and 20 neurons each
 - activation - tanh
 - dropout - 0.4
 - optimizer - Adam
 - Batch size - 512
 - Epoch - 500

**Training accuracy** - : 0.803119

**Test accuracy** - : 0.8027

**AUC Score** - : 0.8513

Accuracies values might be marginally higher from ML models but might be with more experiments with more configurations in DL may get better values.

But if we consider time to train the final model, it was much lesser in DL model

<a id=section18></a>
## **18. Conclusion**

If we consider tradeoff between accuracy and time, then we can prefer DL model but if accuracy is important then we can prefer ML model in this case as accuracy is marginally higher but time taken was lot higher than DL model in the given dataset.

Also, we can consider the below points for profitable advertisements

 - Pharma is most profitable industry in terms of advertisement.
 - Advertisements targeted towards male and especially male living with spouse are more profitable.
 - Comedy genre is most profitable and for married people and primetime is the most profitable to air the advertisement.
 - Comedy and Drama with runtime of 40-50 mins is more profitable across the World, but for all categoeries have significant count.
 - Most profitable run time for Comedy/Drama is 40 mins, followed by 50 mins, 45 mins and 60 mins.
 - Pharma is mostly profitable for Male audience while Auto and Entertainment for both Male and Female.