# Marketing Analytics - Paphawit

>**“If you can’t explain it simply, you don’t understand it well enough.”**
— Albert Einstein

**Marketing Analytics** is the essential process of a company. It will help company to make better decision. For the right decision, the company will gain huge profit with minimum effort. On the other hand, the bad decision, it will cost huge lost in revenue.

For that reason, this dataset will explain the example on how to use marketing data in Statistics and Machine Learning. The table have 28 columns and 2240 rows. The column description is collected from *Thiago Fraletti* in Kaggle discussion section.

***

# Table of Contents

- [Feature Description](#feature)
- [Reading Data](#read-data)
- [Data Cleaning](#data-cleaning)
- [Data Exploring](#data-exploring)
    - [Univariate Analysis](#univariate)
    - [Bivariate Analysis](#bivariate)
- [Preparing Data for Machine Learning](#preparing)
    - [Normalizer Data](#normalizer)
    - [One Hot Encoding Data](#one-hot)
    - [Train / Test Split Data](#train-test)
- [Cluster Analysis](#cluster-analysis)
    - [K-Means Clustering](#k-means)
- [Machine Learning Algorithms](#machine-learning)
    - [Logistic Regression Classifier](#logistic-regression)
    - [Decision Tree Classifier](#decision-tree)
    - [Support Vertor Machine Classifier](#svc)
    - [Gradient Boosting Classifier](#gradient-boosting-classifier)

***

# Feature Description <a class="anchor" id="feature"></a>

The features of this dataset are consist of:
- **Demographic** (DtCustomer, Education, Marital, Kidhome, Teenhome, Income)
- **Purchasing record** (amount spent on product, number of purchase and web visit)
- **Campaign record** (the acceptance of the campaign)

The predictor is `Response` feature which means the customer will accept the last campaign or not. The other feature list is shown as below:

Feature Description
- `AcceptedCmp1` -> 1 if customer accepted the offer in the 1st campaign. 0 otherwise
- `AcceptedCmp2` -> 1 if customer accepted the offer in the 2nd campaign. 0 otherwise 
- `AcceptedCmp3` -> 1 if customer accepted the offer in the 3rd campaign. 0 otherwise 
- `AcceptedCmp4` -> 1 if customer accepted the offer in the 4th campaign. 0 otherwise 
- `AcceptedCmp5` -> 1 if customer accepted the offer in the 5th campaign. 0 otherwise 
- `Response` -> 1 if customer accepted the offer in the last campaign. 0 otherwise (target)
- `Complain` -> 1 if customer complained in the last 2 years
- `DtCustomer` -> date of customer's enrollment with the company
- `Education` -> customer's level of education
- `Marital` -> customer's marital status
- `Kidhome` -> number of small children in customer's household
- `Teenhome` -> number of teenagers in customer's household
- `Income` -> customer's yearly household income
- `MntFishProducts` -> amount spent on fish products in the last 2 years
- `MntMeatProducts` -> amount spent on meat products in the last 2 years
- `MntFruits` -> amount spent on fruits in the last 2 years
- `MntSweetProducts` -> amount spent on sweet products in the last 2 years
- `MntWines` -> amount spent on wines in the last 2 years
- `MntGoldProds` -> amount spent on gold products in the last 2 years
- `NumDealsPurchases` -> number of purchases made with discount
- `NumCatalogPurchases` -> number of purchases made using catalogue
- `NumStorePurchases` -> number of purchases made directly in stores
- `NumWebPurchases` -> number of purchases made through company's web site
- `NumWebVisitsMonth` -> number of visits to company's web site in the last month 
- `Recency` -> number of days since the last purchase

The technique that would be used on this dataset are:
- **Customer Segmentation** (by KMean Clustering)
- **Predict the target** (`Response`) (by using Logistic Regression Classifier / Decision Tree Classifier / Support Vector Machine Classifier / Gradient Boosting Classifier)

***

# Reading Data <a class="anchor" id="read-data"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None) #to see all column in dataframe
df_read = pd.read_csv('/kaggle/input/marketing-data/marketing_data.csv')

***

# Data Cleaning <a class="anchor" id="data-cleaning"></a>

For data cleaning process, first, import **Pandas** to see the information. (by using `head()` and `info()` function.)

In [None]:
df_read.head()

Some columns in dataset can be changed in the following:
- `ID` can be set to index
- `Year_Birth` can be converted to Age (for easier intrepret)
- `Income` can be converted to float (for using in machine learning)
- `Dt_Customer` can be converted to Number of Year to be customer (for future classification)

In [None]:
df_read.info()

There are some missing value in `Income` feature. The feature need to use dropna function to clean up (also to eliminate white space in this column name.)

In [None]:
df = df_read.copy() #create new dataframe from original dataframe
df = df.rename(columns={' Income ': 'Income'}) #rename Income feature by eliminating space
df = df.set_index('ID') #set index to customer ID
df = df.dropna(subset=['Income']) #delete out 24 rows that is missing
df['Income'] = df['Income'].replace('[\$,]', '', regex=True).astype(float) #eliminate dollar sign and convert income from string to float
df['Dt_Customer'] = pd.to_datetime(df["Dt_Customer"]).dt.strftime('%Y') #convert Dt_Customer feature into year
df['Year_of_Customer'] = 2021 - df['Dt_Customer'].astype(int) #convert Dt_Customer to Year_of_Customer for preparing for future classification
df['Age'] = 2021 - df['Year_Birth'] #change Year_Birth feature into Age (for easier intrepret)
df = df.drop(['Year_Birth','Dt_Customer'],axis=1) #drop the feature that no more usage
df = df.sort_index() #sort ID customer for easier reference
df.head()

In [None]:
# Rearrage new converted column to be the same as original data set
df = df.reindex(columns = ['Age','Education', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'Year_of_Customer',
                           'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
                           'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                           'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
                           'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp1',
                           'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5',
                           'Response', 'Complain', 'Country'])

***

# Data Exploring <a class="anchor" id="data-exploring"></a>

At the beginning of machine learning steps, the dataset should be understood in each individual column and relationship between them.

Import **Matplotlib** and **Seaborn** for visualize the data. (This notebook will be use Seaborn as the main plotting library.)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2') #set theme color of graph

## Univariate Analysis <a class="anchor" id="univariate"></a>

Univariate Analysis will help to understand each individual columns, starting by using `describe()` function to see overall statistics.

In [None]:
df.describe().T.apply(lambda x: np.round(x, decimals=2)) #use function and transpose and round number to two decimal for ealier to intre

Create function to plot each individual columns.

In [None]:
#configuration plotting
def plot_config(figsize, xticks_rotation, title, ylim):
    plt.figure(figsize=figsize)
    plt.xticks(rotation=xticks_rotation)
    plt.title(title)
    plt.ylim(ylim)

#create number label in histogram    
def histogram_label(histogram, space):
    for p in histogram.patches:
        height = p.get_height()
        histogram.text(x = p.get_x()+(p.get_width()/2),
        y = height+space,
        s = '{:.0f}'.format(height),
        ha = 'center')
        
#create number label in countplot        
def countplot_label(label_value, space):
    for index,data in enumerate(label_value):
        plt.text(index,data+space,'{:,}'.format(data),horizontalalignment='center',rotation='0')

**Age of customer** is nearly normal distributed. The range is in between 29-78 years old. The median age is 51.

In [None]:
pd.DataFrame(df['Age'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0 , title='Age of Customer', ylim=(0,700))
customer_histplot = sns.histplot(df['Age'], bins=np.histogram_bin_edges(df['Age'], bins=6, range=(20, 80)))
histogram_label(histogram=customer_histplot, space=5)

**Customer Education** is mostly in graduation (n=1116). The column has five unique value.

In [None]:
pd.DataFrame(df['Education'].describe()).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0 , title='Customer Education', ylim=(0,1200))
education_countplot = sns.countplot(x='Education', data=df, order=df['Education'].value_counts().index)
countplot_label(label_value=df['Education'].value_counts().sort_values(ascending=False),space=20)

**Marital Status** is mostly in married. For Alone, YOLO and Absurd status, the sample is small. The column will be dropped some row which have these three status. The maximum sample is came from married.

In [None]:
pd.DataFrame(df['Marital_Status'].describe()).T

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Marital Status', ylim=(0,1000))
marital_countplot = sns.countplot(x='Marital_Status', data=df, order = df['Marital_Status'].value_counts().index)
countplot_label(label_value=df['Marital_Status'].value_counts().sort_values(ascending=False), space=10)

**Customer Income** is normally distributed. this column need to cut off outlier. Because of outlier, the max and min value is very different and it manipulate overall statistic.

In [None]:
pd.DataFrame(df['Income'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(25,5), xticks_rotation=0, title='Customer Income', ylim=(0,200))
income_histplot = sns.histplot(df['Income'])
histogram_label(histogram=income_histplot, space=1)

**Number of Kid in Home** is mostly in not have any kid. The column has zero, one and two value which mean the number of kid in the customer home.

In [None]:
pd.DataFrame(df['Kidhome'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Number of Kid in Home', ylim=(0,1500))
kidhome_countplot = sns.countplot(x='Kidhome', data=df, order = df['Kidhome'].value_counts().index)
countplot_label(label_value=df['Kidhome'].value_counts().sort_values(ascending=False), space=15)

**Number of Teen in Home** is equally between zero and one.

In [None]:
pd.DataFrame(df['Teenhome'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Number of Teen in Home', ylim=(0,1500))
teenhome_countplot = sns.countplot(x='Teenhome', data=df, order = df['Teenhome'].value_counts().index)
countplot_label(label_value=df['Teenhome'].value_counts().sort_values(ascending=False), space=15)

**Year of Being Customer** is mostly in 8 years

In [None]:
pd.DataFrame(df['Year_of_Customer'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Year of Being Customer', ylim=(0,1500))
year_countplot = sns.countplot(x='Year_of_Customer', data=df, order =sorted(df['Year_of_Customer'].value_counts().index))
countplot_label(label_value=df['Year_of_Customer'].value_counts().sort_index(), space=15)

**Number of days since the last purchase** is equally in each value.

In [None]:
pd.DataFrame(df['Recency'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Number of days since the last purchase', ylim=(0,300))
recency_histplot = sns.histplot(df['Recency'],bins=np.histogram_bin_edges(df['Recency'], bins=10, range=(0, 100)))
histogram_label(histogram=recency_histplot, space=2)

**Amount spent on wines, fruits, sweet products, gold products** have the same distributed. The wines have highest mean which implies the customer spend more on wines.

In [None]:
pd.DataFrame(df[['MntWines','MntFruits','MntSweetProducts','MntGoldProds']].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(10,5),xticks_rotation=0, title='Amount spent on wines in the last 2 years', ylim=(0,1000))
plt.xticks(np.arange(0, 1600, 100))
wines_histplot = sns.histplot(df['MntWines'],bins=np.histogram_bin_edges(df['MntWines'], bins=15, range=(0, 1500)))
histogram_label(histogram=wines_histplot, space=7)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Amount spent on fruits in the last 2 years', ylim=(0,1600))
plt.xticks(np.arange(0, 220, 20))
fruits_histogram = sns.histplot(df['MntFruits'],bins=np.histogram_bin_edges(df['MntFruits'], bins=10, range=(0, 200)))
histogram_label(histogram=fruits_histogram, space=10)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Amount spent on sweet products in the last 2 years', ylim=(0,1800))
plt.xticks(np.arange(0, 220, 20))
sweet_histogram = sns.histplot(df['MntSweetProducts'],bins=np.histogram_bin_edges(df['MntSweetProducts'], bins=10, range=(0, 200)))
histogram_label(histogram=sweet_histogram, space=10)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Amount spent on gold products in the last 2 years', ylim=(0,2000))
plt.xticks(np.arange(0, 300, 20))
gold_histogram = sns.histplot(df['MntGoldProds'],bins=np.histogram_bin_edges(df['MntGoldProds'], bins=14, range=(0, 280)))
histogram_label(histogram=gold_histogram, space=15)

These five below graph show the channel and how that customers purchase the company product. The store is the most channel of this dataset.

In [None]:
pd.DataFrame(df[['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth']].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Number of purchases made with discount', ylim=(0,1000))
numdeal_countplot = sns.countplot(x='NumDealsPurchases', data=df, order = sorted(df['NumDealsPurchases'].value_counts().index))
countplot_label(label_value=df['NumDealsPurchases'].value_counts().sort_values(ascending=False).sort_index(), space=7)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title="Number of purchases made through company's web site", ylim=(0,1000))
numweb_countplot = sns.countplot(x='NumWebPurchases', data=df, order = sorted(df['NumWebPurchases'].value_counts().index)).set(ylim=(0,1000))
countplot_label(label_value=df['NumWebPurchases'].value_counts().sort_values(ascending=False).sort_index(), space=7)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Number of purchases made using catalogue', ylim=(0,1000))
numcatalog_countplot = sns.countplot(x='NumCatalogPurchases', data=df, order = sorted(df['NumCatalogPurchases'].value_counts().index))
countplot_label(label_value=df['NumCatalogPurchases'].value_counts().sort_values(ascending=False).sort_index(), space=7)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title='Number of purchases made directly in stores', ylim=(0,1000))
numstore_countplot = sns.countplot(x='NumStorePurchases', data=df, order = sorted(df['NumStorePurchases'].value_counts().index))
countplot_label(label_value=df['NumStorePurchases'].value_counts().sort_values(ascending=False).sort_index(), space=7)

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title="Number of visits to company's web site in the last month", ylim=(0,1000))
numweb_countplot = sns.countplot(x='NumWebVisitsMonth', data=df, order = sorted(df['NumWebVisitsMonth'].value_counts().index))
countplot_label(label_value=df['NumWebVisitsMonth'].value_counts().sort_values(ascending=False).sort_index(), space=7)

These six below charts show the acceptance of the campaign.

In [None]:
pd.DataFrame(df[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5']].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted the 1st campaign', ylim=(0,2500))
accept1_countplot = sns.countplot(x='AcceptedCmp1', data=df, order = df['AcceptedCmp1'].value_counts().index)
countplot_label(label_value=df['AcceptedCmp1'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted the 2nd campaign', ylim=(0,2500))
accept2_countplot = sns.countplot(x='AcceptedCmp2', data=df, order = df['AcceptedCmp2'].value_counts().index)
countplot_label(label_value=df['AcceptedCmp2'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted the 3rd campaign', ylim=(0,2500))
accept3_countplot = sns.countplot(x='AcceptedCmp3', data=df, order = df['AcceptedCmp3'].value_counts().index)
countplot_label(label_value=df['AcceptedCmp3'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted the 4th campaign', ylim=(0,2500))
accept4_countplot = sns.countplot(x='AcceptedCmp4', data=df, order = df['AcceptedCmp4'].value_counts().index)
countplot_label(label_value=df['AcceptedCmp4'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted the 5th campaign', ylim=(0,2500))
accept5_countplot = sns.countplot(x='AcceptedCmp5', data=df, order = df['AcceptedCmp5'].value_counts().index)
countplot_label(label_value=df['AcceptedCmp5'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer accepted in the last campaign (target)', ylim=(0,2500))
response_countplot = sns.countplot(x='Response', data=df, order = df['Response'].value_counts().index)
countplot_label(label_value=df['Response'].value_counts().sort_values(ascending=False).sort_index(), space=15)

In this dataset, customers are likely to not complain.

In [None]:
pd.DataFrame(df['Complain'].describe().apply(lambda x: np.round(x, decimals=2))).T

In [None]:
plot_config(figsize=(5,5), xticks_rotation=0, title='Customer complained in the last 2 years', ylim=(0,2500))
complain_countplot = sns.countplot(x='Complain', data=df, order = df['Complain'].value_counts().index)
countplot_label(label_value=df['Complain'].value_counts().sort_values(ascending=False).sort_index(), space=15)

**Customer's Country** is diversify. Otherhand, the ME is small and it need to be cut out.

In [None]:
pd.DataFrame(df['Country'].describe()).T

In [None]:
plot_config(figsize=(10,5), xticks_rotation=0, title="Customer's Country", ylim=(0,2500))
country_countplot = sns.countplot(x='Country', data=df, order = df['Country'].value_counts().index)
countplot_label(label_value=df['Country'].value_counts().sort_values(ascending=False), space=15)

After the reviewing of each individual column, the series have been eliminated the outlier.

In [None]:
#cut off the outlier
df = df[df['Age'].between(29,78,inclusive='both')]
df = df.query("Marital_Status not in ['Alone', 'Absurd', 'YOLO']")
df = df[df['Income'].between(0,100000,inclusive='both')]
df = df.query('MntGoldProds <= 270')
df = df.query('NumDealsPurchases <= 9')
df = df.query('NumCatalogPurchases <= 11')
df = df.query('NumWebVisitsMonth <= 9')
df = df.query("Country not in ['ME']")

Plot the final chart of all columns to see overall picture.

In [None]:
fig, axs = plt.subplots(nrows=9,ncols=3,figsize=(20,45))
sns.histplot(ax = axs[0,0],x='Age',data=df)
sns.countplot(ax = axs[0,1],x='Education',data=df)
sns.countplot(ax = axs[0,2],x='Marital_Status',data=df)
sns.histplot(ax = axs[1,0],x='Income',data=df)
sns.countplot(ax = axs[1,1],x='Kidhome',data=df)
sns.countplot(ax = axs[1,2],x='Teenhome',data=df)
sns.countplot(ax = axs[2,0],x='Year_of_Customer',data=df)
sns.histplot(ax = axs[2,1],x='Recency',data=df)
sns.histplot(ax = axs[2,2],x='MntWines',data=df)
sns.histplot(ax = axs[3,0],x='MntFruits',data=df)
sns.histplot(ax = axs[3,1],x='MntMeatProducts',data=df)
sns.histplot(ax = axs[3,2],x='MntFishProducts',data=df)
sns.histplot(ax = axs[4,0],x='MntSweetProducts',data=df)
sns.histplot(ax = axs[4,1],x='MntGoldProds',data=df)
sns.countplot(ax = axs[4,2],x='NumDealsPurchases',data=df)
sns.countplot(ax = axs[5,0],x='NumWebPurchases',data=df)
sns.countplot(ax = axs[5,1],x='NumCatalogPurchases',data=df)
sns.countplot(ax = axs[5,2],x='NumStorePurchases',data=df)
sns.countplot(ax = axs[6,0],x='NumWebVisitsMonth',data=df)
sns.countplot(ax = axs[6,1],x='AcceptedCmp1',data=df)
sns.countplot(ax = axs[6,2],x='AcceptedCmp2',data=df)
sns.countplot(ax = axs[7,0],x='AcceptedCmp3',data=df)
sns.countplot(ax = axs[7,1],x='AcceptedCmp4',data=df)
sns.countplot(ax = axs[7,2],x='AcceptedCmp5',data=df)
sns.countplot(ax = axs[8,0],x='Response',data=df)
sns.countplot(ax = axs[8,1],x='Complain',data=df)
sns.countplot(ax = axs[8,2],x='Country',data=df)

# Bivariate Analysis <a class="anchor" id="bivariate"></a>

Plot the heatmap with seaborn to see relationship between values.

In [None]:
plt.figure(figsize=(15,10))
mask = np.triu(np.ones_like(df.corr(), dtype=bool)) #cut off the top triangle of value
sns.heatmap(df.corr().round(1),annot=True, cmap="YlGnBu",mask=mask)

From above heatmap, the dataset are shown that:
- A customer who has high income tend to spend more. His main purchasing channel is on Catolog and Store. The second channel is Website. He tend to not have kid at home. The campaign that impact their decision is 1st and 5th campagin.
- A customer who has kid at home tend to spend less. he often visit a website and accept a few deal that the company offer.
- The five of campaian is significantly impact on amount of purchasing wines.

***

# Cluster Analysis <a class="anchor" id="cluster-analysis"></a>

## K-Means Clustering <a class="anchor" id="k-means"></a>

Use K-Means clustering to understand more on dataset.

In [None]:
from sklearn.cluster import KMeans

In [None]:
def plot_k(column_name,x,y):
    df_kmeans = df[[column_name]]
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(df_kmeans)
    sns.scatterplot(ax=axs_k[x,y],data=df,x='Age',y='Income',hue=kmeans.labels_,palette="Set2")
    axs_k[x,y].set_title(column_name)

In [None]:
fig_k, axs_k = plt.subplots(nrows=2,ncols=3,figsize=(20,10))

plot_k('MntWines',0,0)
plot_k('MntFruits',0,1)
plot_k('MntMeatProducts',0,2)
plot_k('MntFishProducts',1,0)
plot_k('MntSweetProducts',1,1)
plot_k('MntGoldProds',1,2)

In [None]:
fig_k, axs_k = plt.subplots(nrows=2,ncols=3,figsize=(20,10))

plot_k('NumDealsPurchases',0,0)
plot_k('NumWebPurchases',0,1)
plot_k('NumCatalogPurchases',0,2)
plot_k('NumStorePurchases',1,0)
plot_k('NumWebVisitsMonth',1,1)
fig_k.delaxes(axs_k[1,2])

From the charts, K-Means show that **the income are more effective in grouping customer than age.** 

***

# Preparing Data for Machine Learning <a class="anchor" id="preparing"></a>

Before implementing a classifier, the dataset should be normalized, completed one-hot enconding and split data into trian and test.

## Normalizer Data <a class="anchor" id="normalizer"></a>

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_norm = df.copy()
df_norm['Income'] = scaler.fit_transform(df_norm[['Income']])
df_norm['MntWines'] = scaler.fit_transform(df_norm[['MntWines']])
df_norm['MntFruits'] = scaler.fit_transform(df_norm[['MntFruits']])
df_norm['MntMeatProducts'] = scaler.fit_transform(df_norm[['MntMeatProducts']])
df_norm['MntFishProducts'] = scaler.fit_transform(df_norm[['MntFishProducts']])
df_norm['MntSweetProducts'] = scaler.fit_transform(df_norm[['MntSweetProducts']])
df_norm['MntGoldProds'] = scaler.fit_transform(df_norm[['MntGoldProds']])
df_norm['NumDealsPurchases'] = scaler.fit_transform(df_norm[['NumDealsPurchases']])
df_norm['NumWebPurchases'] = scaler.fit_transform(df_norm[['NumWebPurchases']])
df_norm['NumCatalogPurchases'] = scaler.fit_transform(df_norm[['NumCatalogPurchases']])
df_norm['NumStorePurchases'] = scaler.fit_transform(df_norm[['NumStorePurchases']])
df_norm['NumWebVisitsMonth'] = scaler.fit_transform(df_norm[['NumWebVisitsMonth']])

## One Hot Encoding Data <a class="anchor" id="one-hot"></a>

In [None]:
df_dummies = pd.get_dummies(df_norm,columns=['Education','Marital_Status','Country'])

## Train / Test Split Data <a class="anchor" id="train-test"></a>

In [None]:
df_dummies['Response'].value_counts()

The respone value is unbalance between 0 and 1. It will effect the algorithm. As a result, the dataset will be sampling 314 row of both 0 and 1 respose value.

In [None]:
df_sample_0 = df_dummies.query("Response == 0").sample(n=314,random_state=0)
df_sample_1 = df_dummies.query("Response == 1").sample(n=314,random_state=0)
df_sample = pd.concat([df_sample_0,df_sample_1])

In [None]:
X = df_sample.drop('Response',axis=1)
y = df_sample['Response']

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

***

# Machine Learning Algorithms <a class="anchor" id="machine-learning"></a>

Apply four algorithm to perform classifier.

## Logistic Regression Classifier <a class="anchor" id="logistic-regression"></a>

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train,y_train)
plot_confusion_matrix(clf, X_test, y_test,display_labels=['Not Response','Response'])  
plt.show()
clf.score(X_test, y_test)

The Logistic Regression Classifier performs well on this data set. It have 82.21% accuracy from test dataset.

In [None]:
df_coef = pd.DataFrame({'Name':X_train.columns,'Coefficient': clf.coef_.round(2)[0]}).sort_values('Coefficient',ascending=False)
graph_coef = sns.catplot(x='Coefficient',y='Name', data=df_coef,kind='bar')
graph_coef.fig.set_size_inches(20,10)

The column which have high coefficient are:
- Positive coefficient:
    - Acceptant of 3rd campaign
    - Number of visiting website in Month
    - Amount of spending on meat product
    - Acceptant of 5rd campaign
    - Number of purchasing on catalog and deals
- Negative coefficient:
    - Number of purchasing at store
    - Having basic education
    - Having teen at home
    - The IND country
    - Married

These are the recommendation from logistic regression classifier:
- The company can learn from 3rd campaign and 5th campaign because the person who accept 3rd and 5th campaign likely to accept the last campaign. 
- The number of visiting website in Month also has impact on this. 
- The person who purchase on catalog and deals has tentative to accept the last campaign.
- The person who spend more on meat likely to accept the last campaign.
- The person who purchasing more purchasing at store not like the last campaign. 
- The last campaign not attractive to people who has basic education, or has teen at home, or lives in IND country, or has already married.

## Decision Tree Classifier <a class="anchor" id="decision-tree"></a>

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train,y_train)
plot_confusion_matrix(clf, X_test, y_test,display_labels=['Not Response','Response'])  
plt.show()
clf.score(X_test, y_test)

The Decision Tree Classifier does not perform well. it have 68.26 accuracy from test dataset. The tree are visualized as below.

In [None]:
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=X_train.columns,  
                                filled=True)
graphviz.Source(dot_data, format="png") 

## Support Vertor Machine Classifier <a class="anchor" id="svc"></a>

In [None]:
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train,y_train)
plot_confusion_matrix(clf, X_test, y_test,display_labels=['Not Response','Response'])  
plt.show()
clf.score(X_test, y_test)

The Support Vertor Machine Classifier does not performed well.

## Gradient Boosting Classifier <a class="anchor" id="gradient-boosting-classifier"></a>

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(max_depth=4)
clf.fit(X_train,y_train)
plot_confusion_matrix(clf, X_test, y_test,display_labels=['Not Response','Response'])  
plt.show()
clf.score(X_test, y_test)

Gradient Boosting Classifier does perform well. On the other hand, comparing to Logistic Regression, it has lower performance.

***