# Notebook for Customer Personality Analysis

### Target Problems to solve: 

Need to perform clustering to summarize customer segments.

### Source:
[Notebook1](https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/notebook)

[Author Notebook](https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/)

### Table of Contents
[Library](#Library)

[Load Data](#loading)

[Clean Data](#clean)

[Preprocess Data](#preprocess)

[Clustering Method](#cluster)

[Apriori Algorithm](#apriori)

<a id="Library"></a>
## Library 

In [None]:
# Dependency 
!pip install -U dataprep

In [None]:
# Library 
import pandas as pd 
import numpy as np 
from dataprep.eda import plot, plot_correlation, create_report, plot_missing
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

<a id="loading"></a>
## Load data 

In [None]:
#  Load data and separate by tab 
data = pd.read_csv("../input/customer-personality-analysis/marketing_campaign.csv",sep='\t')

In [None]:
data.head()
#len(data)

<a id="clean"></a>

## Clean 

- Deal with missing data 
- Deal with categorical data (Dtype is object)
- EDA: (Auto EDA)

    Go deep inside the data to find more way and insight 
- Feature engeining: 
    
    Summary the feature and make it more reasonable 
    
- Outlier
    
    Box plot (1.5IQR)
    
- Corrolation matrix heatmap

In [None]:
# Auto EDA
plot(data)

In [None]:
data.info()

In [None]:
# Remove NA 
data = data.dropna()
len(data)

In [None]:
# Date time
data["Dt_Customer"]=pd.to_datetime(data["Dt_Customer"])

# datetime operations:
newest_date=max(data["Dt_Customer"])

# time passed compare with the dt_customer
data["Dt_pass"]=newest_date-data["Dt_Customer"]
data["Dt_pass"]=pd.to_numeric(data["Dt_pass"].dt.days)

In [None]:
#  Feature Engining 
#Age of customer today 
data["Age"] = 2014-data["Year_Birth"]

#Total spendings on various items
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

#Deriving living situation by marital status"Alone"
data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

#Feature indicating total children living in the household
data["Children"]=data["Kidhome"]+data["Teenhome"]

#Feature for total members in the householde
data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]

#Feature pertaining parenthood
data["Is_Parent"] = np.where(data.Children> 0, 1, 0)

#Segmenting education levels in three groups
data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

#For clarity
data=data.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

#Dropping some of the redundant features
to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(to_drop, axis=1)

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
plot(data)

In [None]:
# Based on the EDA above we checked for the skewed data to see any outlier 
sns.boxplot(x=data["Income"])
sns.boxplot(x=data["Age"])

In [None]:
#Dropping the outliers by setting a cap on Age and income. 
data = data[(data["Age"]<90)]
data = data[(data["Income"]<600000)]
len(data)

In [None]:
# Heat map of corr matrix
cor_m=data.corr()
plt.figure(figsize=(20,20))
sns.heatmap(cor_m,annot=True)

# Some are corrolated 

<a id="preprocess"></a>

## Preprocessing

- Encode categorical data 
- Split the y and x / Get the fearures
- Scaling 
- Collinearity problem:

    Reduce dimension: PCA

In [None]:
# Name of categorical col
categorical_cols=list(data.dtypes[data.dtypes==object].index)

In [None]:
# Encode the categorical data 
Encoder=LabelEncoder()
for col in categorical_cols:
    # df[[col]] retrun df 
    data[[col]]=data[[col]].apply(Encoder.fit_transform)

In [None]:
data.head()
ds = data.copy()

In [None]:
# Split the x and y
y=['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
ds=ds.drop(y,axis=1)

In [None]:
# Scaling
Scaler= StandardScaler()
Scaler.fit(ds)
scaled_ds=pd.DataFrame(Scaler.transform(ds),columns=ds.columns)

In [None]:
corr_1=scaled_ds.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr_1,annot=True)

# Colilinearity 

In [None]:
scaled_ds.head()

In [None]:
# PCA reduce dimension
pca=PCA(n_components=3)
pca.fit(scaled_ds)
pca_ds=pd.DataFrame(pca.transform(scaled_ds),columns=(["col1","col2", "col3"]))

In [None]:
pca_ds.describe()

In [None]:
# Visualization
x =pca_ds["col1"]
y =pca_ds["col2"]
z =pca_ds["col3"]
fig=plt.figure()
ax=plt.axes(projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

<a id="cluster"></a>
## Clustering Method

### [sklearn cluster](https://scikit-learn.org/stable/modules/clustering.html)

- K Means and fit of K

<a id="kmeans"></a>

### K Means：

Overview: 

The K-means algorithm aims to choose centroids that minimise the **inertia**, or **within-cluster sum-of-squares criterion**

[Analysis](#a-part) :

With the cluster gained we need to retrive info from it！！！That's the part of A in DA

- Scatter plot 
- Box plot 


In [None]:
# K means with elbow method library :
Auto_elbow=KElbowVisualizer(KMeans(),k=10)
Auto_elbow.fit(pca_ds)
Auto_elbow.show()
# The plot below shows the turnning point is @ k=4

In [None]:
inertia = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(pca_ds)
    inertia.append(km.inertia_)
    
plt.plot(K, inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
# K Means
k=4
km=KMeans(n_clusters=k)
km=km.fit(pca_ds)
predictions=km.fit_predict(pca_ds)
pca_ds["cluster"]=predictions
data["cluster"]=predictions

In [None]:
fig = plt.figure(figsize=(10,8))
ax = plt.axes( projection='3d')
ax.scatter(x, y, z, s=40, c=pca_ds["cluster"], marker='o' )
ax.set_title("The Plot Of The Clusters")
plt.show()

<a id="a-part"></a>
#### Analysis

I felt quite hard no idea in this part which is why need expert and continue doing practise.

To make myself more sensentive to the features and be more open mind and creative!!

- scatter plot for income and spent for clusters 
- box plot for detail


In [None]:
## A part 
data.head()

In [None]:
# Income - Spent 
scatter=sns.scatterplot(data=data,y=data["Income"],x=data["Spent"],hue=data["cluster"])
scatter.set_title("Cluster's Profile Based On Income And Spending")
plt.legend()
plt.show()

In [None]:
# box - plot
plt.figure()
box_1=sns.boxplot(y=data["Spent"],x=data["cluster"])
plt.show()

In [None]:
plt.figure()
box_1=sns.boxplot(y=data["Income"],x=data["cluster"])
plt.show()

In [None]:
plt.figure()
box_1=sns.boxplot(y=data["Dt_pass"],x=data["cluster"])
plt.show()

In [None]:
plt.figure()
box_1=sns.boxplot(y=data["Age"],x=data["cluster"])
plt.show()

<a id="apriori"></a>
## Apriori Algorithm

Overview:

Fondation of basket analysis problem (Associate Rule)

#### Basket Analysis:

The goal is to find combinations of products that are often bought together, which we call **frequent itemsets**. The technical term for the domain is Frequent Itemset Mining.

#### Preprocess data to be put into apriori algo

In [None]:
# Continuous data to categorical segment data
## Age -> Age group 
cut_labels_Age = ['Young', 'Adult', 'Mature', 'Senior']
cut_bins = [0, 20, 45, 65, 90]
data['Age_group']=pd.cut(data["Age"],bins=cut_bins,labels=cut_labels_Age)

In [None]:
## Income -> Income Group
cut_labels_Income = ['Low income', 'Low to medium income', 'Medium to high income', 'High income']
data['Income_group']=pd.qcut(data["Income"],q=4,labels=cut_labels_Income)

In [None]:
## Dt_pass -> Type of seniority  
cut_labels_Seniority = ['New customers', 'Repeat customers', 'Experienced customers', 'Old customers']
data['Seniority_group'] = pd.qcut(data['Dt_pass'], q=4, labels=cut_labels_Seniority)

In [None]:
## Drop the continuous cols
data=data.drop(columns=['Age','Income','Dt_pass'])

In [None]:
## Categorical Spending -> categorical spend group:
cut_labels = ['Low consumer', 'Frequent consumer', 'Fan consumer']
data['Wines_segment'] = pd.qcut(data['Wines'][data['Wines']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Fruits_segment'] = pd.qcut(data['Fruits'][data['Fruits']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Meat_segment'] = pd.qcut(data['Meat'][data['Meat']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Fish_segment'] = pd.qcut(data['Fish'][data['Fish']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Sweets_segment'] = pd.qcut(data['Sweets'][data['Sweets']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")
data['Gold_segment'] = pd.qcut(data['Gold'][data['Gold']>0],q=[0, .25, .75, 1], labels=cut_labels).astype("object")

data.replace(np.nan, "Non consumer",inplace=True)
data.drop(columns=['Wines','Fruits','Meat','Fish','Sweets','Gold'],inplace=True)
data = data.astype(object)

In [None]:
## Overall spent group 
cut_labels_spent = ['Low spent', 'Low to medium spent', 'Medium to high spent', 'High spent']
data['Overall_spent_group']=pd.qcut(data["Spent"],q=4,labels=cut_labels_spent)


In [None]:
## Drop all irrelavent cols
y=['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response','Spent']
data=data.drop(y,axis=1)

In [None]:
# Aporior

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 999)
pd.options.display.float_format = "{:.3f}".format

association=data.copy() 
df = pd.get_dummies(association)
min_support = 0.08
max_len = 10
frequent_items = apriori(df, use_colnames=True, min_support=min_support, max_len=max_len + 1)
rules = association_rules(frequent_items, metric='lift', min_threshold=1)

In [None]:
product='Wines'
segment='Fan consumer'
target = '{\'%s_segment_%s\'}' %(product,segment)
results_personnal_care = rules[rules['consequents'].astype(str).str.contains(target, na=False)].sort_values(by='confidence', ascending=False)
results_personnal_care.head()