In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Customer Segmentation Analysis Project

# Problem Statement

This Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
\n
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## Attributes

### People
- ID: Customer's unique identifier  
- Year_Birth: Customer's birth year  
- Education: Customer's education level  
- Marital_Status: Customer's marital status  
- Income: Customer's yearly household income  
- Kidhome: Number of children in customer's household  
- Teenhome: Number of teenagers in customer's household  
- Dt_Customer: Date of customer's enrollment with the company  
- Recency: Number of days since customer's last purchase  
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise  

### Products

- MntWines: Amount spent on wine in last 2 years  
- MntFruits: Amount spent on fruits in last 2 years  
- MntMeatProducts: Amount spent on meat in last 2 years  
- MntFishProducts: Amount spent on fish in last 2 years  
- MntSweetProducts: Amount spent on sweets in last 2 years  
- MntGoldProds: Amount spent on gold in last 2 years  

### Promotion

- NumDealsPurchases: Number of purchases made with a discount  
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise  
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise  
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise  
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise  
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise  
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise  

### Place

- NumWebPurchases: Number of purchases made through the company’s website  
- NumCatalogPurchases: Number of purchases made using a catalogue  
- NumStorePurchases: Number of purchases made directly in stores  
- NumWebVisitsMonth: Number of visits to company’s website in the last month  

# Introduction

For this project, I will be clustering customers into segments using unsupervised learning on data from a marketing campaign. This allows us to derive insights into customer behaviour and sentiments of the products offered by this company.

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
from datetime import date
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import metrics
from yellowbrick.cluster import KElbowVisualizer
%matplotlib inline

# Check and prepare data

In [None]:
data=pd.read_csv("../input/customer-personality-analysis/marketing_campaign.csv",sep='\t')

In [None]:
data.head()

In [None]:
#Summary statistics of dataset
data.describe()

In [None]:
#Checking data types and possible missing values
data.info()
np.sum(pd.isnull(data))

There are 24 customers with missing income data. While this comes as no surprise, we will need to identify the reason for the missing data to decide how we deal with it. 
1. If they were unemployed and left it blank, the values should be imputed with 0 instead
2. If they intentionally left it blank due to confidentiality reasons or are self-employed, then there is a good reason to leave them out of the dataset


In [None]:
#Examine observations with null income
data.loc[pd.isnull(data.Income)]

The dataset is constrained by the fact that employment status was not collected, so we are unable to determine which customer has 0 income. We can try to make a wild guess that if the customer is married and has children at home, they may have decided to voluntarily leave the workforce to be a caregiver, hence having 0 income. However, given that there are only 24 such observations out of 2240, representing only 1% of our dataset, it might be better to leave them out.  

Just a note, for cases where the number of observations with missing values is sizeable, it would be better to consider imputations instead.

In [None]:
#Drop rows with missing income data
data=data.dropna()


#Sense check of data values

data.nunique()

For this sense check, I am looking out for attributes that defy common sense. For example, if Year_Birth has more than 120 unique values, this would be a sign that the data may have been collected incorrectly or customers may have input the wrong values.


At a glance, most values in the dataset make sense. *Year_birth* has 59 unique values, which is an acceptable range. *Income* has the widest range of 1974 unique values, but that is to be expected. Maximum number of kids and teens at home are 3 each. Binary values such as *AcceptedCmp* have 2 values each, which is consistent. Overall, the data looks fine.  

Something which caught my attention was *Z_CostContact* and *Z_Revenue*, which only has 1 value in each column. There is no Data Dictionary available to explain these 2 columns, but seeing as they have the same value for all observations, they would not affect our model or results if we drop them.

*Marital_Status* has 8 unique values, and *Education* has 5 unique values. It might be worthwhile to explore what the different categories are for each attribute and identify any overlaps. This will be done in the next phase when we transform data.


In [None]:
data=data.drop(columns=["Z_CostContact", "Z_Revenue"],axis=1)

# Transform data

After checking through and ensuring the sensibility and integrity of the dataset, we move on to transforming the data meaningfully. This involves identifying opportunities to group or create new attributes in order to improve our analysis.  

## Categorical Attributes

### Marital_Status

In [None]:
#Exploring Marital_Status categories
data['Marital_Status'].value_counts()

We can group Marital Status into 2 buckets: Attached & Single. 
1. 'Married' and 'Together' can be replaced with 'Attached'
2. 'Divorced','Widow' and 'Alone' can be replaced with 'Single' 

The last 2 categories, 'YOLO' and 'Absurd', do not seem to be appropriate answers for this question. Since there are only 4 observations with such answers, I will drop them.

In [None]:
#Replace values
data['Marital_Status']=data['Marital_Status'].replace(['Married','Together'],'Attached')
data['Marital_Status']=data['Marital_Status'].replace(['Divorced','Widow','Alone'],'Single')

#Drop rows with answers as 'YOLO' or 'Absurd' for Marital Status
data=data[(data.Marital_Status!='YOLO')&(data.Marital_Status!='Absurd')]

### Education

Next, we examine the Education attribute. We can group the 5 categories into 2 buckets: Postgrad & Undergrad

1. 'Graduation','Master' and 'PhD' can be replaced with 'Postgrad'
2. 'Basic' and '2n Cycle' can be replaced with 'Undergrad'

In [None]:
#Examine categories
data['Education'].value_counts()

#Replace values
data['Education']=data['Education'].replace(['Graduation','Master','PhD'],'Postgrad')
data['Education']=data['Education'].replace(['Basic','2n Cycle'],'Undergrad')

### Dt_Customer

Upon inspection, this is not a categorical attribute, but a datetime attribute parsed in as strings. We can convert them to datetime format here.

In [None]:
data['Dt_Customer'] = pd.to_datetime(data.Dt_Customer)

## Feature Engineering

We can create useful attributes that would improve the analysis.

In [None]:
#Given the birth year of customers, we can derive their age during the time of data collection.
data['Age']=2015-data['Year_Birth']

#Number of children and number of teenagers in customer's household can be combined
data['Children']=data['Kidhome']+data['Teenhome']

#Derive number of months that customers have been members of the company as of data collection
last_date=date(2015,1,1)
data['Member_months']=pd.to_numeric(data['Dt_Customer'].dt.date.apply(lambda x:(last_date-x)).dt.days,downcast='integer')/30



#Aggregate spending of all goods
data['Spending']=data['MntWines']+data['MntFruits']+data['MntMeatProducts']+data['MntFishProducts']+data['MntSweetProducts']+data['MntGoldProds']

#Rename columns for better readability
data=data.rename(columns={'MntWines': 'Wines','MntFruits':'Fruits','MntMeatProducts':'Meat','MntFishProducts':'Fish','MntSweetProducts':'Sweets','MntGoldProds':'Gold'})

#Create an indicator of whether customer is a parent
data['Is_parent']=np.where(data.Children>0,'Yes','No')

In [None]:
#Lastly, remove the unnecessary attributes
to_delete=['Recency','ID','Year_Birth','AcceptedCmp1' , 'AcceptedCmp2', 'AcceptedCmp3' , 'AcceptedCmp4','AcceptedCmp5', 'Response', 'Kidhome', 'Teenhome','NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','Complain','Dt_Customer']
data=data.drop(columns=to_delete,axis=1)
data.head()


# Visualize Data

This segment of the project will be split into 2 parts:
1. Examine data points for each attribute to identify outliers, if any. Outliers can distort statistical analyses, so it would be important to remove them.
2. Examine relationships between some variables.


## Univariate

In [None]:
#Income

sns.scatterplot(x='Income',y='Income',data=data)

Wow! That's a really high annual income. While I do applaud that customer for success in his/her career, it is an extreme outlier in our dataset and will be removed.

In [None]:
#Remove data point
data=data[data['Income']<600000]

In [None]:
#Age

sns.scatterplot(x='Age',y='Age',data=data)

For Age, most of the ages fall between the range of 20 to 80. The 3 outliers with ages above 100 can be removed from our dataset.

In [None]:
data=data[data['Age']<100]

In [None]:
#Spending

sns.scatterplot(x='Spending',y='Spending',data=data)

Spending does not have any outliers, so we can leave it as it is.

## Bivariate

In [None]:
#Income and Spending

sns.jointplot(x='Income',y='Spending',data=data,kind='reg')

Income and Spending seem to be positively correlated, which makes sense as customers have higher purchasing power when income is higher.

In [None]:
#Age and Income

sns.jointplot(x='Age',y='Income',data=data,kind='reg')

Age and Income are positively correlated as well, which also makes sense as workers tend to have periodical pay increments.

Alternatively, we could have checked our variables using a pairplot. However, a pairplot may be too cluttered if we introduce more variables, or have regression lines per plot.

In [None]:
examine_vars=['Income','Spending','Age','Member_months','Is_parent']
sns.set(rc = {'figure.figsize':(12,8)})
sns.pairplot(data[examine_vars],hue='Is_parent')

Identify Perfectly Collinearity/ Multicollinearity amongst variables

In [None]:
sns.set(rc = {'figure.figsize':(10,8)})
sns.heatmap(data.drop(['Wines', 'Fruits', 'Meat',
       'Fish', 'Sweets', 'Gold'],axis=1).corr(),annot=True)

From the correlation heatmap, there does not seem to be any case of multicollinearity or perfect collinearity.

# Analysis

1. Label Encoding categorical attributes
2. Scaling data so that each attribute has a single unit variance

In [None]:
#Identify categorical variables
cat_vars=list(data.select_dtypes(['object']).columns)

In [None]:
#Label Encoding
LE=LabelEncoder()
for i in cat_vars:
    data[i]=data[[i]].apply(LE.fit_transform)

In [None]:
#Check to see if attributes have transformed
data[cat_vars].dtypes

In [None]:
#Scaling
scaler=StandardScaler()
scaler.fit(data)

In [None]:
scaled=scaler.transform(data)
#scaled_data is in an array format; We shall convert it back to a dataframe
scaled_data=pd.DataFrame(scaled,columns=data.columns)

In [None]:
#Check is scaling was done correctly
scaled_data.head()

## Principal Component Analysis

It is difficult to visualize high dimensional data, so we can use PCA to find principal components.

In [None]:
#Reducing to 3 dimensions, so that we can plot a 3D graph
pca=PCA(n_components=3)
pca.fit(scaled_data)
pca_transformed=pca.transform(scaled_data)
pca_data=pd.DataFrame(pca_transformed,columns=['Att1','Att2','Att3'])

In [None]:
fig=plt.figure(figsize=(10,8))
ax=fig.add_subplot(projection='3d')
ax.scatter(pca_data['Att1'],pca_data['Att2'],pca_data['Att3'])

Do we really need 3 principal components? We can check that using a scree plot to show percentage of total variance explained.

In [None]:
pc_values=np.arange(pca.n_components_)
plt.plot(pc_values, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()

print(pca.explained_variance_ratio_)

The first principal component explains around 40.3% of total variation in the dataset, the second explains 10.3% and the third explains 8.1%.

## Clustering

With the 3 attributes from dimensionality reduction, we can move on to perform clustering. I will use the elbow method to determine the optimal number of clusters.

In [None]:
km = KMeans()
Elbow_M = KElbowVisualizer(estimator = km, k = 10)
Elbow_M.fit(pca_data)
Elbow_M.show()

From the graph, the optimal cluster size is 4, where there is a kink in the curve. After 4 clusters, the Within Cluster Sum of squares does not decrease significantly with each iteration.

# Evaluate Model

In [None]:
#Assigning datapoints to each cluster

km=KMeans(n_clusters=4,random_state=0)

prediction=km.fit_predict(pca_data)
pca_data['Cluster']=prediction
data['Cluster']=prediction

In [None]:
#Visualize clusters in 3D plot

fig=plt.figure(figsize=(10,8))
ax=fig.add_subplot(projection='3d')
ax.scatter(pca_data['Att1'],pca_data['Att2'],pca_data['Att3'],c=pca_data['Cluster'],marker='o',alpha=0.5,cmap='Accent')
ax.set_title('Clusters Plot 3D')

In [None]:
#Visualize clusters in 2D plot

plt.figure(figsize=(10,8))

sns.scatterplot(data=pca_data,x='Att1',y='Att2',hue='Cluster')
plt.title('Clusters plot 2D')

It would be useful to identify features of the clusters for customer segmentation. This can be done through analyzing plots, to observe clustering of datapoints across attribute comparisons.

In [None]:
#Distribution of datapoints in clusters

dist= sns.countplot(x=data["Cluster"])
dist.set_title('Distribution of clusters')

In [None]:
plot1 = sns.scatterplot(data = data,x=data["Spending"], y=data["Income"],hue=data["Cluster"], palette= 'Accent')

In [None]:
#Continuous variables
examine_vars=['Income','Spending','Age','Member_months']
examine_vars.append('Cluster')
p=sns.pairplot(data[examine_vars],hue='Cluster')
p.fig.set_size_inches(15,15)

In [None]:
Features=['Education', 'Marital_Status','Children','Is_parent']
for i in Features:
    plt.figure()
    sns.kdeplot(x=data[i],y=data['Spending'],hue=data['Cluster'],palette='Accent')
    plt.title('{} vs Spending'.format(i))

# Conclusion

From our plots, it appears that the 4 clusters have the following characteristics:  

Cluster 0: Average Income, Average Spending, Mostly parents but maximum 2 children  
Cluster 1: High Income, High Spending, Mostly not parents and maximum of 1 child  
Cluster 2: Average to Low Income, Low Spending, Mostly parents with no restrictions on number of children  
Cluster 3: Low Income, Low Spending, a fair mix between parents and non-parents but no more than 2 chilren  