# Problem Statement:
## Customer Personality Analysis

In this project, we aim to perform an in-depth analysis of customer behavior and preferences. By leveraging exploratory data analysis (EDA), univariate and bivariate analysis, as well as unsupervised learning techniques, we seek to gain valuable insights into customer segments.

## Goal:
Perform clustering to summarize customer segments.

## Purpose / Business Implementation:
The primary objective of this analysis is to help the business gain a comprehensive understanding of its customers, including their habits, behavior, and needs. By segmenting customers based on their characteristics, the business can tailor its products and services to meet the specific needs of each segment.

Customer segmentation plays a crucial role in optimizing marketing strategies, product development, and customer relationship management. By identifying distinct customer segments, businesses can effectively allocate resources, personalize marketing campaigns, and enhance customer satisfaction.

## Data Source:
SAS Institute
## Collection Method:
Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner.

## TABLE OF CONTENTS:
* [LOAD DATA](#LOAD-DATA)
* [DATA CLEANING](#DATA-CLEANING)
* [FEATURE ENGINEERING](#FEATURE-ENGINEERING)
* [UNIVARIATE ANALYSIS](#UNIVARIATE-ANALYSIS)
* [FEATURE CORRELATION](#FEATURE-CORRELATION)
* [BIVARIATE ANALYSIS](#BIVARIATE-ANALYSIS)
* [CLUSTERING](#CLUSTERING)
* [MODEL EVALUATION](#MODEL-EVALUATION)
* [CLUSTERING SUMMARY](#CLUSTERING-SUMMARY)

In [20]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
# Ignoring Warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
# Setting Seed for Reproducibility
np.random.seed(123) 

# LOAD DATA

In [21]:
# Loading data
data = pd.read_csv("Dataset\marketing_campaign.csv", sep="\t")

In [22]:
# Looking at the first five rows
data.head(5)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


## Customer Data Attributes

### People
- **ID**: Customer's unique identifier
- **Year-Birth**: Customer's birth year
- **Education**: Customer's education level
- **Marital_Status**: Customer's marital status
- **Income**: Customer's yearly household income
- **Kidhome**: Number of children in customer’s household
- **Teenhome**: Number of teenagers in customer’s household
- **Dt_Customer**: Date of customer’s enrollment with the company
- **Recency**: Number of days since customer’s last purchase
- **Complain**: 1 if the customer complained in the last 2 years, 0 otherwise

### Products
- **MntWines**: Amount spent on wine in last 2 years
- **MntFruits**: Amount spent on fruits in last 2 years
- **MntMeatProducts**: Amount spent on meat in last 2 years
- **MntFishProducts**: Amount spent on fish in last 2 years
- **MntSweetProducts**: Amount spent on sweets in last 2 years
- **MntGoldProds**: Amount spent on gold in last 2 years

### Place
- **NumDealsPurchases**: Number of purchases made with a discount

### Promotion
- **AcceptedCmp1**: If customer accepted the offer in the 1st campaign, 0 otherwise
- **AcceptedCmp2**: If customer accepted the offer in the 2nd campaign, 0 otherwise
- **AcceptedCmp3**: If customer accepted the offer in the 3rd campaign, 0 otherwise
- **AcceptedCmp4**: If customer accepted the offer in the 4th campaign, 0 otherwise
- **AcceptedCmp5**: If customer accepted the offer in the 5th campaign, 0 otherwise
- **Response**: If customer accepted the offer in the last campaign, 0 otherwise

### Online
- **NumWebPurchases**: Number of purchases made through the company’s website
- **NumCatalogPurchases**: Number of purchases made through a catalog


#### Quick Overview of the Data

In [23]:
data.describe()

Unnamed: 0,ID,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
count,2240.0,2240.0,2216.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,...,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0
mean,5592.159821,1968.805804,52247.251354,0.444196,0.50625,49.109375,303.935714,26.302232,166.95,37.525446,...,5.316518,0.072768,0.074554,0.072768,0.064286,0.013393,0.009375,3.0,11.0,0.149107
std,3246.662198,11.984069,25173.076661,0.538398,0.544538,28.962453,336.597393,39.773434,225.715373,54.628979,...,2.426645,0.259813,0.262728,0.259813,0.245316,0.114976,0.096391,0.0,0.0,0.356274
min,0.0,1893.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
25%,2828.25,1959.0,35303.0,0.0,0.0,24.0,23.75,1.0,16.0,3.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
50%,5458.5,1970.0,51381.5,0.0,0.0,49.0,173.5,8.0,67.0,12.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
75%,8427.75,1977.0,68522.0,1.0,1.0,74.0,504.25,33.0,232.0,50.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
max,11191.0,1996.0,666666.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,...,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0


### DATA CLEANING

In [24]:
# Information about the data columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [25]:
# Checking for missing values
data.isnull().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

**Takeaways:**
- There are 24 missing values in the `Income` column.
- `Dt_Customer` is in `object` format, we need to convert it to `datetime` format.

Dealing with missing values:

In [26]:
# How many missing values are there in the dataset? (in Percentage)
missing = data['Income'].isnull().sum()
total = data['Income'].isnull().count()
percent = (missing/total*100)
print(f"The percentage of missing values in Income is {percent}%")

The percentage of missing values in Income is 1.0714285714285714%


- We are missing only 1% values. We will fill the missing values in the `Income` column with the median value.

In [27]:
# Filling missing values with the median
data['Income'] = data['Income'].fillna(data['Income'].median())

In [28]:
# Checking for missing values again
data.isnull().sum() 

ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64

We will also convert the `Dt_Customer` column to `datetime` format.

In [32]:
# Converting 'Dt_Customer' to datetime
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], dayfirst=True)

In [33]:
data["Dt_Customer"].head()

0   2012-09-04
1   2014-03-08
2   2013-08-21
3   2014-02-10
4   2014-01-19
Name: Dt_Customer, dtype: datetime64[ns]

Number of unique values in each column:

In [34]:
data.nunique()

ID                     2240
Year_Birth               59
Education                 5
Marital_Status            8
Income                 1975
Kidhome                   3
Teenhome                  3
Dt_Customer             663
Recency                 100
MntWines                776
MntFruits               158
MntMeatProducts         558
MntFishProducts         182
MntSweetProducts        177
MntGoldProds            213
NumDealsPurchases        15
NumWebPurchases          15
NumCatalogPurchases      14
NumStorePurchases        14
NumWebVisitsMonth        16
AcceptedCmp3              2
AcceptedCmp4              2
AcceptedCmp5              2
AcceptedCmp1              2
AcceptedCmp2              2
Complain                  2
Z_CostContact             1
Z_Revenue                 1
Response                  2
dtype: int64

`Z_CostContact` and `Z_Revenue` exhibit homogeneity in their values, each presenting a singular unique value. We will drop these columns as they do not hold any descriminatory power.

In [35]:
# Dropping 'Z_CostContact' and 'Z_Revenue' columns
data = data.drop(['Z_CostContact', 'Z_Revenue'], axis=1)

Examining some of the features before we proceed with feature engineering:

In [37]:
# Education column
data['Education'].unique()

array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)

In [39]:
# Marital_Status column
data['Marital_Status'].unique()

array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

### FEATURE ENGINEERING

- We will create a new column `Age` by subtracting `Year_Birth` from the current year.
- We will convert `Education` values to Undergraduate and Postgraduate.
- We will create a new column `Children` by adding `Kidhome` and `Teenhome`.
- We will create a new column `Total_Spent` by adding all the `Mnt` columns.
- We will create a new column `TotalPurchases` by adding all the `Num` columns.
- We will create a new column `TotalAcceptedCmp` by adding all the `AcceptedCmp` columns.
- We will create a new column `CustomerSince` by subtracting `Dt_Customer` from the current date.
- We will create a new column `IsParent` by converting the `Children` column to `1` and `0`.
- We will convert `Marital_Status` to `Relationship` and `Single`.


In [40]:
# Renaming the Product columns
data.rename(columns={"MntWines": "Wines", "MntFruits": "Fruits", "MntMeatProducts": "Meat", "MntFishProducts": "Fish", "MntSweetProducts": "Sweets", "MntGoldProds": "Gold"}, inplace=True)

In [41]:
# Creating Age column
data['Age'] = 2024 - data['Year_Birth']

# Converting Education to Understandable Format
data["Education"] = data["Education"].replace(['Graduation', 'PhD', 'Master', '2n Cycle'], "Post Graduate")
data["Education"] = data["Education"].replace(["Basic"], "Under Graduate")

# Converting Marital_Status to Understandable Format
data["Marital_Status"] = data["Marital_Status"].replace(['Single', 'Divorced', 'Widow', 'Alone','Absurd', 'YOLO'], "Single")
data["Marital_Status"] = data["Marital_Status"].replace(["Married", "Together"], "Relationship")

# Creating a new column 'Total_Spent'
data['Total_Spent'] = data['Wines'] + data['Fruits'] + data['MeatProducts'] + data['FishProducts'] + data['SweetProducts'] + data['GoldProds']

# Creating a new column 'Children'
data['Children'] = data['Kidhome'] + data['Teenhome']

# CustomerSince column
years = []
for date in data['Dt_Customer']:
    years.append(2024 - date.year)
data['CustomerSince'] = years  

# TotalPurchases column
data['TotalPurchases'] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']

# TotalAcceptedCmp column
data['TotalAcceptedCmp'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5']

# Is_Parent column
data['Is_Parent'] = np.where(data['Children'] > 0, 1, 0)

# Dropping unnecessary columns
drop_columns = ["ID", "Year_Birth", "Dt_Customer"]
data.drop(drop_columns, axis=1, inplace=True)