___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

Welcome to "***Clustering (Customer Segmentation) Project***". This is the last medium project of ***Machine Learning*** course.

At the end of this project, you will have performed ***Cluster Analysis*** with an ***Unsupervised Learning*** method.

---

In this project, customers are required to be segmented according to the purchasing history obtained from customer of a food delivery company.

This project is less challenging than other projects. After getting to know the data set quickly, you are expected to perform ***Exploratory Data Analysis***. You should observe the distribution of customers according to different variables, also discover relationships and correlations between variables. Then you will specify the different variables to use for cluster analysis.

The last step in customer segmentation is to group the customers into distinct clusters based on their characteristics and behaviors. One of the most common methods for clustering is ***K-Means Clustering***, which partitions the data into k clusters based on the distance to the cluster centroids. Other clustering methods include ***hierarchical clustering***, density-based clustering, and spectral clustering. Each cluster can be assigned a label that describes its main features and preferences.

- ***NOTE:*** *This project assumes that you already know the basics of coding in Python. You should also be familiar with the theory behind Cluster Analysis and scikit-learn module as well as Machine Learning before you begin.*

***Features:***
- AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- ***Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise*** 
- Complain - 1 if customer complained in the last 2 years
- DtCustomer - date of customer’s enrolment with the company
- Education - customer’s level of education
- Marital - customer’s marital status
- Kidhome - number of small children in customer’s household
- Teenhome - number of teenagers in customer’s household
- Income - customer’s yearly household income
- MntFishProducts - amount spent on fish products in the last 2 years
- MntMeatProducts - amount spent on meat products in the last 2 years
- MntFruits - amount spent on fruits products in the last 2 years
- MntSweetProducts - amount spent on sweet products in the last 2 years
- MntWines - amount spent on wine products in the last 2 years
- MntGoldProds - amount spent on gold products in the last 2 years
- NumDealsPurchases - number of purchases made with discount
- NumCatalogPurchases - number of purchases made using catalogue
- NumStorePurchases - number of purchases made directly in stores
- NumWebPurchases - number of purchases made through company’s web site
- NumWebVisitsMonth - number of visits to company’s web site in the last month
- Recency - number of days since the last purchase

#### 1. Import Libraries, Load Dataset, Exploring Data
- Import Libraries
- Load Dataset
- The First Look

#### 2. Exploratory Data Analysis (EDA)


#### 3. Cluster Analysis

- Clustering with numeric features

    * Create a new dataset with numeric features
    
    * Determine optimal number of clusters*
    
    * Apply K Means*
    
    * Visualizing and Labeling All the Clusters*
    
    
- Clustering based on selected features

    * Create a new dataset with variables of your choice*
    
    * Determine optimal number of clusters*
    
    * Apply K Means*
    
    * Visualizing and Labeling All the Clusters*
    
    
- Hierarchical Clustering with selected features

    * Determine optimal number of clusters using Dendogram*

    * Apply Agglomerative Clustering*

    * Visualizing and Labeling All the Clusters*

- Conclusion

---
---

## 1. Import Libraries, Load Dataset, Exploring Data

### Import Libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_style("whitegrid")

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

from sklearn.metrics import silhouette_score

from ipywidgets import interact
import warnings

warnings.filterwarnings("ignore")

pd.options.display.float_format = (
    lambda x: "{:.0f}".format(x) if int(x) == x else "{:,.2f}".format(x)
)

### Load Dataset

In [6]:
data = pd.read_csv('marketing_campaign.csv', sep=';')

In [10]:
df = data.copy()
df.sample(10).T

Unnamed: 0,1482,1403,760,2075,892,1413,1664,1977,1047,904
ID,3102,4188,10270,10281,10925,4310,8299,4867,4002,6404
Year_Birth,1981,1957,1981,1970,1983,1944,1989,1968,1960,1969
Education,2n Cycle,Graduation,2n Cycle,Graduation,Graduation,Graduation,PhD,PhD,PhD,Graduation
Marital_Status,Together,Single,Married,Divorced,Married,Married,Single,Single,Married,Together
Income,19414,36864,35523,64713,76630,80589,33996,38236,77037,58917
Kidhome,1,0,1,1,0,0,0,1,0,1
Teenhome,0,1,0,0,0,0,0,1,1,2
Dt_Customer,2013-10-16,2012-08-13,2013-10-03,2014-02-07,2014-01-14,2014-01-22,2013-09-11,2013-09-20,2013-10-13,2013-03-24
Recency,32,53,8,11,93,25,46,2,3,10
MntWines,2,204,11,180,255,507,40,58,463,151


### The First Look
- Since we will do clustering analysis, we will remove Response variable from dataset.
- You can rename columns to more usable, if you need.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [11]:
df.duplicated().sum()

0

In [13]:
df.isnull().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

In [16]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,2240,5592.16,3246.66,0,2828.25,5458.5,8427.75,11191
Year_Birth,2240,1968.81,11.98,1893,1959.0,1970.0,1977.0,1996
Income,2216,52247.25,25173.08,1730,35303.0,51381.5,68522.0,666666
Kidhome,2240,0.44,0.54,0,0.0,0.0,1.0,2
Teenhome,2240,0.51,0.54,0,0.0,0.0,1.0,2
Recency,2240,49.11,28.96,0,24.0,49.0,74.0,99
MntWines,2240,303.94,336.6,0,23.75,173.5,504.25,1493
MntFruits,2240,26.3,39.77,0,1.0,8.0,33.0,199
MntMeatProducts,2240,166.95,225.72,0,16.0,67.0,232.0,1725
MntFishProducts,2240,37.53,54.63,0,3.0,12.0,50.0,259


In [17]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Education,2240,5,Graduation,1127
Marital_Status,2240,8,Married,864
Dt_Customer,2240,663,2012-08-31,12


In [18]:
df['Marital_Status'].value_counts()

Marital_Status
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64

In [19]:
df['dt_customer'].max()

'2014-06-29'

In [48]:
# Tarih sütununu datetime formatına dönüştürme
df['dt_customer'] = pd.to_datetime(df['dt_customer'])

# En son tarihi alma
max_date = df['dt_customer'].max()

# Müşterilik süresini hesaplama (aylık)
df['dt_month'] = df['dt_customer'].apply(lambda x: (max_date.year - x.year) * 12 + max_date.month - x.month)
df

Unnamed: 0,id,year_birth,education,marital_status,income,kidhome,teenhome,dt_customer,recency,mntwines,...,acceptedcmp5,acceptedcmp1,acceptedcmp2,complain,z_costcontact,z_revenue,response,age,marital_status_summary,dt_month
0,5524,1957,Graduation,Single,58138,0,0,2012-09-04,58,635,...,0,0,0,0,3,11,1,57,single,21
1,2174,1954,Graduation,Single,46344,1,1,2014-03-08,38,11,...,0,0,0,0,3,11,0,60,single,3
2,4141,1965,Graduation,Together,71613,0,0,2013-08-21,26,426,...,0,0,0,0,3,11,0,49,together,10
3,6182,1984,Graduation,Together,26646,1,0,2014-02-10,26,11,...,0,0,0,0,3,11,0,30,together,4
4,5324,1981,PhD,Married,58293,1,0,2014-01-19,94,173,...,0,0,0,0,3,11,0,33,together,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223,0,1,2013-06-13,46,709,...,0,0,0,0,3,11,0,47,together,12
2236,4001,1946,PhD,Together,64014,2,1,2014-06-10,56,406,...,0,1,0,0,3,11,0,68,together,0
2237,7270,1981,Graduation,Divorced,56981,0,0,2014-01-25,91,908,...,0,0,0,0,3,11,0,33,single,5
2238,8235,1956,Master,Together,69245,0,1,2014-01-24,8,428,...,0,0,0,0,3,11,0,58,together,5


In [21]:
df['age'] = 2014 - df['Year_Birth']
df.age.value_counts()

age
38     89
43     87
39     83
42     79
36     77
44     77
41     74
49     74
45     71
40     69
58     55
56     53
35     53
62     52
37     52
46     51
55     51
48     50
60     50
59     49
54     49
32     45
51     45
47     44
52     44
57     43
63     43
31     42
28     42
50     42
34     39
33     39
30     38
53     36
61     35
29     32
25     30
65     30
64     29
26     29
27     27
66     21
24     18
68     16
67     16
23     15
22     13
69      8
71      7
70      7
21      5
19      5
20      3
18      2
115     1
73      1
121     1
114     1
74      1
Name: count, dtype: int64

In [42]:
df.nunique()

id                        2240
year_birth                  59
education                    5
marital_status               8
income                    1974
kidhome                      3
teenhome                     3
dt_customer                663
recency                    100
mntwines                   776
mntfruits                  158
mntmeatproducts            558
mntfishproducts            182
mntsweetproducts           177
mntgoldprods               213
numdealspurchases           15
numwebpurchases             15
numcatalogpurchases         14
numstorepurchases           14
numwebvisitsmonth           16
acceptedcmp3                 2
acceptedcmp4                 2
acceptedcmp5                 2
acceptedcmp1                 2
acceptedcmp2                 2
complain                     2
z_costcontact                1
z_revenue                    1
response                     2
age                         59
marital_status_summary       2
dtype: int64

In [26]:
df.columns = [col.lower() for col in df.columns]
df.columns

Index(['id', 'year_birth', 'education', 'marital_status', 'income', 'kidhome',
       'teenhome', 'dt_customer', 'recency', 'mntwines', 'mntfruits',
       'mntmeatproducts', 'mntfishproducts', 'mntsweetproducts',
       'mntgoldprods', 'numdealspurchases', 'numwebpurchases',
       'numcatalogpurchases', 'numstorepurchases', 'numwebvisitsmonth',
       'acceptedcmp3', 'acceptedcmp4', 'acceptedcmp5', 'acceptedcmp1',
       'acceptedcmp2', 'complain', 'z_costcontact', 'z_revenue', 'response',
       'age'],
      dtype='object')

In [28]:
df[df['income'].isna()]

Unnamed: 0,id,year_birth,education,marital_status,income,kidhome,teenhome,dt_customer,recency,mntwines,...,acceptedcmp3,acceptedcmp4,acceptedcmp5,acceptedcmp1,acceptedcmp2,complain,z_costcontact,z_revenue,response,age
10,1994,1983,Graduation,Married,,1,0,2013-11-15,11,5,...,0,0,0,0,0,0,3,11,0,31
27,5255,1986,Graduation,Single,,1,0,2013-02-20,19,5,...,0,0,0,0,0,0,3,11,0,28
43,7281,1959,PhD,Single,,0,0,2013-11-05,80,81,...,0,0,0,0,0,0,3,11,0,55
48,7244,1951,Graduation,Single,,2,1,2014-01-01,96,48,...,0,0,0,0,0,0,3,11,0,63
58,8557,1982,Graduation,Single,,1,0,2013-06-17,57,11,...,0,0,0,0,0,0,3,11,0,32
71,10629,1973,2n Cycle,Married,,1,0,2012-09-14,25,25,...,0,0,0,0,0,0,3,11,0,41
90,8996,1957,PhD,Married,,2,1,2012-11-19,4,230,...,0,0,0,0,0,0,3,11,0,57
91,9235,1957,Graduation,Single,,1,1,2014-05-27,45,7,...,0,0,0,0,0,0,3,11,0,57
92,5798,1973,Master,Together,,0,0,2013-11-23,87,445,...,0,0,0,0,0,0,3,11,0,41
128,8268,1961,PhD,Married,,0,1,2013-07-11,23,352,...,0,0,0,0,0,0,3,11,0,53


In [37]:
def mapping_marital_status(x):
    if x in ["Married", "Together"]:
        return "together"
    elif x in ["Single", "Divorced", "Widow", 'Absurd', "Alone", "YOLO"]:
        return "single"

In [38]:
df.marital_status.apply(mapping_marital_status).value_counts(dropna=False)


marital_status
together    1444
single       796
Name: count, dtype: int64

In [40]:
df["marital_status_summary"] = df.marital_status.apply(mapping_marital_status)

0         single
1         single
2       together
3       together
4       together
          ...   
2235    together
2236    together
2237      single
2238    together
2239    together
Name: marital_status_summary, Length: 2240, dtype: object

In [41]:
df['education'].value_counts()

education
Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: count, dtype: int64

In [None]:
def mapping_education(x):
    if x in ["Graduation", , "Master""PhD"]:
        return "low_level_grade"
    elif x in ["HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm"]:
        return "medium_level_grade"
    elif x in ["Bachelors", "Masters", "Prof-school", "Doctorate"]:
        return "high_level_grade"

In [32]:
df.groupby(['marital_status', 'education'])[['income']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,income
marital_status,education,Unnamed: 2_level_1
Absurd,Graduation,79244.0
Absurd,Master,65487.0
Alone,Graduation,34176.0
Alone,Master,61331.0
Alone,PhD,35860.0
Divorced,2n Cycle,49395.13
Divorced,Basic,9548.0
Divorced,Graduation,54526.04
Divorced,Master,50331.95
Divorced,PhD,53096.62


In [None]:
df.drop(columns['id', 'year_birth', 'dt_customer', 'z_costcontact', 'z_revenue', 'response'], inplace=True)

## 2. Exploratory Data Analysis (EDA)

After performing Cluster Analysis, you need to know the data well in order to label the observations correctly. Analyze frequency distributions of features, relationships and correlations between the independent variables and the dependent variable. It is recommended to apply data visualization techniques. Observing breakpoints helps you to internalize the data.

### PCA
- We have too many features for bivariate analysis and pairplot. So we will create 3 components to have insight how our data distrubuted. 

# 3. Cluster Analysis

The purpose of the project is to perform cluster analysis using [K-Means](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1) and [Hierarchical Clustering](https://medium.com/analytics-vidhya/hierarchical-clustering-d2d92835280c) algorithms.

The K-Means algorithm requires determining the number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering), while Hierarchical Clustering builds a dendrogram without defining the number of clusters beforehand. Different labeling should be done based on the information obtained from each analysis.
Labeling example:

- **Normal Customers**  -- An Average consumer in terms of purchases and Income
- **Spender Customers** --  Income is less but purcheses high, so can also be treated as potential target customer.

## K-means Clsutering

### Create a new dataset with numeric features

#### Determine optimal number of clusters

### inertia

In [39]:
def elbow_vis(X, k_range=range(2, 11), inertia=True):
    """
    This is a function that visualizes the elbow method for determining the optimal number of clusters in a dataset.

    Args:
        X (pd.Dataframe): Input data
        k_range (range, optional): generator for k values. Defaults to range(2, 11).
        inertia (bool, optional): to select either inertia or distorion. Defaults to True.
    Returns:
        None.
    """
    if inertia:
        inertias = []
        for k in k_range:
            kmeanModel = KMeans(n_clusters=k, random_state=42).fit(X)
            inertias.append(kmeanModel.inertia_)
        plt.figure(figsize=(10, 6))
        plt.plot(k_range, inertias, "bo--")
        plt.xlabel("k")
        plt.ylabel("Inertia")
        plt.title("The Elbow Method showing the optimal k")
        plt.show()
    else:
        distortion = []
        for k in k_range:
            kmeanModel = KMeans(n_clusters=k, random_state=42)
            kmeanModel.fit(X)
            distances = kmeanModel.transform(
                X
            )  # distances from each observation to each cluster centroid
            labels = kmeanModel.labels_
            result = []
            for i in range(k):
                cluster_distances = distances[
                    labels == i, i
                ]  # distances from observations in each cluster to their own centroid
                result.append(
                    np.mean(cluster_distances**2)
                )  # calculate the mean of squared distances from observations in each cluster to their own centroid and add it to the result list
            distortion.append(
                sum(result)
            )  # sum the means of all clusters and add it to the distortion list

        plt.figure(figsize=(10, 6))
        plt.plot(k_range, distortion, "r*--", markersize=14.0)
        plt.xlabel("Different k values")
        plt.ylabel("Distortion")
        plt.title("elbow method")

### distortion

### silhoutte score

#### Apply K Means

#### Visualizing and Labeling All the Clusters

### Clustering based on selected features

#### Select features from existing data

#### Determine optimal number of clusters for selected features

#### Building the model based on the optimal number of clusters with selected features

#### We have made our inferences on the Kmeans algorithm so far, and we will take and compare the results with another clustering algorithm, Hierarchical Clustering:


### The point we need to pay attention to is the number of clusters in kmeans so that we can find the differences.

## Hierarchical Clustering

### Determine optimal number of clusters using Dendogram

### silhouette_score

### Clustering based on selected features

## Conclusion

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___