# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')

### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [18]:
df=pd.read_csv("marketing.csv")
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
0,5524,1957,Graduation,Single,58138.0,0,0,4/9/2012,58,635,...,10,4,7,0,0,0,0,0,0,1
1,2174,1954,Graduation,Single,46344.0,1,1,8/3/2014,38,11,...,1,2,5,0,0,0,0,0,0,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,2,10,4,0,0,0,0,0,0,0
3,6182,1984,Graduation,Together,26646.0,1,0,10/2/2014,26,11,...,0,4,6,0,0,0,0,0,0,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,3,6,5,0,0,0,0,0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [5]:
df.isnull().sum()/len(df)*100

ID                     0.000000
Year_Birth             0.000000
Education              0.000000
Marital_Status         0.000000
Income                 1.071429
Kidhome                0.000000
Teenhome               0.000000
Dt_Customer            0.000000
Recency                0.000000
MntWines               0.000000
MntFruits              0.000000
MntMeatProducts        0.000000
MntFishProducts        0.000000
MntSweetProducts       0.000000
MntGoldProds           0.000000
NumDealsPurchases      0.000000
NumWebPurchases        0.000000
NumCatalogPurchases    0.000000
NumStorePurchases      0.000000
NumWebVisitsMonth      0.000000
AcceptedCmp3           0.000000
AcceptedCmp4           0.000000
AcceptedCmp5           0.000000
AcceptedCmp1           0.000000
AcceptedCmp2           0.000000
Complain               0.000000
Response               0.000000
dtype: float64

In [6]:
# Filling the missing value in the income my mean
df['Income'] =df['Income'].fillna (df[ 'Income'].mean())

In [7]:
## Lets recheck the missing values
df.isnull().sum()

ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
dtype: int64

### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [12]:
 len(df[df.duplicated()])

0

### 5. Drop the columns which you think redundant for the analysis 

In [19]:
df = df.drop(columns=['ID', 'Dt_Customer'])

### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [24]:
print(df['Marital_Status'].unique())

['Single' 'relationship']


In [25]:
# Checking number of unique categories present in the "Marital Status"
df['Marital Status'].value_counts()

KeyError: 'Marital Status'

In [22]:
df['Marital_Status'] = df['Marital_Status'].replace(['Married', 'Together'], 'relationship')
                                                     
df['Marital_Status'] = df['Marital_Status'].replace(['Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd'], 'Single')

In [None]:
print(df['Marital_Status'].unique())

In [21]:
df['Marital_Status'].value_counts()

Marital_Status
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64

### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [40]:
df['Total_Expenses'] = df['Mntwines'] + df['MntFruits'] + df['MntMeat Products'] + df['MntFishProducts'] + df[ 'MntSweetProducts'] + df['MntGoldProds']

KeyError: 'Mntwines'

In [41]:
df['Total_Expenses'].head()

KeyError: 'Total_Expenses'

### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [42]:
[] df['NumTotalPurchases'] = df['NumWebPurchases'] + df['NumCatalog Purchases'] + df['NumStorePurchases'] + df [ 'NumDeals Purchases']

SyntaxError: invalid syntax (3056068346.py, line 1)

In [43]:
df['Num_Total_Purchases'].head()

KeyError: 'Num_Total_Purchases'

### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [33]:
[ ] df['Kids'] = df['Kidhome'] + df[ 'Teenhome']

SyntaxError: invalid syntax (153055405.py, line 1)

In [37]:
df['Kids'].head()

KeyError: 'Kids'

### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [None]:
df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['Accepted Cmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5'] + df['Response']

### 11. Drop those columns which we have used above for obtaining new features

In [35]:
 # Dropping the columns, since we have grouped them
col_del = ["Accepted Cmp1", "AcceptedCmp2", "Accepted Cmp3", "AcceptedCmp4","AcceptedCmp5", "Response", "NumWebVisitsMonth", "NumWebPurchases", "NumCatalogPurchases","NumStorePurchases", "NumDeals Purchases", "Kidhome", 
df=df.drop(columns=col_del, axis=1) 
df.head()

SyntaxError: '[' was never closed (858343173.py, line 2)

### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [None]:
 # Adding a column "Age" in the dataframe
df['Age'] = 2024 - df["Year_Birth"]

In [None]:
df.drop('Year_Birth', axis=1, inplace=True)

In [36]:
df.head (2)

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
0,1957,Graduation,Single,58138.0,0,0,58,635,88,546,...,10,4,7,0,0,0,0,0,0,1
1,1954,Graduation,Single,46344.0,1,1,38,11,1,6,...,1,2,5,0,0,0,0,0,0,0


### 13. Encode the categorical variables in the dataset

In [None]:
## Label Encoding
cate=['Education', 'Marital Status']
lbl_encode = LabelEncoder()
for i in cate:
df[i]=df[[i]].apply(lbl_encode.fit_transform)

In [39]:
df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
0,1957,Graduation,Single,58138.0,0,0,58,635,88,546,...,10,4,7,0,0,0,0,0,0,1
1,1954,Graduation,Single,46344.0,1,1,38,11,1,6,...,1,2,5,0,0,0,0,0,0,0
2,1965,Graduation,relationship,71613.0,0,0,26,426,49,127,...,2,10,4,0,0,0,0,0,0,0
3,1984,Graduation,relationship,26646.0,1,0,26,11,4,20,...,0,4,6,0,0,0,0,0,0,0
4,1981,PhD,relationship,58293.0,1,0,94,173,43,118,...,3,6,5,0,0,0,0,0,0,0


### 14. Standardize the columns, so that values are in a particular range

In [44]:
## Standardization
df1= df.copy()
scaled_features StandardScaler().fit_transform(df1.values)
scaled_features_df = pd.DataFrame(scaled_features, index=df1.index, columns=df1.columns)

SyntaxError: invalid syntax (1971575743.py, line 3)

In [45]:
scaled_features_df.head (3)

NameError: name 'scaled_features_df' is not defined

### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [46]:
## step1: Calculate the covariance matrix.
cov_matrix = np.cov(scaled_features.T) 
cov_matrix

NameError: name 'scaled_features' is not defined

In [None]:
 ## step2: Calculate the eigen values and eigen vectors.
eig_vals, eig_vectors = np.linalg.eig (cov_matrix)
print('eigein vals:', '\n',eig_vals)
print('\n')
print('eigein vectors', '\n',eig_vectors)

In [None]:
## step3: Scree plot.
total = sum(eig_vals)
var_exp = [(i/total)*100 for i in sorted (eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Explained Variance: ',var_exp)
print('Cummulative Variance Explained: ',cum_var_exp)

In [None]:
## Scree plot.
plt.bar(range(10), var_exp, align='center', color='lightgreen', edgecolor='black', label='Explained Variance')
plt.step(range(10), cum_var_exp, where='mid', color='red', label='Cummulative Explained Variance')
plt.xlabel('Principal Components')
plt.ylabel('Explianed Variance ratio')
plt.title('Scree Plot')
plt.legend (loc='best')
plt.show()

### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [None]:
## Using the dimensions obtainted from the PCA to apply clustering. (i.e, 8)
pca = PCA (n_components=8)
pca_df = pd.DataFrame(pca.fit_transform(scaled_features_df), columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8'])
pca_df.head()

# Kmeans Clustering using PCA transformed data

In [47]:
## finding optimal K value by KMeans clustering using Elbow plot.
cluster_errors = []
cluster_range = range (1,15)
for num_clusters in cluster_range:
clusters = KMeans (num_clusters, random_state=100) clusters.fit(pca_df)
cluster_errors.append(clusters.inertia_)

IndentationError: expected an indented block after 'for' statement on line 4 (882412080.py, line 5)

In [None]:
## creataing a dataframe of number of clusters and cluster errors.
cluster_df = pd.DataFrame({'num_clusters': cluster_range, 'cluster_errors' : cluster_errors})

## Elbow plot.
plt.figure(figsize=[15,5])
plt.plot(cluster_df[ 'num_clusters'], cluster_df['cluster_errors'], marker='0',color='b')
plt.show()

In [None]:
## Applying KMeans clustering for the optimal number of clusters obtained above. kmeans = KMeans (n_clusters=3, random_state=180)
kmeans.fit(pca_df)
KMeans (n_clusters-3, random_state=100)

In [None]:
## creating a dataframe of the labels.
label = pd.DataFrame (kmeans.labels_, columns=['Label'])

In [None]:
## joining the label dataframe to the pca_df dataframe.
kmeans df = pca_df.join(label)
kmeans_df.head()

In [None]:
 kmeans_df['Label'].value_counts()

In [None]:
 ## visualizing the clusters formed
sns.scatterplot (kmeans_df['PC1'], kmeans_df[ 'PC2'], hue='Label',data=kmeans_df)
plt.show()

### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

# Agglomerative clusteringusing the original data

In [16]:
plt.figure(figsize=[18,5])
merg linkage (scaled_features, method='ward') dendrogram(merg, leaf rotation=90,)
plt.xlabel('Datapoints')
plt.ylabel('Euclidean distance')
plt.show()

SyntaxError: invalid syntax (4210127920.py, line 2)

# Computing silhoutte score for agglomerative clustering

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
for i in range(2,15):
hier = AgglomerativeClustering (n_clusters=i) 
hier = hier.fit(scaled_features_df)
labels = hier.fit_predict(scaled_features_df)
print(i,silhouette_score (scaled_features_df,labels))

In [None]:
##Building hierarchical clustering model using the optimal clusters as 3 using originals data
hie_cluster = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage='ward')
hie_cluster_model = hie_cluster.fit(df_standardized)

In [17]:
# creating a dataframe of the labels
df_label1 = pd.DataFrame(hie_cluster_model.labels_, columns=['Labels'])
df_label1.head(5)

NameError: name 'hie_cluster_model' is not defined

### Visualization and Interpretation of results

KMeans Clustering Visualization and Interpretation
Interpretation:
The scatter plot shows how the data points are distributed among different clusters in the reduced 2D PCA space. Each color represents a different cluster. The separation between clusters indicates how distinct they are from each other.

Agglomerative Clustering Visualization and Interpretation
Interpretation:
The dendrogram shows the hierarchical merging of clusters. The height at which two clusters are joined represents the distance between them. Cutting the dendrogram at different heights results in different numbers of clusters.

Bivariate Analysis and Interpretation
Interpretation:
Boxplots will help in understanding the distribution of each feature within different clusters. The central tendency (median) and spread (IQR) for each feature can be observed. This can help identify which features distinguish the clusters

-----
## Happy Learning
-----