# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [3]:
 df = pd.read_csv('marketing.csv')
 df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
0,5524,1957,Graduation,Single,58138.0,0,0,4/9/2012,58,635,...,10,4,7,0,0,0,0,0,0,1
1,2174,1954,Graduation,Single,46344.0,1,1,8/3/2014,38,11,...,1,2,5,0,0,0,0,0,0,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,2,10,4,0,0,0,0,0,0,0
3,6182,1984,Graduation,Together,26646.0,1,0,10/2/2014,26,11,...,0,4,6,0,0,0,0,0,0,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,3,6,5,0,0,0,0,0,0,0


### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [4]:
df.isnull().sum()/len(df)*100

ID                     0.000000
Year_Birth             0.000000
Education              0.000000
Marital_Status         0.000000
Income                 1.071429
Kidhome                0.000000
Teenhome               0.000000
Dt_Customer            0.000000
Recency                0.000000
MntWines               0.000000
MntFruits              0.000000
MntMeatProducts        0.000000
MntFishProducts        0.000000
MntSweetProducts       0.000000
MntGoldProds           0.000000
NumDealsPurchases      0.000000
NumWebPurchases        0.000000
NumCatalogPurchases    0.000000
NumStorePurchases      0.000000
NumWebVisitsMonth      0.000000
AcceptedCmp3           0.000000
AcceptedCmp4           0.000000
AcceptedCmp5           0.000000
AcceptedCmp1           0.000000
AcceptedCmp2           0.000000
Complain               0.000000
Response               0.000000
dtype: float64

### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [5]:
df[df.duplicated()]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response


### 5. Drop the columns which you think redundant for the analysis 

In [6]:
df.describe()


Unnamed: 0,ID,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
count,2240.0,2240.0,2216.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,...,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0
mean,5592.159821,1968.805804,52247.251354,0.444196,0.50625,49.109375,303.935714,26.302232,166.95,37.525446,...,2.662054,5.790179,5.316518,0.072768,0.074554,0.072768,0.064286,0.013393,0.009375,0.149107
std,3246.662198,11.984069,25173.076661,0.538398,0.544538,28.962453,336.597393,39.773434,225.715373,54.628979,...,2.923101,3.250958,2.426645,0.259813,0.262728,0.259813,0.245316,0.114976,0.096391,0.356274
min,0.0,1893.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2828.25,1959.0,35303.0,0.0,0.0,24.0,23.75,1.0,16.0,3.0,...,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,5458.5,1970.0,51381.5,0.0,0.0,49.0,173.5,8.0,67.0,12.0,...,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,8427.75,1977.0,68522.0,1.0,1.0,74.0,504.25,33.0,232.0,50.0,...,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,11191.0,1996.0,666666.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,...,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [7]:
df['Marital_Status'].unique()

array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

In [None]:
df['Marital_Status'].replace(['Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'],'Single')

In [None]:
df['Marital_Status'].replace(['Together', 'Married'],'Relationship')

### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [27]:
df.groupby(by="MntWines")
columns_series = pd.Series(df.columns)
req_columns = columns_series[columns_series.str.contains('Mnt')].values.tolist()
df['Total_Expenses'] =  df[req_columns].sum(axis=1)
df['Total_Expenses']

0       1617
1         27
2        776
3         53
4        422
        ... 
2235    1341
2236     444
2237    1241
2238     843
2239     172
Length: 2240, dtype: int64

### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [18]:
columns_series = pd.Series(df.columns)

In [21]:
req_columns = columns_series[columns_series.str.contains('Num')].values.tolist()

In [23]:
df['Num_Total_Purchases'] =  df[req_columns].sum(axis=1)

In [24]:
df['Num_Total_Purchases']

0       32
1       11
2       25
3       14
4       24
        ..
2235    23
2236    29
2237    25
2238    26
2239    18
Name: Num_Total_Purchases, Length: 2240, dtype: int64

### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [12]:
columns_series = pd.Series(df.columns)
req_columns = columns_series[columns_series.str.contains('home')].values.tolist()
df['Kids'] =  df[req_columns].sum(axis=1)
df['Kids']

0       0
1       2
2       0
3       1
4       1
       ..
2235    1
2236    3
2237    0
2238    1
2239    2
Name: Kids, Length: 2240, dtype: int64

### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [56]:
df1 = df.copy()

In [57]:
df1['TotalAcceptedCmp'] = df1['AcceptedCmp1'] + df1['AcceptedCmp2'] + df1['AcceptedCmp3']  + df1['AcceptedCmp4']  + df1['AcceptedCmp5'] + df1['Response'] 

In [58]:
df1['TotalAcceptedCmp']

0       1
1       0
2       0
3       0
4       0
       ..
2235    0
2236    1
2237    1
2238    0
2239    1
Name: TotalAcceptedCmp, Length: 2240, dtype: int64

### 11. Drop those columns which we have used above for obtaining new features

In [59]:
df1.drop(['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Response'],axis =1,inplace=True )

In [60]:
df1.info

<bound method DataFrame.info of          ID  Year_Birth  Education  Marital_Status   Income  Kidhome  \
0      5524        1957          2               4  58138.0        0   
1      2174        1954          2               4  46344.0        1   
2      4141        1965          2               5  71613.0        0   
3      6182        1984          2               5  26646.0        1   
4      5324        1981          4               3  58293.0        1   
...     ...         ...        ...             ...      ...      ...   
2235  10870        1967          2               3  61223.0        0   
2236   4001        1946          4               5  64014.0        2   
2237   7270        1981          2               2  56981.0        0   
2238   8235        1956          3               5  69245.0        0   
2239   9405        1954          4               3  52869.0        1   

      Teenhome  Dt_Customer  Recency  MntWines  ...  MntSweetProducts  \
0            0          553   

### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [61]:
df1['age'] = 2023-df1['Year_Birth']
df1.drop('Year_Birth',axis =1,inplace=True)

### 13. Encode the categorical variables in the dataset

In [62]:

from sklearn.preprocessing import StandardScaler,LabelEncoder
df_cat = df1.select_dtypes(include='object')
#df_cat = df_cat.replace(to_replace ="?",value ="Private")
df_cat

0
1
2
3
4
...
2235
2236
2237
2238
2239


In [63]:
## Label encoding
le = LabelEncoder()
for col in df_cat:
    df1[col] = le.fit_transform(df1[col])
    

### 14. Standardize the columns, so that values are in a particular range

In [64]:
## Standardization
df2 = df.copy()
scaled_features = StandardScaler().fit_transform(df2.values)
scaled_features_df = pd.DataFrame(scaled_features, index=df2.index, columns=df2.columns)
scaled_features_df.head(5)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Kids
0,-0.020999,-0.985345,-0.350141,0.251004,0.234063,-0.825218,-0.929894,1.187363,0.307039,0.983781,...,-0.550785,0.693904,-0.28014,-0.28383,-0.28014,-0.262111,-0.11651,-0.097282,2.388846,-1.264505
1,-1.053058,-1.235733,-0.350141,0.251004,-0.234559,1.032559,0.906934,1.582039,-0.383664,-0.870479,...,-1.166125,-0.130463,-0.28014,-0.28383,-0.28014,-0.262111,-0.11651,-0.097282,-0.418612,1.396361
2,-0.44707,-0.317643,-0.350141,1.18034,0.769478,-0.825218,-0.929894,-0.162843,-0.798086,0.362723,...,1.295237,-0.542647,-0.28014,-0.28383,-0.28014,-0.262111,-0.11651,-0.097282,-0.418612,-1.264505
3,0.181716,1.268149,-0.350141,1.18034,-1.017239,1.032559,-0.929894,-1.523436,-0.798086,-0.870479,...,-0.550785,0.28172,-0.28014,-0.28383,-0.28014,-0.262111,-0.11651,-0.097282,-0.418612,0.065928
4,-0.082614,1.017761,1.428354,-0.678332,0.240221,1.032559,-0.929894,-0.562712,1.550305,-0.389085,...,0.064556,-0.130463,-0.28014,-0.28383,-0.28014,-0.262111,-0.11651,-0.097282,-0.418612,0.065928


### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [65]:
## Calculating covariance matrix
cov_matrix = np.cov(scaled_features_df.T)
cov_matrix= cov_matrix.dropna
print('Covariance matrix','\n',cov_matrix)

Covariance matrix 
 [[ 1.00044663e+00  2.76368349e-05 -3.84114609e-03  1.95885588e-02
              nan  2.40668086e-03 -2.58104310e-03  2.44967341e-02
  -4.65448809e-02 -2.28885078e-02  4.60211667e-03 -4.43919937e-03
  -2.44858793e-02 -7.64496983e-03 -1.34438344e-02 -3.71824475e-02
  -1.89323486e-02 -3.44167996e-03 -1.49335955e-02 -7.44950801e-03
  -3.60560931e-02 -2.53980586e-02 -7.52038148e-03 -2.16238909e-02
  -1.50676897e-02  3.38981712e-02 -2.19776291e-02 -1.45948170e-04]
 [ 2.76368349e-05  1.00044663e+00 -1.71466853e-01 -6.06069703e-02
              nan  2.30278341e-01 -3.52268696e-01 -2.77779803e-03
  -1.98799637e-02 -1.57842989e-01 -1.79252427e-02 -3.08861876e-02
  -4.16439684e-02 -1.81407051e-02 -6.18457914e-02 -6.08727256e-02
  -1.45104917e-01 -1.21328797e-01 -1.28329345e-01  1.21193363e-01
   6.18020701e-02 -6.05366207e-02  7.12572266e-03 -5.93296539e-03
  -6.54220470e-03 -3.01410006e-02  2.13347389e-02 -9.02395543e-02]
 [-3.84114609e-03 -1.71466853e-01  1.00044663e+00  7.0

In [67]:
## Calculating eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen vectors:','\n',eig_vecs)
print('\n')
print('Eigen values:','\n',eig_vals)

LinAlgError: Array must not contain infs or NaNs

In [68]:
## Calculating the Variance explained and the cummulative variance explained
total = sum(eig_vals)
var_exp = [ (i/total)*100  for i in sorted(eig_vals,reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Variance Explained: ',var_exp)
print('Cummulative Variance Explained: ',cum_var_exp)

NameError: name 'eig_vals' is not defined

In [69]:
## Scree plot
plt.bar(range(15),var_exp, align='center',color='lightgreen',edgecolor='black',label='Indiviual Explained Varinace')
plt.step(range(15), cum_var_exp, where='mid',color='red',label='Cummulative explained Variance')
plt.legend(loc = 'best')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.tight_layout()
plt.show()

NameError: name 'var_exp' is not defined

### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [70]:
cluster_range = range(1,15)
cluster_errors = []

for num_clusters in cluster_range:
   
    clusters = KMeans(num_clusters, n_init=10)
    clusters.fit(data_pca)
    
    labels = clusters.labels_
    centroids = clusters.cluster_centers_
    
    cluster_errors.append(clusters.inertia_)

clusters_df = pd.DataFrame({'num_clusters':cluster_range, 
                           'cluster_errors':cluster_errors})

clusters_df

NameError: name 'KMeans' is not defined

### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

In [71]:
plt.figure(figsize=[18,7])
merg = linkage(data_pca, method='ward')
dendrogram(merg, leaf_rotation=90,)
plt.xlabel('Datapoints')
plt.ylabel('Euclidean distance')
plt.show()

NameError: name 'linkage' is not defined

<Figure size 1800x700 with 0 Axes>

### Visualization and Interpretation of results

-----
## Happy Learning
-----