# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

import warnings
warnings.filterwarnings('ignore')

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA


### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [None]:
df=pd.read_csv('"C:/Users/DEBANJANA/OneDrive/Desktop/marketing.csv"')
df.head()

### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [None]:
df.isnull().sum()/len(df)*100

### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [None]:
duplicates = df[df.duplicated()]
df.drop_duplicates(inplace=True)

### 5. Drop the columns which you think redundant for the analysis 

In [None]:
df.drop(columns='B', inplace=True)

### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [None]:
def group_marital_status(status):
    if status in ['Married', 'Together']:
        return 'relationship'
    elif status in ['Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd']:
        return 'Single'

df['Marital_Status_Grouped'] = df['Marital_Status'].apply(group_marital_status)
unique_categories = df['Marital_Status_Grouped'].unique()
print("Unique categories in 'Marital_Status_Grouped':", unique_categories)

### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [None]:
columns_to_sum1 = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

df['Total_Expenses'] = df[columns_to_sum1].sum(axis=1)


### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [None]:
columns_to_sum2 = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']

df['Num_Total_Purchases'] = df[columns_to_sum2].sum(axis=1)

### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [None]:
columns_to_sum3 = ['Kidhome', 'Teenhome']

df['Kids'] = df[columns_to_sum3].sum(axis=1)

### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [None]:
columns_to_sum4 = ['AcceptedCmp1 , 2 , 3 , 4, 5', 'Response']

df['TotalAcceptedCmp'] = df[columns_to_sum4].sum(axis=1)

### 11. Drop those columns which we have used above for obtaining new features

In [None]:
columns_to_drop = ['Married', 'Together', 'Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd', 'ines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases', 'Kidhome', 'Teenhome', 'AcceptedCmp1 , 2 , 3 , 4, 5', 'Response']
df = df.drop(columns=columns_to_drop)


### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [None]:
df['Age'] = 2024 - df['Year_Birth']
df = df.drop(columns=['Year_Birth'])


### 13. Encode the categorical variables in the dataset

In [None]:
label_encoder = LabelEncoder()
df['Education_Label'] = label_encoder.fit_transform(df['Education'])
df['Marital_Status_Label'] = label_encoder.fit_transform(df['Marital_Status_Grouped'])


### 14. Standardize the columns, so that values are in a particular range

In [None]:
columns_to_standardize = [
    'relationship', 'Single', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp'
]

scaler = StandardScaler()

df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

print(df.head())


### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [None]:
onehot_encoder = OneHotEncoder(sparse=False)

education_encoded = onehot_encoder.fit_transform(df[['Education']])
education_df = pd.DataFrame(education_encoded, columns=onehot_encoder.categories_[0])
df = pd.concat([df, education_df], axis=1)

marital_status_encoded = onehot_encoder.fit_transform(df[['Marital_Status']])
marital_status_df = pd.DataFrame(marital_status_encoded, columns=onehot_encoder.categories_[0])
df = pd.concat([df, marital_status_df], axis=1)

df.drop(['Married', 'Together', 'Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd', 'ines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases', 'Kidhome', 'Teenhome', 'AcceptedCmp1 , 2 , 3 , 4, 5', 'Response'], axis=1, inplace=True)

columns_to_standardize = [
    'relationship', 'Single', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp'
]

scaler = StandardScaler()

df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

pca = PCA()

pca.fit(df.drop('ID', axis=1)) 

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = explained_variance_ratio.cumsum()

num_components_90 = next(i for i, total in enumerate(cumulative_variance) if total >= 0.90) + 1
num_components_95 = next(i for i, total in enumerate(cumulative_variance) if total >= 0.95) + 1

print(f"Number of components to explain at least 90% of the variance: {num_components_90}")
print(f"Number of components to explain at least 95% of the variance: {num_components_95}")

### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [None]:
# Encoding 'Education' and 'Marital_Status' using One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse=False)

education_encoded = onehot_encoder.fit_transform(df[['Education']])
education_df = pd.DataFrame(education_encoded, columns=onehot_encoder.categories_[0])
df = pd.concat([df, education_df], axis=1)

marital_status_encoded = onehot_encoder.fit_transform(df[['Marital_Status']])
marital_status_df = pd.DataFrame(marital_status_encoded, columns=onehot_encoder.categories_[0])
df = pd.concat([df, marital_status_df], axis=1)

# Dropping original categorical columns
df.drop(['Married', 'Together', 'Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd', 'ines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases', 'Kidhome', 'Teenhome', 'AcceptedCmp1 , 2 , 3 , 4, 5', 'Response'], axis=1, inplace=True)

# Select the columns to be standardized
columns_to_standardize = [
    'relationship', 'Single', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp'
]

# Initialize the StandardScaler
scaler = StandardScaler()

# Standardize the columns
df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])


4. Applying PCA

# Initialize PCA
pca = PCA()

# Fit PCA on the standardized data
pca.fit(df.drop('ID', axis=1))  # Exclude 'ID' from PCA

# Determine the number of components to explain at least 90% of the variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = explained_variance_ratio.cumsum()
num_components = next(i for i, total in enumerate(cumulative_variance) if total >= 0.90) + 1

# Apply PCA with the determined number of components
pca = PCA(n_components=num_components)
pca_transformed = pca.fit_transform(df.drop('ID', axis=1))


5. Applying K-Means Clustering

# Determine the optimal number of clusters using the elbow method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(pca_transformed)
    wcss.append(kmeans.inertia_)

# Plot the elbow method
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# From the plot, choose an optimal number of clusters (e.g., 3 for demonstration)
optimal_clusters = 3

# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(pca_transformed)

# Show the resulting clustered data
print(df.head())

### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

In [None]:
columns_to_standardize = [
   'relationship', 'Single', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp'
]

# Initialize the StandardScaler
scaler = StandardScaler()

# Standardize the columns
df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

agg_cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
df['Cluster'] = agg_cluster.fit_predict(df.drop('ID', axis=1))

# Show the resulting clustered data
print(df.head())

# Bivariate analysis using pairplot
sns.pairplot(df, hue='Cluster', vars=columns_to_standardize[:4])  # Pairplot for the first few features
plt.show()

# Analyzing the distribution of features across clusters
for column in columns_to_standardize:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x='Cluster', y=column, data=df)
    plt.title(f'Cluster vs {column}')
    plt.show()




### Visualization and Interpretation of results

In [None]:
columns_to_standardize = [
'relationship', 'Single', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp'
]
for column in columns_to_standardize:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x='Cluster', y=column, data=df)
    plt.title(f'Cluster vs {column}')
    plt.show()

-----
## Happy Learning
-----