# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score


### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [None]:
# Import required libraries
import pandas as pd

# Load the CSV file
file_path = "path/to/marketing.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Display the first 5 rows of the dataframe
print("First 5 rows of the dataframe:")
print(df.head())

# Check the shape of the dataset
print("\nShape of the dataset:")
print(df.shape)

# Display information about the dataset
print("\nInfo of the dataset:")
print(df.info())


### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [None]:
# Check the percentage of missing values in each column
missing_percentage = df.isnull().mean() * 100

# Display columns with missing values and their respective percentages
columns_with_missing_values = missing_percentage[missing_percentage > 0]
if not columns_with_missing_values.empty:
    print("Columns with missing values and their respective percentages:")
    print(columns_with_missing_values)
else:
    print("No missing values in the dataset.")

df_filled = df.fillna(df.mean())  

### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [None]:
# Check for duplicate records
duplicate_rows = df[df.duplicated()]

if not duplicate_rows.empty:
    print("Duplicate records found. Dropping duplicates...")
    # Drop duplicate records
    df = df.drop_duplicates()
    print("Duplicates dropped. Updated shape of the dataset:", df.shape)
else:
    print("No duplicate records found in the dataset.")


### 5. Drop the columns which you think redundant for the analysis 

In [None]:
# List of columns to be dropped (replace with the actual column names)
columns_to_drop = ['ID', 'Dt_Customer']

# Drop the specified columns
df = df.drop(columns=columns_to_drop)

# Display the updated shape of the dataset
print("Updated shape of the dataset after dropping columns:", df.shape)


### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [None]:
# Check unique categories in the 'Marital_Status' column
unique_categories = df['Marital_Status'].unique()
print("Unique categories in 'Marital_Status':", unique_categories)

# Group categories 'Married' and 'Together' as 'relationship'
df['Marital_Status'] = df['Marital_Status'].replace(['Married', 'Together'], 'relationship')

# Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'
df['Marital_Status'] = df['Marital_Status'].replace(['Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd'], 'Single')

# Check the unique categories after grouping
updated_categories = df['Marital_Status'].unique()
print("\nUnique categories after grouping:")
print(updated_categories)


### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [None]:
# List of columns to be grouped
expense_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

# Create a new column 'Total_Expenses' by summing the specified columns
df['Total_Expenses'] = df[expense_columns].sum(axis=1)

# Drop the individual expense columns if needed
df = df.drop(columns=expense_columns)

# Display the updated dataframe
print("Updated dataframe with 'Total_Expenses':")
print(df.head())


### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [None]:
# List of columns to be grouped
purchase_columns = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']

# Create a new column 'Num_Total_Purchases' by summing the specified columns
df['Num_Total_Purchases'] = df[purchase_columns].sum(axis=1)

# Drop the individual purchase columns if needed
df = df.drop(columns=purchase_columns)

# Display the updated dataframe
print("Updated dataframe with 'Num_Total_Purchases':")
print(df.head())


### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [None]:
# List of columns to be grouped
kids_columns = ['Kidhome', 'Teenhome']

# Create a new column 'Kids' by summing the specified columns
df['Kids'] = df[kids_columns].sum(axis=1)

# Drop the individual columns 'Kidhome' and 'Teenhome' if needed
df = df.drop(columns=kids_columns)

# Display the updated dataframe
print("Updated dataframe with 'Kids':")
print(df.head())


### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [None]:
# List of columns to be grouped
cmp_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

# Create a new column 'TotalAcceptedCmp' by summing the specified columns
df['TotalAcceptedCmp'] = df[cmp_columns].sum(axis=1)

# Drop the individual columns if needed
df = df.drop(columns=cmp_columns)

# Display the updated dataframe
print("Updated dataframe with 'TotalAcceptedCmp':")
print(df.head())


### 11. Drop those columns which we have used above for obtaining new features

In [None]:
# List of columns used for creating new features
columns_to_drop = ['ID', 'Dt_Customer', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                   'MntSweetProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases',
                   'NumStorePurchases', 'NumDealsPurchases', 'Kidhome', 'Teenhome',
                   'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

# Drop the specified columns
df = df.drop(columns=columns_to_drop)

# Display the updated dataframe
print("Updated dataframe after dropping columns:")
print(df.head())


### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [None]:
# Extract 'age' from 'Year_Birth'
current_year = pd.to_datetime('today').year
df['Age'] = current_year - df['Year_Birth']

# Drop the 'Year_Birth' column
df = df.drop(columns=['Year_Birth'])

# Display the updated dataframe
print("Updated dataframe with 'Age' and without 'Year_Birth':")
print(df.head())


### 13. Encode the categorical variables in the dataset

In [None]:
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['Education', 'Marital_Status'], drop_first=True)

# Display the updated dataframe with encoded categorical variables
print("Updated dataframe with encoded categorical variables:")
print(df_encoded.head())


### 14. Standardize the columns, so that values are in a particular range

In [None]:
from sklearn.preprocessing import StandardScaler

# Select only the numerical columns for standardization
numerical_columns = df_encoded.select_dtypes(include=['float64', 'int64']).columns

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the selected numerical columns
df_encoded[numerical_columns] = scaler.fit_transform(df_encoded[numerical_columns])

# Display the updated dataframe with standardized numerical columns
print("Updated dataframe with standardized numerical columns:")
print(df_encoded.head())


### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [None]:
from sklearn.decomposition import PCA

# Separate features (X) and target variable (if any)
# Assuming the target variable is not included in the dataframe
X = df_encoded.drop(columns=['target_variable_column'])  # Replace with the actual target variable column if present

# Choose the desired explained variance threshold (e.g., 0.90 or 0.95)
desired_variance_threshold = 0.95

# Create a PCA object
pca = PCA()

# Fit PCA on the data
pca.fit(X)

# Determine the number of components needed to explain the desired variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
num_components_needed = np.argmax(cumulative_variance >= desired_variance_threshold) + 1

# Transform the data using the selected number of components
X_pca = pca.transform(X)[:, :num_components_needed]

# Display the number of components and the cumulative explained variance
print(f"Number of components needed: {num_components_needed}")
print(f"Cumulative explained variance with {num_components_needed} components: {cumulative_variance[num_components_needed - 1]:.4f}")

# Use X_pca for further analysis or modeling


### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [None]:
from sklearn.cluster import KMeans

# Assuming X_pca is the PCA-transformed data from the previous step

# Choose the number of clusters (K) - You may need to tune this based on analysis
num_clusters = 3

# Create a KMeans object
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model to the PCA-transformed data
kmeans.fit(X_pca)

# Assign cluster labels to the original dataframe
df_encoded['Cluster_Labels'] = kmeans.labels_

# Display the cluster labels in the dataframe
print("Dataframe with Cluster Labels:")
print(df_encoded.head())


### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

In [None]:
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df_encoded is the dataframe after all the preprocessing steps

# Choose the number of clusters (K) for Agglomerative clustering
num_clusters_agg = 3

# Create an AgglomerativeClustering object
agg_clustering = AgglomerativeClustering(n_clusters=num_clusters_agg)

# Fit the model to the original data
agg_labels = agg_clustering.fit_predict(df_encoded.drop(columns=['Cluster_Labels']))

# Assign cluster labels to the dataframe
df_encoded['Agg_Cluster_Labels'] = agg_labels

# Display the dataframe with Agglomerative cluster labels
print("Dataframe with Agglomerative Cluster Labels:")
print(df_encoded.head())

# Bivariate analysis between cluster labels and different features
features_for_analysis = ['Income', 'Total_Expenses', 'Num_Total_Purchases', 'Kids', 'TotalAcceptedCmp', 'Age']

for feature in features_for_analysis:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='Agg_Cluster_Labels', y=feature, data=df_encoded)
    plt.title(f'Bivariate Analysis: {feature} by Agglomerative Cluster Labels')
    plt.show()


### Visualization and Interpretation of results

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df_encoded is the dataframe with Cluster_Labels from K-means clustering

# Scatter plot for two selected features
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Feature1', y='Feature2', hue='Cluster_Labels', data=df_encoded, palette='viridis', s=80)
plt.title('K-means Clustering Results')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend(title='Cluster Labels')
plt.show()


In [None]:
# Assuming df_encoded is the dataframe with Cluster_Labels from K-means clustering

# Group by cluster labels and calculate mean for each feature
cluster_means = df_encoded.groupby('Cluster_Labels').mean()

# Display the cluster means
print("Cluster Means:")
print(cluster_means)


-----
## Happy Learning
-----