# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [None]:
### Import required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [None]:
# Load the CSV file
df = pd.read_csv('/content/drive/MyDrive/Datasets/marketing.csv')

# Display the first 5 rows
print(df.head(5))

# Check the shape of the dataset
print("Shape:", df.shape)

# Check the info of the dataset
print("Info:")
print(df.info())

### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [None]:
# Check the percentage of missing values
missing_percentages = df.isnull().mean() * 100

# Display the missing percentages
print("Missing Value Percentages:")
print(missing_percentages)

### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [None]:
# Check for duplicate records
duplicates = df.duplicated()

# Count the number of duplicate records
num_duplicates = duplicates.sum()

# Display the number of duplicate records
print("Number of Duplicate Records:", num_duplicates)

# Drop the duplicate records (if any)
df.drop_duplicates(inplace=True)

# Verify if duplicates were dropped
print("Number of Records after Dropping Duplicates:", len(df))

### 5. Drop the columns which you think redundant for the analysis

In [None]:
# Check the column names
columns = df.columns

# Print the column names
print("Column Names:")
print(columns)

In [None]:
# List of redundant columns to drop
redundant_columns = ['ID', 'Dt_Customer']

# Drop the redundant columns
df.drop(columns=redundant_columns, inplace=True)

# Verify the updated DataFrame
print(df.head())

### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [None]:
# Check the unique categories in 'Marital_Status'
unique_categories = df['Marital_Status'].unique()

# Print the unique categories
print("Unique Categories in 'Marital_Status':")
print(unique_categories)

# Group categories 'Married' and 'Together' as 'relationship'
df['Marital_Status'].replace(['Married', 'Together'], 'relationship', inplace=True)

# Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'
df['Marital_Status'].replace(['Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd'], 'Single', inplace=True)

# Verify the updated 'Marital_Status' column
print("\nUpdated 'Marital_Status' Column:")
print(df['Marital_Status'].unique())

### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [None]:
# Define the columns to be grouped
columns_to_group = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

# Create the 'Total_Expenses' column as the sum of the grouped columns
df['Total_Expenses'] = df[columns_to_group].sum(axis=1)

# Verify the updated DataFrame with the new 'Total_Expenses' column
print(df.head())

### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [None]:
# Define the columns to be grouped
columns_to_group = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']

# Create the 'Num_Total_Purchases' column as the sum of the grouped columns
df['Num_Total_Purchases'] = df[columns_to_group].sum(axis=1)

# Verify the updated DataFrame with the new 'Num_Total_Purchases' column
print(df.head())

### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [None]:
# Create the 'Kids' column as the sum of 'Kidhome' and 'Teenhome'
df['Kids'] = df['Kidhome'] + df['Teenhome']

# Verify the updated DataFrame with the new 'Kids' column
print(df.head())

### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [None]:
# Define the columns to be grouped
columns_to_group = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

# Create the 'TotalAcceptedCmp' column as the sum of the grouped columns
df['TotalAcceptedCmp'] = df[columns_to_group].sum(axis=1)

# Verify the updated DataFrame with the new 'TotalAcceptedCmp' column
print(df.head())

### 11. Drop those columns which we have used above for obtaining new features

In [None]:
# Define the columns to drop
columns_to_drop = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
                   'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                   'NumDealsPurchases', 'Kidhome', 'Teenhome', 'AcceptedCmp1', 'AcceptedCmp2',
                   'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

# Drop the columns
df.drop(columns=columns_to_drop, inplace=True)

# Verify the updated DataFrame
print(df.head())

In [None]:
df.shape

### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [None]:
# Extract 'age' from 'Year_Birth'
df['age'] = pd.Timestamp('now').year - df['Year_Birth']

# Drop the 'Year_Birth' column
df.drop(columns=['Year_Birth'], inplace=True)

# Verify the updated DataFrame
print(df.head())

### 13. Encode the categorical variables in the dataset

In [None]:
# Encode categorical variables using get_dummies()
df_encoded = pd.get_dummies(df)

# Verify the encoded DataFrame
print(df_encoded.head())

### 14. Standardize the columns, so that values are in a particular range

In [None]:
# Verify the current columns in the DataFrame
print(df.head())

In [None]:
# Identify the numeric columns
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Print the numeric columns
print("Numeric Columns:")
print(numeric_columns)

In [None]:
# Identify the numeric columns
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate the range for each numeric column
column_ranges = df[numeric_columns].max() - df[numeric_columns].min()

# Specify a threshold for range difference to determine columns to be standardized
range_threshold = 100

# Identify columns with a range above the threshold
columns_to_standardize = column_ranges[column_ranges > range_threshold].index.tolist()

# Create a StandardScaler object
scaler = StandardScaler()

# Standardize the selected columns
df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

# Verify the standardized DataFrame
print(df.head())

### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [None]:
print(df.columns)

In [None]:
# Select the columns for PCA analysis
columns_for_pca = ['Income', 'Recency', 'NumWebVisitsMonth', 'Complain']

# Drop rows with missing values in the selected columns
df = df.dropna(subset=columns_for_pca)

# Standardize the selected columns
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[columns_for_pca])

# Handle missing values (NaN) using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_scaled_imputed = imputer.fit_transform(df_scaled)

# Apply PCA
pca = PCA()
pca.fit(df_scaled_imputed)

# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Calculate cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Determine the number of components to explain the desired variance
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1

# Generate elbow plot
plt.plot(range(1, len(explained_variance_ratio) + 1), cumulative_variance_ratio, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Elbow Plot for PCA')
plt.axvline(x=n_components, color='red', linestyle='--', label='95% Variance')
plt.legend()
plt.show()

# Print the number of components required
print("Number of PCA components to explain 95% variance:", n_components)


### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [None]:
print(df.columns)

In [None]:
# Select the columns for PCA analysis
columns_for_pca = ['Income', 'Recency', 'NumWebVisitsMonth', 'Complain', 'MntWines',
                   'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
                   'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
                   'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                   'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
                   'AcceptedCmp2', 'Complain']

# Drop rows with missing values in the selected columns
df = df.dropna(subset=columns_for_pca)

# Standardize the selected columns
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[columns_for_pca])

# Apply PCA
pca = PCA(n_components=8)  # Use all 8 components for clustering
df_pca = pca.fit_transform(df_scaled)

# Print the PCA components used
print("PCA Components:")
for i in range(pca.n_components_):
    print(f"Component {i+1}: {pca.explained_variance_ratio_[i]}")

# Apply K-means clustering
kmeans = KMeans(n_clusters=4)  # Set the number of clusters
kmeans.fit(df_pca)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Add the cluster labels to the original DataFrame
df['Cluster'] = cluster_labels

# Scatter plot of PCA-transformed data with cluster labels
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=cluster_labels)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-means Clustering')
plt.show()

### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

In [None]:
# Print the column names
print(df.columns)

In [None]:
# Drop 'ID' and 'Dt_Customer' columns
df = df.drop(columns=['ID', 'Dt_Customer'])

# Convert 'Year_Birth' to 'Age'
df['Age'] = 2023 - df['Year_Birth']

# Now drop the 'Year_Birth' column as it's no longer needed
df = df.drop(columns=['Year_Birth'])

# Define which columns should be encoded vs scaled
columns_to_encode = ['Education', 'Marital_Status']
columns_to_scale = [col for col in df.columns if col not in columns_to_encode]

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(drop='first')
num_imputer = SimpleImputer(strategy='mean')  # Using mean strategy for imputation
cat_imputer = SimpleImputer(strategy='most_frequent')  # Using most_frequent strategy for imputation

# Combine the encoder/scaler to preprocess data
preprocess = ColumnTransformer(
    transformers=[
        ('impute_encode', make_pipeline(cat_imputer, ohe), columns_to_encode),
        ('impute_scale', make_pipeline(num_imputer, scaler), columns_to_scale)])

df_std = preprocess.fit_transform(df)

# Applying Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
clusters = agg_clustering.fit_predict(df_std)

# Adding cluster labels to the dataframe
df['cluster_labels'] = clusters

# Listing features and cluster labels
print("Features: ", df.columns[:-1])  # Excluding 'cluster_labels'
print("Cluster Labels: ", np.unique(clusters))

# For numerical columns, calculate the mean
numerical_cols = df.select_dtypes(include=[np.number]).columns
numeric_cluster_analysis = df[numerical_cols].groupby('cluster_labels').mean()

# For categorical columns, calculate the mode (most frequent category)
categorical_cols = df.select_dtypes(include=[object]).columns
categorical_cols_with_labels = list(categorical_cols) + ['cluster_labels']
categorical_cluster_analysis = df[categorical_cols_with_labels].groupby('cluster_labels').agg(lambda x: x.mode().iloc[0])

# Combine both dataframes for the final cluster analysis
cluster_analysis = pd.concat([numeric_cluster_analysis, categorical_cluster_analysis], axis=1)

# Printing the cluster analysis
cluster_analysis

### Visualization and Interpretation of results

In [None]:
analysis = """
Cluster 0:
- Average income: $75,886.80
- Low number of kids at home and teens at home
- Average recency (time since last purchase)
- Relatively higher spending on wines, meat products, and fish products
- Moderate spending on fruits, sweet products, and gold products
- Higher acceptance rates for campaign 3, campaign 4, and campaign 5
- Older age group with an average age of 54.90
- Mostly have a graduation level of education
- Mostly married individuals

Cluster 1:
- Average income: $34,783.23
- High number of kids at home and teens at home
- Average recency (time since last purchase)
- Lower spending on wines, fruits, meat products, fish products, sweet products, and gold products
- Low acceptance rates for campaign 1 and campaign 2
- Higher complaint rates
- Average age of 51.72
- Mostly have a graduation level of education
- Mostly married individuals

Cluster 2:
- Average income: $56,000.80
- Moderate number of kids at home and high number of teens at home
- Low recency (recent purchases)
- Higher spending on wines, meat products, fish products, sweet products, and gold products
- Relatively lower acceptance rates for campaign 1, campaign 2, and campaign 5
- Lower complaint rates
- Average age of 57.02
- Mostly have a graduation level of education
- Mostly married individuals

In conclusion, the three clusters represent distinct customer segments based on their income levels, family size, recency of purchases, spending patterns, responsiveness to marketing campaigns, and demographic characteristics. Cluster 0 represents affluent and older customers who are more responsive to campaigns. Cluster 1 consists of customers with lower income levels, larger family sizes, and lower spending. Cluster 2 represents customers with moderate income levels, moderate family sizes, and higher spending on various product categories. These insights can be valuable for targeted marketing strategies and personalized customer engagement based on specific cluster characteristics.
"""

print(analysis)

In [None]:
# Box plot for continuous features
for i in range(df.shape[1] - 1):
    plt.figure(figsize=(6, 4))
    sns.boxplot(x='cluster_labels', y=df.columns[i], data=df)
    plt.xlabel('Cluster Labels')
    plt.ylabel(df.columns[i])
    plt.title(f"Box Plot: {df.columns[i]} vs Cluster Labels")
    plt.show()

-----
## Happy Learning
-----