#CODECHEF-VIT RECRUITMENTS 2025

# **Step 0: Copy this notebook**

Guidelines:

*   Make a copy of this notebook in your Google Drive
*   
Submit the editted colab notebook as your final submission



# **Step 1: Dataset for this task**

Guidelines: Download the dataset from the link provided and import it into your notebook

Link: https://drive.google.com/file/d/1F65Br7-pkcTZ05d9JG_-qdsKCd_WOyKo/view

# **Step 2: Import necessary libraries**

Guidelines: Import all required libraries here

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# **Step 3: Import Dataset**

Guidelines: Import the csv dataset into a dataframe

In [None]:
df = pd.read_csv('customer_data.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe() #we get all statistics of all numerical data given below.

# Step 4: Data Cleaning

Guidelines: Prepare the data for analysis

In [None]:
df.isnull().sum() # there are no null values to be removed

In [None]:
df.dropna(inplace=True) #deleting rows with null values

In [None]:
df.duplicated().sum() #there are no duplicate values in the dataset

In [None]:
df = df.drop_duplicates() #dropping duplicated values

In [None]:
df.shape

In [None]:
education_mapping = {'HighSchool': 0, 'College': 1, 'Bachelor': 2, 'Masters': 3}
loyalty_mapping = {'Regular': 0, 'Silver': 1, 'Gold': 2}
purchase_frequency_mapping = {'rare': 0, 'occasional': 1, 'frequent': 2}

# Apply manual encoding
df['education'] = df['education'].map(education_mapping)
df['loyalty_status'] = df['loyalty_status'].map(loyalty_mapping)
df['purchase_frequency'] = df['purchase_frequency'].map(purchase_frequency_mapping)

# Apply Label Encoding to nominal categories
label_encoders = {}
nominal_features = ['region', 'product_category', 'gender']

for col in nominal_features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])  # Transform categorical values into numbers
    label_encoders[col] = le  # Save the encoder for future use


In [None]:
print("Education Mapping:", education_mapping)
print("Loyalty Status Mapping:", loyalty_mapping)
print("Purchase Frequency Mapping:", purchase_frequency_mapping)

for col, le in label_encoders.items():
    print(f"{col} Mapping:")
    for i, class_ in enumerate(le.classes_):
        print(f"  {class_} -> {i}")
    print()

In [None]:
df.head()

# Step 5: Analysis

Guidelines: Perform your analysis here

In [None]:
sns.set(style="whitegrid")

Age Distribution

In [None]:
plt.figure(figsize=(6,4))
sns.histplot(df['age'], bins=20, kde=True, color='blue')
plt.title("Age Distribution")
plt.show()

Income Distribution

In [None]:
plt.figure(figsize=(6, 4))
sns.histplot(df['income'], bins=20, kde=True, color='green')
plt.title("Income Distribution")
plt.show()

Purchase Amount Distribution

In [None]:
plt.figure(figsize=(6, 4))
sns.histplot(df['purchase_amount'], bins=20, kde=True, color='purple')
plt.title("Purchase Amount Distribution")
plt.show()

Box plot for income

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x=df['income'], color='red')
plt.title("Income Box Plot")

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x=df['purchase_amount'], color='yellow')
plt.title("Purchase Amount Box Plot")

In [None]:
x = df[['income', 'purchase_amount']]

In [None]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [None]:
inertia = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(x_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertia, marker='o', linestyle='--', color='blue')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(x_scaled)

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df['income'], y=df['purchase_amount'], hue=df['Cluster'], palette='viridis', s=100)
plt.title("Customer Segmentation Based on Spending Habits")
plt.xlabel("Income")
plt.ylabel("Purchase Amount")
plt.legend(title="Cluster")
plt.show()

In [None]:
cluster_summary = df.groupby("Cluster")[["income", "purchase_amount"]].mean()
print(cluster_summary)

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x=df['product_category'], order=df['product_category'].value_counts().index, palette="coolwarm")
plt.title("Product Category Distribution")
plt.xlabel("Product Category")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(7, 7))
df['product_category'].value_counts().plot.pie(autopct='%1.1f%%', colors=sns.color_palette('coolwarm'))
plt.title("Product Category Distribution")
plt.ylabel("")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap (With Encoded Categorical Data)")
plt.show()

# Step 6: Results and Inferences
Guidelines: List out your inferences here

**Heatmap Inference**

*   Strong Positive Correlation(0.95) between income and purchase_amount
*   Moderate Positive Correlation (0.46) between income and cluster

*   No significant relation between satisfaction_score and any feature
*   Similarly loyalty status has weak correlations and gender has no correlations to other features.

**Histograms**


*   Age distributions shows normal distributions with a slight skew
*   Right-skewed, meaning a few customers have very high incomes, i.e, most customers are mid-income range and only few high-value customers contribute to significant total revenue.

**Box Plot**


*   Income vs. Purchase Amount: A few customers spend much more than average.
*   Purchase Frequency vs. Purchase Amount: Frequent buyers tend to spend more.

*   There are clear outliers in spending (likely luxury shoppers).

**Customer Segmentation (Clustering using KMeans)**

*   Cluster 0: Low spenders (probably budget conscious)
*   Cluster 1: Mid level spenders (occasional buyers)
*   Cluster 2: High-value customers (loyal, frequent_buyers)
*   Majority fall into mid-level spenders.



**Product Category Insights (Bar Chart/Pie Chart)**


*   Top Categories: Clothing, Electronics and Food
*   Least Popular: Beauty and Books.















