### Smart Segmentation - Customer Personas with AI

This notebook contains the solution for the Day 5 Assignment.

**Objectives:**
1.  **Explore Gender vs. Spending Score:** Analyze the relationship between 'Gender' and 'Spending Score (1-100)'.
2.  **Apply Feature Engineering for Clustering:** Create a new feature, determine optimal clusters, and visualize.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Set plot style
sns.set_style('whitegrid')

In [None]:
# Clone the dataset repo if not already present (Ensure internet access)
try:
    !git clone "https://github.com/HarshvardhanSingh-13/Datasets"
except:
    print("Repo might already exist or git not available.")

# Load the dataset
# Adjust path if necessary based on your local environment setup
try:
    # Standard path if cloned in current dir
    df = pd.read_csv('Datasets/Mall Dataset/Mall_Customers.csv')
except FileNotFoundError:
    # Fallback for Colab/specific structure
    try:
        df = pd.read_csv('/content/Datasets/Mall Dataset/Mall_Customers.csv')
    except FileNotFoundError:
        print("Error: Dataset not found. Please ensure the dataset is downloaded.")

if 'df' in locals():
    print("Data Loaded Successfully")
    display(df.head())

## 2. Explore Gender vs. Spending Score

Goal: Analyze if gender plays a significant role in spending habits.

In [None]:
if 'df' in locals():
    # Summary Statistics by Gender
    gender_spending = df.groupby('Gender')['Spending Score (1-100)'].describe()
    display(gender_spending)

    # Visualization
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='Gender', y='Spending Score (1-100)', data=df, palette='Set2')
    plt.title('Spending Score Distribution by Gender')
    plt.show()

    plt.figure(figsize=(10, 6))
    sns.violinplot(x='Gender', y='Spending Score (1-100)', data=df, palette='Set2', inner='quartile')
    plt.title('Spending Score Density by Gender')
    plt.show()

**Observation:**
Look at the means and distributions. Often in this dataset, the spending score distribution between males and females is quite similar, though females might have a slightly higher mean or different spread.

---

## 3. Apply Feature Engineering for Clustering

**Step 1: Feature Engineering**
Let's create a new feature called `Spending_Power_Index`. 
Hypothesis: High income doesn't always mean high spending. The ratio might indicate "willingness" to spend relative to earnings.
Formula: `Spending Score` * `Annual Income` / 100 (A simple interaction term representing approximate total value)
Or let's try: `Annual Income` and `Spending Score` are typically the best for clustering. Let's engineer a feature that captures Age groups or normalized income.

In [None]:
if 'df' in locals():
    # Feature Engineering choice: Interaction between Income and Score
    # Let's create an 'Affluence Score' = Annual Income * Spending Score
    # This amplifies high income + high spenders vs low income + low spenders
    df['Affluence_Score'] = df['Annual Income (k$)'] * df['Spending Score (1-100)']

    # Preparing data for clustering
    # We will use Annual Income, Spending Score, and our new Affluence Score
    X = df[['Annual Income (k$)', 'Spending Score (1-100)', 'Affluence_Score']]
    
    # Scaling is crucial because Affluence_Score will be much larger in magnitude
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print("Feature Engineering Complete.")
    print(X.head())

**Step 2: Determine Optimal Number of Clusters (Elbow Method)**

In [None]:
if 'df' in locals():
    wcss = []
    K_range = range(1, 11)

    for k in K_range:
        kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
        kmeans.fit(X_scaled)
        wcss.append(kmeans.inertia_)

    plt.figure(figsize=(10, 5))
    plt.plot(K_range, wcss, marker='o', linestyle='--')
    plt.title('Elbow Method')
    plt.xlabel('Number of Clusters')
    plt.ylabel('WCSS')
    plt.show()

**Step 3: Clustering and Visualization**
Based on the Elbow plot (usually k=5 for this dataset), we will fit the model.

In [None]:
if 'df' in locals():
    # Assuming k=5 is optimal (common for this specific dataset structure)
    optimal_k = 5
    kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    
    df['Cluster'] = clusters

    # Visualize - We map back to original features for interpretability
    plt.figure(figsize=(12, 8))
    sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', 
                    data=df, palette='viridis', s=100, alpha=0.8)
    
    plt.title(f'Customer Segments (k={optimal_k}) with Engineered Feature Influence')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend(title='Cluster')
    plt.show()

## 4. Conclusion and Overall Insights

**Summary of Findings:**
*   **Gender Analysis:** Gender alone does not seem to be a strong differentiator for spending habits in this dataset. The distributions of spending scores for males and females are quite overlapping.
*   **Clustering:** Using the engineered feature **Affluence_Score** (Income * Spending) combined with Income and Spending Score allowed us to find distinct customer segments.
*   **Segments Identified (Typical clusters for this dataset):**
    1.  **Low Income, Low Spending:** Sensitive to price, likely needs discounts to convert.
    2.  **Low Income, High Spending:** Young aspirational shoppers? Good targets for buy-now-pay-later schemes.
    3.  **High Income, Low Spending:** "Frugal" wealthy. Unlikely to be swayed by standard promotions.
    4.  **High Income, High Spending:** The VIP customers. Need exclusive access and luxury branding.
    5.  **Average Income, Average Spending:** The massive middle class. Standard mass-market campaigns work here.

**Feature Engineering Impact:**
Creating an interaction term helped emphasize the 'extremes' (High-High vs Low-Low), potentially making the clusters more robust or at least easier to interpret in terms of total customer value.