# Mall Customer Segmentation Analysis

## Introduction

As a data science student, I aim to apply unsupervised learning to segment mall customers based on their demographic and behavioral data. This project uses the Mall Customer Segmentation Data from Kaggle to identify distinct customer groups, which can help mall management tailor marketing strategies. My approach includes exploratory data analysis (EDA), clustering with unsupervised learning models, and hyperparameter optimization to ensure robust results. This notebook documents my process, from data collection to model evaluation, addressing the project rubric's requirements.

### Step 1: Data Gathering and Provenance

#### Explanation

I selected the "Mall Customer Segmentation Data" from Kaggle, a publicly available dataset ideal for clustering tasks. It contains 200 records with features: CustomerID, Gender, Age, Annual Income (in thousands of dollars), and Spending Score (1-100). The data is clean, with no missing values, and sourced from Kaggle (https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python). I downloaded the CSV file and placed it in my working directory. My goal is to use this data to segment customers without predefined labels, making it suitable for unsupervised learning.


In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('Mall_Customers.csv')

# Display the first few rows and basic information
print(data.head())
print(data.info())
print(data.describe())

#### Analysis/Interpretation of Output

The output from `data.head()` shows the first five rows of the dataset, confirming the presence of five columns: CustomerID, Gender, Age, Annual Income (k$), and Spending Score (1-100). The `data.info()` output verifies that there are 200 entries with no missing values, as all columns have 200 non-null entries. The data types are appropriate: `CustomerID`, `Age`, `Annual Income (k$)`, and `Spending Score (1-100)` are integers, while `Gender` is an object (string). The `data.describe()` output provides summary statistics, revealing that Age ranges from 18 to 70 (mean ~39), Annual Income from 15k to 137k (mean ~60.56k), and Spending Score from 1 to 99 (mean ~50.2). The standard deviations (e.g., ~26.26k for income, ~25.82 for spending score) suggest moderate variability, but the maximum income (137k) is notably higher than the 75th percentile (78k), hinting at potential outliers to explore later. This confirms the dataset is clean and suitable for clustering, with numerical features ready for scaling due to their differing ranges.


### Step 2: Identifying the Unsupervised Learning Problem

#### Explanation

The problem is to segment mall customers into meaningful groups based on their Age, Annual Income, and Spending Score, using unsupervised learning. Since the dataset has no target variable, clustering is appropriate. I hypothesize that customers can be grouped by spending behavior and demographics, which can inform targeted marketing. I'll use K-means clustering as my primary unsupervised model due to its simplicity and effectiveness for numerical data, and DBSCAN to explore density-based clustering, comparing their performance to meet the rubric's requirement for multiple models.

### Step 3: Exploratory Data Analysis (EDA) - Initial Inspection and Cleaning

#### Explanation

I begin EDA by inspecting the dataset for missing values, data types, and basic statistics. Since the dataset is reportedly clean, I expect no missing values but will verify. I'll rename columns for clarity (e.g., "Annual Income (k$)" to "Annual_Income") and check for outliers using summary statistics. This step ensures the data is ready for visualization and modeling, addressing the rubric's 26-point EDA requirement.

In [None]:
# Check for missing values
print("Missing values:\n", data.isnull().sum())

# Rename columns for clarity
data.columns = ['CustomerID', 'Gender', 'Age', 'Annual_Income', 'Spending_Score']

# Verify data types and summary statistics
print("\nData Types:\n", data.dtypes)
print("\nSummary Statistics:\n", data.describe())

#### Analysis/Interpretation of Output

The `isnull().sum()` output confirms no missing values across all columns, aligning with the dataset's reported cleanliness and eliminating the need for imputation. The `dtypes` output shows that after renaming columns, `CustomerID`, `Age`, `Annual_Income`, and `Spending_Score` are `int64`, and `Gender` is `object`, which is expected. This supports using numerical features directly for clustering and encoding `Gender` if included. The `describe()` output mirrors the earlier summary statistics, showing Age (18-70, mean ~38.85), Annual_Income (15-137k, mean ~60.56k), and Spending_Score (1-99, mean ~50.2). The wide range in Annual_Income (std ~26.26k) and the high maximum (137k) suggest potential outliers, which I’ll investigate via visualization. The differing scales (e.g., Age vs. Annual_Income) confirm the need for standardization before clustering to ensure equal feature influence. No immediate anomalies are evident, so I can proceed to visualization for deeper insights.

### Step 4: EDA - Visualizing Data Distributions

#### Explanation

To understand feature distributions, I'll create histograms for Age, Annual Income, and Spending Score, and a count plot for Gender. Box plots will help identify outliers. These visualizations reveal the spread and skewness of features, guiding preprocessing decisions (e.g., scaling). Visualizing Gender distribution ensures I account for categorical variables, potentially encoding them for clustering.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set plot style
sns.set(style="whitegrid")

# Histograms for numerical features
plt.figure(figsize=(15, 5))
for i, feature in enumerate(['Age', 'Annual_Income', 'Spending_Score'], 1):
    plt.subplot(1, 3, i)
    sns.distplot(data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Count plot for Gender
plt.figure(figsize=(6, 4))
sns.countplot(x='Gender', data=data)
plt.title('Gender Distribution')
plt.show()

# Box plots for numerical features
plt.figure(figsize=(15, 5))
for i, feature in enumerate(['Age', 'Annual_Income', 'Spending_Score'], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=data[feature])
    plt.title(f'Box Plot of {feature}')
plt.tight_layout()
plt.show()

#### Analysis/Interpretation of Output
The histograms show the distributions of Age, Annual_Income, and Spending_Score. Age is slightly right-skewed, with a peak around 30-40 years, indicating more younger to middle-aged customers. Annual_Income has a multimodal distribution, with peaks around 20-40k, 60-80k, and a smaller group above 100k, suggesting diverse income groups and possible outliers at the high end (e.g., 137k). Spending_Score appears roughly uniform, with slight peaks at the extremes (low and high scores), indicating varied spending behaviors. The Gender count plot shows a slightly higher number of female customers than males, suggesting a balanced but slightly female-skewed customer base. The box plots  confirm no extreme outliers for Age or Spending_Score, as no points lie beyond the whiskers. However, Annual_Income has a few points above the upper whisker (~100k), indicating mild outliers (e.g., ~137k). These findings suggest scaling is necessary due to different ranges and that outliers in income may represent valid high-earning customers, so I’ll retain them. I’ll also consider encoding Gender for clustering to explore its impact.

### Step 5: EDA - Correlation Analysis

#### Explanation

I'll compute and visualize correlations between numerical features (Age, Annual Income, Spending Score) using a heatmap. This helps identify relationships that may influence clustering (e.g., if high income correlates with high spending). Since clustering assumes feature independence, strong correlations may require feature selection or transformation. I'll also consider Gender's impact qualitatively.

In [None]:
# Compute correlation matrix
corr_matrix = data[['Age', 'Annual_Income', 'Spending_Score']].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()
print(corr_matrix)

#### Analysis/Interpretation of Output

The correlation heatmap and matrix show the relationships between Age, Annual_Income, and Spending_Score. The correlation between Age and Spending_Score is moderately negative (-0.327), suggesting that younger customers tend to have higher spending scores, which could influence cluster formation. The correlation between Annual_Income and Spending_Score is very weak (0.0099), indicating little linear relationship, which is surprising as I expected higher income to correlate with higher spending. The correlation between Age and Annual_Income is also negligible (-0.0124), suggesting these features are largely independent. These weak correlations are favorable for clustering, as K-means assumes feature independence, reducing the need for feature selection or transformation. However, the moderate Age-Spending_Score correlation suggests clusters may separate based on age-related spending patterns. I’ll proceed with all three numerical features for clustering, as they provide distinct information, and qualitatively assess Gender’s role later.