# Hierarchical Clustering - Mall Customers Dataset




## 1. Introduction

In this project, we perform hierarchical clustering on a dataset of mall customers to segment them into different clusters based on their purchasing behavior. The goal of this analysis is to discover patterns or trends within the customer data that may help in decision-making for targeted marketing, personalized offers, etc.

### Libraries Used:
- **pandas**: For data manipulation and cleaning.
- **numpy**: For numerical operations.
- **matplotlib** & **seaborn**: For data visualization and analysis.
- **sklearn**: For performing hierarchical clustering and model evaluation.
- **scipy**: For hierarchical clustering (linkage function).

```python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage


## 2. Dataset Overview

The dataset consists of information about various customers of a mall, including their spending habits and demographic features. Below are the columns available in the dataset:

- **CustomerID**: Unique identifier for each customer.

- **Annual Income**: The yearly income of the customer.

- **Spending Score**: A score assigned to the customer based on their spending behavior.

- **Age**: The age of the customer.

We will use these features to perform clustering and group customers into segments.

```python

df = pd.read_csv('Mall_Customers.csv')
df.head()


## 3. Data Exploration

Let's take a look at the structure of the data and perform some basic analysis.

### 3.1 Basic Analysis

We will first check the first few records, data types, and summary statistics.

```python

df.info()  # Information about the dataset
df.describe()  # Summary statistics




## 3.2 Check for Missing Values

It's important to check if there are any missing values in the dataset.

```python

df.isnull().sum()  # Check for missing values


### 3.3 Data Visualization

We can visualize the relationships between the features using scatter plots and pair plots.

```python

sns.pairplot(df[['Annual Income', 'Spending Score', 'Age']])
plt.show()


## 4. Data Preprocessing

Before performing clustering, we need to preprocess the data by scaling it so that each feature contributes equally to the clustering process.

### 4.1 Scaling the Data

We will scale the features using StandardScaler to standardize them.

```python

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Annual Income', 'Spending Score', 'Age']])


## 5. Hierarchical Clustering

### 5.1 Dendrogram

We will plot the dendrogram to determine the number of clusters to form. This will give us insights into the distances at which clusters merge.


```python

linked = linkage(df_scaled, method='ward')  # Hierarchical clustering using Ward's method
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.show()


### 5.2 Create the Clusters

Based on the dendrogram, we will decide the number of clusters. For example, let’s assume we choose 5 clusters.


```python

hierarchical = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
df['Cluster'] = hierarchical.fit_predict(df_scaled)


### 5.3 Visualize the Clusters

Let's visualize the clusters by plotting the scatter plot.

```python

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income', y='Spending Score', hue='Cluster', palette='Set1')
plt.title('Hierarchical Clustering of Mall Customers')
plt.show()


## 6. Model Evaluation

In hierarchical clustering, model evaluation is usually done based on the separation of clusters. Since we don't have predefined labels, evaluating the clustering is subjective. We will focus on the cluster distribution and separability.

### 6.1 Cluster Distribution

We can check the distribution of customers in each cluster.

```python

df['Cluster'].value_counts()


## 7. Conclusion

### Summary

- **Dataset**: The dataset contains information about mall customers, including their annual income, spending score, and age.

- **Clustering Method**: We used hierarchical clustering with Ward's linkage method to group the customers into 5 clusters.

- **Evaluation**: The clusters were analyzed based on their separation and distribution, with no clear predefined labels to assess accuracy.

### Insights

- Customers in the same cluster show similar spending behaviors and income levels.

- The clustering can be used for targeted marketing and improving customer service strategies.