# Customer Segmentation Solution

This project is designed to introduce you to unsupervised learning through a practical application: customer segmentation. You'll learn how to use clustering techniques to segment customers based on their behaviors and attributes.

## Objective

The goal of this project is to understand unsupervised learning techniques, particularly clustering, and apply them to segment a customer dataset. This will help in identifying distinct groups within the customer base to tailor marketing strategies accordingly.

## Getting Started

You will use the "Mall Customers" dataset for this project. This dataset includes information about customers that can be used for segmentation, such as gender, age, annual income, and spending score.

- **Resource**: The dataset can often be found on Kaggle or directly via a web search for "Mall Customers dataset CSV".

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


In [None]:
# Load the dataset
df = pd.read_csv('Mall_Customers.csv')


## Data Exploration

Understanding the dataset is crucial before applying any clustering techniques. Explore the dataset to gain insights into the customer information it contains.

- **TODO**: Display the first few rows of the dataset.
- **TODO**: Generate descriptive statistics for the dataset.
- **TODO**: Visualize the distribution of key features to get a sense of the data.

- **Resource**: [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- **Resource**: [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)


In [None]:
# Display the first few rows
print(df.head())

# Generate descriptive statistics
print(df.describe())

# Visualize the distribution of 'Annual Income' and 'Spending Score'
sns.jointplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df, kind='scatter')
plt.show()


## Preprocessing the Data

Before applying clustering algorithms, it's important to preprocess the data. This can involve handling missing values, encoding categorical variables, and scaling features.

- **TODO**: Check for and handle any missing values in the dataset.
- **TODO**: Encode any categorical variables as needed.
- **TODO**: Scale the features to have a similar range for better performance of clustering algorithms.

- **Resource**: [Preprocessing Data with Scikit-Learn](https://scikit-learn.org/stable/modules/preprocessing.html)


In [None]:
# Assuming 'Gender' needs to be encoded and features need to be scaled
df_encoded = pd.get_dummies(df, drop_first=True)

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_encoded[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']])

## Choosing and Applying a Clustering Algorithm

There are several clustering algorithms available. K-Means is a popular choice for many segmentation tasks, but feel free to explore others like hierarchical clustering or DBSCAN.

- **TODO**: Choose a clustering algorithm to apply to the dataset.
- **TODO**: Determine the optimal number of clusters (if applicable, e.g., for K-Means).
- **TODO**: Apply the clustering algorithm to segment the customer data.

- **Resource**: [Scikit-learn Clustering Techniques](https://scikit-learn.org/stable/modules/clustering.html)


In [None]:
# Using the elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(scaled_features)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
# Assuming the elbow is at 5 clusters
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(scaled_features)

df['Cluster'] = cluster_labels


## Analyzing the Segments

After applying the clustering algorithm, analyze the resulting customer segments.

- **TODO**: Examine the characteristics of each cluster. What distinguishes each segment?
- **TODO**: Visualize the clusters in terms of key features.

- **Resource**: [Matplotlib Documentation](https://matplotlib.org/3.1.1/contents.html)


In [None]:
# Visualizing clusters for 'Annual Income' and 'Spending Score'
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', data=df, palette='viridis')
plt.title('Customer Segments Based on Annual Income and Spending Score')
plt.show()

# Analyzing clusters by 'Age'
sns.boxplot(x='Cluster', y='Age', data=df)
plt.title('Age Distribution by Cluster')
plt.show()


## Conclusion

Reflect on the clustering process and the resulting customer segments. Consider how these segments might inform marketing strategies or product offerings.

Congratulations on completing this project! This exercise aimed to give you hands-on experience with unsupervised learning and customer segmentation.
