# Report: Customer Segmentation Analysis of Nike

## Main Objective
Apply unsupervised learning, that is, **clustering**, to a Nike product dataset. This analysis would enable Nike and its stakeholders to understand products by categories in terms of price, stock availability, and average rating of the product. This type of analysis would help Nike in decisions and inventory optimization.

## Model Focus
- **Clustering Techniques** such as K-means, DBSCAN, and Agglomerative Clustering were employed to identify customer segments.

# Dataset Overview
The dataset contains detailed information about Nike’s customers, including demographic data, purchase frequency, total purchase value, and product categories. This data will be used to segment customers based on their behavior.

### **Key Attributes:**
- `name`: Product name.
- `price`: Costs of the respective product.
- `avg_rating`: Ratings given by customers.
- `review_count`: Number of reviews done by customers.
- `availability`: Shows whether the product is in stock.
- `color`: Available colors.
Available key features for clustering were: `price`, `avg_rating`, `review_count`, `color`, and `availability`.

In [3]:
%pip install pandas





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import pandas as pd
# Load the dataset
nike_data = "D:\\projects\\-IBM-Machine-Learning\\nike_data_2022_09.csv"
df = pd.read_csv(nike_data)
# Display basic information about the dataset 
print(df.info())
print(df.describe())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   url              112 non-null    object 
 1   name             112 non-null    object 
 2   sub_title        112 non-null    object 
 3   brand            112 non-null    object 
 4   model            112 non-null    int64  
 5   color            110 non-null    object 
 6   price            112 non-null    float64
 7   currency         112 non-null    object 
 8   availability     108 non-null    object 
 9   description      112 non-null    object 
 10  raw_description  112 non-null    object 
 11  avg_rating       23 non-null     float64
 12  review_count     23 non-null     float64
 13  images           108 non-null    object 
 14  available_sizes  56 non-null     object 
 15  uniq_id          112 non-null    object 
 16  scraped_at       112 non-null    object 
dtypes: float64(3), i

## Data Exploration and Cleaning

Exploratory and preliminary cleaning of the data was undertaken to prepare the data set for analysis:
- **Missing Values**: Missing values in `avg_rating` and `review_count` were filled with zeros, assuming no reviews or ratings for these products. Missing values in `availability` were filled with "Unknown."
- **Categorical Encoding**: Categorical columns like `availability` and `color` were encoded into numerical values.
- **Scaling**: Numerical features such as `price`, `avg_rating`, and `review_count` were normalized using standard scaling to bring them onto the same scale for clustering algorithms.
  

In [10]:
# Data Cleaning and Preprocessing
from sklearn.preprocessing import StandardScaler
# Drop columns not useful for clustering (URLs, descriptions, images, etc.)
cleaned_data = df.drop(columns=['url', 'description', 'raw_description', 'images', 'sub_title', 'uniq_id', 'scraped_at'])

# Handle missing values
# Fill missing 'avg_rating' and 'review_count' with 0, assuming no reviews or ratings
cleaned_data['avg_rating'].fillna(0, inplace=True)
cleaned_data['review_count'].fillna(0, inplace=True)

# Fill missing 'availability' with 'Unknown'
cleaned_data['availability'].fillna('Unknown', inplace=True)

# Drop rows with missing 'color' and 'available_sizes' for simplicity in this analysis
cleaned_data.dropna(subset=['color', 'available_sizes'], inplace=True)

# Encode categorical columns 'availability' and 'color'
cleaned_data['availability'] = cleaned_data['availability'].astype('category').cat.codes
cleaned_data['color'] = cleaned_data['color'].astype('category').cat.codes

# Scale numerical columns like 'price', 'avg_rating', and 'review_count'
scaler = StandardScaler()
scaled_columns = ['price', 'avg_rating', 'review_count']
cleaned_data[scaled_columns] = scaler.fit_transform(cleaned_data[scaled_columns])

# Display the sanitized data
print(cleaned_data.isnull().sum())
print(cleaned_data.head())

name               0
brand              0
model              0
color              0
price              0
currency           0
availability       0
avg_rating         0
review_count       0
available_sizes    0
dtype: int64
                                      name brand     model  color     price  \
0  Nike Dri-FIT Team (MLB Minnesota Twins)  Nike  14226571     26 -0.809161   
1                             Club América  Nike  13814665      3  0.438128   
4    Paris Saint-Germain Repel Academy AWF  Nike  13327415     16 -0.060788   
5        NFL Miami Dolphins (Mike Gesicki)  Nike  14057953     41  1.435959   
7                    Nike College (Oregon)  Nike  13817332     41 -1.333771   

  currency  availability  avg_rating  review_count       available_sizes  
0      USD             0   -0.436256     -0.219495  S | M | L | XL | 2XL  
1      USD             0    2.437901     -0.159535             L (12–14)  
4      USD             0   -0.436256     -0.219495   XS | S | M | L | XL  
5 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_data['avg_rating'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_data['review_count'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

After data cleaning, the key attributes were preprocessed for unsupervised learning.

# Training Multiple Clustering Models
Now that the data is preprocessed, I will train three different clustering models:

- *K-Means Clustering*
- *Agglomerative Clustering*
- *DBSCAN (Density-Based Spatial Clustering)*

1. **K-Means Clustering**:
- K-Means is a centroid-based clustering algorithm. I used 3 clusters based on the size of the dataset and the number of attributes.
- The clusters were formed on the basis of similar values in `price`, `avg_rating`, `review_count`, `availability`, and `color`.

2. **Agglomerative Clustering**:
- It is a hierarchical clustering algorithm in which each product was assigned to its own cluster in the beginning and clusters were merged iteratively based on proximity.
Just like the K-Means method also it was targeting 3 distinct clusters.
3. **DBSCAN (Density-Based Clustering)**:
It groups products by density: it recognizes areas of high density as clusters and marks sparse areas as noise or outliers.
- For K-Means or Agglomerative Clustering, the number of clusters needs to be pre-specified

In [11]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score

# Selecting features for clustering (excluding 'name', 'brand', 'model', 'currency', and 'available_sizes')
X = cleaned_data.drop(columns=['name', 'brand', 'model', 'currency', 'available_sizes'])

# 1. K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# 2. Agglomerative Clustering
agglo = AgglomerativeClustering(n_clusters=3)
agglo_labels = agglo.fit_predict(X)

# 3. DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Silhouette scores to evaluate the clustering quality (higher is better)
kmeans_silhouette = silhouette_score(X, kmeans_labels)
agglo_silhouette = silhouette_score(X, agglo_labels)
dbscan_silhouette = silhouette_score(X, dbscan_labels) if len(set(dbscan_labels)) > 1 else -1  # Handle the case when DBSCAN only finds one cluster

(kmeans_silhouette, agglo_silhouette, dbscan_silhouette)

(np.float64(0.5776236995471457), np.float64(0.5682635453508619), -1)

## **Model Evaluation**
The **Silhouette Score** was calculated to evaluate the performance of each clustering model. The Silhouette Score computes how similar each product is to its own cluster compared to others. The higher, the better defined the clusters.

- K-Means: It should be a good baseline as its cluster centroids are optimized and lead to a high Silhouette Score.
- Agglomerative Clustering: The performance is quite comparable to K-Means because it is also distance-based clustering.
- DBSCAN: This will detect outliers really well but this algorithm can fail pretty miserably if the product data does not have strong density variations.

## Final Model Recommendation
Based on the clusterings, the model that can best fit to give distinctly formed product groups should be chosen. Probably, K-Means or Agglomerative Clustering, which can give clear-cut separability of the categories of products. If there is a worry that outliers or niche products exist, then DBSCAN could be useful for picking up such cases.
- **Product Categories**: The chosen model probably categorized the products by price and availability. Sub-clustering could imply different pricing ranges like low, medium, high.
- **Availability Insights**: Clustering by availability may present to the stakeholders which lines are more often out of stock.
- **Customer Ratings**: Clustering with average rating may indicate hot-selling items and those requiring improvement.

## Ideas for Next Steps
**Increase Product Features**: Increasing the product features, for instance, adding some customer demographics or region availability, may improve the quality of clustering and increase actionable insights further.
Model Tuning: Fine-tuning the K-Means number of clusters or distance metrics in Agglomerative Clustering improve the outcome .
Outlier Detection: For niche products or outliers, run DBSCAN with different parameters to get unique clusters and sparse categories.