# Exercise 3B

In this exercise, you will explore how the DBSCAN clustering algorithm identifies dense groups and outliers in the Pokémon statistics dataset. You will analyze the dataset, scale features, determine DBSCAN parameters, run DBSCAN, and interpret the clustering results.

In [31]:
import kagglehub
import os
import pandas as pd

In [32]:
# Download latest version
path = kagglehub.dataset_download("abcsds/pokemon")
print("Path to dataset files:", path)

Using Colab cache for faster access to the 'pokemon' dataset.
Path to dataset files: /kaggle/input/pokemon


In [33]:
if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

True


## 1. Load the Dataset (8 pts)

Load the Pokémon dataset into a pandas DataFrame.

Show the first five rows (3 pts)

In [34]:
df.head(5)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False



How many Pokémon are in the dataset? (2 pts)


In [35]:
df.shape[0]

800



List all columns available (3 pts)

In [36]:
df.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

## 2: Select Features & Clean the Data (8 pts)

Select the numeric features needed for clustering (HP, Attack, Defense, Sp. Atk, Sp. Def, Speed). (3 pts)

In [37]:
features = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
features_df = df[features]

Are there any missing values? (2 pts)

In [38]:
features_df.isnull().sum()

Unnamed: 0,0
HP,0
Attack,0
Defense,0
Sp. Atk,0
Sp. Def,0
Speed,0


If missing values exist, describe how you handled them (3 pts)

In [39]:
# No missing values.

# 3. Scale the Features (10 pts)

Apply StandardScaler to the selected features.

Show the transformed feature sample (e.g., first 5 rows) (10 pts)


In [40]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_df)
scaled_features_df = pd.DataFrame(scaled_features, columns=features_df.columns)
scaled_features_df.head(5)

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,-0.950626,-0.924906,-0.797154,-0.23913,-0.248189,-0.801503
1,-0.362822,-0.52413,-0.347917,0.21956,0.291156,-0.285015
2,0.420917,0.092448,0.293849,0.831146,1.010283,0.403635
3,0.420917,0.647369,1.577381,1.503891,1.729409,0.403635
4,-1.185748,-0.832419,-0.989683,-0.392027,-0.787533,-0.112853


## 4. Determine a Suitable eps Value (10 pts)

Using k = 4:
Compute the distance to the 4th nearest neighbor for each Pokémon (6 pts)

In [41]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(scaled_features_df)

distances, indices = neigh.kneighbors(scaled_features_df)

k_distances = distances[:, 4]

k_distances.sort()

print("First 10 k-distances (sorted):", k_distances[:10])

First 10 k-distances (sorted): [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


What is your chosen eps based on the "elbow"? (2 pts)



In [42]:
eps_chosen = 1.0

One-sentence explanation of your reasoning (2 pts)

In [59]:
# The chosen eps value of 1.0 is selected because it appears to be the 'elbow' point in the k-distance graph,
# where the rate of increase in distances significantly changes, indicating a good threshold for defining dense regions.

## 5. Run DBSCAN (10 points)

Run DBSCAN using your chosen eps and min_samples


How many clusters did DBSCAN find? (4 pts)

In [60]:
from sklearn.cluster import DBSCAN
import numpy as np

dbscan = DBSCAN(eps=eps_chosen, min_samples=5)

dbscan_labels = dbscan.fit_predict(scaled_features_df)

num_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(f"DBSCAN found {num_clusters} clusters.")


DBSCAN found 2 clusters.


How many Pokémon were labeled as noise? (4 pts)

In [61]:
num_noise_points = np.sum(dbscan_labels == -1)
print(f"DBSCAN labeled {num_noise_points} Pokémon as noise.")


DBSCAN labeled 202 Pokémon as noise.


Show the unique labels output by DBSCAN (2 pts)

In [62]:
# Show unique labels
unique_labels = np.unique(dbscan_labels)
print(f"Unique labels output by DBSCAN: {unique_labels}")


Unique labels output by DBSCAN: [-1  0  1]


## 6. Attach Cluster Labels to the Original Dataset (7 points)

Add the cluster labels back to the original DataFrame (3 pts)

In [47]:
df['DBSCAN_Cluster'] = dbscan_labels

Show the first 10 rows including the cluster label (4 pts)

In [48]:
df.head(10)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,DBSCAN_Cluster
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,0
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,0
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,0
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,0
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,0
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,0
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False,0
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False,0
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False,-1
9,7,Squirtle,Water,,314,44,48,65,50,64,43,1,False,0


## 7. Explore the Clusters (12 points)

For each cluster:


How many Pokémon does it contain? (4 pts)

In [49]:
df['DBSCAN_Cluster'].value_counts().sort_index()

Unnamed: 0_level_0,count
DBSCAN_Cluster,Unnamed: 1_level_1
-1,202
0,593
1,5


What are the average Attack, Defense, and Speed? (4 pts)


In [50]:
df.groupby('DBSCAN_Cluster')[['Attack', 'Defense', 'Speed']].mean()

Unnamed: 0_level_0,Attack,Defense,Speed
DBSCAN_Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,93.084158,91.950495,72.69802
0,74.322091,67.394604,66.62226
1,65.0,107.0,86.0


Compare the clusters: What differences do you notice? (4 pts)

In [63]:
# Cluster -1 or Noise have these Pokemons generally have higher average attack and defense while cluster 0 has the largest cluser and represents pokemon with moderate or more average stats.
# And cluster 1 has the small cluster and stands out with the lowest average attack but the highest average defense

## 8. Identify Outliers (10 points)

List all Pokémon labeled as noise (cluster = -1) (4 pts)

In [52]:
df[df['DBSCAN_Cluster'] == -1]['Name']

Unnamed: 0,Name
8,CharizardMega Charizard Y
19,BeedrillMega Beedrill
44,Jigglypuff
45,Wigglytuff
55,Diglett
...,...
793,Yveltal
795,Diancie
796,DiancieMega Diancie
797,HoopaHoopa Confined


Are many of them legendary? (3 pts)

In [53]:
df[df['DBSCAN_Cluster'] == -1]['Legendary'].value_counts()

Unnamed: 0_level_0,count
Legendary,Unnamed: 1_level_1
False,157
True,45


Explain why DBSCAN might classify them as outliers (3 pts)

In [54]:
##The results show that out of the 202 Pokémon labeled as noise, 45 of them are Legendary, and 157 are not. So, while it's not a majority, a notable portion of the noise points are indeed Legendary Pokémon. This suggests that their unique statistical profiles might be causing DBSCAN to isolate them. Now, let's consider why DBSCAN might classify them as outliers.

## 9. Interpret the Clustering Results (25 points)
Write a short interpretation (4–6 sentences).
Discuss:



---

---




What types of Pokémon grouped together (5 pts)

In [55]:
print('Type 1 distribution per cluster:')
print(df.groupby('DBSCAN_Cluster')['Type 1'].value_counts(normalize=True).unstack(fill_value=0).round(2))

print('\nType 2 distribution per cluster (showing top 5 for each):')
# For Type 2, it's often more sparse, so let's just see the top few per cluster
for cluster in df['DBSCAN_Cluster'].unique():
    print(f'\nCluster {cluster} (Type 2):')
    type2_counts = df[df['DBSCAN_Cluster'] == cluster]['Type 2'].value_counts(normalize=True)
    if not type2_counts.empty:
        print(type2_counts.head(5).round(2))
    else:
        print('No Type 2 data for this cluster.')

Type 1 distribution per cluster:
Type 1           Bug  Dark  Dragon  Electric  Fairy  Fighting  Fire  Flying  \
DBSCAN_Cluster                                                                
-1              0.06  0.03    0.07      0.02   0.02      0.03  0.05    0.00   
 0              0.10  0.04    0.03      0.06   0.02      0.03  0.07    0.01   
 1              0.00  0.00    0.00      1.00   0.00      0.00  0.00    0.00   

Type 1          Ghost  Grass  Ground   Ice  Normal  Poison  Psychic  Rock  \
DBSCAN_Cluster                                                              
-1               0.06   0.04    0.04  0.02    0.09    0.00     0.13  0.10   
 0               0.03   0.10    0.04  0.03    0.13    0.05     0.05  0.04   
 1               0.00   0.00    0.00  0.00    0.00    0.00     0.00  0.00   

Type 1          Steel  Water  
DBSCAN_Cluster                
-1               0.07   0.14  
 0               0.02   0.14  
 1               0.00   0.00  

Type 2 distribution per clust

Whether the clusters make intuitive sense (10 pts)

In [56]:
# In summary, the clusters found by DBSCAN generally make intuitive sense. Given the diverse range of Pokémon types and stats, it's reasonable to have a large 'general' cluster (Cluster 0), alongside smaller, more specialized clusters, and a significant number of noise points that represent unique or powerful Pokémon that don't fit into denser groups.


What the noise points reveal about DBSCAN (5 pts)


In [57]:
# In essence, the noise points (Cluster -1) highlight DBSCAN's ability to identify outliers. These are Pokémon that are statistically unique in their attributes and don't form dense clusters, and the presence of legendary Pokémon among them further illustrates this point.

What stat patterns you discovered (5 pts)





*   Cluster -1 (Noise): Pokémon in this group exhibit the highest average Attack and Defense stats, suggesting they are generally more powerful and resilient.
*   Cluster 0: This is the largest and most general cluster, characterized by moderate average stats across Attack, Defense, and Speed.
*   Cluster 1: This small, specialized cluster shows the lowest average Attack but the highest average Defense, combined with relatively high Speed, indicating a focus on defensive capabilities and agility rather than raw offensive power.

