# **Pokémon Data Cleaning**

This notebook merges and cleans two Pokémon datasets ('poke_data.csv' and 'poke_links.csv') to prepare a final dataset for analysis and dashboard creation.

The following steps are performed:

- Load both raw CSV files as DataFrames
- Standardize Dex numbers to enable merging
- Merge datasets on Dex and Name
- Clean and normilize column names
- Handle missing values
- Fis data types
- Export cleaned DataFrame as 'pokemon_clean.csv'

## **1. Import libraries and load raw CSV files**

In [2]:
import os
import pandas as pd

In [34]:
#Loading the CSV files as new DataFrames
file_path_links = os.path.join('..', 'Data', 'poke_links.csv') #The '..' (2 dots) goes one folder up
df_links = pd.read_csv(file_path_links)

file_path_data = os.path.join('..', 'Data', 'poke_data.csv') #The '..' (2 dots) goes one folder up
df_data = pd.read_csv(file_path_data)
df_data[:5]

Unnamed: 0,Dex,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,4,Charmander,Fire,,309,39,52,43,60,50,65
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80


## **2. Standardize Dex numbers to 4-digit strings**

In [9]:
# Standardize Dex numbers to 4-digit strings with .str.zfill(4)
df_data['Dex'] = df_data['Dex'].astype(str).str.zfill(4)
df_links['Dex'] = df_links['Dex'].astype(str).str.zfill(4)
df_data[:15]

Unnamed: 0,Dex,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,4,Charmander,Fire,,309,39,52,43,60,50,65
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80
5,6,Charizard,Fire,Flying,534,78,84,78,109,85,100
6,7,Squirtle,Water,,314,44,48,65,50,64,43
7,8,Wartortle,Water,,405,59,63,80,65,80,58
8,9,Blastoise,Water,,530,79,83,100,85,105,78
9,10,Caterpie,Bug,,195,45,30,35,20,20,45


## **3. Preview datasets and basic structure check**

In [15]:
df_links.info()
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Dex         1025 non-null   object 
 1   Name        1025 non-null   object 
 2   poke_links  1025 non-null   object 
 3   Generation  1025 non-null   float64
 4   Image URL   1025 non-null   object 
dtypes: float64(1), object(4)
memory usage: 40.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Dex      1025 non-null   object
 1   Name     1025 non-null   object
 2   Type 1   1025 non-null   object
 3   Type 2   526 non-null    object
 4   Total    1025 non-null   int64 
 5   HP       1025 non-null   int64 
 6   Attack   1025 non-null   int64 
 7   Defense  1025 non-null   int64 
 8   Sp. Atk  1025 non-null   int64 
 9   Sp. Def  1025 non-null   int64 
 10

In [17]:
df_data.describe()

Unnamed: 0,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,427.686829,70.18439,77.521951,72.507317,70.080976,70.205854,67.186341
std,112.770735,26.631054,29.782541,29.286972,29.658378,26.639329,28.717227
min,175.0,1.0,5.0,5.0,10.0,20.0,5.0
25%,325.0,50.0,55.0,50.0,47.0,50.0,45.0
50%,450.0,68.0,75.0,70.0,65.0,67.0,65.0
75%,508.0,85.0,100.0,90.0,90.0,86.0,88.0
max,720.0,255.0,181.0,230.0,173.0,230.0,200.0


In [18]:
df_data.columns

Index(['Dex', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed'],
      dtype='object')

## **4. Check for missing values and duplicates**

In [19]:
df_data.isnull().sum()

Dex          0
Name         0
Type 1       0
Type 2     499
Total        0
HP           0
Attack       0
Defense      0
Sp. Atk      0
Sp. Def      0
Speed        0
dtype: int64

In [20]:
df_data.duplicated().sum()

0

## **5. Merge datasets on Dex and Name**

In [22]:
merged_df = pd.merge(df_data, df_links, on=['Dex', 'Name'], how='inner')
merged_df[:5]

Unnamed: 0,Dex,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,poke_links,Generation,Image URL
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,https://pokemondb.net/pokedex/bulbasaur,1.0,https://img.pokemondb.net/artwork/large/bulbas...
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,https://pokemondb.net/pokedex/ivysaur,1.0,https://img.pokemondb.net/artwork/large/ivysau...
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,https://pokemondb.net/pokedex/venusaur,1.0,https://img.pokemondb.net/artwork/large/venusa...
3,4,Charmander,Fire,,309,39,52,43,60,50,65,https://pokemondb.net/pokedex/charmander,1.0,https://img.pokemondb.net/artwork/large/charma...
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,https://pokemondb.net/pokedex/charmeleon,1.0,https://img.pokemondb.net/artwork/large/charme...


## **6. Clean column names for consistency**

In [25]:
merged_df.columns = (
    merged_df.columns
    .str.lower()
    .str.strip()
    .str.replace(' ', '_')
    .str.replace('.', '')
)
merged_df[:5]

Unnamed: 0,dex,name,type_1,type_2,total,hp,attack,defense,sp_atk,sp_def,speed,poke_links,generation,image_url
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,https://pokemondb.net/pokedex/bulbasaur,1.0,https://img.pokemondb.net/artwork/large/bulbas...
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,https://pokemondb.net/pokedex/ivysaur,1.0,https://img.pokemondb.net/artwork/large/ivysau...
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,https://pokemondb.net/pokedex/venusaur,1.0,https://img.pokemondb.net/artwork/large/venusa...
3,4,Charmander,Fire,,309,39,52,43,60,50,65,https://pokemondb.net/pokedex/charmander,1.0,https://img.pokemondb.net/artwork/large/charma...
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,https://pokemondb.net/pokedex/charmeleon,1.0,https://img.pokemondb.net/artwork/large/charme...


## **7. Fill missing values in Type 2**

In [26]:
merged_df['type_2'] = merged_df['type_2'].fillna('None')
merged_df[:5]

Unnamed: 0,dex,name,type_1,type_2,total,hp,attack,defense,sp_atk,sp_def,speed,poke_links,generation,image_url
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,https://pokemondb.net/pokedex/bulbasaur,1.0,https://img.pokemondb.net/artwork/large/bulbas...
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,https://pokemondb.net/pokedex/ivysaur,1.0,https://img.pokemondb.net/artwork/large/ivysau...
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,https://pokemondb.net/pokedex/venusaur,1.0,https://img.pokemondb.net/artwork/large/venusa...
3,4,Charmander,Fire,,309,39,52,43,60,50,65,https://pokemondb.net/pokedex/charmander,1.0,https://img.pokemondb.net/artwork/large/charma...
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,https://pokemondb.net/pokedex/charmeleon,1.0,https://img.pokemondb.net/artwork/large/charme...


## **8. Convert generation column to integer**

In [31]:
merged_df['generation'] = merged_df['generation'].astype('int')
merged_df[200:205]

Unnamed: 0,dex,name,type_1,type_2,total,hp,attack,defense,sp_atk,sp_def,speed,poke_links,generation,image_url
200,201,Unown,Psychic,,336,48,72,48,72,48,48,https://pokemondb.net/pokedex/unown,2,https://img.pokemondb.net/artwork/large/unown.jpg
201,202,Wobbuffet,Psychic,,405,190,33,58,33,58,33,https://pokemondb.net/pokedex/wobbuffet,2,https://img.pokemondb.net/artwork/large/wobbuf...
202,203,Girafarig,Normal,Psychic,455,70,80,65,90,65,85,https://pokemondb.net/pokedex/girafarig,2,https://img.pokemondb.net/artwork/large/girafa...
203,204,Pineco,Bug,,290,50,65,90,35,35,15,https://pokemondb.net/pokedex/pineco,2,https://img.pokemondb.net/artwork/large/pineco...
204,205,Forretress,Bug,Steel,465,75,90,140,60,60,40,https://pokemondb.net/pokedex/forretress,2,https://img.pokemondb.net/artwork/large/forret...


## **9. Save cleaned dataset to CSV**

In [33]:
file_path_merged = os.path.join('..', 'Data', 'pokemon_clean.csv')
merged_df.to_csv(file_path_merged, index=False)