# YBLL Workshop - Intro to Data Science
## Hands-on: Exploratory Data Analysis with Pandas

*By Carlos Doble*

In this notebook, we will be using Pandas to perform one of the usual data science techniques: exploratory data analysis (EDA). This technique allows us to understand the data that we are handling better, gain insights from it, and know what data preprocessing (or data cleaning) steps we need to do before providing it to our machine learning model (oops, that's a spoiler).


In [None]:
## Install required libraries for this hands-on
!pip install pandas
!pip install numpy
!pip install matplotlib

In [None]:
## Importing required libraries to the notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

For the first part of the hands-on, you will be performing a guided EDA on a dataset, the Pokemon Dataset.

In [None]:
df_pokemon = pd.read_csv('pokemon_final.csv') 
df_pokemon.head(9)

### Dataset Columns and Descriptions

The dataset contains the following columns:

| Column             | Description                                |
|--------------------|--------------------------------------------|
| `pokemon_id`     | Unique identifier for each Pokémon       |
| `pokemon_name`   | Name of the Pokémon                       |
| `species_id`     | Identifier for the species of the Pokémon |
| `height`         | Height of the Pokémon in decimetres (1dm == 0.1m)      |
| `weight`         | Weight of the Pokémon in hectograms (1hg == 0.1kg)       |
| `type_1_name`    | Primary type of the Pokémon               |
| `type_2_name`    | Secondary type of the Pokémon (if any)    |
| `attack`         | Attack stat of the Pokémon                |
| `defense`        | Defense stat of the Pokémon               |
| `hp`             | Hit Points (HP) stat of the Pokémon       |
| `special-attack` | Special Attack stat of the Pokémon        |
| `special-defense`| Special Defense stat of the Pokémon       |
| `speed`          | Speed stat of the Pokémon                 |

Time to put our knowledge on descriptive statistics to good use. Finding the counts, minimum and maximum values, measures of central tendency, and measures of variability are usually the first steps in EDA.

In [None]:
## Replace None values to find the min and max values for age
min_height = df_pokemon['height'].min()
max_height = df_pokemon['height'].max()

print("Mininum value for height:", min_height)
print("Maximum value for height:", max_height)

In [None]:
min_height_df = df_pokemon[df_pokemon['height'] == min_height]
min_height_df

In [None]:
max_height_df = df_pokemon[df_pokemon['height'] == max_height]
max_height_df

In [None]:
## Replace None values to find the min and max values for age
min_attack = df_pokemon['attack'].min()
max_attack = df_pokemon['attack'].max()

print("Mininum value for attack:", min_attack)
print("Maximum value for attack:", max_attack)

In [None]:
min_attack_df = df_pokemon[df_pokemon['attack'] == min_attack]
min_attack_df

In [None]:
max_attack_df = df_pokemon[df_pokemon['attack'] == max_attack]
max_attack_df

## Quartiles

In [None]:
## Replace None values to find the first three quantiles
q1_attack = df_pokemon['attack'].quantile(0.25)
q2_attack = df_pokemon['attack'].quantile(0.50)
q3_attack = df_pokemon['attack'].quantile(0.75)

print("First quantile for attack:", q1_attack)
print("Second quantile for attack:", q2_attack)
print("Third quantile for attack:", q3_attack)

In [None]:
## Plotting distribution of the age and its quantiles
fig, ax = plt.subplots(figsize=(10, 5))
ax = df_pokemon['attack'].dropna().plot.kde(c="#00aeef")
labels = {f'First quantile for age: {q1_attack}',
          f'Second quantile for age: {q2_attack}',
          f'Third quantile for age: {q3_attack}',
          }
handles, _ = ax.get_legend_handles_labels()
handles.append(ax.axvline(x=q1_attack, c="#99cc33", linestyle='dashed'))
handles.append(ax.axvline(x=q2_attack, c="#99cc33", linestyle='dashed'))
handles.append(ax.axvline(x=q3_attack, c="#99cc33", linestyle='dashed'))
ax.legend(handles = handles[1:], labels = labels)

## Measures of Central Tendency (Mean, Median, Mode)

In [None]:
## Replace None to get the mean for the age
mean_attack = df_pokemon['attack'].mean()

## Replace None to get the median for the age
median_attack = df_pokemon['attack'].median()

## Replace None to get the mode for the age
mode_attack = df_pokemon['attack'].mode()

print(f"Mean attack: {mean_attack}")
print(f"Median attack: {median_attack}")
print(f"Mode attack: {mode_attack}")

## Measures of Variability (Range, IQR, Variance/Standard Deviation)

In [None]:
# Replace None to get the range of age
range_attack = df_pokemon['attack'].max() - df_pokemon['attack'].min()

## Replace None to get the IQR for age
iqr_attack = df_pokemon['attack'].quantile(0.75) - df_pokemon['attack'].quantile(0.25)

## Replace None to get the variance for age
var_attack = df_pokemon['attack'].var()

## Replace None to get the standard deviation for age
std_attack = df_pokemon['attack'].std()

print(f"Range of attack: {range_attack}")
print(f"IQR of attack: {iqr_attack}")
print(f"Variance of attack: {var_attack}")
print(f"Standard Deviation of attack: {std_attack}")

For the next cells, the notebook will provide some questions to be answered by EDA exercise. In reality, one of the skills a data scientist must have is to be curious enough in generating these questions (*don't be afraid to ask dumb questions as long as the data can answer it*)

### Q1: What are the top 5 fastest Pokémon in the dataset?

In [None]:
top_5_fastest_pokemon = df_pokemon.sort_values(by='speed', ascending=False).head(5)
top_5_fastest_pokemon

### Q2: Does the weight of a Pokémon impact its speed stat?

In [None]:
correlation = df_pokemon['weight'].corr(df_pokemon['speed'])
print(f"Correlation between weight and speed: {correlation}")

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_pokemon['weight'], df_pokemon['speed'], alpha=0.5)
plt.title('Correlation between Weight and Speed')
plt.xlabel('Weight')
plt.ylabel('Speed')
plt.grid(True)
plt.show()

### Q3: What is the most common Pokemon type?

In [None]:
combined_counts = pd.concat([type_1_counts, type_2_counts], axis=0).groupby(level=0).sum()
combined_counts = combined_counts.sort_values(ascending=False)

combined_counts.plot(kind='bar', figsize=(10, 6), color='#00aeef')

Basic Data Exploration

    How many Pokémon are in the dataset?
    What are the different types of Pokémon (type_1_name) and how many Pokémon belong to each type?
    Are there any missing values in the dataset? If so, in which columns?
    What is the range of values for numerical features like height, weight, attack, defense, etc.?

Descriptive Statistics

    What are the mean, median, and standard deviation of numerical features such as height, weight, and attack?
    Which Pokémon has the highest attack? Which one has the highest defense?
    What are the top 5 fastest Pokémon in the dataset?
    What is the distribution of Pokémon heights and weights? Are they normally distributed or skewed?

Type-Based Analysis

    What are the most and least common Pokémon primary types (type_1_name)?
    What is the average attack, defense, and speed for each Pokémon type?
    Do Pokémon with two types (type_2_name) tend to have higher or lower stats compared to single-type Pokémon?

Comparing Attributes

    Is there a correlation between a Pokémon's weight and its attack or defense stats?
    Do taller Pokémon tend to have higher HP (Hit Points)?
    How do special-attack and special-defense compare across different Pokémon types?
    Do fire-type Pokémon generally have higher special-attack than other types?

Advanced Explorations

    Is there a relationship between a Pokémon’s speed and its type?
    How does the distribution of attack values compare between physical attackers (high attack) and special attackers (high special-attack)?
    What is the overall distribution of Pokémon by species_id? Are some species more represented than others?
    Which type combinations (type_1_name and type_2_name) appear most frequently?
    What insights can we gain from comparing the average stats of starter Pokémon (Bulbasaur, Charmander, Squirtle, etc.) versus legendary Pokémon?