
# 🦸‍♀️ Superhero Mini‑Project: How Tall is your average SuperHero?
**Duration:** 15–20 minutes  
**Tools:** Python, pandas, matplotlib  
**Dataset:** `heroes_information.csv` (already provided)

Welcome! Today you're a **data detective**. We'll use real superhero data to answer questions like:

- Who are the **tallest heroes** on average — Marvel or DC?
- How can we **clean messy data** (like wrong heights) so our answers are trustworthy?
- How can we get **subsets** of data (e.g., *all female heroes*)?

By the end, you'll have loaded, cleaned, explored, and **visualized** superhero data using Python.



## 🎯 Learning Goals
By the end of this mini‑project, you will:
1. Load and inspect a real dataset using **pandas**
2. Identify and fix **data quality** issues (like `-99.0` in height/weight)
3. Create **subsets** of your data using filters (e.g., all female heroes)
4. Compare groups using **groupby** (e.g., average height by publisher)
5. Visualize your results with **matplotlib**



## 1) Setup & Imports (Run me first)


In [None]:

import pandas as pd
import matplotlib.pyplot as plt

# Make plots show up in the notebook
%matplotlib inline

# Optional: widen the display a bit for readability
pd.set_option('display.max_columns', 50)



## 2) Load the Dataset
Let's read the CSV file into a pandas DataFrame and peek at the first few rows.

To do this you'll need to add the data to your collab using the following steps 
1. Click the file folder on the right side panel 
2. rightclick the white space to create a new folder 
3. rename the new folder `data` 
4. download [this linked file](https://drive.google.com/file/d/1JYQtyfhhDFLJxM8oKmIlcAIADdAnFlIQ/view?usp=drive_link) from google drive
5. click the upload button place this file in your data folder



In [None]:
# If you've followed the steps above you can use the following code to load the data
csv_path = "data/heros_information.csv"

heros = pd.read_csv(csv_path)
heros.head()



## 3) First Look: What's in our Data?
We'll inspect the **columns**, **types**, and some basic **statistics**.


In [None]:

heroes.info()


In [None]:
# lets see what the data looks like
heroes.head(20)



### 🔎 Data Quality Check
Notice anything odd? For example, sometimes datasets use placeholder values like **`-99.0`** for **unknown** height/weight.  
If we don't fix that, our averages and charts will be misleading.

👉 **Plan:** Replace `-99.0` with `NaN` (pandas' missing value), then fill missing heights/weights with the **median** value.
- **Why median?** It's less affected by extreme outliers than the mean.



## 4) Clean the Data
We'll replace placeholder values and handle missing values so our analysis is fair.


In [None]:

# Replace placeholder numeric values (-99.0) with NA
heroes = heroes.replace(-99.0, pd.NA)

# Fill missing height/weight with the column median (robust to outliers)
heroes['Height'] = pd.to_numeric(heroes['Height'], errors='coerce')
heroes['Weight'] = pd.to_numeric(heroes['Weight'], errors='coerce')
heroes['Height'] = heroes['Height'].fillna(heroes['Height'].median())
heroes['Weight'] = heroes['Weight'].fillna(heroes['Weight'].median())

# Quick re-check
heroes[['Height','Weight']].describe()


- After a bit of clean up , the one line of code above shows the info of the tallest, shortest, heaviest, and lightest heroes




In [None]:
# lets find out the names of these heros
tallest_hero = heroes[heroes['Height'] == heroes['Height'].max()]
shortest_hero = heroes[heroes['Height'] == heroes['Height'].min()]
heaviest_hero = heroes[heroes['Weight'] == heroes['Weight'].max()]
lightest_hero = heroes[heroes['Weight'] == heroes['Weight'].min()]

print("The tallest hero is: ", tallest_hero.name.values[0], "with a height of", tallest_hero.Height.values[0], "cm")
print("The shortest hero is: ", shortest_hero.name.values[0], "with a height of", shortest_hero.Height.values[0], "cm")
print("The heaviest hero is: ", heaviest_hero.name.values[0], "with a weight of", heaviest_hero.Weight.values[0], "kg")
print("The lightest hero is: ", lightest_hero.name.values[0], "with a weight of", lightest_hero.Weight.values[0], "kg")





## 5) Getting a Subset: All Female Heroes
**Question:** How can we get a subset of **all female** characters?  
We use a **Boolean filter** on the `Gender` column.


In [None]:

female_heroes = heroes[heroes['Gender'] == 'Female']


len_female = len(female_heroes)
len_total = len(heroes)
print(f"Female heroes: {len_female} out of {len_total} total heroes")
female_heroes.head()



> ✅ **Try it yourself:** Change `"Female"` to `"Male"` or try another column like `Alignment` (e.g., `"good"`, `"bad"`, `"neutral"`).


In [None]:
# create a boolean mask for a new filtered dataset
boolean_mask = heroes['Gender'] == 'Female' # update this to filter by a different column
filtered_heroes = heroes[boolean_mask]


len_filtered = len(filtered_heroes)
len_total = len(heroes)
print(f"heroes: {len_filtered} out of {len_total} total heroes")
filtered_heroes.head()


## 6) Who Publishes the Tallest Heroes (on average)?
We'll compare the **average height** grouped by `Publisher`.  
Then we'll visualize the result with a simple bar chart.


In [None]:

publisher_height = heroes.groupby('Publisher', dropna=True)['Height'].mean().sort_values(ascending=False)
publisher_height


In [None]:

# Bar chart of average height by publisher
publisher_height.plot(kind='bar', figsize=(9,4))
plt.title("Average Hero Height by Publisher")
plt.xlabel("Publisher")
plt.ylabel("Average Height (cm)")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()



### (Optional View) Marvel vs DC Focus
Sometimes it's easier to compare the two biggest publishers directly.


In [None]:

top_publishers = heroes[heroes['Publisher'].isin(['Marvel Comics', 'DC Comics'])]
marvel_dc_height = top_publishers.groupby('Publisher')['Height'].mean().sort_values(ascending=False)
marvel_dc_height


In [None]:

marvel_dc_height.plot(kind='bar', figsize=(6,4))
plt.title("Average Hero Height: Marvel vs DC")
plt.xlabel("Publisher")
plt.ylabel("Average Height (cm)")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# lets include a scatter plot of height  vs weight of heros and color code marvel and dc characters

# First, we'll create a new dataframe that only includes the heroes from Marvel and DC
marvel_dc_heroes = heroes[heroes['Publisher'].isin(['Marvel Comics', 'DC Comics'])]

# Now, we'll create a scatter plot of height vs weight
plt.figure(figsize=(10, 6))
plt.scatter(marvel_dc_heroes['Height'], marvel_dc_heroes['Weight'], c=marvel_dc_heroes['Publisher'].map({'Marvel Comics': 'red', 'DC Comics': 'blue'}))
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')




## 7) Reflect & Explain 
Now's your chance to explain what you've found in your data. 

- What surprised you about the results?
- Why was **data cleaning** important today?
- If you changed how we filled missing values (mean vs median), how might the results differ?
- (Bonus thought) How would the answer change if we only looked at **good** heroes?

---

#### Double click this cell and erase it to write your explanation.