# 50 Problems: Data Cleaning & Analysis Bootcamp

**Role:** Junior Data Scientist
**Mission:** You have received two messy datasets from the field. Your job is to clean, standardize, and analyze them.

**The Mess:**
- Duplicate rows
- Missing values (NaN)
- Inconsistent text ("Apple" vs "apple")
- Mixed units (mph vs km/h)
- Outliers (data errors)

---

## LEVEL 1: Inspection & Triage
First, we need to see how bad the damage is. We'll use `../data/fastest-animals.csv`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Problem 1
**Task:** Load `../data/fastest-animals.csv` into a variable `df_animals`.

In [None]:
# YOUR CODE HERE


### Problem 2
**Task:** Display the first 10 rows. Notice the messy data (NaNs, duplicates).

In [None]:
# YOUR CODE HERE


### Problem 3
**Task:** Check the data types of each column (`.dtypes`). Is 'Speed' a number or an object (string)?

In [None]:
# YOUR CODE HERE


### Problem 4
**Task:** Check for missing values in the entire DataFrame (`.isnull().sum()`).

In [None]:
# YOUR CODE HERE


### Problem 5
**Task:** Check for duplicate rows (`.duplicated().sum()`).

In [None]:
# YOUR CODE HERE


### Problem 6
**Task:** Drop the duplicate rows from `df_animals` (update the variable).

In [None]:
# YOUR CODE HERE


### Problem 7
**Task:** Display the rows where `Speed` is missing (NaN).

In [None]:
# YOUR CODE HERE


### Problem 8
**Task:** Drop the rows where `Speed` is missing.

In [None]:
# YOUR CODE HERE


### Problem 9
**Task:** We still have a missing 'Unit' for the Greyhound. Fill all missing 'Unit' values with 'km/h' (assuming that's the default).

In [None]:
# YOUR CODE HERE


### Problem 10
**Task:** Verify your dataframe is now clean of NaNs and Duplicates.

In [None]:
# YOUR CODE HERE


--- 
## LEVEL 2: The Logic Challenge (Standardization)
We have a problem. Some animals are measured in `mph` and some in `km/h`. We need to convert everyone to `km/h`.

### Problem 11
**Task:** Get a list of all unique values in the `Unit` column to see what we are dealing with.

In [None]:
# YOUR CODE HERE


### Problem 12
**Task:** Create a boolean mask (a list of True/False) for rows where `Unit` is 'mph'.

In [None]:
# YOUR CODE HERE


### Problem 13
**Task:** Ensure the `Speed` column is numeric. If it loaded as a string/object, convert it to float.

In [None]:
# YOUR CODE HERE


### Problem 14
**Task:** This is the hard one. Update the `Speed` column: **IF** the unit is 'mph', multiply the speed by `1.609`.

In [None]:
# YOUR CODE HERE


### Problem 15
**Task:** Now that all numbers are converted, change all 'mph' values in the `Unit` column to 'km/h'.

In [None]:
# YOUR CODE HERE


### Problem 16
**Task:** Rename the `Speed` column to `Speed_kmh` and drop the `Unit` column (since it's all the same now).

In [None]:
# YOUR CODE HERE


### Problem 17
**Task:** Sort the animals by speed (fastest on top). Who is the fastest?

In [None]:
# YOUR CODE HERE


### Problem 18
**Task:** Save this clean dataset to `../data/clean_animals.csv` (without the index).

In [None]:
# YOUR CODE HERE


### Problem 19
**Task:** Group the animals by `Class` (Mammal, Bird, etc.) and find the average speed for each class.

In [None]:
# YOUR CODE HERE


### Problem 20
**Task:** Visualize the average speed per class using a Bar Chart.

In [None]:
# YOUR CODE HERE


--- 
## LEVEL 3: String Cleaning & Outliers
Now we switch to `../data/fruits-weights.csv`. This data has typo issues.

### Problem 21
**Task:** Load `../data/fruits-weights.csv` into `df_fruits`.

In [None]:
# YOUR CODE HERE


### Problem 22
**Task:** Look at the unique values in the `fruit` column. Notice "Apple", "apple", "Banana", "BANANA".

In [None]:
# YOUR CODE HERE


### Problem 23
**Task:** Convert the `fruit` column to lowercase using `.str.lower()`.

In [None]:
# YOUR CODE HERE


### Problem 24
**Task:** Strip any extra whitespace from the `fruit` column using `.str.strip()`.

In [None]:
# YOUR CODE HERE


### Problem 25
**Task:** Check unique values again. They should be clean now (e.g., only one 'apple').

In [None]:
# YOUR CODE HERE


### Problem 26
**Task:** Drop any rows with missing weight values.

In [None]:
# YOUR CODE HERE


### Problem 27
**Task:** Sort by weight. Do you see a crazy value? (Outlier detection).

In [None]:
# YOUR CODE HERE


### Problem 28
**Task:** That 3000g Grapefruit is an error. Filter the dataframe to only keep fruits weighing LESS than 1000.

In [None]:
# YOUR CODE HERE


### Problem 29
**Task:** Group by `fruit` and calculate the average weight.

In [None]:
# YOUR CODE HERE


### Problem 30
**Task:** Plot a bar chart of the average fruit weights.

In [None]:
# YOUR CODE HERE


--- 
## LEVEL 4: Real World Analysis (Aggregation)
Back to `df_animals`. Let's ask some statistical questions.

### Problem 31
**Task:** What is the median speed of all animals?

In [None]:
# YOUR CODE HERE


### Problem 32
**Task:** What is the standard deviation of the speeds?

In [None]:
# YOUR CODE HERE


### Problem 33
**Task:** Get the descriptive statistics for the 'Fish' class only.

In [None]:
# YOUR CODE HERE


### Problem 34
**Task:** Filter for animals faster than 100 km/h AND that are NOT birds.

In [None]:
# YOUR CODE HERE


### Problem 35
**Task:** Create a new column `Speed_Level`. Set it to 'High' if speed > 100, else 'Normal'.

In [None]:
# YOUR CODE HERE


### Problem 36
**Task:** Count how many animals are in each `Speed_Level`.

In [None]:
# YOUR CODE HERE


### Problem 37
**Task:** Use `groupby` to find the slowest animal in each Class (`.min()`).

In [None]:
# YOUR CODE HERE


### Problem 38
**Task:** Create a pivot table: Index='Class', Values='Speed_kmh', Aggfunc='mean'.

In [None]:
# YOUR CODE HERE


### Problem 39
**Task:** Who is the fastest Mammal? (Filter for Mammals, then find the max).

In [None]:
# YOUR CODE HERE


### Problem 40
**Task:** Create a histogram of all animal speeds. Use 5 bins.

In [None]:
# YOUR CODE HERE


--- 
## LEVEL 5: Mastery Challenge
Combine everything. Clean. Analyze. Visualize.

### Problem 41
**Task:** Create a scatter plot of `Animal` vs `Speed` (this will be messy, but try it).

In [None]:
# YOUR CODE HERE


### Problem 42
**Task:** That plot is too crowded. Filter for only the top 5 fastest animals and plot them.

In [None]:
# YOUR CODE HERE


### Problem 43
**Task:** Add a title "Top 5 Fastest Animals" and label the Y-axis "Speed (km/h)".

In [None]:
# YOUR CODE HERE


### Problem 44
**Task:** Make the bars green.

In [None]:
# YOUR CODE HERE


### Problem 45
**Task:** Save the figure as `top_5_animals.png`.

In [None]:
# YOUR CODE HERE


### Problem 46
**Task:** Create a function `is_fast(speed)` that returns "YES" if speed > 100, else "NO".

In [None]:
# YOUR CODE HERE


### Problem 47
**Task:** Apply this function to the `Speed_kmh` column to create a new column `Fast_Check`.

In [None]:
# YOUR CODE HERE


### Problem 48
**Task:** Use a Pie Chart to show the proportion of "YES" vs "NO" in `Fast_Check`.

In [None]:
# YOUR CODE HERE


### Problem 49
**Task:** Randomly sample 3 rows from the dataframe (`.sample()`).

In [None]:
# YOUR CODE HERE


### Problem 50
**Task:** Final Report. Print a string summary: "We analyzed [X] animals. The average speed is [Y] km/h."

In [None]:
# YOUR CODE HERE
