# 50 Problems: Data Cleaning & Analysis Bootcamp

**Role:** Junior Data Scientist
**Mission:** You have received two messy datasets from the field. Your job is to clean, standardize, and analyze them.

**The Mess:**
- Duplicate rows
- Missing values (NaN)
- Inconsistent text ("Apple" vs "apple")
- Mixed units (mph vs km/h)
- Outliers (data errors)

---

## LEVEL 1: Inspection & Triage
First, we need to see how bad the damage is. We'll use `../data/fastest-animals.csv`.

In [102]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


### Problem 1
**Task:** Load `../data/fastest-animals.csv` into a variable `df_animals`.

In [103]:
# YOUR CODE HERE
#loads data into a pandaframe
df_animals = pd.read_csv('../data/fastest-animals.csv')


In [104]:
df_animals.head()


Unnamed: 0,Animal,Class,Speed,Unit
0,Cheetah,Mammal,120.0,km/h
1,Peregrine Falcon,Bird,389.0,km/h
2,Cheetah,Mammal,120.0,km/h
3,Sailfish,Fish,68.0,mph
4,Black Marlin,Fish,,mph


### Problem 2
**Task:** Display the first 10 rows. Notice the messy data (NaNs, duplicates).

In [105]:
# YOUR CODE HERE
df_animals.head(10)


Unnamed: 0,Animal,Class,Speed,Unit
0,Cheetah,Mammal,120.0,km/h
1,Peregrine Falcon,Bird,389.0,km/h
2,Cheetah,Mammal,120.0,km/h
3,Sailfish,Fish,68.0,mph
4,Black Marlin,Fish,,mph
5,Pronghorn,Mammal,88.5,km/h
6,Springbok,Mammal,55.0,mph
7,Cheetah,Mammal,75.0,mph
8,Golden Eagle,Bird,320.0,km/h
9,Peregrine Falcon,Bird,240.0,mph


### Problem 3
**Task:** Check the data types of each column (`.dtypes`). Is 'Speed' a number or an object (string)?

In [106]:
# YOUR CODE HERE
df_animals.dtypes


Animal     object
Class      object
Speed     float64
Unit       object
dtype: object

Speed is a float

### Problem 4
**Task:** Check for missing values in the entire DataFrame (`.isnull().sum()`).

In [107]:
# YOUR CODE HERE
df_animals.isnull().sum()


Animal    0
Class     0
Speed     1
Unit      1
dtype: int64

### Problem 5
**Task:** Check for duplicate rows (`.duplicated().sum()`).

In [108]:
# YOUR CODE HERE
df_animals.duplicated().sum()


np.int64(1)

### Problem 6
**Task:** Drop the duplicate rows from `df_animals` (update the variable).

In [109]:
# YOUR CODE HERE
df_animals = df_animals.drop_duplicates()


### Problem 7
**Task:** Display the rows where `Speed` is missing (NaN).

In [110]:
# YOUR CODE HERE
rows_null = df_animals[df_animals['Speed'].isna()]
rows_null


Unnamed: 0,Animal,Class,Speed,Unit
4,Black Marlin,Fish,,mph


### Problem 8
**Task:** Drop the rows where `Speed` is missing.

In [111]:
# YOUR CODE HERE(
df_animals = df_animals.dropna(subset=['Speed'])


### Problem 9
**Task:** We still have a missing 'Unit' for the Greyhound. Fill all missing 'Unit' values with 'km/h' (assuming that's the default).

In [112]:
# YOUR CODE HERE
df_animals['Unit'] = df_animals['Unit'].fillna('km/h')


### Problem 10
**Task:** Verify your dataframe is now clean of NaNs and Duplicates.

In [156]:
# YOUR CODE HERE
df_animals.duplicated().sum()
df_animals.isna().sum()


Animal    0
Class     0
Speed     0
Unit      0
dtype: int64

--- 
## LEVEL 2: The Logic Challenge (Standardization)
We have a problem. Some animals are measured in `mph` and some in `km/h`. We need to convert everyone to `km/h`.

### Problem 11
**Task:** Get a list of all unique values in the `Unit` column to see what we are dealing with.

In [114]:
# YOUR CODE HERE


### Problem 12
**Task:** Create a boolean mask (a list of True/False) for rows where `Unit` is 'mph'.

In [115]:
# YOUR CODE HERE


### Problem 13
**Task:** Ensure the `Speed` column is numeric. If it loaded as a string/object, convert it to float.

In [116]:
# YOUR CODE HERE


### Problem 14
**Task:** This is the hard one. Update the `Speed` column: **IF** the unit is 'mph', multiply the speed by `1.609`.

In [117]:
# YOUR CODE HERE


### Problem 15
**Task:** Now that all numbers are converted, change all 'mph' values in the `Unit` column to 'km/h'.

In [118]:
# YOUR CODE HERE


### Problem 16
**Task:** Rename the `Speed` column to `Speed_kmh` and drop the `Unit` column (since it's all the same now).

In [119]:
# YOUR CODE HERE


### Problem 17
**Task:** Sort the animals by speed (fastest on top). Who is the fastest?

In [120]:
# YOUR CODE HERE


### Problem 18
**Task:** Save this clean dataset to `../data/clean_animals.csv` (without the index).

In [121]:
# YOUR CODE HERE


### Problem 19
**Task:** Group the animals by `Class` (Mammal, Bird, etc.) and find the average speed for each class.

In [122]:
# YOUR CODE HERE


### Problem 20
**Task:** Visualize the average speed per class using a Bar Chart.

In [123]:
# YOUR CODE HERE


--- 
## LEVEL 3: String Cleaning & Outliers
Now we switch to `../data/fruits-weights.csv`. This data has typo issues.

### Problem 21
**Task:** Load `../data/fruits-weights.csv` into `df_fruits`.

In [124]:
# YOUR CODE HERE


### Problem 22
**Task:** Look at the unique values in the `fruit` column. Notice "Apple", "apple", "Banana", "BANANA".

In [125]:
# YOUR CODE HERE


### Problem 23
**Task:** Convert the `fruit` column to lowercase using `.str.lower()`.

In [126]:
# YOUR CODE HERE


### Problem 24
**Task:** Strip any extra whitespace from the `fruit` column using `.str.strip()`.

In [127]:
# YOUR CODE HERE


### Problem 25
**Task:** Check unique values again. They should be clean now (e.g., only one 'apple').

In [128]:
# YOUR CODE HERE


### Problem 26
**Task:** Drop any rows with missing weight values.

In [129]:
# YOUR CODE HERE


### Problem 27
**Task:** Sort by weight. Do you see a crazy value? (Outlier detection).

In [130]:
# YOUR CODE HERE


### Problem 28
**Task:** That 3000g Grapefruit is an error. Filter the dataframe to only keep fruits weighing LESS than 1000.

In [131]:
# YOUR CODE HERE


### Problem 29
**Task:** Group by `fruit` and calculate the average weight.

In [132]:
# YOUR CODE HERE


### Problem 30
**Task:** Plot a bar chart of the average fruit weights.

In [133]:
# YOUR CODE HERE


--- 
## LEVEL 4: Real World Analysis (Aggregation)
Back to `df_animals`. Let's ask some statistical questions.

### Problem 31
**Task:** What is the median speed of all animals?

In [134]:
# YOUR CODE HERE


### Problem 32
**Task:** What is the standard deviation of the speeds?

In [135]:
# YOUR CODE HERE


### Problem 33
**Task:** Get the descriptive statistics for the 'Fish' class only.

In [136]:
# YOUR CODE HERE


### Problem 34
**Task:** Filter for animals faster than 100 km/h AND that are NOT birds.

In [137]:
# YOUR CODE HERE


### Problem 35
**Task:** Create a new column `Speed_Level`. Set it to 'High' if speed > 100, else 'Normal'.

In [138]:
# YOUR CODE HERE


### Problem 36
**Task:** Count how many animals are in each `Speed_Level`.

In [139]:
# YOUR CODE HERE


### Problem 37
**Task:** Use `groupby` to find the slowest animal in each Class (`.min()`).

In [140]:
# YOUR CODE HERE


### Problem 38
**Task:** Create a pivot table: Index='Class', Values='Speed_kmh', Aggfunc='mean'.

In [141]:
# YOUR CODE HERE


### Problem 39
**Task:** Who is the fastest Mammal? (Filter for Mammals, then find the max).

In [142]:
# YOUR CODE HERE


### Problem 40
**Task:** Create a histogram of all animal speeds. Use 5 bins.

In [143]:
# YOUR CODE HERE


--- 
## LEVEL 5: Mastery Challenge
Combine everything. Clean. Analyze. Visualize.

### Problem 41
**Task:** Create a scatter plot of `Animal` vs `Speed` (this will be messy, but try it).

In [144]:
# YOUR CODE HERE


### Problem 42
**Task:** That plot is too crowded. Filter for only the top 5 fastest animals and plot them.

In [145]:
# YOUR CODE HERE


### Problem 43
**Task:** Add a title "Top 5 Fastest Animals" and label the Y-axis "Speed (km/h)".

In [146]:
# YOUR CODE HERE


### Problem 44
**Task:** Make the bars green.

In [147]:
# YOUR CODE HERE


### Problem 45
**Task:** Save the figure as `top_5_animals.png`.

In [148]:
# YOUR CODE HERE


### Problem 46
**Task:** Create a function `is_fast(speed)` that returns "YES" if speed > 100, else "NO".

In [149]:
# YOUR CODE HERE


### Problem 47
**Task:** Apply this function to the `Speed_kmh` column to create a new column `Fast_Check`.

In [150]:
# YOUR CODE HERE


### Problem 48
**Task:** Use a Pie Chart to show the proportion of "YES" vs "NO" in `Fast_Check`.

In [151]:
# YOUR CODE HERE


### Problem 49
**Task:** Randomly sample 3 rows from the dataframe (`.sample()`).

In [152]:
# YOUR CODE HERE


### Problem 50
**Task:** Final Report. Print a string summary: "We analyzed [X] animals. The average speed is [Y] km/h."

In [153]:
# YOUR CODE HERE
