# ai03bTasks
# Machine Learning: Decision Trees
## Data Cleaning

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Import Libraries and Load Data

Run this cell first to set up your environment.

In [None]:
import pandas as pd

# Load the Titanic dataset if it's not already
df = pd.read_csv("Titanic Dataset.csv")

print("✓ Data loaded successfully!")
print(f"Original shape: {df.shape}")
print(f"\nOriginal columns: {df.columns.tolist()}")

---
## Task 1: Understand the Original Data

Before cleaning, let's see what we're working with.

### 1a. How many rows and columns are in the original dataset?

In [None]:
# TODO: Print the shape of the DataFrame
print(f"Rows: {______}")
print(f"Columns: {______}")

### 1b. Display the first few rows

In [None]:
# TODO: Use .head() to display the first 5 rows


### 1c. Check for missing values in the original data

In [None]:
# TODO: Use .isnull().sum() to count missing values per column
print("Missing values in original data:")


**Q: Which columns have the most missing values?**

bodys, cabin, then boat

---
## Task 2: Select Useful Features

We'll keep only the columns that help predict survival.

### 2a. Keep only these 8 columns: pclass, survived, sex, age, sibsp, parch, fare, embarked

In [None]:
# TODO: Select only the useful columns
# Hint: df = df[['column1', 'column2', ...]]
df = df[[________, ________, ________, ________, 
         ________, ________, ________, ________]]

print("✓ Columns selected!")
print(f"New shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

### 2b. Explain why we dropped certain columns

**Q: Why did we drop the 'name' column?**.  (Refer to the lesson if you don't recall - or ask a neighbor!)

A: Names are unique identifiers so bassically high cardinality. Unless I extract titles like Mr. or Mrs. , the name itself doesn't help the math of the model predict survival.

**Q: Why did we drop the 'cabin' column?**

A: It has too many missing values. Dropping the rows would lose too much data, and filling them is bad because I don't have a good reference point/info.

**Q: Why did we drop the 'boat' column?**

A: This causes Data Leakage. If a passenger has a lifeboat number, it implies they survived. The model would just "cheat" by looking at this column rather than learning patterns from age or class and then everything is messed up.

---
## Task 3: Check for Missing Values

Now let's see which of our selected columns have missing values.

In [None]:
# TODO: Check for missing values in the cleaned dataset
print("Missing values after feature selection:")


**Q: How many missing values are in the 'age' column?**

A: 263

**Q: How many missing values are in the 'fare' column?**

A: 1

**Q: How many missing values are in the 'embarked' column?**

A: 2

---
## Task 4: Handle Missing Age Values

Age has many missing values. We'll fill them with the median age.

### 4a. Calculate the median age

In [None]:
# TODO: Calculate the median age
median_age = df['age'].________()
print(f"Median age: {median_age}")

### 4b. Fill missing ages with the median

In [None]:
# TODO: Fill missing ages with median_age
# Hint: Use .fillna(value, inplace=True)
df['age'].________(________, inplace=True)

print("✓ Missing ages filled with median!")
print(f"Missing ages now: {df['age'].isnull().sum()}")

### 4c. Compare median vs mean for age

In [None]:
# Calculate both median and mean
print(f"Median age: {df['age'].median():.2f}")
print(f"Mean age: {df['age'].mean():.2f}")

**Q: Which is larger, the median or mean? Why might this be?**

A: The Mean is barely larger. Because age is "right-skewed"—there is a couple old people who pull the average (mean) up, while the median stays in the middle.

---
## Task 5: Handle Missing Fare Values

Fare has only 1 missing value. We'll fill it with the median fare.

In [None]:
# TODO: Calculate median fare
median_fare = ________
print(f"Median fare: ${median_fare:.2f}")

# TODO: Fill missing fare with median
df['fare'].________(________, inplace=True)

print("✓ Missing fare filled!")
print(f"Missing fares now: {df['fare'].isnull().sum()}")

---
## Task 6: Handle Missing Embarked Values

Embarked has only 2 missing values. Since this is so few, we'll drop those rows.

### 6a. How many rows before dropping?

In [None]:
rows_before = len(df)
print(f"Rows before dropping: {rows_before}")

### 6b. Drop rows with missing embarked values

In [None]:
# TODO: Drop rows where 'embarked' is missing
# Hint: Use .dropna(subset=['column_name'], inplace=True) - also in the lesson material
df.________(subset=[________], inplace=True)

rows_after = len(df)
rows_dropped = rows_before - rows_after

print("✓ Rows with missing embarked dropped!")
print(f"Rows after dropping: {rows_after}")
print(f"Rows dropped: {rows_dropped}")
print(f"Missing embarked now: {df['embarked'].isnull().sum()}")

**Q: What percentage of data did we lose by dropping these rows?**

A: A small amount about 0.15%. Since we only lost 2 rows out of 1300+, it is better to drop them rather than guessing where they embarked.

---
## Task 7: Verify All Missing Values Are Gone

Let's do a final check to make sure our data is completely clean.

In [None]:
# TODO: Check for any remaining missing values
print("Final missing value check:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

**Q: Are there any missing values remaining? (Should be 0!)**

A: Nah the sum should be 0.

---
## Task 8: Summary Statistics

Now that our data is clean, let's look at summary statistics.

In [None]:
# Display summary statistics
print("Summary statistics after cleaning:")
df.describe()

**Q: What is the average age after filling missing values?**

A: 29 years old 

**Q: What is the average fare?**

A: 33 $.

---
## Task 9: Save the Cleaned Data
Save your cleaned data to a new CSV file so you can use it in the next lesson.

In [None]:
# Save cleaned data
# Writes the DataFrame to a new CSV file without adding extra row numbers (index)
df.to_csv("Titanic_Cleaned.csv", index=False)     

# Prints a confirmation message so the user knows the save was successful
print("✓ Cleaned data saved to 'Titanic_Cleaned.csv'")   

---
## Reflection Questions

Answer these questions based on your work:

**1. Why is it important to check for missing values before building a model?**

Answer: Machine learning algorithms are math equations. Most algorithms can't handle "NaN" values and will crash or throw errors. Even if they don't crash, missing data can lead to biased or inaccurate predictions.

**2. When should you fill missing values vs. drop rows?**

Answer: When you have a lot of missing data in a column that is important, or when the number of rows with missing data is large you don't want to lose that much data. Drop when only a small percentage of rows are missing data like the 'embarked' part, or when a column is missing so much data it is useless like 'cabin'.

**3. Why did we use median instead of mean to fill missing ages?**

Answer: The median is more robust to outliers. If there were a few 80byear olds on the ship, they would pull the Mean higher, potentially making it inaccurate for the average person. The Median represents the true "middle" of the crowd.

**4. What could happen if we trained a model on data with missing values?**

Answer: The code would likely fail/error out. If the library handles it automatically, it might make bad assumptions like assuming missing values are 0, leading to a model that makes bad predictions.

**5. Name one real-world scenario where missing data might occur.**

Answer: A user skips an optional question on a survey; a weather sensor battery dies and stops recording temperature for an hour; a clerical error where a doctor forgets to write down a patient's weight. So could be a small deal, or a huge deal just depends.

---
## Lesson Complete!

You've successfully cleaned the Titanic dataset!

**Summary of what you did:**
- Selected 8 useful features from 15 columns
- Filled missing ages with median
- Filled missing fares with median
- Dropped 2 rows with missing embarked values
- Verified all missing values are gone

Save this notebook and push to GitHub.

**Next lesson**: Convert categorical data to numbers