# 📝 Section 2: Homework 2: Comprehensive Data Cleaning and Analysis

### Assignment Overview

In this homework, you will clean, transform, and analyze the Titanic dataset. You will practice handling missing data, grouping and merging, creating pivot tables, and engineering new features. This assignment will test your ability to apply a variety of **Pandas** techniques to a real-world dataset.

---

## 🚢 1. Getting Started and Data Cleaning

### Instructions
1.  Import the **Pandas** library.
2.  Load the Titanic dataset from the specified URL.
3.  Perform the initial data cleaning:
    * Impute missing `Age` values with the **median**.
    * Fill missing `Embarked` values with the **mode** (the most frequent value).
    * Drop the `Cabin` column entirely.
4.  Display the first 10 rows of the cleaned DataFrame to verify your work.

In [2]:
# Your code for getting started and data cleaning here
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_df = pd.read_csv(url)

# Impute missing Age with the median
titanic_df['Age'].fillna(titanic_df['Age'].median(),inplace=True)

# Fill missing Embarked with the mode
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0],inplace=True)

# Drop the Cabin column
titanic_df.drop('Cabin',axis=1,inplace=True)

# Display the first 10 rows of the cleaned DataFrame
display(titanic_df.head(10))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['Age'].fillna(titanic_df['Age'].median(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0],inplace=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C


---

## 📊 2. Grouping and Merging

### Instructions
1.  Group the cleaned DataFrame by `Pclass`.
2.  Calculate the **mean `Fare`** and the **count of passengers** for each class. *Hint: `.agg()` can be useful here.*
3.  Create a second, small DataFrame that maps `Pclass` values to a `ClassDescription` (e.g., 1 to "Upper", 2 to "Middle", 3 to "Lower").
4.  **Merge** your grouped data with this new DataFrame to add the descriptive class names.
5.  Display the final merged DataFrame.

In [3]:
# Group by Pclass with mean Fare + passenger count
grouped_by_class = titanic_df.groupby("Pclass").agg({
    "Fare": "mean",
    "PassengerId": "count"
}).reset_index()

# Create class description DataFrame
class_desc_df = pd.DataFrame({
    "Pclass": [1, 2, 3],
    "ClassDescription": ["Upper", "Middle", "Lower"]
})

# Merge
merged_class_df = pd.merge(grouped_by_class, class_desc_df, on="Pclass")

display(merged_class_df)


Unnamed: 0,Pclass,Fare,PassengerId,ClassDescription
0,1,84.154687,216,Upper
1,2,20.662183,184,Middle
2,3,13.67555,491,Lower


---

## 🔀 3. Creating a Pivot Table

### Instructions
1.  Create a **pivot table** from the cleaned `titanic_df`.
2.  The pivot table should show the **mean `Fare`**.
3.  Use `Embarked` for the **rows** (index) and `Sex` for the **columns**.
4.  Fill any missing values (`NaN`) in the resulting pivot table with `0`.
5.  Display the pivot table.

In [4]:
# Your code for creating a pivot table here

pivot_table = titanic_df.pivot_table(
    values="Fare",
    index="Embarked",
    columns="Sex",
    aggfunc="mean",
    fill_value=0
)

display(pivot_table)


Sex,female,male
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,75.169805,48.262109
Q,12.634958,13.838922
S,39.143456,21.711996


---

## ✨ 4. Feature Engineering and Outlier Handling

### Instructions
1.  **Feature Engineering**: Create a new `TravelGroup` column in `titanic_df`. This column should categorize passengers based on their total family size (`SibSp` + `Parch` + 1).
    * **"Solo"**: Family size of 1.
    * **"Small"**: Family size of 2 or 3.
    * **"Large"**: Family size of 4 or more.
2.  **Outlier Handling**: Document your chosen method for handling outliers in the `Fare` column in the markdown cell below. Then, apply this method to your DataFrame.
3.  **Final Summary**:
    * Display the first 10 rows of the DataFrame with the new `TravelGroup` column and adjusted `Fare`.
    * Create a dictionary called `summary_stats` to store at least two summary statistics (e.g., mean `Fare` after outlier handling, number of "Solo" travelers).

### My Outlier Handling Method

*(Double-click here to document your method. Explain what method you chose (e.g., IQR, Z-score, clipping) and why you chose it for this dataset.)*

In [5]:
# Create FamilySize = SibSp + Parch + 1
titanic_df["FamilySize"] = titanic_df["SibSp"] + titanic_df["Parch"] + 1

# TravelGroup categories
def travel_group(size):
    if size == 1:
        return "Solo"
    elif size in [2, 3]:
        return "Small"
    else:
        return "Large"

titanic_df["TravelGroup"] = titanic_df["FamilySize"].apply(travel_group)

# Outlier handling for Fare (IQR clipping)
Q1 = titanic_df["Fare"].quantile(0.25)
Q3 = titanic_df["Fare"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

titanic_df["Fare"] = titanic_df["Fare"].clip(lower=lower_bound, upper=upper_bound)

# Show first 10 rows
display(titanic_df.head(10))

# Summary stats dictionary
summary_stats = {
    "Mean_Fare": titanic_df["Fare"].mean(),
    "Max_FamilySize": titanic_df["FamilySize"].max(),
    "Solo_Count": (titanic_df["TravelGroup"] == "Solo").sum()
}

summary_stats


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,FamilySize,TravelGroup
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,2,Small
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,65.6344,C,2,Small
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,1,Solo
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,2,Small
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,1,Solo
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,Q,1,Solo
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,1,Solo
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S,5,Large
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S,3,Small
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C,2,Small


{'Mean_Fare': np.float64(24.04681335578002),
 'Max_FamilySize': 11,
 'Solo_Count': np.int64(537)}

---

## ✅ Self-Assessment

Run the cell below to check your work. This script will evaluate the key components of your assignment and provide a score and feedback. **Make sure you have run all the cells above this one first.**

In [7]:
#@title Run this cell to check your work
import pandas as pd
from IPython.display import display, Markdown

def check_homework():
    """Checks the student's work and provides feedback."""
    score = 0
    total_points = 5
    feedback = []

    # Check 1: Data Cleaning
    try:
        if 'titanic_df' in globals() and isinstance(titanic_df, pd.DataFrame):
            cabin_dropped = 'Cabin' not in titanic_df.columns
            age_imputed = titanic_df['Age'].notna().all()
            embarked_filled = titanic_df['Embarked'].notna().all()
            if cabin_dropped and age_imputed and embarked_filled:
                score += 1
                feedback.append("- ✅ **Task 1 (Cleaning):** Passed. DataFrame is loaded and initial cleaning is correct.")
            else:
                error_msgs = []
                if not cabin_dropped: error_msgs.append("'Cabin' column not dropped")
                if not age_imputed: error_msgs.append("'Age' has missing values")
                if not embarked_filled: error_msgs.append("'Embarked' has missing values")
                feedback.append(f"- ❌ **Task 1 (Cleaning):** Incomplete. Issues found: {', '.join(error_msgs)}.")
        else:
            feedback.append("- ❌ **Task 1 (Cleaning):** Failed. DataFrame named `titanic_df` not found.")
    except Exception as e:
        feedback.append(f"- ❌ **Task 1 (Cleaning):** An error occurred: {e}")

    # Check 2: Grouping and Merging
    try:
        if 'merged_class_df' in globals() and isinstance(merged_class_df, pd.DataFrame):
            required_cols = ['Pclass', 'Fare', 'ClassDescription']
            # Accommodate different possible names for the count column
            count_col_present = any(col in merged_class_df.columns for col in ['PassengerId', 'Survived', 'Name'])
            if all(c in merged_class_df.columns for c in required_cols) and count_col_present:
                score += 1
                feedback.append("- ✅ **Task 2 (Grouping/Merging):** Passed. Grouping and merging were successful.")
            else:
                feedback.append("- ❌ **Task 2 (Grouping/Merging):** Failed. `merged_class_df` is missing required columns.")
        else:
            feedback.append("- ❌ **Task 2 (Grouping/Merging):** Failed. DataFrame `merged_class_df` not found.")
    except Exception as e:
        feedback.append(f"- ❌ **Task 2 (Grouping/Merging):** An error occurred: {e}")

    # Check 3: Pivot Table
    try:
        if 'pivot_table' in globals() and isinstance(pivot_table, pd.DataFrame):
            index_ok = pivot_table.index.name == 'Embarked'
            cols_ok = 'male' in pivot_table.columns and 'female' in pivot_table.columns
            nan_ok = not pivot_table.isna().any().any()
            if index_ok and cols_ok and nan_ok:
                score += 1
                feedback.append("- ✅ **Task 3 (Pivot Table):** Passed. Pivot table has the correct structure and no missing values.")
            else:
                feedback.append("- ❌ **Task 3 (Pivot Table):** Failed. Check index, columns, or missing value handling.")
        else:
            feedback.append("- ❌ **Task 3 (Pivot Table):** Failed. DataFrame `pivot_table` not found.")
    except Exception as e:
        feedback.append(f"- ❌ **Task 3 (Pivot Table):** An error occurred: {e}")

    # Check 4: Feature Engineering
    try:
        if 'titanic_df' in globals() and 'TravelGroup' in titanic_df.columns:
            expected_values = {"Solo", "Small", "Large"}
            actual_values = set(titanic_df['TravelGroup'].unique())
            if actual_values.issubset(expected_values):
                score += 1
                feedback.append("- ✅ **Task 4 (Feature Engineering):** Passed. `TravelGroup` column created correctly.")
            else:
                feedback.append("- ❌ **Task 4 (Feature Engineering):** Failed. `TravelGroup` contains unexpected values.")
        else:
            feedback.append("- ❌ **Task 4 (Feature Engineering):** Failed. `TravelGroup` column not found in `titanic_df`.")
    except Exception as e:
        feedback.append(f"- ❌ **Task 4 (Feature Engineering):** An error occurred: {e}")

    # Check 5: Summary Stats
    try:
        if 'summary_stats' in globals() and isinstance(summary_stats, dict):
            if len(summary_stats) >= 2:
                score += 1
                feedback.append("- ✅ **Task 5 (Summary Stats):** Passed. `summary_stats` dictionary is created with at least two entries.")
            else:
                feedback.append("- ❌ **Task 5 (Summary Stats):** Failed. Dictionary must contain at least two stats.")
        else:
            feedback.append("- ❌ **Task 5 (Summary Stats):** Failed. Dictionary `summary_stats` not found.")
    except Exception as e:
        feedback.append(f"- ❌ **Task 5 (Summary Stats):** An error occurred: {e}")

    # Final Feedback
    final_message = "**Homework Self-Assessment Feedback:**\n\n" + "\n".join(feedback)
    final_message += f"\n\n### **Final Score: {score}/{total_points}**"
    if score == total_points:
        final_message += "\n\nExcellent work! All checks passed."
    else:
        final_message += "\n\nSome tasks need revision. Please review the feedback above."

    display(Markdown(final_message))

check_homework()

**Homework Self-Assessment Feedback:**

- ✅ **Task 1 (Cleaning):** Passed. DataFrame is loaded and initial cleaning is correct.
- ✅ **Task 2 (Grouping/Merging):** Passed. Grouping and merging were successful.
- ✅ **Task 3 (Pivot Table):** Passed. Pivot table has the correct structure and no missing values.
- ✅ **Task 4 (Feature Engineering):** Passed. `TravelGroup` column created correctly.
- ✅ **Task 5 (Summary Stats):** Passed. `summary_stats` dictionary is created with at least two entries.

### **Final Score: 5/5**

Excellent work! All checks passed.