# Lesson 02: Working with DataFrames

## Your Self-Guided Reference Guide

This notebook is your reference guide for learning how to work with **DataFrames** in pandas. A DataFrame is like a spreadsheet in Python — it has rows, columns, and lets you organize and analyze data.

As you work through the lesson, use this notebook to:
- See examples of each technique
- Understand how pandas operations connect to SQL queries
- Practice with the "Try This" sections

We'll be working with the **Titanic Dataset** — a real historical dataset of passengers from the RMS Titanic.

## Sub-Lesson 02a: Setup & Loading Data

This section covers importing pandas and loading the Titanic dataset into a DataFrame.

In [None]:
# Import pandas library — this gives us access to DataFrame tools
import pandas as pd

# Load the Titanic dataset into a DataFrame
# The file must be in the same folder as this notebook
titanic = pd.read_csv("Titanic Dataset.csv")

# Quick check — how big is our data?
# .shape returns (rows, columns)
print("Shape:", titanic.shape)

# Display the first 5 rows to see what the data looks like
titanic.head()

## Sub-Lesson 02b: Selecting Columns

This section covers selecting one or more columns from a DataFrame.

In [None]:
# Select a SINGLE column — returns a Series (one column of data)
# Use square brackets with the column name in quotes
# This is like SQL: SELECT name FROM titanic
names = titanic["name"]

# Display the first 5 names
print("First 5 names:")
print(names.head())
print()
print(f"Data type: {type(names)}")

In [None]:
# Select MULTIPLE columns — use DOUBLE brackets with a list of column names
# This is like SQL: SELECT name, age, survived FROM titanic
subset = titanic[["name", "age", "survived"]]

# Display the first 5 rows
print("First 5 rows of selected columns:")
print(subset.head())
print()
print(f"Data type: {type(subset)}")

### Try This:
Select the columns **'name'**, **'sex'**, and **'fare'** and display the first 10 rows.

In [None]:
# Try This: Select name, sex, and fare columns


## Sub-Lesson 02b: Filtering Rows

This section covers filtering data using comparison operators and combining conditions.

In [None]:
# Filter: keep only rows where a condition is True
# In this case: pclass == 1 (first class passengers)
# This is like SQL: SELECT * FROM titanic WHERE pclass = 1
firstClass = titanic[titanic["pclass"] == 1]

# How many first class passengers were there?
print(f"First class passengers: {len(firstClass)}")
print()

# Show name, class, and fare for the first few first-class passengers
firstClass[["name", "pclass", "fare"]].head()

In [None]:
# Different comparison operators in action

# Passengers older than 50
# This is like SQL: WHERE age > 50
olderPassengers = titanic[titanic["age"] > 50]
print(f"Passengers over 50 years old: {len(olderPassengers)}")
print()

# Passengers who survived (survived == 1 means they survived)
# This is like SQL: WHERE survived = 1
survivors = titanic[titanic["survived"] == 1]
print(f"Passengers who survived: {len(survivors)}")
print()

# Passengers with missing age values (age column is NaN)
# .isna() checks for missing/empty values
missingAge = titanic[titanic["age"].isna()]
print(f"Passengers with missing age data: {len(missingAge)}")

In [None]:
# Combining MULTIPLE conditions with & (and) or | (or)
# IMPORTANT: Each condition needs its own parentheses!

# Female survivors — BOTH conditions must be True
# This is like SQL: WHERE sex = 'female' AND survived = 1
femaleSurvivors = titanic[
    (titanic["sex"] == "female") & (titanic["survived"] == 1)
]
print(f"Female survivors: {len(femaleSurvivors)}")
print()

# First OR second class passengers — EITHER condition can be True
# This is like SQL: WHERE pclass = 1 OR pclass = 2
upperClass = titanic[
    (titanic["pclass"] == 1) | (titanic["pclass"] == 2)
]
print(f"Upper class passengers (1st or 2nd class): {len(upperClass)}")
print()

# NOT a condition — use ~ (tilde)
# Passengers who did NOT survive
# This is like SQL: WHERE survived != 1
nonSurvivors = titanic[~(titanic["survived"] == 1)]
print(f"Passengers who did not survive: {len(nonSurvivors)}")

### Try This:
Filter to find **male passengers in third class**. How many are there? Show their names and fares.

In [None]:
# Try This: Filter for male passengers in third class


## Sub-Lesson 02b: Sorting Data

This section covers sorting rows with .sort_values() in ascending and descending order.

In [None]:
# Sort by age — ascending (smallest to largest) is the default
# This is like SQL: SELECT * FROM titanic ORDER BY age
byAge = titanic.sort_values("age")

# Display the 10 youngest passengers
print("10 youngest passengers:")
byAge[["name", "age"]].head(10)

In [None]:
# Sort DESCENDING — highest to lowest
# This is like SQL: SELECT * FROM titanic ORDER BY fare DESC
byFare = titanic.sort_values("fare", ascending=False)

# Display the 10 passengers who paid the highest fares
print("10 highest fares paid:")
byFare[["name", "fare", "pclass"]].head(10)

### Try This:
Sort passengers by age in **descending order** (oldest to youngest) and show the **10 oldest passengers**.

In [None]:
# Try This: Sort by age descending and show the 10 oldest passengers


## Sub-Lesson 02b: Basic Statistics

This section covers .mean(), .max(), .min(), .sum(), .count() and calculating stats on filtered data.

In [None]:
# Calculate statistics on number columns
avgAge = titanic["age"].mean()              # Average age
maxFare = titanic["fare"].max()             # Highest fare paid
minAge = titanic["age"].min()               # Youngest passenger age
totalSurvived = titanic["survived"].sum()   # Total number who survived
ageCount = titanic["age"].count()           # How many age values we have

# Display the results
print(f"Average age: {avgAge:.1f} years")
print(f"Highest fare: ${maxFare:.2f}")
print(f"Youngest passenger: {minAge} years old")
print(f"Total survivors: {int(totalSurvived)}")
print(f"Age values recorded: {ageCount}")

In [None]:
# Filter FIRST, then calculate — this is very powerful!
# We can calculate statistics on subsets of data

# Average fare for first class passengers only
# This is like SQL: SELECT AVG(fare) FROM titanic WHERE pclass = 1
avgFirstFare = titanic[titanic["pclass"] == 1]["fare"].mean()

# Average fare for third class passengers
# This is like SQL: SELECT AVG(fare) FROM titanic WHERE pclass = 3
avgThirdFare = titanic[titanic["pclass"] == 3]["fare"].mean()

# Display comparison
print(f"Average first class fare: ${avgFirstFare:.2f}")
print(f"Average third class fare: ${avgThirdFare:.2f}")
print(f"Difference: ${avgFirstFare - avgThirdFare:.2f}")

### Try This:
Calculate the **average age of survivors** vs **average age of non-survivors**. What do you notice?

In [None]:
# Try This: Calculate average age of survivors vs non-survivors


## Sub-Lesson 02b: Value Counts

This section covers .value_counts() for understanding data distribution.

In [None]:
# .value_counts() counts how many times each value appears
# This is like SQL: SELECT pclass, COUNT(*) FROM titanic GROUP BY pclass

print("Passengers by class:")
print(titanic["pclass"].value_counts())
print()

print("Survival counts (0 = did not survive, 1 = survived):")
print(titanic["survived"].value_counts())
print()

print("Gender distribution:")
print(titanic["sex"].value_counts())

### Try This:
Use `.value_counts()` on the **'embarked'** column (the port where passengers boarded). Which port had the most passengers?

In [None]:
# Try This: Use value_counts on the 'embarked' column


## Section 7: Summary — Pandas to SQL Bridge

You now know the fundamental pandas operations! Here's how they connect to SQL queries:

| Task | Pandas | SQL |
|------|--------|-----|
| Select columns | `df[["col1", "col2"]]` | `SELECT col1, col2 FROM table` |
| Filter rows | `df[df["col"] == value]` | `SELECT * FROM table WHERE col = value` |
| Sort data | `df.sort_values("col")` | `SELECT * FROM table ORDER BY col` |
| Count rows | `len(df)` | `SELECT COUNT(*) FROM table` |
| Average | `df["col"].mean()` | `SELECT AVG(col) FROM table` |
| Group & count | `df["col"].value_counts()` | `SELECT col, COUNT(*) FROM table GROUP BY col` |

### What's Next?

In the next lesson, we'll take this data and load it into a **real SQLite database**. You'll learn how to write actual SQL queries to answer the same questions we've answered here with pandas — but using the power of a database!

---

**Keep practicing!** Use this notebook as a reference while you work through the lesson activities.