# Pandas Student Notebook — Foundations Practice  
## Dataset: Kaggle “Titanic: Machine Learning from Disaster” (train.csv)

This notebook is a guided practice for core Pandas skills: loading data, cleaning, feature creation, grouping, and basic validation.

Expected columns include: `Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Name`.

Write your code in the empty code cells. Keep your work readable and show intermediate results when helpful.


## 0. Setup

Load `train.csv` into a DataFrame called `df`.


In [92]:
import pandas as pd
import numpy as np
import os


df = pd.read_csv(os.path.join('data', 'titanic', 'train.csv'))
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 1. Basic inspection

1) Show the column names.  
2) Use `df.describe()` for numeric columns.  
3) Show how many missing values each column has.

Write as a comment: Which 2 columns have the most missing values?


In [93]:
# 1) column names
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [94]:
# 2) numeric describe
df.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [95]:
# 3) missing values per column
df.isna().sum().sort_values(ascending=False)

# Most missing: Cabin, Age (often also Embarked has a few).

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

## 2. Selecting and filtering

1) Select only these columns into a new DataFrame `mini`:  
`Survived, Pclass, Sex, Age, Fare, Embarked`

2) Filter passengers who:
- are in 1st class
- and paid more than 100 (`Fare > 100`)

Show the first 10 rows of the filtered result.


In [96]:
mini = df[["Survived","Pclass","Sex","Age","Fare","Embarked"]].copy()
mini.head()


Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked
0,0,3,male,22.0,7.25,S
1,1,1,female,38.0,71.2833,C
2,1,3,female,26.0,7.925,S
3,1,1,female,35.0,53.1,S
4,0,3,male,35.0,8.05,S


In [97]:
mini.loc[(mini["Pclass"] == 1) & (mini['Fare'] > 100)].head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked
27,0,1,male,19.0,263.0,S
31,1,1,female,,146.5208,C
88,1,1,female,23.0,263.0,S
118,0,1,male,24.0,247.5208,C
195,1,1,female,58.0,146.5208,C


## 3. Missing values: counting and filling

1) Count how many `Age` values are missing.  
2) Create a new column `Age_filled` where missing ages are replaced by the **median** age.



In [98]:
df['Age'].isna().sum()

np.int64(177)

In [99]:
age_median = df["Age"].median()
df["Age_filled"] = df["Age"].fillna(age_median)

df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_filled
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0


## 4. Feature engineering (vectorized)

Create these new columns:
- `is_child`: (younger than 16)
- `family_size`
- `is_alone`



In [100]:
df["is_child"] = df["Age_filled"] < 16
df["family_size"] = df["SibSp"] + df["Parch"] + 1
df["is_alone"] = df["family_size"] == 1

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_filled,is_child,family_size,is_alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,False,2,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,False,2,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,False,1,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,False,2,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,False,1,True


## 5. Value counts and proportions

1) Compute how many passengers per `Sex`.
2) Compute the proportion of survivors overall.  
3) Compute the proportion of survivors by `Sex`.


In [101]:
# 1) passengers per Sex
df["Sex"].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [102]:
# 2) proportion of survivors overall
df["Survived"].mean()


np.float64(0.3838383838383838)

In [103]:
# 3) proportion of survivors by Sex
df.groupby("Sex")["Survived"].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

## 6. Groupby + aggregation

Compute survival rate by:
1) `Pclass`
2) `Pclass` and `Sex` together

Make the result easy to read (sort by class and use clear column names).


In [104]:
surv_by_pclass = (
    df.groupby("Pclass", as_index=False)["Survived"]
    .mean()
    .rename(columns={"Survived": "survival_rate"})
    .sort_values("Pclass")
)
surv_by_pclass


Unnamed: 0,Pclass,survival_rate
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [105]:
surv_by_pclass_sex = (
    df.groupby(["Pclass", "Sex"], as_index=False)["Survived"]
    .mean()
    .rename(columns={"Survived": "survival_rate"})
    .sort_values(["Pclass", "Sex"])
)
surv_by_pclass_sex


Unnamed: 0,Pclass,Sex,survival_rate
0,1,female,0.968085
1,1,male,0.368852
2,2,female,0.921053
3,2,male,0.157407
4,3,female,0.5
5,3,male,0.135447


## 10. Data quality check

Create a boolean column `suspicious_fare` that is True if:
- `Fare <= 0` OR `Fare` is missing

Then print:
- number of suspicious rows
- a sample of suspicious rows (show `Name, Fare, Pclass, Embarked`)

Write as a comment: Is a zero fare always “wrong”? Give one possible explanation.


In [109]:
df["suspicious_fare"] = (df["Fare"].isna()) | (df["Fare"] <= 0)

print("number of suspicious rows:", int(df["suspicious_fare"].sum()))

df[df["suspicious_fare"]][["Name", "Fare", "Pclass", "Embarked"]].head(10)


number of suspicious rows: 15


Unnamed: 0,Name,Fare,Pclass,Embarked
179,"Leonard, Mr. Lionel",0.0,3,S
263,"Harrison, Mr. William",0.0,1,S
271,"Tornquist, Mr. William Henry",0.0,3,S
277,"Parkes, Mr. Francis ""Frank""",0.0,2,S
302,"Johnson, Mr. William Cahoone Jr",0.0,3,S
413,"Cunningham, Mr. Alfred Fleming",0.0,2,S
466,"Campbell, Mr. William",0.0,2,S
481,"Frost, Mr. Anthony Wood ""Archie""",0.0,2,S
597,"Johnson, Mr. Alfred",0.0,3,S
633,"Parr, Mr. William Henry Marsh",0.0,1,S


## 11. Capstone: clean feature table

Create a DataFrame `features` containing:
- `Survived`
- `Pclass`
- `Age_filled`
- `Fare_filled` (fill missing Fare with median)
- `family_size`
- `is_alone`

Requirements:
- No missing values in `features`
- Show `features.head()` and `features.isna().sum()`

This is a typical “model-ready” table.


In [107]:
fare_median = df["Fare"].median()
df["Fare_filled"] = df["Fare"].fillna(fare_median)

features = df[[
    "Survived",
    "Pclass",
    "Age_filled",
    "Fare_filled",
    "family_size",
    "is_alone",
]].copy()


features.head()

Unnamed: 0,Survived,Pclass,Age_filled,Fare_filled,family_size,is_alone
0,0,3,22.0,7.25,2,False
1,1,1,38.0,71.2833,2,False
2,1,3,26.0,7.925,1,True
3,1,1,35.0,53.1,2,False
4,0,3,35.0,8.05,1,True


In [108]:
features.isna().sum()

Survived       0
Pclass         0
Age_filled     0
Fare_filled    0
family_size    0
is_alone       0
dtype: int64