# 🔪 Part 3: Data Slicing and Aggregation (Select, Filter, GroupBy)

**Goal:** To master the core Pandas techniques used to target specific data subsets and calculate aggregated statistics, which are fundamental to Exploratory Data Analysis (EDA).

---
### Key Learning Objectives
1.  Select single and multiple columns using bracket (`[]`) and dot (`.`) notation.
2.  Filter rows based on single and multiple **boolean conditions** (`&` for AND, `|` for OR).
3.  Sort the data by one or more columns using `sort_values()`.
4.  Use the powerful **`groupby()`** method to calculate aggregated statistics.

In [7]:
import pandas as pd
import os

# Load the Titanic dataset
try:
    # NOTE: Assuming 'titanic_snapshot.csv' exists from the previous notebook's export
    titanic_df = pd.read_csv('data-visualization/data/titanic_snapshot.csv')
except FileNotFoundError:
    # Fallback to URL if local file isn't available
    url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
    titanic_df = pd.read_csv(url)

print("Data loaded successfully!")
print(f"Shape: {titanic_df.shape}")
print(f"Columns: {list(titanic_df.columns)}\n")

Data loaded successfully!
Shape: (891, 12)
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']



## 1. Selecting Columns

Selecting the right columns is the first step in focusing your analysis. Pandas offers flexible methods for this:

* **Single Column:** Returns a Pandas **Series** (1D). You can use the dictionary-style syntax (`df['ColumnName']`) or, for simple names, **dot notation** (`df.ColumnName`).
* **Multiple Columns:** Requires passing a **list of column names** inside the brackets (`df[['Col1', 'Col2']]`), which always returns a **DataFrame** (2D).
* **By Type:** The **`.select_dtypes()`** method is useful for automatically grabbing all **numeric** or **text (`object`)** columns at once.

In [8]:
# Single column selection (Series)
names = titanic_df['Name']
print(f"Type: {type(names)}")
print("First 5 names:")
print(names.head())

# Multiple columns selection (DataFrame)
basic_info = titanic_df[['Name', 'Age', 'Sex', 'Survived']]
print(f"\nType: {type(basic_info)}")
print("Basic info (first 5 rows):")
print(basic_info.head())

# Selection by Data Type
numeric_data = titanic_df.select_dtypes(include=['number'])
print("\nNumeric columns:")
print(numeric_data.dtypes)

Type: <class 'pandas.core.series.Series'>
First 5 names:
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Type: <class 'pandas.core.frame.DataFrame'>
Basic info (first 5 rows):
                                                Name   Age     Sex  Survived
0                            Braund, Mr. Owen Harris  22.0    male         0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0  female         1
2                             Heikkinen, Miss. Laina  26.0  female         1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0  female         1
4                           Allen, Mr. William Henry  35.0    male         0

Numeric columns:
PassengerId      int64
Survived         int64
Pclass           int64
Age            float6

## 2. Filtering Data (Boolean Masking)

Filtering, or **subsetting**, allows you to select only rows that meet specific conditions.

* **Single Condition:** Create a boolean Series (True/False) and pass it to the DataFrame: `df[df['Age'] > 30]`.
* **Multiple Conditions:** Conditions must be wrapped in **parentheses** and combined using:
    * **AND** operator: `&`
    * **OR** operator: `|`

In [9]:
# Single condition filtering
women = titanic_df[titanic_df['Sex'] == 'female']
print(f"Total passengers: {len(titanic_df)}")
print(f"Female passengers: {len(women)} ({len(women)/len(titanic_df)*100:.1f}%)")

# Multiple conditions filtering
young_women = titanic_df[(titanic_df['Sex'] == 'female') & (titanic_df['Age'] < 30)]
print(f"\nYoung women (under 30): {len(young_women)}")

elite_survivors = titanic_df[
    (titanic_df['Sex'] == 'female') &
    (titanic_df['Pclass'] == 1) &
    (titanic_df['Survived'] == 1)
]
print(f"First-class women who survived: {len(elite_survivors)}")
print("\nExample: Elite survivors")
print(elite_survivors[['Name', 'Age', 'Pclass', 'Fare']].head())

Total passengers: 891
Female passengers: 314 (35.2%)

Young women (under 30): 147
First-class women who survived: 91

Example: Elite survivors
                                                 Name   Age  Pclass      Fare
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0       1   71.2833
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0       1   53.1000
11                           Bonnell, Miss. Elizabeth  58.0       1   26.5500
31     Spencer, Mrs. William Augustus (Marie Eugenie)   NaN       1  146.5208
52           Harper, Mrs. Henry Sleeper (Myna Haxtun)  49.0       1   76.7292


## 3. Sorting and Ordering Data

The **`sort_values()`** method arranges your data by the values in one or more columns.

* **Ascending vs. Descending:** Use `ascending=False` for descending order.
* **Multi-Column Sort:** Pass a list of columns and a corresponding list of boolean values for the sort order.

In [10]:
# Basic sorting by one column
by_fare = titanic_df.sort_values(by='Fare', ascending=False)
print("Most expensive tickets:")
print(by_fare[['Name', 'Fare', 'Pclass']].head())

# Multi-column sorting
by_class_fare = titanic_df.sort_values(by=['Pclass', 'Fare'], ascending=[True, False])
print("\nSorted by Class (1→3), then by highest Fare within class:")
print(by_class_fare[['Name', 'Pclass', 'Fare', 'Survived']].head(10))

Most expensive tickets:
                                   Name      Fare  Pclass
679  Cardeza, Mr. Thomas Drake Martinez  512.3292       1
258                    Ward, Miss. Anna  512.3292       1
737              Lesurer, Mr. Gustave J  512.3292       1
88           Fortune, Miss. Mabel Helen  263.0000       1
438                   Fortune, Mr. Mark  263.0000       1

Sorted by Class (1→3), then by highest Fare within class:
                                      Name  Pclass      Fare  Survived
258                       Ward, Miss. Anna       1  512.3292         1
679     Cardeza, Mr. Thomas Drake Martinez       1  512.3292         1
737                 Lesurer, Mr. Gustave J       1  512.3292         1
27          Fortune, Mr. Charles Alexander       1  263.0000         0
88              Fortune, Miss. Mabel Helen       1  263.0000         1
341         Fortune, Miss. Alice Elizabeth       1  263.0000         1
438                      Fortune, Mr. Mark       1  263.0000         0
3

## 4. Aggregation with GroupBy

**`groupby()`** is arguably the most powerful data analysis tool in Pandas. It allows you to **Split** the data (by category), **Apply** a function (like `mean()`, `sum()`, `count()`), and **Combine** the results. This is the foundation of statistical summarization.

In [11]:
# Survival rate by gender
gender_survival = titanic_df.groupby('Sex')['Survived'].mean()
print("Survival rate by gender:")
for gender, rate in gender_survival.items():
    print(f"  {gender.capitalize()}: {rate:.2%}")

# Survival rate by class
class_survival = titanic_df.groupby('Pclass')['Survived'].mean()
print(f"\nSurvival rate by class:")
for pclass, rate in class_survival.items():
    print(f"  Class {pclass}: {rate:.2%}")

# Multi-variable GroupBy
age_by_gender_class = titanic_df.groupby(['Sex', 'Pclass'])['Age'].mean()
print(f"\nAverage age by gender and class:")
print(age_by_gender_class.round(1))

Survival rate by gender:
  Female: 74.20%
  Male: 18.89%

Survival rate by class:
  Class 1: 62.96%
  Class 2: 47.28%
  Class 3: 24.24%

Average age by gender and class:
Sex     Pclass
female  1         34.6
        2         28.7
        3         21.8
male    1         41.3
        2         30.7
        3         26.5
Name: Age, dtype: float64


In [12]:
# Redefine filtered DataFrames needed for summary and export
women = titanic_df[titanic_df['Sex'] == 'female']
older_passengers = titanic_df[titanic_df['Age'] > 30]
young_women = titanic_df[(titanic_df['Sex'] == 'female') & (titanic_df['Age'] < 30)]
first_class = titanic_df[titanic_df['Pclass'] == 1]


summary = f"""
KEY INSIGHTS FROM DATA MANIPULATION
===================================
- Female survival rate: {titanic_df[titanic_df['Sex'] == 'female']['Survived'].mean():.1%}
- Male survival rate: {titanic_df[titanic_df['Sex'] == 'male']['Survived'].mean():.1%}
- First class survival: {titanic_df[titanic_df['Pclass'] == 1]['Survived'].mean():.1%}
- Third class survival: {titanic_df[titanic_df['Pclass'] == 3]['Survived'].mean():.1%}

Next Steps:
- Handle missing Age and Cabin data properly (imputation).
- Create new, engineered features (e.g., FamilySize, Title).
- Introduce basic data visualization.
"""
print(f"\n{summary}")

# Save filtered datasets for further analysis
os.makedirs('data-visualization/data', exist_ok=True)
women.to_csv('data-visualization/data/titanic_women.csv', index=False)
complete_data = titanic_df.dropna(subset=['Age', 'Embarked'])
complete_data.to_csv('data-visualization/data/titanic_complete.csv', index=False)

print(f"\nSaved datasets:")
print(f"- Women passengers: data-visualization/data/titanic_women.csv ({len(women)} rows)")
print(f"- Complete data: data-visualization/data/titanic_complete.csv ({len(complete_data)} rows)")



KEY INSIGHTS FROM DATA MANIPULATION
- Female survival rate: 74.2%
- Male survival rate: 18.9%
- First class survival: 63.0%
- Third class survival: 24.2%

Next Steps:
- Handle missing Age and Cabin data properly (imputation).
- Create new, engineered features (e.g., FamilySize, Title).
- Introduce basic data visualization.


Saved datasets:
- Women passengers: data-visualization/data/titanic_women.csv (314 rows)
- Complete data: data-visualization/data/titanic_complete.csv (712 rows)
