# Day 5 — Applied Data Science: From Data to Insights

**Duration:** 2 hours

**Objectives:**
- Run an end-to-end mini-project (Titanic)
- Practice cleaning, EDA, modeling, evaluation
- Prepare deliverables for the take-home assignment

## 1. Project Overview

**Dataset:** Titanic (from seaborn)
**Goal:** Predict `survived` using features. Deliverables: a Jupyter notebook and a 1-page summary report.

## 2. Step-by-step pipeline (we'll follow this in the notebook)
1. Load data
2. Inspect and clean
3. Feature engineering
4. Visualize (EDA)
5. Train two models (Logistic Regression, Decision Tree)
6. Evaluate and compare
7. Summarize findings

In [None]:
# Load dataset and quick inspection
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

sns.set()

df = sns.load_dataset('titanic')
print('Shape:', df.shape)
df.head()

## 3. Cleaning (suggested)
- Impute `age` with median
- Fill `embarked` with mode
- Encode `sex` and `class` (one-hot or ordinal)
- Drop unused columns (e.g., `deck`, `embark_town` if desired)

In [None]:
# Cleaning steps
from sklearn.impute import SimpleImputer

notebook_df = df[['survived','pclass','sex','age','fare','embarked']].copy()
# Impute age
imp = SimpleImputer(strategy='median')
notebook_df['age'] = imp.fit_transform(notebook_df[['age']])
# Impute embarked
imp2 = SimpleImputer(strategy='most_frequent')
notebook_df['embarked'] = imp2.fit_transform(notebook_df[['embarked']])
# Encode sex
notebook_df['sex'] = notebook_df['sex'].map({'male':0,'female':1})
# One-hot embarked
notebook_df = pd.get_dummies(notebook_df, columns=['embarked'], prefix='emb')
notebook_df = notebook_df.dropna()
notebook_df.head()

## 4. EDA — Visualizations

Create plots to show relationships and distributions. Examples: survival rate by sex, age histogram, survival by fare bin.

In [None]:
# EDA visuals
import seaborn as sns

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.countplot(x='sex', hue='survived', data=notebook_df)
plt.title('Survival by sex')

plt.subplot(1,2,2)
notebook_df['age'].hist(bins=20)
plt.title('Age distribution')
plt.show()

# Survival by fare bins
bins=[0,10,20,50,100,600]
labels=['0-10','10-20','20-50','50-100','100+']
notebook_df['fare_bin'] = pd.cut(notebook_df['fare'].fillna(0), bins=bins, labels=labels)
surv_by_fare = notebook_df.groupby('fare_bin')['survived'].mean().reset_index()
sns.barplot(x='fare_bin', y='survived', data=surv_by_fare)
plt.title('Survival rate by Fare bin')
plt.show()

## 5. Modeling: train Logistic Regression & Decision Tree

In [None]:
# Modeling & evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

features = ['pclass','sex','age','fare']
X = notebook_df[features]
y = notebook_df['survived'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print('Logistic Regression accuracy:', accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print('Decision Tree accuracy:', accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

## 6. Interpret & Compare

Compare accuracy, recall, precision. Discuss which model you would prefer and why (consider domain costs: false negatives vs false positives).

## 7. Take-Home Assignment (deliverables)

**Individual task (due in 1 week):**
- Use the Titanic dataset (or Iris if preferred).
- Clean and preprocess the data.
- Create at least 3 EDA visualizations with short captions/insights.
- Train **two** models (one linear/simple, one tree-based or ensemble).
- Compare their performance using at least two metrics (accuracy & recall or F1).
- Submit:
  1. Jupyter Notebook (.ipynb) with code and outputs.
  2. One-page PDF/Markdown summary: problem, approach, results, insights.

**Bonus (optional):** Try feature engineering (create new features) and report impact.

## 8. Starter checklist & tips

- Use train/test split and consider cross-validation for robustness.
- Document every preprocessing step.
- When in doubt, visualize the data.
- Keep reproducible code: set random_state where appropriate.

## 9. Course Wrap-up

Congratulations — you completed the 5-day crash/refresher! Good luck with the take-home assignment. Reach out if you need help.