# 📊 Day 9: Data Analysis & Visualization with Pandas & Matplotlib  
## *Student Workbook — No Solutions Provided*

> **Instructor Note**: This notebook is for students. All code cells are empty or contain only starter comments. Explanations guide them — but they must write the code.

## ⏱️ Hour 0 — Syntax & Setup (15 min)

### 📝 Why This Matters

> In FAANG, you’ll rarely start from scratch. You’ll inherit messy data, broken scripts, and vague client requests.  
> Today, you’ll learn to:  
> - **Clean** like a data engineer  
> - **Visualize** like a product analyst  
> - **Fix** like a senior dev  
> - **Explain** like a tech lead

> **Toolchain**: `pandas` (data manipulation), `matplotlib` (visualization), `venv` (isolation), `git` (version control).

In [None]:
%%bash
# Create project folder + venv
# YOUR CODE HERE: mkdir, venv, activate, install pandas matplotlib seaborn jupyter

# Save dependencies
# YOUR CODE HERE: pip freeze > requirements.txt

# Create .gitignore
# YOUR CODE HERE: echo ... > .gitignore

echo "✅ Environment ready. Activate with: source venv/bin/activate"

In [None]:
# Core imports — import pandas, matplotlib.pyplot, seaborn
# YOUR CODE HERE

# Check versions (FAANG standard: pin versions in requirements.txt)
# YOUR CODE HERE: print pandas version, matplotlib version

# Set style (professional defaults)
# YOUR CODE HERE: set matplotlib style to 'seaborn-v0_8-whitegrid'
# YOUR CODE HERE: set seaborn palette to "colorblind"

print("✅ Libraries imported. Ready to analyze.")

## 📚 Part 1: Data Concepts & Process (20 min)

### 📝 The Data Lifecycle (FAANG Workflow)

> **1. Ingest** → Load raw data (CSV, API, DB)  
> **2. Clean** → Fix duplicates, NaNs, typos, outliers  
> **3. Transform** → Group, filter, calculate new columns  
> **4. Visualize** → Charts that tell a story (not just pretty)  
> **5. Interpret** → "What does this mean for the business?"  
> **6. Ship** → Save, commit, document, present

> ⚠️ **FAANG Tip**: Never skip cleaning. Dirty data → wrong decisions → fired engineers.

### 📝 Matplotlib Core Concepts

> - **`pyplot`**: MATLAB-style interface. Use `import matplotlib.pyplot as plt`.  
> - **Figure & Axes**: A `Figure` is the window. `Axes` are the plot area.  
> - **Plot Types**:  
>   - `plt.plot()` → Line (trends)  
>   - `plt.bar()` → Bars (comparisons)  
>   - `plt.scatter()` → Scatter (relationships)  
>   - `plt.hist()` → Histograms (distributions)  
>   - `plt.pie()` → Pie (parts of whole)  
> - **Styling**: `marker`, `linestyle`, `color`, `label`, `title`, `grid`, `subplot`.

> 🎯 **Golden Rule**: Every chart must answer: *"So what?"*

## 🛠️ Part 2: Build Tools — Clean & Visualize (45 min)

### 📝 The Dataset

> We’ll use `sales_messy.csv` — a real-world mess:  
> - Duplicate rows  
> - Missing values (NaN)  
> - Inconsistent names ("iPhone " vs "iPhone")  
> - Empty rows  
> - Misleading column names

> Your job: Turn this into `sales_clean.csv` — ready for the CFO.

In [None]:
%%writefile sales_messy.csv
Month,revenue($),Product,Region
Jan,1000,iPhone ,North
Feb,1500,iPhone,North
Feb,1500,iPhone,North
Mar, ,iPad,South
Apr,2000,MacBook ,East
May,2500,MacBook,East
Jun,3000, ,West
Jul,3500,iPhone,North
Aug, ,iPad,South
Sep,4500,MacBook,East
Oct,5000,iPhone,North
Nov,5500,iPad,South
Dec,6000,MacBook,East
,,,
Jan,1000,iPhone ,North

In [None]:
# Load messy data
# YOUR CODE HERE: use pd.read_csv()

print("🔴 RAW DATA — 5 Rows:")
# YOUR CODE HERE: print first 5 rows

print("\n🔴 RAW DATA — Info:")
# YOUR CODE HERE: print df.info()

print("\n🔴 RAW DATA — Describe (only numeric):")
# YOUR CODE HERE: print df.describe()

In [None]:
# 1. Drop duplicates
# YOUR CODE HERE

# 2. Rename columns: rename "revenue($)" to "Revenue"
# YOUR CODE HERE

# 🆕 2.5: Convert Revenue to numeric (coerce errors to NaN)
# HINT: use pd.to_numeric(..., errors='coerce')
# YOUR CODE HERE

# 3. Fill missing Revenue (with mean)
# YOUR CODE HERE

# 4. Strip whitespace from Product
# YOUR CODE HERE

# 5. Drop rows where ALL columns are NaN
# YOUR CODE HERE

# 6. Ensure Product has no empty strings
# YOUR CODE HERE

# Save cleaned data
# YOUR CODE HERE: to_csv("sales_clean.csv", index=False)

print("✅ CLEANED DATA:")
# YOUR CODE HERE: print df_clean

print(f"\n✅ Rows reduced from {len(df_messy)} to {len(df_clean)}")

## 🐞 Part 3: Fix Tickets — Visualization Deep Dive (45 min)

### 📝 Ticket Philosophy

> In FAANG, you’ll get tickets like these. Your fix must:  
> - Solve the technical issue  
> - Explain the business impact  
> - Not break anything else  
> - Be documented in code + commit

### 🎟️ Ticket 901: Fix Misleading Y-Axis (Line Plot + Labels + Grid)

#### 📝 Client Note

> "The chart makes growth look flat because Y-axis starts at $1000. Fix it — our investors think we’re failing!"

In [None]:
# Load cleaned data
# YOUR CODE HERE

# BROKEN: Misleading axis
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
# YOUR CODE HERE: plot line with red markers
plt.title("❌ BROKEN: Hides Growth")
plt.grid(True)

# FIXED: Honest axis
plt.subplot(1, 2, 2)
# YOUR CODE HERE: plot line with green markers, add ylabel, xlabel
plt.title("✅ FIXED: True Growth")
# YOUR CODE HERE: set ylim to start at 0
plt.tight_layout()
plt.show()

print("💡 Business Impact: [YOUR ANSWER: How did this fix affect the business?]")

### 🎟️ Ticket 902: Add Product Comparison (Bar Chart + Subplot)

#### 📝 Client Note

> "We need to know which product drives revenue. Show me a bar chart — and put it next to the line chart."

In [None]:
# Group by Product
# YOUR CODE HERE: group by "Product", sum "Revenue"

# Create subplots
# YOUR CODE HERE: plt.subplots(2, 1, figsize=(10, 8))

# Line chart (top)
# YOUR CODE HERE: plot line with markers and linestyle

# Bar chart (bottom)
# YOUR CODE HERE: bar chart with colors

plt.tight_layout()
plt.show()

print("💡 Business Impact: [YOUR ANSWER: What decision did this enable?]")

### 🎟️ Ticket 903: Show Revenue Distribution (Histogram + Scatter)

#### 📝 Client Note

> "Is revenue predictable? Or are there wild outliers? Plot a histogram and scatter vs. month number."

In [None]:
# Histogram
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
# YOUR CODE HERE: histogram with bins=5, edgecolor, color

plt.subplot(1, 2, 2)
# YOUR CODE HERE: scatter plot (x=month number, y=Revenue)

plt.tight_layout()
plt.show()

print("💡 Business Impact: [YOUR ANSWER: What did this reveal about risk?]")

### 🎟️ Ticket 904: Revenue Share (Pie Chart + Styling)

#### 📝 Client Note

> "What % of revenue comes from each product? Make a pie chart — and make it look boardroom-ready."

In [None]:
# Pie chart
plt.figure(figsize=(8, 8))
# YOUR CODE HERE: pie chart with autopct, startangle, colors, explode, shadow
plt.title("Revenue Share by Product", fontsize=16, fontweight='bold')
plt.axis('equal')  # Perfect circle
plt.show()

print("💡 Business Impact: [YOUR ANSWER: What strategy did this support?]")

## 📈 Part 4: Business Impact & Extend (15 min)

### 📝 FAANG-Style Retrospective

> For each ticket, write:  
> **Technical Fix**: What code changed?  
> **Business Impact**: How did this affect revenue/users/decisions?  
> **Prevention**: How to avoid this in future? (e.g., data validation, unit tests)

> Example:  
> - **Ticket 901**: Set `plt.ylim(0)` → Showed true growth → Prevented investor pullout.  
> - **Prevention**: Add data validation + automated chart audits.

### 🚀 Extend: What’s Next?

> - Add **unit tests** for cleaning functions (pytest)  
> - Build **interactive dashboards** (Plotly/Dash)  
> - **Automate** with GitHub Actions (run tests + generate charts on push)  
> - **Deploy** as web app (Flask + Heroku)

## ✅ Deliverables Checklist

- [ ] `sales_clean.csv` generated  
- [ ] 4 charts saved as PNG (line, bar, histogram, pie)  
- [ ] GitHub repo with branches: `v1-messy`, `v2-cleaned`, `v3-visualized`  
- [ ] Client text proof saved  
- [ ] Business impact notes for all 4 tickets  
- [ ] LinkedIn post drafted

## 🎓 You Are Now FAANG-Ready

> You didn’t just "learn matplotlib."  
> You **solved business problems with data** — cleaned it, visualized it, fixed it, explained it.  
> This is what FAANG hires for.  
> Go update your LinkedIn:  
> *"Mastered Pandas + Matplotlib — turned messy data into boardroom-ready insights. #DataScience #Python #FAANG"*