# 🚀 Data Science Bootcamp: Hands-On Exercises

---

<div align="center">
    <img src="figs/a_lab_flask_with_data_points.png" width="300"/>
</div>

---

## 🎯 Welcome to Your Data Science Lab!

Congratulations on completing the introduction session! Now it's time to **apply what you've learned** and build your data science portfolio with real-world datasets.

### 📚 What You'll Accomplish:

In this hands-on exercise notebook, you'll work with **5 exciting new datasets** to:

✨ **Analyze Netflix viewing habits** and discover binge-watching patterns  
🛍️ **Optimize retail sales strategies** using transaction data  
🎵 **Explore Spotify music trends** and build recommendation insights  
🌤️ **Investigate weather patterns** and climate trends  
📱 **Decode social media engagement** and viral content secrets  

### 🎪 How This Works:

Each exercise includes:
- **📋 Clear Instructions** - Step-by-step guidance
- **💡 Hints & Tips** - Helpful suggestions when you need them
- **✍️ TODO Sections** - Your space to write code
- **🎯 Challenges** - Optional advanced tasks
- **✅ Self-Check Questions** - Verify your understanding

### 🏆 Your Goal:

Complete all 5 exercises to build a **professional data science portfolio** showcasing:
- Data exploration and cleaning
- Statistical analysis
- Visualization mastery
- Business insights generation
- Storytelling with data

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0; text-align: center;">
    <h3>💪 Ready to Level Up Your Skills?</h3>
    <p style="font-size: 18px;">Let's dive into real data and discover amazing insights!</p>
</div>

## ⚙️ Setup & Configuration

First, let's set up your data science environment with all the tools you'll need.

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import warnings

# Import our custom utilities
from data_science_utils import (
    MyDataScienceKit, PlottingUtils, DataAnalysisUtils,
    create_environment_checker
)
from interactive_components import InteractiveQuiz, DataExplorer

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Initialize your toolkit
my_kit = MyDataScienceKit("Data Science Explorer")

print("🎉 Environment Setup Complete!")
print("="*50)
print(create_environment_checker())
print("="*50)
print("\n🚀 You're ready to start your exercises!")

---

## 📊 Exercise 1: Netflix Binge-Watching Analytics 🎬

<div align="center">
    <img src="figs/visualization-example.png" width="400"/>
</div>

### 🎯 Objective
Analyze Netflix viewing patterns to understand user behavior, content preferences, and binge-watching habits.

### 📋 Your Tasks:
1. Load and explore the Netflix viewing data
2. Identify the most popular shows and genres
3. Analyze viewing patterns by time of day
4. Investigate device preferences
5. Calculate completion rates and identify binge-worthy content

In [None]:
# Load the Netflix dataset
netflix_df = pd.read_csv('datasets/netflix_viewing_data.csv')

print("🎬 Netflix Viewing Data Loaded!")
print(f"📊 Dataset Shape: {netflix_df.shape}")
print(f"👥 Unique Users: {netflix_df['user_id'].nunique()}")
print(f"📺 Unique Shows: {netflix_df['show'].nunique()}")
print("\n📋 Dataset Overview:")
display(netflix_df.head())
print("\n📊 Data Types:")
print(netflix_df.dtypes)

### 📺 Task 1.1: Most Popular Shows

**TODO:** Find the top 10 most-watched shows and visualize them.

In [None]:
# TODO: Calculate the most popular shows
# Hint: Use value_counts() or groupby() to count show occurrences

# Your code here:
popular_shows = netflix_df['show'].value_counts().head(10)

# Visualization
fig, ax = PlottingUtils.create_bar_chart(
    dict(popular_shows),
    title="🏆 Top 10 Most-Watched Netflix Shows",
    xlabel="Shows",
    ylabel="Number of Views",
    figsize=(12, 6)
)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("\n📊 Top 5 Shows:")
for i, (show, count) in enumerate(popular_shows.head().items(), 1):
    print(f"{i}. {show}: {count} views")

### 🎭 Task 1.2: Genre Analysis

**TODO:** Analyze genre preferences and viewing patterns.

In [None]:
# TODO: Analyze genres
# Your code here:

# Genre distribution
genre_counts = netflix_df['genre'].value_counts()

# Average watch time by genre
genre_watch_time = netflix_df.groupby('genre').agg({
    'watch_time_minutes': ['mean', 'sum', 'count'],
    'rating': 'mean',
    'completed': 'mean'
}).round(2)

genre_watch_time.columns = ['avg_watch_time', 'total_watch_time', 'view_count', 'avg_rating', 'completion_rate']
genre_watch_time = genre_watch_time.sort_values('total_watch_time', ascending=False)

print("🎭 Genre Analysis:")
display(genre_watch_time)

# Create a pie chart for genre distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Pie chart
colors = PlottingUtils.COLORS['palette'][:len(genre_counts)]
ax1.pie(genre_counts.values, labels=genre_counts.index, autopct='%1.1f%%', colors=colors)
ax1.set_title('📊 Genre Distribution', fontweight='bold')

# Bar chart for completion rates
ax2.bar(genre_watch_time.index, genre_watch_time['completion_rate'], color=colors)
ax2.set_title('✅ Completion Rate by Genre', fontweight='bold')
ax2.set_xlabel('Genre')
ax2.set_ylabel('Completion Rate')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 🕐 Task 1.3: Time of Day Analysis

**TODO:** When do people watch Netflix the most?

In [None]:
# TODO: Analyze viewing patterns by time of day
# Your code here:

time_analysis = netflix_df.groupby('time_of_day').agg({
    'user_id': 'count',
    'watch_time_minutes': 'mean',
    'completed': 'mean'
}).round(2)

time_analysis.columns = ['view_count', 'avg_watch_time', 'completion_rate']
time_analysis = time_analysis.sort_values('view_count', ascending=False)

print("🕐 Viewing Patterns by Time of Day:")
display(time_analysis)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
x = range(len(time_analysis))
bars = ax.bar(x, time_analysis['view_count'], color=PlottingUtils.COLORS['palette'][:len(time_analysis)])
ax.set_xticks(x)
ax.set_xticklabels(time_analysis.index)
ax.set_title('📺 Netflix Viewing by Time of Day', fontweight='bold', fontsize=14)
ax.set_xlabel('Time of Day')
ax.set_ylabel('Number of Views')

# Add value labels on bars
for bar, value in zip(bars, time_analysis['view_count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
            f'{int(value)}', ha='center', va='bottom', fontweight='bold')

plt.show()

# Insight
peak_time = time_analysis.index[0]
print(f"\n🎯 Key Insight: Peak viewing time is during {peak_time} with {time_analysis.iloc[0]['view_count']:.0f} views!")

### 📱 Task 1.4: Device Analysis

**TODO:** Which devices do users prefer for watching content?

In [None]:
# TODO: Analyze device preferences
# Your code here:

device_analysis = netflix_df.groupby('device').agg({
    'user_id': 'count',
    'watch_time_minutes': 'mean',
    'completed': lambda x: (x == True).mean(),
    'rating': 'mean'
}).round(2)

device_analysis.columns = ['usage_count', 'avg_watch_time', 'completion_rate', 'avg_rating']

print("📱 Device Usage Analysis:")
display(device_analysis)

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Device usage count
axes[0, 0].bar(device_analysis.index, device_analysis['usage_count'], 
               color=PlottingUtils.COLORS['palette'][:len(device_analysis)])
axes[0, 0].set_title('📱 Device Usage Count')
axes[0, 0].set_ylabel('Number of Views')

# Average watch time by device
axes[0, 1].bar(device_analysis.index, device_analysis['avg_watch_time'],
               color=PlottingUtils.COLORS['palette'][:len(device_analysis)])
axes[0, 1].set_title('⏱️ Average Watch Time by Device')
axes[0, 1].set_ylabel('Minutes')

# Completion rate by device
axes[1, 0].bar(device_analysis.index, device_analysis['completion_rate'],
               color=PlottingUtils.COLORS['palette'][:len(device_analysis)])
axes[1, 0].set_title('✅ Completion Rate by Device')
axes[1, 0].set_ylabel('Completion Rate')

# Average rating by device
axes[1, 1].bar(device_analysis.index, device_analysis['avg_rating'],
               color=PlottingUtils.COLORS['palette'][:len(device_analysis)])
axes[1, 1].set_title('⭐ Average Rating by Device')
axes[1, 1].set_ylabel('Rating (1-5)')

plt.tight_layout()
plt.show()

### 🎯 Task 1.5: Binge-Watching Analysis

**TODO:** Identify binge-worthy content (high completion rates and ratings).

In [None]:
# TODO: Find binge-worthy shows
# Define binge-worthy as: completed = True, rating >= 4, watch_time > average

# Your code here:
avg_watch_time = netflix_df['watch_time_minutes'].mean()

# Filter for potential binge-watching behavior
binge_criteria = (
    (netflix_df['completed'] == True) & 
    (netflix_df['rating'] >= 4) & 
    (netflix_df['watch_time_minutes'] > avg_watch_time)
)

binge_worthy = netflix_df[binge_criteria].groupby('show').agg({
    'user_id': 'count',
    'rating': 'mean',
    'watch_time_minutes': 'mean'
}).round(2)

binge_worthy.columns = ['binge_count', 'avg_rating', 'avg_watch_time']
binge_worthy = binge_worthy.sort_values('binge_count', ascending=False).head(10)

print("🍿 Top 10 Binge-Worthy Shows:")
display(binge_worthy)

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.bar(range(len(binge_worthy)), binge_worthy['binge_count'],
               color=PlottingUtils.COLORS['success'])
ax.set_xticks(range(len(binge_worthy)))
ax.set_xticklabels(binge_worthy.index, rotation=45, ha='right')
ax.set_title('🍿 Most Binge-Watched Shows on Netflix', fontweight='bold', fontsize=14)
ax.set_xlabel('Show')
ax.set_ylabel('Number of Binge Sessions')

# Add rating stars on top of bars
for i, (bar, rating) in enumerate(zip(bars, binge_worthy['avg_rating'])):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
            f'⭐{rating:.1f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

print(f"\n🎯 Most Binge-Worthy: {binge_worthy.index[0]} with {binge_worthy.iloc[0]['binge_count']:.0f} binge sessions!")

### ✅ Self-Check Quiz: Netflix Analytics

In [None]:
# Interactive quiz for Netflix analysis
netflix_quiz = InteractiveQuiz("Netflix Analytics Check")

netflix_quiz.add_question(
    "What type of analysis helps identify binge-watching behavior?",
    ["Time series analysis", "Completion rate & rating analysis", "Device analysis", "Genre analysis"],
    1,
    "Correct! High completion rates combined with high ratings indicate binge-worthy content."
)

netflix_quiz.add_question(
    "Which metric is most important for content recommendation?",
    ["Watch time only", "Device used", "User ratings and completion", "Time of day"],
    2,
    "Excellent! User ratings and completion rates are key indicators of content satisfaction."
)

quiz_widget = netflix_quiz.create_quiz()
display(quiz_widget)

---

## 🛍️ Exercise 2: Retail Sales Intelligence

<div align="center">
    <img src="figs/shopping_cart_icon.png" width="200"/>
</div>

### 🎯 Objective
Analyze retail transaction data to optimize sales strategies, understand customer behavior, and identify revenue opportunities.

### 📋 Your Tasks:
1. Explore sales trends over time
2. Identify best-selling products and categories
3. Analyze the impact of promotions
4. Segment customers based on purchasing behavior
5. Find revenue optimization opportunities

In [None]:
# Load the retail dataset
retail_df = pd.read_csv('datasets/retail_transactions_data.csv')

# Convert date to datetime
retail_df['date'] = pd.to_datetime(retail_df['date'])
retail_df['month'] = retail_df['date'].dt.month
retail_df['week'] = retail_df['date'].dt.isocalendar().week

print("🛍️ Retail Transaction Data Loaded!")
print(f"📊 Dataset Shape: {retail_df.shape}")
print(f"💰 Total Revenue: ${retail_df['total_amount'].sum():,.2f}")
print(f"🛒 Unique Customers: {retail_df['customer_id'].nunique()}")
print(f"📦 Product Categories: {retail_df['category'].nunique()}")
print("\n📋 Sample Transactions:")
display(retail_df.head())

### 📈 Task 2.1: Sales Trends Analysis

**TODO:** Analyze sales trends over time (daily, weekly, monthly).

In [None]:
# TODO: Analyze sales trends
# Your code here:

# Daily sales
daily_sales = retail_df.groupby('date')['total_amount'].sum().reset_index()
daily_sales.columns = ['date', 'revenue']

# Weekly sales
weekly_sales = retail_df.groupby('week')['total_amount'].sum().reset_index()
weekly_sales.columns = ['week', 'revenue']

# Monthly sales
monthly_sales = retail_df.groupby('month')['total_amount'].sum().reset_index()
monthly_sales.columns = ['month', 'revenue']

# Create visualizations
fig, axes = plt.subplots(3, 1, figsize=(15, 12))

# Daily trend
axes[0].plot(daily_sales['date'], daily_sales['revenue'], 
             color=PlottingUtils.COLORS['primary'], linewidth=1)
axes[0].set_title('📈 Daily Sales Trend', fontweight='bold')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Revenue ($)')
axes[0].grid(True, alpha=0.3)

# Weekly trend
axes[1].bar(weekly_sales['week'], weekly_sales['revenue'],
            color=PlottingUtils.COLORS['success'], alpha=0.7)
axes[1].set_title('📊 Weekly Sales Performance', fontweight='bold')
axes[1].set_xlabel('Week Number')
axes[1].set_ylabel('Revenue ($)')

# Monthly trend
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[2].bar(monthly_sales['month'], monthly_sales['revenue'],
            color=PlottingUtils.COLORS['palette'][:len(monthly_sales)])
axes[2].set_title('📅 Monthly Sales Performance', fontweight='bold')
axes[2].set_xlabel('Month')
axes[2].set_ylabel('Revenue ($)')
axes[2].set_xticks(range(1, 13))
axes[2].set_xticklabels(month_names[:len(monthly_sales)])

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📊 Sales Summary:")
print(f"Average Daily Revenue: ${daily_sales['revenue'].mean():,.2f}")
print(f"Peak Day Revenue: ${daily_sales['revenue'].max():,.2f}")
print(f"Best Month: {month_names[monthly_sales.loc[monthly_sales['revenue'].idxmax(), 'month'] - 1]}")

### 🏆 Task 2.2: Best-Selling Products & Categories

**TODO:** Identify top products and categories by revenue and quantity.

In [None]:
# TODO: Find best-selling products and categories
# Your code here:

# Category analysis
category_analysis = retail_df.groupby('category').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'transaction_id': 'count',
    'unit_price': 'mean'
}).round(2)

category_analysis.columns = ['total_revenue', 'total_quantity', 'transaction_count', 'avg_price']
category_analysis = category_analysis.sort_values('total_revenue', ascending=False)

print("🏆 Category Performance:")
display(category_analysis)

# Product analysis
product_analysis = retail_df.groupby('product').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'customer_id': 'nunique'
}).round(2)

product_analysis.columns = ['revenue', 'units_sold', 'unique_customers']
product_analysis = product_analysis.sort_values('revenue', ascending=False).head(10)

print("\n📦 Top 10 Products by Revenue:")
display(product_analysis)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Category revenue
axes[0, 0].barh(category_analysis.index[:10], category_analysis['total_revenue'][:10],
                color=PlottingUtils.COLORS['palette'][:10])
axes[0, 0].set_title('💰 Revenue by Category', fontweight='bold')
axes[0, 0].set_xlabel('Revenue ($)')

# Category quantity
axes[0, 1].barh(category_analysis.index[:10], category_analysis['total_quantity'][:10],
                color=PlottingUtils.COLORS['palette'][:10])
axes[0, 1].set_title('📦 Units Sold by Category', fontweight='bold')
axes[0, 1].set_xlabel('Quantity')

# Top products
axes[1, 0].bar(range(len(product_analysis)), product_analysis['revenue'],
               color=PlottingUtils.COLORS['success'])
axes[1, 0].set_title('🏆 Top 10 Products by Revenue', fontweight='bold')
axes[1, 0].set_xlabel('Product Rank')
axes[1, 0].set_ylabel('Revenue ($)')
axes[1, 0].set_xticks(range(len(product_analysis)))
axes[1, 0].set_xticklabels(range(1, 11))

# Product popularity
axes[1, 1].scatter(product_analysis['units_sold'], product_analysis['revenue'],
                   s=product_analysis['unique_customers']*2, alpha=0.6,
                   color=PlottingUtils.COLORS['info'])
axes[1, 1].set_title('📊 Product Performance Matrix', fontweight='bold')
axes[1, 1].set_xlabel('Units Sold')
axes[1, 1].set_ylabel('Revenue ($)')

plt.tight_layout()
plt.show()

### 🎯 Task 2.3: Promotion Impact Analysis

**TODO:** Analyze the effectiveness of promotions on sales.

In [None]:
# TODO: Analyze promotion impact
# Your code here:

# Promotion analysis
promo_analysis = retail_df.groupby('promotion').agg({
    'total_amount': ['mean', 'sum', 'count'],
    'quantity': 'mean'
}).round(2)

promo_analysis.columns = ['avg_transaction', 'total_revenue', 'transaction_count', 'avg_quantity']

print("🎯 Promotion Impact Analysis:")
display(promo_analysis)

# Calculate lift
no_promo_avg = promo_analysis.loc[False, 'avg_transaction']
promo_avg = promo_analysis.loc[True, 'avg_transaction']
lift = ((promo_avg - no_promo_avg) / no_promo_avg) * 100

print(f"\n📈 Promotion Lift: {lift:.1f}% increase in average transaction value")

# Category-wise promotion effectiveness
category_promo = retail_df.groupby(['category', 'promotion'])['total_amount'].mean().unstack(fill_value=0)
category_promo['lift'] = ((category_promo[True] - category_promo[False]) / category_promo[False] * 100).round(1)
category_promo = category_promo.sort_values('lift', ascending=False)

print("\n📊 Promotion Effectiveness by Category:")
display(category_promo.head(10))

# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Overall promotion impact
promo_labels = ['No Promotion', 'With Promotion']
axes[0].bar(promo_labels, [promo_analysis.loc[False, 'avg_transaction'], 
                           promo_analysis.loc[True, 'avg_transaction']],
            color=[PlottingUtils.COLORS['secondary'], PlottingUtils.COLORS['success']])
axes[0].set_title('💳 Average Transaction Value', fontweight='bold')
axes[0].set_ylabel('Amount ($)')

# Transaction volume
axes[1].pie([promo_analysis.loc[False, 'transaction_count'], 
             promo_analysis.loc[True, 'transaction_count']],
            labels=promo_labels, autopct='%1.1f%%',
            colors=[PlottingUtils.COLORS['secondary'], PlottingUtils.COLORS['success']])
axes[1].set_title('📊 Transaction Volume Distribution', fontweight='bold')

# Category lift
top_categories = category_promo.head(8)
axes[2].barh(range(len(top_categories)), top_categories['lift'],
             color=PlottingUtils.COLORS['palette'][:len(top_categories)])
axes[2].set_yticks(range(len(top_categories)))
axes[2].set_yticklabels(top_categories.index)
axes[2].set_title('🚀 Promotion Lift by Category (%)', fontweight='bold')
axes[2].set_xlabel('Lift %')

plt.tight_layout()
plt.show()

### 👥 Task 2.4: Customer Segmentation

**TODO:** Segment customers based on their purchasing behavior (RFM analysis).

In [None]:
# TODO: Perform customer segmentation
# Your code here:

# Calculate RFM metrics
current_date = retail_df['date'].max()

rfm = retail_df.groupby('customer_id').agg({
    'date': lambda x: (current_date - x.max()).days,  # Recency
    'transaction_id': 'count',  # Frequency
    'total_amount': 'sum'  # Monetary
}).reset_index()

rfm.columns = ['customer_id', 'recency', 'frequency', 'monetary']

# Create segments based on quantiles
rfm['R_score'] = pd.qcut(rfm['recency'], q=3, labels=[3, 2, 1])  # Lower recency is better
rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=3, labels=[1, 2, 3])
rfm['M_score'] = pd.qcut(rfm['monetary'], q=3, labels=[1, 2, 3])

# Combine scores
rfm['RFM_score'] = rfm['R_score'].astype(str) + rfm['F_score'].astype(str) + rfm['M_score'].astype(str)

# Define segments
def segment_customers(row):
    if row['RFM_score'] == '333':
        return 'Champions'
    elif row['R_score'] == 3 and row['F_score'] >= 2:
        return 'Loyal Customers'
    elif row['R_score'] == 3 and row['F_score'] == 1:
        return 'Potential Loyalists'
    elif row['R_score'] == 2:
        return 'At Risk'
    elif row['R_score'] == 1 and row['M_score'] == 3:
        return 'Can\'t Lose Them'
    else:
        return 'Lost'

rfm['segment'] = rfm.apply(segment_customers, axis=1)

# Segment summary
segment_summary = rfm.groupby('segment').agg({
    'customer_id': 'count',
    'recency': 'mean',
    'frequency': 'mean',
    'monetary': 'mean'
}).round(2)

segment_summary.columns = ['customer_count', 'avg_recency', 'avg_frequency', 'avg_monetary']
segment_summary = segment_summary.sort_values('avg_monetary', ascending=False)

print("👥 Customer Segmentation Results:")
display(segment_summary)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Segment distribution
segment_counts = rfm['segment'].value_counts()
axes[0, 0].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%',
               colors=PlottingUtils.COLORS['palette'][:len(segment_counts)])
axes[0, 0].set_title('👥 Customer Segment Distribution', fontweight='bold')

# Monetary by segment
axes[0, 1].bar(segment_summary.index, segment_summary['avg_monetary'],
               color=PlottingUtils.COLORS['palette'][:len(segment_summary)])
axes[0, 1].set_title('💰 Average Spend by Segment', fontweight='bold')
axes[0, 1].set_xlabel('Segment')
axes[0, 1].set_ylabel('Average Monetary Value ($)')
axes[0, 1].tick_params(axis='x', rotation=45)

# RFM scatter
scatter = axes[1, 0].scatter(rfm['frequency'], rfm['monetary'], 
                             c=rfm['recency'], cmap='RdYlGn_r', 
                             s=50, alpha=0.6)
axes[1, 0].set_title('📊 RFM Analysis: Frequency vs Monetary', fontweight='bold')
axes[1, 0].set_xlabel('Frequency')
axes[1, 0].set_ylabel('Monetary ($)')
plt.colorbar(scatter, ax=axes[1, 0], label='Recency (days)')

# Segment characteristics
segment_chars = segment_summary[['avg_recency', 'avg_frequency']].head(5)
x = np.arange(len(segment_chars))
width = 0.35

axes[1, 1].bar(x - width/2, segment_chars['avg_recency'], width, 
               label='Avg Recency', color=PlottingUtils.COLORS['info'])
axes[1, 1].bar(x + width/2, segment_chars['avg_frequency']*10, width, 
               label='Avg Frequency (x10)', color=PlottingUtils.COLORS['success'])
axes[1, 1].set_xlabel('Segment')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(segment_chars.index, rotation=45, ha='right')
axes[1, 1].set_title('📈 Segment Characteristics', fontweight='bold')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print(f"\n🎯 Key Insight: {segment_counts.index[0]} is the largest segment with {segment_counts.iloc[0]} customers")

---

## 🎵 Exercise 3: Spotify Music Discovery

<div align="center">
    <img src="figs/visualization-example.png" width="400"/>
</div>

### 🎯 Objective
Analyze Spotify streaming data to understand music preferences, discover patterns in audio features, and build insights for music recommendations.

### 📋 Your Tasks:
1. Explore genre popularity and trends
2. Analyze audio features and their relationships
3. Identify factors affecting skip rates
4. Build mood-based music insights
5. Find patterns in playlist additions

In [None]:
# Load the Spotify dataset
spotify_df = pd.read_csv('datasets/spotify_music_data.csv')

print("🎵 Spotify Music Data Loaded!")
print(f"📊 Dataset Shape: {spotify_df.shape}")
print(f"🎤 Unique Artists: {spotify_df['artist'].nunique()}")
print(f"🎸 Genres: {spotify_df['genre'].nunique()}")
print(f"🎭 Moods: {spotify_df['mood'].nunique()}")
print("\n📋 Dataset Overview:")
display(spotify_df.head())
print("\n📊 Audio Feature Statistics:")
display(spotify_df[['tempo_bpm', 'energy', 'danceability', 'valence', 'acousticness']].describe())

### 🎸 Task 3.1: Genre Popularity Analysis

**TODO:** Analyze which genres are most popular and their characteristics.

In [None]:
# TODO: Analyze genre popularity and characteristics
# Your code here:

genre_analysis = spotify_df.groupby('genre').agg({
    'play_count': ['sum', 'mean'],
    'skip_rate': 'mean',
    'added_to_playlist': lambda x: x.sum() / len(x),
    'energy': 'mean',
    'danceability': 'mean',
    'valence': 'mean'
}).round(3)

genre_analysis.columns = ['total_plays', 'avg_plays', 'avg_skip_rate', 'playlist_add_rate', 
                          'avg_energy', 'avg_danceability', 'avg_valence']
genre_analysis = genre_analysis.sort_values('total_plays', ascending=False)

print("🎸 Genre Performance Analysis:")
display(genre_analysis)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Total plays by genre
axes[0, 0].barh(genre_analysis.index[:10], genre_analysis['total_plays'][:10],
                color=PlottingUtils.COLORS['palette'][:10])
axes[0, 0].set_title('🎵 Total Plays by Genre', fontweight='bold')
axes[0, 0].set_xlabel('Total Play Count')

# Skip rate by genre
axes[0, 1].bar(genre_analysis.index[:10], genre_analysis['avg_skip_rate'][:10],
               color=PlottingUtils.COLORS['palette'][:10])
axes[0, 1].set_title('⏭️ Skip Rate by Genre', fontweight='bold')
axes[0, 1].set_ylabel('Average Skip Rate')
axes[0, 1].tick_params(axis='x', rotation=45)

# Audio features radar chart for top 5 genres
top_genres = genre_analysis.head(5)
features = ['avg_energy', 'avg_danceability', 'avg_valence']

x = np.arange(len(features))
width = 0.15

for i, (genre, row) in enumerate(top_genres.iterrows()):
    axes[1, 0].bar(x + i * width, row[features], width, label=genre)

axes[1, 0].set_xlabel('Audio Features')
axes[1, 0].set_ylabel('Average Value')
axes[1, 0].set_title('🎼 Audio Features by Top Genres', fontweight='bold')
axes[1, 0].set_xticks(x + width * 2)
axes[1, 0].set_xticklabels(['Energy', 'Danceability', 'Valence'])
axes[1, 0].legend()

# Playlist addition rate
axes[1, 1].scatter(genre_analysis['avg_plays'], genre_analysis['playlist_add_rate'],
                   s=genre_analysis['total_plays']/1000, alpha=0.6,
                   color=PlottingUtils.COLORS['info'])
axes[1, 1].set_title('📋 Playlist Addition vs Popularity', fontweight='bold')
axes[1, 1].set_xlabel('Average Plays per Song')
axes[1, 1].set_ylabel('Playlist Addition Rate')

# Add genre labels for top 5
for i, (genre, row) in enumerate(genre_analysis.head(5).iterrows()):
    axes[1, 1].annotate(genre, (row['avg_plays'], row['playlist_add_rate']),
                       xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.tight_layout()
plt.show()

### 🎼 Task 3.2: Audio Features Analysis

**TODO:** Explore relationships between audio features and their impact on song performance.

In [None]:
# TODO: Analyze audio features
# Your code here:

# Create audio feature correlation matrix
audio_features = ['tempo_bpm', 'energy', 'danceability', 'valence', 'acousticness', 
                  'play_count', 'skip_rate']
audio_corr = spotify_df[audio_features].corr()

# Create correlation heatmap
fig, ax = PlottingUtils.create_correlation_heatmap(
    spotify_df[audio_features],
    title="🎼 Audio Features Correlation Matrix",
    figsize=(10, 8)
)
plt.show()

# Feature impact on performance
print("\n📊 Feature Correlations with Performance Metrics:")
performance_corr = pd.DataFrame({
    'Play Count Correlation': audio_corr['play_count'].drop('play_count'),
    'Skip Rate Correlation': audio_corr['skip_rate'].drop('skip_rate')
}).sort_values('Play Count Correlation', ascending=False)
display(performance_corr)

# Feature distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
features_to_plot = ['energy', 'danceability', 'valence', 'acousticness', 'tempo_bpm']

for idx, feature in enumerate(features_to_plot):
    row = idx // 3
    col = idx % 3
    axes[row, col].hist(spotify_df[feature], bins=30, 
                        color=PlottingUtils.COLORS['palette'][idx], alpha=0.7)
    axes[row, col].set_title(f'Distribution of {feature.replace("_", " ").title()}')
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Frequency')

# Hide empty subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()

# High-performance songs analysis
high_performers = spotify_df[spotify_df['play_count'] > spotify_df['play_count'].quantile(0.75)]
print(f"\n🌟 High-Performance Songs (Top 25% by play count):")
print(f"Average Energy: {high_performers['energy'].mean():.3f}")
print(f"Average Danceability: {high_performers['danceability'].mean():.3f}")
print(f"Average Valence: {high_performers['valence'].mean():.3f}")
print(f"Average Skip Rate: {high_performers['skip_rate'].mean():.3f}")

### 😊 Task 3.3: Mood-Based Music Analysis

**TODO:** Analyze music characteristics by mood and their listener engagement.

In [None]:
# TODO: Analyze music by mood
# Your code here:

mood_analysis = spotify_df.groupby('mood').agg({
    'play_count': 'sum',
    'skip_rate': 'mean',
    'energy': 'mean',
    'valence': 'mean',
    'tempo_bpm': 'mean',
    'added_to_playlist': lambda x: x.sum() / len(x)
}).round(3)

mood_analysis.columns = ['total_plays', 'avg_skip_rate', 'avg_energy', 
                         'avg_valence', 'avg_tempo', 'playlist_rate']
mood_analysis = mood_analysis.sort_values('total_plays', ascending=False)

print("😊 Mood-Based Music Analysis:")
display(mood_analysis)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Mood popularity
axes[0, 0].pie(mood_analysis['total_plays'], labels=mood_analysis.index, 
               autopct='%1.1f%%', colors=PlottingUtils.COLORS['palette'][:len(mood_analysis)])
axes[0, 0].set_title('🎭 Music Listening by Mood', fontweight='bold')

# Mood characteristics
mood_chars = mood_analysis[['avg_energy', 'avg_valence', 'avg_tempo']].head()
mood_chars_norm = mood_chars.div(mood_chars.max(), axis=0)  # Normalize for comparison

x = np.arange(len(mood_chars))
width = 0.25

axes[0, 1].bar(x - width, mood_chars_norm['avg_energy'], width, label='Energy')
axes[0, 1].bar(x, mood_chars_norm['avg_valence'], width, label='Valence')
axes[0, 1].bar(x + width, mood_chars_norm['avg_tempo'], width, label='Tempo')
axes[0, 1].set_xlabel('Mood')
axes[0, 1].set_ylabel('Normalized Value')
axes[0, 1].set_title('🎼 Audio Characteristics by Mood', fontweight='bold')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(mood_chars.index, rotation=45)
axes[0, 1].legend()

# Skip rate by mood
axes[1, 0].barh(mood_analysis.index, mood_analysis['avg_skip_rate'],
                color=[PlottingUtils.COLORS['success'] if rate < 0.2 else PlottingUtils.COLORS['warning'] 
                       for rate in mood_analysis['avg_skip_rate']])
axes[1, 0].set_title('⏭️ Skip Rate by Mood', fontweight='bold')
axes[1, 0].set_xlabel('Average Skip Rate')

# Mood vs Energy/Valence scatter
scatter = axes[1, 1].scatter(spotify_df['energy'], spotify_df['valence'], 
                             c=pd.Categorical(spotify_df['mood']).codes, 
                             cmap='tab10', alpha=0.5, s=20)
axes[1, 1].set_title('🎨 Energy vs Valence by Mood', fontweight='bold')
axes[1, 1].set_xlabel('Energy')
axes[1, 1].set_ylabel('Valence')

# Add mood labels as legend
mood_labels = spotify_df['mood'].unique()
handles = [plt.Line2D([0], [0], marker='o', color='w', 
                     markerfacecolor=plt.cm.tab10(i/10), markersize=8, label=mood) 
          for i, mood in enumerate(mood_labels)]
axes[1, 1].legend(handles=handles, title='Mood', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

print(f"\n🎯 Most Popular Mood: {mood_analysis.index[0]} with {mood_analysis.iloc[0]['total_plays']:,} plays")
print(f"🎵 Most Engaging Mood (lowest skip): {mood_analysis['avg_skip_rate'].idxmin()} ({mood_analysis['avg_skip_rate'].min():.1%})")

---

## 🌤️ Exercise 4: Weather Pattern Analysis

<div align="center">
    <img src="figs/Satellite_icon.png" width="200"/>
</div>

### 🎯 Objective
Analyze weather data to identify climate patterns, seasonal trends, and extreme weather events across different cities.

### 📋 Your Tasks:
1. Explore temperature trends over time
2. Identify seasonal patterns
3. Find extreme weather events
4. Compare weather patterns across cities
5. Analyze climate indicators

In [None]:
# Load the weather dataset
weather_df = pd.read_csv('datasets/weather_patterns_data.csv')
weather_df['date'] = pd.to_datetime(weather_df['date'])
weather_df['month'] = weather_df['date'].dt.month
weather_df['season'] = weather_df['date'].dt.month%12 // 3 + 1
season_map = {1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Fall'}
weather_df['season_name'] = weather_df['season'].map(season_map)

print("🌤️ Weather Pattern Data Loaded!")
print(f"📊 Dataset Shape: {weather_df.shape}")
print(f"🌍 Cities: {weather_df['city'].unique()}")
print(f"📅 Date Range: {weather_df['date'].min().date()} to {weather_df['date'].max().date()}")
print("\n📋 Dataset Overview:")
display(weather_df.head())
print("\n🌡️ Temperature Statistics:")
display(weather_df[['temperature', 'humidity', 'precipitation', 'wind_speed']].describe())

### 🌡️ Task 4.1: Temperature Trends Analysis

**TODO:** Analyze temperature trends over time and identify patterns.

In [None]:
# TODO: Analyze temperature trends
# Your code here:

# Overall temperature trends
monthly_temp = weather_df.groupby(weather_df['date'].dt.to_period('M'))['temperature'].agg(['mean', 'min', 'max'])
monthly_temp.index = monthly_temp.index.to_timestamp()

# City-wise temperature analysis
city_temp = weather_df.groupby('city')['temperature'].agg(['mean', 'min', 'max', 'std']).round(2)
city_temp = city_temp.sort_values('mean', ascending=False)

print("🌡️ Temperature Analysis by City:")
display(city_temp)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Temperature trend over time
axes[0, 0].plot(monthly_temp.index, monthly_temp['mean'], 
                color=PlottingUtils.COLORS['primary'], linewidth=2, label='Mean')
axes[0, 0].fill_between(monthly_temp.index, monthly_temp['min'], monthly_temp['max'], 
                        alpha=0.3, color=PlottingUtils.COLORS['info'])
axes[0, 0].set_title('🌡️ Temperature Trends Over Time', fontweight='bold')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Temperature (°F)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# City temperature comparison
cities = city_temp.index
x = np.arange(len(cities))
width = 0.25

axes[0, 1].bar(x - width, city_temp['min'], width, label='Min', color=PlottingUtils.COLORS['info'])
axes[0, 1].bar(x, city_temp['mean'], width, label='Mean', color=PlottingUtils.COLORS['success'])
axes[0, 1].bar(x + width, city_temp['max'], width, label='Max', color=PlottingUtils.COLORS['warning'])

axes[0, 1].set_xlabel('City')
axes[0, 1].set_ylabel('Temperature (°F)')
axes[0, 1].set_title('🌍 Temperature Comparison by City', fontweight='bold')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(cities, rotation=45)
axes[0, 1].legend()

# Temperature distribution
for city in weather_df['city'].unique():
    city_data = weather_df[weather_df['city'] == city]['temperature']
    axes[1, 0].hist(city_data, bins=30, alpha=0.5, label=city)

axes[1, 0].set_title('🌡️ Temperature Distribution by City', fontweight='bold')
axes[1, 0].set_xlabel('Temperature (°F)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Temperature variability
axes[1, 1].bar(city_temp.index, city_temp['std'], 
               color=PlottingUtils.COLORS['palette'][:len(city_temp)])
axes[1, 1].set_title('📊 Temperature Variability (Std Dev) by City', fontweight='bold')
axes[1, 1].set_xlabel('City')
axes[1, 1].set_ylabel('Standard Deviation')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"\n🌡️ Warmest City: {city_temp.index[0]} (Avg: {city_temp.iloc[0]['mean']}°F)")
print(f"❄️ Coldest City: {city_temp.index[-1]} (Avg: {city_temp.iloc[-1]['mean']}°F)")

### 🍂 Task 4.2: Seasonal Pattern Analysis

**TODO:** Identify seasonal weather patterns.

In [None]:
# TODO: Analyze seasonal patterns
# Your code here:

seasonal_analysis = weather_df.groupby('season_name').agg({
    'temperature': ['mean', 'std'],
    'humidity': 'mean',
    'precipitation': 'sum',
    'wind_speed': 'mean'
}).round(2)

seasonal_analysis.columns = ['avg_temp', 'temp_std', 'avg_humidity', 
                             'total_precip', 'avg_wind']

# Reorder seasons
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_analysis = seasonal_analysis.reindex(season_order)

print("🍂 Seasonal Weather Patterns:")
display(seasonal_analysis)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Seasonal temperatures
colors = ['#2E86AB', '#A4C3A2', '#F18F01', '#C73E1D']  # Winter, Spring, Summer, Fall colors
axes[0, 0].bar(seasonal_analysis.index, seasonal_analysis['avg_temp'], color=colors)
axes[0, 0].set_title('🌡️ Average Temperature by Season', fontweight='bold')
axes[0, 0].set_ylabel('Temperature (°F)')

# Seasonal precipitation
axes[0, 1].pie(seasonal_analysis['total_precip'], labels=seasonal_analysis.index,
               autopct='%1.1f%%', colors=colors)
axes[0, 1].set_title('☔ Precipitation Distribution by Season', fontweight='bold')

# Weather conditions by season
season_conditions = pd.crosstab(weather_df['season_name'], weather_df['conditions'], normalize='index') * 100
season_conditions = season_conditions.reindex(season_order)

season_conditions.plot(kind='bar', stacked=True, ax=axes[1, 0], 
                       color=PlottingUtils.COLORS['palette'][:len(season_conditions.columns)])
axes[1, 0].set_title('🌤️ Weather Conditions by Season (%)', fontweight='bold')
axes[1, 0].set_xlabel('Season')
axes[1, 0].set_ylabel('Percentage')
axes[1, 0].legend(title='Condition', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1, 0].tick_params(axis='x', rotation=0)

# Monthly temperature cycle
monthly_cycle = weather_df.groupby('month')['temperature'].mean()
theta = np.linspace(0, 2*np.pi, 12, endpoint=False)
r = monthly_cycle.values

ax_polar = plt.subplot(2, 2, 4, projection='polar')
bars = ax_polar.bar(theta, r, width=2*np.pi/12, bottom=30,
                   color=plt.cm.coolwarm((r - r.min())/(r.max() - r.min())))
ax_polar.set_theta_zero_location('N')
ax_polar.set_theta_direction(-1)
ax_polar.set_title('📅 Annual Temperature Cycle', fontweight='bold', pad=20)
ax_polar.set_xticks(theta)
ax_polar.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

plt.tight_layout()
plt.show()

### ⛈️ Task 4.3: Extreme Weather Events

**TODO:** Identify and analyze extreme weather events.

In [None]:
# TODO: Find extreme weather events
# Your code here:

# Define extreme thresholds
temp_q95 = weather_df['temperature'].quantile(0.95)
temp_q5 = weather_df['temperature'].quantile(0.05)
precip_q95 = weather_df['precipitation'].quantile(0.95)
wind_q95 = weather_df['wind_speed'].quantile(0.95)

# Identify extreme events
weather_df['extreme_heat'] = weather_df['temperature'] > temp_q95
weather_df['extreme_cold'] = weather_df['temperature'] < temp_q5
weather_df['heavy_rain'] = weather_df['precipitation'] > precip_q95
weather_df['high_wind'] = weather_df['wind_speed'] > wind_q95

# Count extreme events by city
extreme_events = weather_df.groupby('city')[['extreme_heat', 'extreme_cold', 
                                             'heavy_rain', 'high_wind']].sum()

print("⛈️ Extreme Weather Events by City:")
display(extreme_events)

# Find most extreme records
print("\n🌡️ Most Extreme Weather Records:")
print(f"Highest Temperature: {weather_df['temperature'].max()}°F on {weather_df.loc[weather_df['temperature'].idxmax(), 'date'].date()} in {weather_df.loc[weather_df['temperature'].idxmax(), 'city']}")
print(f"Lowest Temperature: {weather_df['temperature'].min()}°F on {weather_df.loc[weather_df['temperature'].idxmin(), 'date'].date()} in {weather_df.loc[weather_df['temperature'].idxmin(), 'city']}")
print(f"Highest Precipitation: {weather_df['precipitation'].max():.2f} inches on {weather_df.loc[weather_df['precipitation'].idxmax(), 'date'].date()}")
print(f"Highest Wind Speed: {weather_df['wind_speed'].max():.1f} mph on {weather_df.loc[weather_df['wind_speed'].idxmax(), 'date'].date()}")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Extreme events by city
extreme_events.plot(kind='bar', ax=axes[0, 0], 
                    color=[PlottingUtils.COLORS['secondary'], PlottingUtils.COLORS['info'],
                           PlottingUtils.COLORS['primary'], PlottingUtils.COLORS['warning']])
axes[0, 0].set_title('⛈️ Extreme Weather Events Count by City', fontweight='bold')
axes[0, 0].set_xlabel('City')
axes[0, 0].set_ylabel('Number of Events')
axes[0, 0].legend(title='Event Type')
axes[0, 0].tick_params(axis='x', rotation=45)

# Extreme temperature distribution
extreme_temps = weather_df[(weather_df['extreme_heat']) | (weather_df['extreme_cold'])]
axes[0, 1].scatter(extreme_temps['date'], extreme_temps['temperature'],
                   c=['red' if x else 'blue' for x in extreme_temps['extreme_heat']],
                   alpha=0.6, s=30)
axes[0, 1].axhline(y=temp_q95, color='red', linestyle='--', alpha=0.5, label=f'Heat threshold ({temp_q95:.1f}°F)')
axes[0, 1].axhline(y=temp_q5, color='blue', linestyle='--', alpha=0.5, label=f'Cold threshold ({temp_q5:.1f}°F)')
axes[0, 1].set_title('🌡️ Extreme Temperature Events Over Time', fontweight='bold')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Temperature (°F)')
axes[0, 1].legend()

# UV Index analysis
uv_analysis = weather_df.groupby('city')['uv_index'].agg(['mean', 'max'])
axes[1, 0].bar(uv_analysis.index, uv_analysis['mean'], 
               color=PlottingUtils.COLORS['palette'][:len(uv_analysis)], 
               label='Average')
axes[1, 0].scatter(uv_analysis.index, uv_analysis['max'], 
                   color='red', s=100, zorder=5, label='Maximum')
axes[1, 0].set_title('☀️ UV Index by City', fontweight='bold')
axes[1, 0].set_ylabel('UV Index')
axes[1, 0].legend()
axes[1, 0].tick_params(axis='x', rotation=45)

# Extreme event timeline
extreme_timeline = weather_df[['date', 'extreme_heat', 'extreme_cold', 'heavy_rain', 'high_wind']].copy()
extreme_timeline['any_extreme'] = extreme_timeline[['extreme_heat', 'extreme_cold', 'heavy_rain', 'high_wind']].any(axis=1)
monthly_extremes = extreme_timeline.groupby(extreme_timeline['date'].dt.to_period('M'))['any_extreme'].sum()
monthly_extremes.index = monthly_extremes.index.to_timestamp()

axes[1, 1].plot(monthly_extremes.index, monthly_extremes.values, 
                color=PlottingUtils.COLORS['secondary'], linewidth=2)
axes[1, 1].fill_between(monthly_extremes.index, monthly_extremes.values, 
                        alpha=0.3, color=PlottingUtils.COLORS['secondary'])
axes[1, 1].set_title('📈 Extreme Weather Events Timeline', fontweight='bold')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Number of Extreme Events')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 📱 Exercise 5: Social Media Engagement

<div align="center">
    <img src="figs/brain_intertwined_with_circuitry.png" width="300"/>
</div>

### 🎯 Objective
Analyze social media data to understand engagement patterns, viral content characteristics, and platform-specific behaviors.

### 📋 Your Tasks:
1. Compare engagement across platforms
2. Identify viral content characteristics
3. Analyze sentiment and its impact
4. Find engagement rate drivers
5. Identify influencer patterns

In [None]:
# Load the social media dataset
social_df = pd.read_csv('datasets/social_media_data.csv')

print("📱 Social Media Data Loaded!")
print(f"📊 Dataset Shape: {social_df.shape}")
print(f"📱 Platforms: {social_df['platform'].unique()}")
print(f"📝 Topics: {social_df['topic'].unique()}")
print(f"✅ Verified Accounts: {social_df['verified_account'].sum()} ({social_df['verified_account'].mean():.1%})")
print("\n📋 Dataset Overview:")
display(social_df.head())
print("\n📊 Engagement Statistics:")
display(social_df[['likes', 'shares', 'comments', 'engagement_rate']].describe())

### 📱 Task 5.1: Platform Comparison

**TODO:** Compare engagement metrics across different social media platforms.

In [None]:
# TODO: Compare platforms
# Your code here:

platform_analysis = social_df.groupby('platform').agg({
    'likes': 'mean',
    'shares': 'mean',
    'comments': 'mean',
    'engagement_rate': 'mean',
    'text_length': 'mean',
    'hashtags': 'mean',
    'post_id': 'count'
}).round(2)

platform_analysis.columns = ['avg_likes', 'avg_shares', 'avg_comments', 
                             'avg_engagement', 'avg_text_length', 'avg_hashtags', 'post_count']
platform_analysis = platform_analysis.sort_values('avg_engagement', ascending=False)

print("📱 Platform Performance Analysis:")
display(platform_analysis)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Engagement by platform
axes[0, 0].bar(platform_analysis.index, platform_analysis['avg_engagement'],
               color=PlottingUtils.COLORS['palette'][:len(platform_analysis)])
axes[0, 0].set_title('📊 Average Engagement Rate by Platform', fontweight='bold')
axes[0, 0].set_ylabel('Engagement Rate')
axes[0, 0].tick_params(axis='x', rotation=45)

# Interaction types by platform
interaction_cols = ['avg_likes', 'avg_shares', 'avg_comments']
x = np.arange(len(platform_analysis))
width = 0.25

axes[0, 1].bar(x - width, platform_analysis['avg_likes'], width, label='Likes')
axes[0, 1].bar(x, platform_analysis['avg_shares'], width, label='Shares')
axes[0, 1].bar(x + width, platform_analysis['avg_comments'], width, label='Comments')
axes[0, 1].set_xlabel('Platform')
axes[0, 1].set_ylabel('Average Count')
axes[0, 1].set_title('💬 Interaction Types by Platform', fontweight='bold')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(platform_analysis.index, rotation=45)
axes[0, 1].legend()

# Platform post distribution
axes[1, 0].pie(platform_analysis['post_count'], labels=platform_analysis.index,
               autopct='%1.1f%%', colors=PlottingUtils.COLORS['palette'][:len(platform_analysis)])
axes[1, 0].set_title('📱 Post Distribution Across Platforms', fontweight='bold')

# Content characteristics by platform
axes[1, 1].scatter(platform_analysis['avg_text_length'], platform_analysis['avg_hashtags'],
                   s=platform_analysis['avg_engagement']*1000, alpha=0.6,
                   c=range(len(platform_analysis)), cmap='viridis')
for i, platform in enumerate(platform_analysis.index):
    axes[1, 1].annotate(platform, 
                       (platform_analysis.iloc[i]['avg_text_length'], 
                        platform_analysis.iloc[i]['avg_hashtags']),
                       xytext=(5, 5), textcoords='offset points', fontsize=9)
axes[1, 1].set_title('📝 Content Characteristics by Platform', fontweight='bold')
axes[1, 1].set_xlabel('Average Text Length')
axes[1, 1].set_ylabel('Average Hashtags')

plt.tight_layout()
plt.show()

### 🚀 Task 5.2: Viral Content Analysis

**TODO:** Identify characteristics of viral content (high engagement posts).

In [None]:
# TODO: Analyze viral content
# Your code here:

# Define viral as top 10% engagement rate
viral_threshold = social_df['engagement_rate'].quantile(0.9)
social_df['is_viral'] = social_df['engagement_rate'] > viral_threshold

print(f"🚀 Viral Threshold: {viral_threshold:.3f} engagement rate")
print(f"📊 Viral Posts: {social_df['is_viral'].sum()} ({social_df['is_viral'].mean():.1%})")

# Compare viral vs non-viral
viral_comparison = social_df.groupby('is_viral').agg({
    'likes': 'mean',
    'shares': 'mean',
    'comments': 'mean',
    'text_length': 'mean',
    'hashtags': 'mean',
    'mentions': 'mean',
    'verified_account': 'mean'
}).round(2)

viral_comparison.index = ['Non-Viral', 'Viral']
print("\n🔥 Viral vs Non-Viral Content:")
display(viral_comparison.T)

# Topic analysis for viral content
viral_topics = pd.crosstab(social_df['topic'], social_df['is_viral'], normalize='index') * 100
viral_topics = viral_topics.sort_values(True, ascending=False)

print("\n📝 Viral Rate by Topic:")
display(viral_topics)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Viral characteristics comparison
comparison_metrics = ['likes', 'shares', 'comments', 'hashtags']
viral_means = [viral_comparison.loc['Viral', m] for m in comparison_metrics]
non_viral_means = [viral_comparison.loc['Non-Viral', m] for m in comparison_metrics]

x = np.arange(len(comparison_metrics))
width = 0.35

axes[0, 0].bar(x - width/2, non_viral_means, width, label='Non-Viral', color=PlottingUtils.COLORS['secondary'])
axes[0, 0].bar(x + width/2, viral_means, width, label='Viral', color=PlottingUtils.COLORS['success'])
axes[0, 0].set_xlabel('Metric')
axes[0, 0].set_ylabel('Average Value')
axes[0, 0].set_title('🚀 Viral vs Non-Viral Content Metrics', fontweight='bold')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(comparison_metrics)
axes[0, 0].legend()

# Viral rate by topic
axes[0, 1].barh(viral_topics.index, viral_topics[True],
                color=PlottingUtils.COLORS['palette'][:len(viral_topics)])
axes[0, 1].set_title('🔥 Viral Rate by Topic (%)', fontweight='bold')
axes[0, 1].set_xlabel('Viral Percentage')

# Text length distribution
axes[1, 0].hist(social_df[~social_df['is_viral']]['text_length'], bins=30, 
                alpha=0.5, label='Non-Viral', color=PlottingUtils.COLORS['secondary'])
axes[1, 0].hist(social_df[social_df['is_viral']]['text_length'], bins=30, 
                alpha=0.5, label='Viral', color=PlottingUtils.COLORS['success'])
axes[1, 0].set_title('📝 Text Length Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Text Length (characters)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Engagement scatter
colors = ['red' if v else 'blue' for v in social_df['is_viral']]
axes[1, 1].scatter(social_df['hashtags'], social_df['engagement_rate'],
                   c=colors, alpha=0.3, s=20)
axes[1, 1].axhline(y=viral_threshold, color='green', linestyle='--', 
                   label=f'Viral Threshold ({viral_threshold:.3f})')
axes[1, 1].set_title('📊 Hashtags vs Engagement Rate', fontweight='bold')
axes[1, 1].set_xlabel('Number of Hashtags')
axes[1, 1].set_ylabel('Engagement Rate')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print(f"\n🎯 Key Insight: Viral posts have {(viral_means[1]/non_viral_means[1] - 1)*100:.1f}% more shares than non-viral posts!")

### 😊 Task 5.3: Sentiment Analysis Impact

**TODO:** Analyze how sentiment affects engagement.

In [None]:
# TODO: Analyze sentiment impact
# Your code here:

sentiment_analysis = social_df.groupby('sentiment').agg({
    'engagement_rate': 'mean',
    'likes': 'mean',
    'shares': 'mean',
    'comments': 'mean',
    'post_id': 'count'
}).round(3)

sentiment_analysis.columns = ['avg_engagement', 'avg_likes', 'avg_shares', 'avg_comments', 'post_count']
sentiment_analysis = sentiment_analysis.sort_values('avg_engagement', ascending=False)

print("😊 Sentiment Impact on Engagement:")
display(sentiment_analysis)

# Platform-sentiment analysis
platform_sentiment = pd.crosstab(social_df['platform'], social_df['sentiment'], 
                                 values=social_df['engagement_rate'], aggfunc='mean').round(3)

print("\n📱 Engagement by Platform and Sentiment:")
display(platform_sentiment)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sentiment distribution
sentiment_colors = {'Positive': PlottingUtils.COLORS['success'], 
                   'Negative': PlottingUtils.COLORS['secondary'], 
                   'Neutral': PlottingUtils.COLORS['info']}
colors = [sentiment_colors.get(s, PlottingUtils.COLORS['primary']) for s in sentiment_analysis.index]

axes[0, 0].pie(sentiment_analysis['post_count'], labels=sentiment_analysis.index,
               autopct='%1.1f%%', colors=colors)
axes[0, 0].set_title('😊 Sentiment Distribution', fontweight='bold')

# Engagement by sentiment
axes[0, 1].bar(sentiment_analysis.index, sentiment_analysis['avg_engagement'], color=colors)
axes[0, 1].set_title('📊 Engagement Rate by Sentiment', fontweight='bold')
axes[0, 1].set_ylabel('Average Engagement Rate')

# Platform-sentiment heatmap
im = axes[1, 0].imshow(platform_sentiment.values, cmap='RdYlGn', aspect='auto')
axes[1, 0].set_xticks(np.arange(len(platform_sentiment.columns)))
axes[1, 0].set_yticks(np.arange(len(platform_sentiment.index)))
axes[1, 0].set_xticklabels(platform_sentiment.columns)
axes[1, 0].set_yticklabels(platform_sentiment.index)

# Add values to heatmap
for i in range(len(platform_sentiment.index)):
    for j in range(len(platform_sentiment.columns)):
        axes[1, 0].text(j, i, f'{platform_sentiment.iloc[i, j]:.3f}',
                       ha='center', va='center', color='white', fontweight='bold')

axes[1, 0].set_title('🔥 Engagement Heatmap: Platform vs Sentiment', fontweight='bold')
axes[1, 0].set_xlabel('Sentiment')
axes[1, 0].set_ylabel('Platform')

# Interaction types by sentiment
interaction_metrics = ['avg_likes', 'avg_shares', 'avg_comments']
sentiment_interactions = sentiment_analysis[interaction_metrics]

sentiment_interactions.plot(kind='bar', ax=axes[1, 1], 
                           color=PlottingUtils.COLORS['palette'][:3])
axes[1, 1].set_title('💬 Interaction Types by Sentiment', fontweight='bold')
axes[1, 1].set_xlabel('Sentiment')
axes[1, 1].set_ylabel('Average Count')
axes[1, 1].legend(title='Interaction Type')
axes[1, 1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

best_sentiment = sentiment_analysis.index[0]
print(f"\n🎯 Key Insight: {best_sentiment} sentiment drives highest engagement ({sentiment_analysis.iloc[0]['avg_engagement']:.3f})!")

---

## 🏆 Challenge Section: Cross-Dataset Analysis

### 🎯 Advanced Challenge
Now that you've analyzed each dataset individually, try combining insights across datasets for deeper analysis!

**Ideas to explore:**
1. **Weather & Retail**: How does weather affect shopping patterns?
2. **Social Media & Netflix**: Content trends across platforms
3. **Spotify & Social Media**: Music trends and social engagement
4. **Weather & Social Media**: Does weather affect social media mood?

In [None]:
# Challenge Example: Weather impact on entertainment consumption
print("🏆 CHALLENGE: Cross-Dataset Analysis")
print("="*50)
print("\nExample: Analyzing patterns across different datasets...\n")

# Example: Compare seasonal patterns
# Weather has clear seasons, let's see if other activities follow similar patterns

# Add month to datasets for comparison
retail_df['month'] = pd.to_datetime(retail_df['date']).dt.month

# Monthly patterns comparison
weather_monthly = weather_df.groupby('month')['temperature'].mean()
retail_monthly = retail_df.groupby('month')['total_amount'].sum()

# Normalize for comparison
weather_norm = (weather_monthly - weather_monthly.min()) / (weather_monthly.max() - weather_monthly.min())
retail_norm = (retail_monthly - retail_monthly.min()) / (retail_monthly.max() - retail_monthly.min())

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
x = range(len(weather_norm))

ax.plot(x, weather_norm, marker='o', label='Temperature (normalized)', linewidth=2)
ax.plot(x, retail_norm[:len(weather_norm)], marker='s', label='Retail Sales (normalized)', linewidth=2)

ax.set_xticks(x)
ax.set_xticklabels(months[:len(weather_norm)])
ax.set_title('🌡️💰 Seasonal Patterns: Temperature vs Retail Sales', fontweight='bold', fontsize=14)
ax.set_xlabel('Month')
ax.set_ylabel('Normalized Value (0-1)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate correlation
correlation = np.corrcoef(weather_norm, retail_norm[:len(weather_norm)])[0, 1]
print(f"\n📊 Correlation between temperature and retail sales: {correlation:.3f}")

print("\n💡 YOUR TURN: Try creating your own cross-dataset analysis!")
print("Suggestions:")
print("- Compare Netflix viewing patterns with weather (indoor activity)")
print("- Analyze if social media sentiment varies by day of week (from retail data)")
print("- See if music energy levels correlate with time patterns")

In [None]:
# TODO: Your cross-dataset analysis
# Write your creative analysis here!

# Your code here:


---

## 🎉 Congratulations! Summary & Portfolio

<div align="center">
    <img src="figs/Data_Science_VD.png" width="400"/>
</div>

### 🏆 What You've Accomplished

You've just completed **5 comprehensive data science projects** using real-world datasets!

#### 📊 Your Data Science Portfolio:

1. **Netflix Analytics** 🎬
   - Analyzed viewing patterns and binge-watching behavior
   - Skills: Time series analysis, user behavior analytics

2. **Retail Intelligence** 🛍️
   - Optimized sales strategies and customer segmentation
   - Skills: RFM analysis, revenue optimization, promotion analysis

3. **Spotify Music Discovery** 🎵
   - Explored audio features and recommendation patterns
   - Skills: Feature correlation, pattern recognition

4. **Weather Pattern Analysis** 🌤️
   - Identified climate trends and extreme events
   - Skills: Time series, anomaly detection, seasonal analysis

5. **Social Media Engagement** 📱
   - Decoded viral content and engagement drivers
   - Skills: Sentiment analysis, viral prediction, platform analytics

### 🛠️ Skills You've Developed:

✅ **Data Exploration & Cleaning**  
✅ **Statistical Analysis**  
✅ **Data Visualization**  
✅ **Pattern Recognition**  
✅ **Business Insights Generation**  
✅ **Storytelling with Data**  
✅ **Cross-functional Analysis**  

### 📈 Your Analytics Metrics:


In [None]:
# Calculate your accomplishments
total_rows_analyzed = sum([
    len(netflix_df),
    len(retail_df),
    len(spotify_df),
    len(weather_df),
    len(social_df)
])

datasets_explored = 5
visualizations_created = 20  # Approximate based on exercises
insights_generated = 25  # Approximate

print("📊 YOUR DATA SCIENCE ACHIEVEMENTS")
print("="*50)
print(f"📈 Total Rows Analyzed: {total_rows_analyzed:,}")
print(f"📁 Datasets Explored: {datasets_explored}")
print(f"📊 Visualizations Created: ~{visualizations_created}")
print(f"💡 Insights Generated: ~{insights_generated}")
print(f"🏆 Skill Level: Data Science Practitioner")
print("="*50)

# Create a completion certificate visualization
fig, ax = plt.subplots(figsize=(12, 8))
ax.text(0.5, 0.9, '🏆 Certificate of Completion 🏆', 
        fontsize=24, fontweight='bold', ha='center')
ax.text(0.5, 0.75, 'Data Science Bootcamp Exercises', 
        fontsize=18, ha='center')
ax.text(0.5, 0.6, f'Successfully Analyzed {total_rows_analyzed:,} Data Points', 
        fontsize=14, ha='center')
ax.text(0.5, 0.5, f'Across {datasets_explored} Real-World Datasets', 
        fontsize=14, ha='center')

skills = ['Netflix Analytics ✓', 'Retail Intelligence ✓', 'Music Discovery ✓', 
          'Weather Analysis ✓', 'Social Media Analytics ✓']
for i, skill in enumerate(skills):
    ax.text(0.5, 0.35 - i*0.05, skill, fontsize=12, ha='center', color='green')

ax.text(0.5, 0.05, f'Date: {datetime.now().strftime("%B %d, %Y")}', 
        fontsize=10, ha='center', style='italic')

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')

# Add border
rect = plt.Rectangle((0.05, 0.02), 0.9, 0.96, fill=False, 
                     edgecolor='gold', linewidth=3)
ax.add_patch(rect)

plt.tight_layout()
plt.show()

### 🚀 Next Steps

You're now ready to:

1. **Build Your Portfolio** 💼
   - Share your analyses on GitHub
   - Create a data science blog
   - Showcase projects to employers

2. **Advance Your Skills** 📚
   - Try machine learning models on these datasets
   - Build interactive dashboards
   - Combine multiple datasets for deeper insights

3. **Real-World Applications** 🌍
   - Apply these techniques to your own data
   - Solve business problems
   - Contribute to open-source projects

### 💬 Final Interactive Check

In [None]:
# Final knowledge check
final_quiz = InteractiveQuiz("Data Science Mastery Check")

final_quiz.add_question(
    "Which analysis technique helps identify customer segments?",
    ["Time series analysis", "RFM analysis", "Correlation analysis", "Sentiment analysis"],
    1,
    "Excellent! RFM (Recency, Frequency, Monetary) analysis is perfect for customer segmentation."
)

final_quiz.add_question(
    "What's the best way to identify viral content characteristics?",
    ["Look at timestamps", "Compare high vs low engagement posts", "Count hashtags", "Check platform"],
    1,
    "Perfect! Comparing characteristics of high vs low engagement content reveals viral patterns."
)

final_quiz.add_question(
    "Which skill is MOST important for a data scientist?",
    ["Programming only", "Statistics only", "Visualization only", "Curiosity and problem-solving"],
    3,
    "Absolutely right! Curiosity and problem-solving drive all great data science work!"
)

quiz_widget = final_quiz.create_quiz()
display(quiz_widget)

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px; border-radius: 10px; margin: 20px 0; text-align: center;">
    <h2>🎊 You Did It!</h2>
    <p style="font-size: 18px;">You've completed the Data Science Bootcamp Exercises!</p>
    <p style="font-size: 16px;">You're not just learning data science — you're becoming a data scientist!</p>
    <br>
    <p style="font-size: 20px; font-weight: bold;">Welcome to the data-driven future! 🚀</p>
</div>

---

### 📝 Notes Section
Use this space to write your own notes, ideas, and insights from the exercises:

In [None]:
# Your notes and observations
my_notes = """
Key Learnings:
1. 
2. 
3. 

Ideas for Future Projects:
- 
- 
- 

Questions to Explore:
- 
- 
"""

print(my_notes)