# Project 1: Exploratory Data Analysis on Student Performance Dataset

**Goal**: Walk from raw CSV to clear, visual insights in a single notebook.

## Session Overview (90 minutes)
- **Kick-off (0-10 min)**: What makes "good" EDA?
- **Live Demo (10-55 min)**: Instructor walkthrough
- **Guided Exercise (55-80 min)**: Student practice
- **Lightning Recap (80-90 min)**: Best practices & pitfalls

## Checkpoints
1. ✅ Load and inspect data
2. ✅ Handle missing values
3. ✅ Categorical vs Numeric audit
4. ✅ Feature engineering
5. ✅ Correlation analysis
6. ✅ Visualization insights

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("✅ Libraries imported successfully!")

In [None]:
# TODO: Load the dataset
# Hint: Use pd.read_csv() to load 'student_performance.csv'
# df = pd.read_csv('student_performance.csv')

# TODO: Display basic information about the dataset
# print(f"📊 Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
# print("\nFirst 5 rows:")
# YOUR CODE HERE to display first 5 rows

## 2. Data Overview and Types

In [None]:
# TODO: Display basic information about the dataset
# Hint: Use df.info() and df.describe()
print("📋 Dataset Info:")
print("=" * 50)
# YOUR CODE HERE

print("\n📈 Basic Statistics:")
print("=" * 50)
# YOUR CODE HERE

In [None]:
# TODO: Check for missing values
# print("🔍 Missing Values Check:")
# print("=" * 50)

# Hint: Use df.isnull().sum() to count missing values
# missing_values = df.isnull().sum()
# missing_percentage = (missing_values / len(df)) * 100

# Create a DataFrame to display missing values nicely
# missing_df = pd.DataFrame({
#     'Missing Count': missing_values,
#     'Missing Percentage': missing_percentage
# })
# missing_df = missing_df[missing_df['Missing Count'] > 0]

# if len(missing_df) == 0:
#     print("✅ No missing values found!")
# else:
#     print(missing_df)

In [None]:
# TODO: Create missing values heatmap
# Hint: Use sns.heatmap() with df.isnull()
# plt.figure(figsize=(10, 6))
# sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
# plt.title('Missing Values Heatmap')
# plt.tight_layout()
# plt.show()

# print("✅ Missing values visualization complete!")

## 3. Categorical vs Numeric Audit

In [None]:
# TODO: Separate categorical and numerical columns
# Hint: Use df.select_dtypes() to identify column types
# categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
# numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# print(f"📊 Categorical columns: {categorical_cols}")
# print(f"📈 Numerical columns: {numerical_cols}")

# TODO: Display unique values for categorical columns
# print("\n🔤 Categorical Variables Analysis:")
# print("=" * 50)
# for col in categorical_cols:
#     print(f"\n{col}:")
#     print(df[col].value_counts())
#     print(f"Unique values: {df[col].nunique()}")

In [None]:
# TODO: Visualize categorical variables
# Hint: Use sns.countplot() in subplots
# fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# axes = axes.ravel()

# for i, col in enumerate(categorical_cols):
#     if i < 4:  # Only plot first 4 categorical variables
#         sns.countplot(data=df, x=col, ax=axes[i])
#         axes[i].set_title(f'Distribution of {col}')
#         axes[i].tick_params(axis='x', rotation=45)

# plt.tight_layout()
# plt.show()

## 4. Feature Engineering

In [None]:
# TODO: Create study time bands (engineered feature)
# Hint: Create a function to map study_time to bands
def create_study_bands(study_time):
    # YOUR CODE HERE
    # if study_time == 1:
    #     return 'Low'
    # elif study_time == 2:
    #     return 'Medium'
    # elif study_time == 3:
    #     return 'High'
    # else:
    #     return 'Very High'
    pass

# Apply the function to create the new feature
# df['study_time_band'] = df['study_time'].apply(create_study_bands)

# print("✅ Study time bands created!")
# print("\nStudy Time Bands Distribution:")
# print(df['study_time_band'].value_counts())

In [None]:
# TODO: Create meal habit feature
# Hint: Combine breakfast and lunch columns with '_'
# df['meal_habit'] = df['breakfast'] + '_' + df['lunch']

# print("✅ Meal habit feature created!")
# print("\nMeal Habit Distribution:")
# print(df['meal_habit'].value_counts())

## 5. Correlation Analysis

In [None]:
# TODO: Select numerical columns for correlation analysis
# numerical_df = df[numerical_cols]

# TODO: Calculate correlation matrix
# Hint: Use the .corr() method
# correlation_matrix = numerical_df.corr()

# print("📊 Correlation Matrix:")
# print("=" * 50)
# print(correlation_matrix.round(3))

In [None]:
# TODO: Create correlation heatmap
# Hint: Use sns.heatmap() with a mask for the upper triangle
# plt.figure(figsize=(10, 8))
# mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
# sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
#             square=True, linewidths=0.5, cbar_kws={"shrink": .8})
# plt.title('Correlation Heatmap of Numerical Variables')
# plt.tight_layout()
# plt.show()

## 6. Guided Mini-Exercise

**Choose ONE of the following prompts to investigate:**

### Option A: Which non-academic factors correlate most with final grade?
- Investigate gender, breakfast habits, lunch habits, meal combinations
- Create box plots and statistical tests
- Calculate summary statistics by groups

### Option B: How does study time interact with parental education?
- Create pivot tables and heatmaps
- Analyze average grades by education level and study time
- Look for interaction effects

**Instructions:**
1. Create at least 2 visualizations
2. Calculate relevant statistics
3. Draw insights from your analysis
4. Document your findings below

In [None]:
# TODO: Option A - Non-academic factors analysis
# Uncomment and complete the code below if you choose Option A

# print("🔍 Option A: Non-academic factors analysis")
# print("=" * 50)

# 1. Gender analysis
# plt.figure(figsize=(15, 5))
# plt.subplot(1, 3, 1)
# YOUR CODE HERE for gender boxplot

# 2. Breakfast impact
# plt.subplot(1, 3, 2)
# YOUR CODE HERE for breakfast boxplot

# 3. Meal habit analysis
# plt.subplot(1, 3, 3)
# YOUR CODE HERE for meal habit boxplot

# plt.tight_layout()
# plt.show()

# Statistical tests
# Hint: Use stats.ttest_ind() for comparing groups
# male_grades = df[df['gender'] == 'Male']['final_grade']
# female_grades = df[df['gender'] == 'Female']['final_grade']
# t_stat, p_value = stats.ttest_ind(male_grades, female_grades)
# print(f"Gender difference t-test p-value: {p_value:.4f}")

# YOUR CODE HERE for breakfast impact test

# Summary statistics
# print("\n📊 Summary Statistics by Gender:")
# print(df.groupby('gender')['final_grade'].agg(['mean', 'std', 'count']))


In [None]:
# TODO: Option B - Study time and parental education interaction
# Uncomment and complete the code below if you choose Option B

# print("🔍 Option B: Study time and parental education interaction")
# print("=" * 50)

# 1. Create pivot table
# pivot_table = df.pivot_table(
#     values='final_grade', 
#     index='parental_education', 
#     columns='study_time_band', 
#     aggfunc='mean'
# )
# print("Average Final Grade by Education and Study Time:")
# print(pivot_table.round(1))

# 2. Visualization
# plt.figure(figsize=(12, 8))
# plt.subplot(2, 2, 1)
# YOUR CODE HERE for heatmap

# plt.subplot(2, 2, 2)
# YOUR CODE HERE for education boxplot

# plt.subplot(2, 2, 3)
# YOUR CODE HERE for study time by education countplot

# plt.subplot(2, 2, 4)
# YOUR CODE HERE for interaction line plot

# plt.tight_layout()
# plt.show()

# Statistical analysis
# print("\n📊 Summary Statistics:")
# summary = df.groupby(['parental_education', 'study_time_band'])['final_grade'].agg(['mean', 'count'])
# print(summary.round(1))

## 7. Your Insights Summary

**Document your findings here:**

### Data Quality:
- **Missing values**: [YOUR FINDINGS]
- **Data types**: [YOUR FINDINGS]
- **Outliers**: [YOUR FINDINGS]

### Key Correlations:
1. **[Variable Name]** ([correlation]) - [Your interpretation]
2. **[Variable Name]** ([correlation]) - [Your interpretation]
3. **[Variable Name]** ([correlation]) - [Your interpretation]

### Feature Engineering Results:
- **Study time bands**: [YOUR FINDINGS]
- **Meal habits**: [YOUR FINDINGS]

### Mini-Exercise Results:
**Which option did you choose?** [A or B]

**Key findings:**
- [YOUR FINDING 1]
- [YOUR FINDING 2]
- [YOUR FINDING 3]

### Your Recommendations:
1. [YOUR RECOMMENDATION 1]
2. [YOUR RECOMMENDATION 2]
3. [YOUR RECOMMENDATION 3]

## 8. Best Practices & Common Pitfalls

### ✅ Best Practices Demonstrated:
1. **Data loading and inspection** - Always check data types and missing values
2. **Visualization hierarchy** - Start with distributions, then correlations
3. **Feature engineering** - Create meaningful derived features
4. **Correlation analysis** - Use heatmaps and pair plots effectively
5. **Documentation** - Clear markdown explanations

### ⚠️ Common Pitfalls to Avoid:
1. **Ignoring data types** - Categorical vs numerical treatment
2. **Correlation ≠ Causation** - Always interpret carefully
3. **Over-plotting** - Too many visualizations can confuse
4. **Missing context** - Always explain what insights mean
5. **No action items** - EDA should lead to recommendations

### 🎯 Next Steps:
- Consider feature selection for modeling
- Plan data preprocessing pipeline
- Design hypothesis testing experiments
- Implement recommendations and measure impact

### 📚 Additional Resources:
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)
- [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)