# üìä Student Dropout Prediction - Data Exploration
## Notebook 1: Exploratory Data Analysis

This notebook explores the Student Mental Health dataset to understand:
- Dataset structure and characteristics
- Distribution of features
- Missing values and data quality
- Relationships between features
- Initial insights for dropout prediction

## 1. Setup and Imports

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add parent directory to path
sys.path.append('..')

from src.utils import (
    load_data, get_missing_value_summary,
    plot_feature_distributions, plot_correlation_matrix
)
import config

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Random seed
np.random.seed(config.RANDOM_STATE)

print("‚úì Libraries imported successfully")
print(f"Working Directory: {os.getcwd()}")

## 2. Load Dataset

**Dataset**: Student Mental Health Dataset from Kaggle

‚ö†Ô∏è **IMPORTANT**: Make sure you have downloaded the dataset and placed it in `data/raw/student_mental_health.csv`

If you haven't downloaded it yet:
1. Go to Kaggle: https://www.kaggle.com/datasets
2. Search for "Student Mental Health"
3. Download the CSV file
4. Place it in `../data/raw/student_mental_health.csv`

In [None]:
# Load the dataset
try:
    df = load_data(config.DATASET_PATH)
    print(f"\n‚úì Dataset loaded successfully!")
    print(f"Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
except FileNotFoundError as e:
    print(f"\n‚ùå Error: {e}")
    print("\nPlease download the dataset and place it in the correct location.")
    print("Expected path:", config.DATASET_PATH)

## 3. Initial Data Inspection

In [None]:
# Display first few rows
print("\n" + "="*80)
print("FIRST 5 ROWS OF THE DATASET")
print("="*80 + "\n")
df.head()

In [None]:
# Dataset info
print("\n" + "="*80)
print("DATASET INFORMATION")
print("="*80 + "\n")
df.info()

In [None]:
# Statistical summary
print("\n" + "="*80)
print("STATISTICAL SUMMARY")
print("="*80 + "\n")
df.describe()

In [None]:
# Column names and types
print("\n" + "="*80)
print("COLUMN INFORMATION")
print("="*80 + "\n")

column_info = pd.DataFrame({
    'Column': df.columns,
    'Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Null Count': df.isnull().sum().values,
    'Unique Values': [df[col].nunique() for col in df.columns]
})

print(column_info.to_string(index=False))

## 4. Missing Value Analysis

In [None]:
# Check for missing values
missing_summary = get_missing_value_summary(df)

if not missing_summary.empty:
    print("\n" + "="*80)
    print("MISSING VALUES SUMMARY")
    print("="*80 + "\n")
    print(missing_summary.to_string(index=False))
    
    # Visualize missing values
    plt.figure(figsize=(12, 6))
    plt.barh(missing_summary['Column'], missing_summary['Missing_Percentage'], color='coral')
    plt.xlabel('Missing Percentage (%)', fontsize=12)
    plt.title('Missing Values by Column', fontsize=14, fontweight='bold')
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("\n‚úì No missing values found in the dataset!")

## 5. Target Variable Analysis

**Note**: The actual target variable name may differ. Update this section based on your dataset.

In [None]:
# Identify potential target variable
# Common names: 'dropout', 'status', 'target', 'label', etc.
print("\nColumn names:")
print(df.columns.tolist())

# TODO: Update 'target_column' with the actual dropout indicator column
# target_column = 'dropout'  # Replace with actual column name
# 
# if target_column in df.columns:
#     print(f"\n" + "="*80)
#     print(f"TARGET VARIABLE: {target_column}")
#     print("="*80 + "\n")
#     
#     # Value counts
#     print(df[target_column].value_counts())
#     print(f"\nDropout Rate: {df[target_column].mean():.2%}")
#     
#     # Visualize class distribution
#     fig, axes = plt.subplots(1, 2, figsize=(14, 5))
#     
#     # Count plot
#     df[target_column].value_counts().plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
#     axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
#     axes[0].set_xlabel('Class', fontsize=12)
#     axes[0].set_ylabel('Count', fontsize=12)
#     axes[0].grid(axis='y', alpha=0.3)
#     
#     # Pie chart
#     df[target_column].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
#                                           colors=['#2ecc71', '#e74c3c'])
#     axes[1].set_ylabel('')
#     axes[1].set_title('Class Proportion', fontsize=14, fontweight='bold')
#     
#     plt.tight_layout()
#     plt.show()
# else:
#     print(f"\n‚ö†Ô∏è Target column '{target_column}' not found!")
#     print("Please update the target_column variable with the correct column name.")

## 6. Feature Distribution Analysis

In [None]:
# Separate numerical and categorical features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"\nNumerical Features ({len(numerical_features)}):")
print(numerical_features)

print(f"\nCategorical Features ({len(categorical_features)}):")
print(categorical_features)

In [None]:
# Plot numerical feature distributions
if numerical_features:
    print("\n" + "="*80)
    print("NUMERICAL FEATURE DISTRIBUTIONS")
    print("="*80 + "\n")
    
    plot_feature_distributions(df, numerical_features[:9], ncols=3)

In [None]:
# Plot categorical feature distributions
if categorical_features:
    print("\n" + "="*80)
    print("CATEGORICAL FEATURE DISTRIBUTIONS")
    print("="*80 + "\n")
    
    plot_feature_distributions(df, categorical_features[:9], ncols=3)

## 7. Correlation Analysis

In [None]:
# Correlation matrix for numerical features
if len(numerical_features) > 1:
    print("\n" + "="*80)
    print("CORRELATION ANALYSIS")
    print("="*80 + "\n")
    
    plot_correlation_matrix(df[numerical_features])

## 8. Outlier Detection

In [None]:
# Box plots for numerical features to detect outliers
if numerical_features:
    print("\n" + "="*80)
    print("OUTLIER DETECTION (Box Plots)")
    print("="*80 + "\n")
    
    n_features = min(len(numerical_features), 9)
    fig, axes = plt.subplots((n_features + 2) // 3, 3, figsize=(15, 3 * ((n_features + 2) // 3)))
    axes = axes.flatten() if n_features > 1 else [axes]
    
    for idx, feature in enumerate(numerical_features[:n_features]):
        df.boxplot(column=feature, ax=axes[idx])
        axes[idx].set_title(f'{feature}', fontweight='bold')
        axes[idx].grid(alpha=0.3)
    
    # Hide unused subplots
    for idx in range(n_features, len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

## 9. Key Insights and Next Steps

### Observations:
1. **Dataset Size**: [Update after running]
2. **Missing Values**: [Update after running]
3. **Class Balance**: [Update after running]
4. **Feature Types**: [Update after running]
5. **Outliers**: [Update after running]

### Next Steps:
1. ‚úÖ Data exploration completed
2. üìù Proceed to `02_data_preprocessing.ipynb` for:
   - Handling missing values
   - Feature engineering
   - Encoding categorical variables
   - Scaling numerical features
   - Train-test split

In [None]:
# Save exploration summary
summary = {
    'total_rows': len(df),
    'total_columns': len(df.columns),
    'numerical_features': len(numerical_features),
    'categorical_features': len(categorical_features),
    'missing_values': df.isnull().sum().sum(),
    'duplicate_rows': df.duplicated().sum()
}

print("\n" + "="*80)
print("EXPLORATION SUMMARY")
print("="*80)
for key, value in summary.items():
    print(f"{key.replace('_', ' ').title()}: {value}")
print("="*80)