# JEE College Prediction - Comprehensive Data Analysis

This notebook provides a complete analysis of the JEE College Prediction dataset, including data loading, cleaning, exploratory data analysis, feature engineering, model training, and evaluation.

## Table of Contents`
1. [Import Required Libraries](#import-libraries)
2. [Load the Dataset with Error Handling](#load-data)
3. [Display Dataset Information](#dataset-info)
4. [Data Cleaning: Handle Missing Values](#missing-values)
5. [Data Cleaning: Clean Rank Columns](#clean-ranks)
6. [Save Cleaned Data](#save-cleaned)
7. [Exploratory Data Analysis: Basic Statistics](#basic-stats)
8. [Exploratory Data Analysis: Visualizations](#visualizations)
9. [Exploratory Data Analysis: Correlation Analysis](#correlation)
10. [Feature Engineering: Define Features and Targets](#define-features)
11. [Feature Engineering: Encode Target Variables](#encode-targets)
12. [Feature Engineering: Define Feature Types](#feature-types)
13. [Model Training: Create Preprocessing Pipeline](#preprocessing)
14. [Model Training: Train/Test Split](#train-test-split)
15. [Model Training: Train the Model](#train-model)
16. [Model Evaluation: Accuracy Metrics](#accuracy)
17. [Model Evaluation: Feature Importance](#feature-importance)
18. [Save Trained Model](#save-model)
19. [Conclusions and Recommendations](#conclusions)

---

## 1. Import Required Libraries {#import-libraries}

Let's start by importing all the necessary libraries for data analysis, visualization, and machine learning. We'll also configure the visualization settings for better plots.

In [None]:
# Import essential libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import json
import pickle
from datetime import datetime

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

# Configure warnings and display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

# Set visualization style for better plots
plt.style.use('default')  # Use default instead of seaborn-v0_8 for compatibility
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")
print(f"🤖 Scikit-learn version: {sklearn.__version__}")

# Import scikit-learn for version check
import sklearn
print(f"🗓️  Analysis started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Import essential libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import warnings
from pathlib import Path

# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Import serialization libraries
import pickle
import joblib

# Configure warnings and display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Set visualization style for better plots
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")

## 2. Load the Dataset with Error Handling {#load-data}

Let's load the JEE admission dataset with proper error handling to ensure the notebook continues gracefully even if the data file is missing.

In [None]:
# Define data paths
data_dir = Path("../data/raw")
processed_dir = Path("../data/processed")
models_dir = Path("../models")

# Create directories if they don't exist
data_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)
models_dir.mkdir(parents=True, exist_ok=True)

# Initialize dataset variable
final_df = None
data_loaded = False

# Try to load the dataset with comprehensive error handling
print("🔍 Attempting to load dataset...")

try:
    # Try different possible data file locations and names
    possible_files = [
        data_dir / "data_v1.pkl",
        data_dir / "jee_data.pkl",
        data_dir / "dataset.pkl",
        processed_dir / "data_v2.pkl"
    ]
    
    for file_path in possible_files:
        if file_path.exists():
            print(f"📁 Found data file at: {file_path}")
            with open(file_path, "rb") as f:
                final_df = pickle.load(f)
            data_loaded = True
            break
    
    if data_loaded:
        print("✅ Data loaded successfully!")
        print(f"📊 Dataset shape: {final_df.shape}")
        print(f"📋 Columns: {final_df.columns.tolist()}")
        print(f"💾 Memory usage: {final_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    else:
        print("❌ No data file found in expected locations:")
        for file_path in possible_files:
            print(f"   - {file_path}")
        print("\n🔧 Please ensure the data file exists in one of these locations.")
        print("📝 You can also create sample data using the data generation script.")
        
except Exception as e:
    print(f"❌ Error loading data: {str(e)}")
    print(f"🔧 Error type: {type(e).__name__}")
    data_loaded = False

# Set a flag for subsequent cells
DATA_AVAILABLE = data_loaded
print(f"\n🎯 Data availability status: {'✅ Available' if DATA_AVAILABLE else '❌ Not Available'}")

## 3. Display Dataset Information {#dataset-info}

Let's examine the structure and basic information about our dataset to better understand what we're working with.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("📊 DATASET OVERVIEW")
    print("=" * 50)
    
    # Basic information
    print(f"📏 Dataset Shape: {final_df.shape}")
    print(f"📋 Number of Columns: {len(final_df.columns)}")
    print(f"📄 Number of Rows: {len(final_df)}")
    
    print("\n📋 COLUMN INFORMATION")
    print("=" * 50)
    print("Column Names:")
    for i, col in enumerate(final_df.columns, 1):
        print(f"  {i:2d}. {col}")
    
    print("\n📈 DATA TYPES")
    print("=" * 50)
    print(final_df.dtypes)
    
    print("\n📊 DATASET INFO")
    print("=" * 50)
    final_df.info()
    
    print("\n👀 FIRST 5 ROWS")
    print("=" * 50)
    display(final_df.head())
    
    print("\n🔍 LAST 5 ROWS")
    print("=" * 50)
    display(final_df.tail())
    
    print("\n📊 BASIC STATISTICS")
    print("=" * 50)
    display(final_df.describe(include='all'))
    
else:
    print("⚠️  Skipping dataset information display - No data available")
    print("📝 Please ensure the dataset is loaded before proceeding with analysis")

## 4. Data Cleaning: Handle Missing Values {#missing-values}

Data cleaning is crucial for accurate analysis. Let's identify and handle missing values appropriately.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("🔍 MISSING VALUES ANALYSIS")
    print("=" * 50)
    
    # Check for missing values
    missing_values = final_df.isnull().sum()
    missing_percentage = (missing_values / len(final_df)) * 100
    
    # Create a summary of missing values
    missing_summary = pd.DataFrame({
        'Column': missing_values.index,
        'Missing_Count': missing_values.values,
        'Missing_Percentage': missing_percentage.values
    })
    
    # Filter only columns with missing values
    missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
    
    if len(missing_summary) > 0:
        print("📊 Missing Values Summary:")
        display(missing_summary.sort_values('Missing_Count', ascending=False))
        
        # Handle missing values based on column type
        original_shape = final_df.shape
        
        # 1. Handle missing Institute values (critical for analysis)
        if 'Institute' in final_df.columns:
            institute_missing = final_df['Institute'].isnull().sum()
            if institute_missing > 0:
                print(f"\n🏫 Removing {institute_missing} rows with missing Institute information...")
                final_df = final_df.dropna(subset=['Institute'])
                print(f"   Dataset shape after removal: {final_df.shape}")
        
        # 2. Handle missing Gender values
        if 'Gender' in final_df.columns:
            gender_missing = final_df['Gender'].isnull().sum()
            if gender_missing > 0:
                print(f"\n👥 Filling {gender_missing} missing Gender values with 'Neutral'...")
                final_df['Gender'] = final_df['Gender'].fillna('Neutral')
                print(f"   Gender distribution after filling:")
                print(final_df['Gender'].value_counts())
        
        # 3. Handle other missing values
        for col in final_df.columns:
            if col not in ['Institute', 'Gender']:
                missing_count = final_df[col].isnull().sum()
                if missing_count > 0:
                    if final_df[col].dtype in ['object', 'category']:
                        final_df[col] = final_df[col].fillna('Unknown')
                        print(f"\n📝 Filled {missing_count} missing values in '{col}' with 'Unknown'")
                    else:
                        median_val = final_df[col].median()
                        final_df[col] = final_df[col].fillna(median_val)
                        print(f"\n🔢 Filled {missing_count} missing values in '{col}' with median: {median_val}")
        
        print(f"\n✅ Data cleaning completed!")
        print(f"📏 Original shape: {original_shape}")
        print(f"📏 Final shape: {final_df.shape}")
        
        # Verify no missing values remain
        remaining_missing = final_df.isnull().sum().sum()
        print(f"🎯 Remaining missing values: {remaining_missing}")
        
    else:
        print("✅ No missing values found in the dataset!")
        
else:
    print("⚠️  Skipping missing values analysis - No data available")

## 5. Data Cleaning: Clean Rank Columns {#clean-ranks}

The rank columns might contain non-numeric values that need to be cleaned and converted to proper numeric format.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("🔧 CLEANING RANK COLUMNS")
    print("=" * 50)
    
    def clean_rank(value):
        """
        Clean rank data by converting various formats to integers.
        
        Args:
            value: Raw rank value (could be string, float, or int)
            
        Returns:
            int: Cleaned rank value or NaN if invalid
        """
        if pd.isna(value):
            return np.nan
            
        try:
            # If it's already a number, try to convert directly
            if isinstance(value, (int, float)):
                return int(value) if not np.isnan(value) else np.nan
            
            # If it's a string, clean it
            if isinstance(value, str):
                # Remove whitespace and convert to lowercase
                value = value.strip().lower()
                
                # Handle common string representations
                if value in ['', 'nan', 'none', 'null', 'na']:
                    return np.nan
                
                # Remove non-numeric characters except digits and decimal point
                import re
                cleaned = re.sub(r'[^\d.]', '', value)
                
                if cleaned == '':
                    return np.nan
                
                # Convert to float first, then to int
                return int(float(cleaned))
                
        except (ValueError, TypeError, AttributeError):
            return np.nan
        
        return np.nan
    
    # Define rank columns to clean
    rank_columns = []
    possible_rank_cols = ['Opening Rank', 'Closing Rank', 'opening_rank', 'closing_rank', 
                         'OpeningRank', 'ClosingRank', 'rank_opening', 'rank_closing']
    
    for col in possible_rank_cols:
        if col in final_df.columns:
            rank_columns.append(col)
    
    if rank_columns:
        print(f"🎯 Found rank columns: {rank_columns}")
        
        for col in rank_columns:
            print(f"\n🔧 Cleaning column: {col}")
            
            # Show original data type and sample values
            print(f"   Original dtype: {final_df[col].dtype}")
            print(f"   Original sample values: {final_df[col].head().tolist()}")
            
            # Count invalid values before cleaning
            if final_df[col].dtype == 'object':
                non_numeric = final_df[col].apply(lambda x: not str(x).replace('.', '').replace('-', '').isdigit() if pd.notna(x) else False).sum()
                print(f"   Non-numeric values before cleaning: {non_numeric}")
            
            # Apply cleaning function
            original_non_null = final_df[col].notna().sum()
            final_df[col] = final_df[col].apply(clean_rank)
            cleaned_non_null = final_df[col].notna().sum()
            
            # Show results
            print(f"   ✅ Cleaned successfully!")
            print(f"   📊 Valid values: {original_non_null} → {cleaned_non_null}")
            print(f"   📈 Data type: {final_df[col].dtype}")
            
            if cleaned_non_null > 0:
                print(f"   📊 Min: {final_df[col].min()}")
                print(f"   📊 Max: {final_df[col].max()}")
                print(f"   📊 Mean: {final_df[col].mean():.2f}")
                print(f"   📊 Median: {final_df[col].median():.2f}")
            
            # Show cleaned sample values
            print(f"   📋 Cleaned sample values: {final_df[col].dropna().head().tolist()}")
        
        print(f"\n✅ All rank columns cleaned successfully!")
        
    else:
        print("⚠️  No rank columns found in the dataset")
        print("📋 Available columns:")
        for col in final_df.columns:
            print(f"   - {col}")
            
else:
    print("⚠️  Skipping rank column cleaning - No data available")

## 6. Save Cleaned Data {#save-cleaned}

Let's save the cleaned dataset for future use and analysis.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("💾 SAVING CLEANED DATA")
    print("=" * 50)
    
    try:
        # Define output path
        output_path = processed_dir / "cleaned_data.pkl"
        
        # Save the cleaned dataset
        with open(output_path, "wb") as f:
            pickle.dump(final_df, f)
        
        print(f"✅ Cleaned data saved successfully!")
        print(f"📁 Location: {output_path}")
        print(f"📊 Shape: {final_df.shape}")
        print(f"💾 File size: {output_path.stat().st_size / 1024**2:.2f} MB")
        
        # Also save as CSV for easy inspection
        csv_path = processed_dir / "cleaned_data.csv"
        final_df.to_csv(csv_path, index=False)
        print(f"📄 CSV version saved: {csv_path}")
        
        # Create a data summary
        summary = {
            'timestamp': pd.Timestamp.now().isoformat(),
            'shape': final_df.shape,
            'columns': final_df.columns.tolist(),
            'dtypes': final_df.dtypes.to_dict(),
            'missing_values': final_df.isnull().sum().to_dict(),
            'memory_usage_mb': final_df.memory_usage(deep=True).sum() / 1024**2
        }
        
        summary_path = processed_dir / "data_summary.json"
        import json
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2, default=str)
        
        print(f"📋 Data summary saved: {summary_path}")
        
    except Exception as e:
        print(f"❌ Error saving cleaned data: {str(e)}")
        print(f"🔧 Error type: {type(e).__name__}")
        
else:
    print("⚠️  Skipping data save - No data available")

## 7. Exploratory Data Analysis: Basic Statistics {#basic-stats}

Let's examine the statistical properties of our cleaned dataset to understand the data distribution.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("📊 BASIC STATISTICS ANALYSIS")
    print("=" * 50)
    
    # Separate numeric and categorical columns
    numeric_cols = final_df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = final_df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    print(f"🔢 Numeric columns ({len(numeric_cols)}): {numeric_cols}")
    print(f"📝 Categorical columns ({len(categorical_cols)}): {categorical_cols}")
    
    # Numeric statistics
    if numeric_cols:
        print(f"\n📊 NUMERIC COLUMNS STATISTICS")
        print("-" * 40)
        display(final_df[numeric_cols].describe())
        
        # Additional statistics for numeric columns
        print(f"\n📈 ADDITIONAL NUMERIC STATISTICS")
        print("-" * 40)
        for col in numeric_cols:
            print(f"\n📊 {col}:")
            print(f"   Count: {final_df[col].count()}")
            print(f"   Mean: {final_df[col].mean():.2f}")
            print(f"   Median: {final_df[col].median():.2f}")
            print(f"   Mode: {final_df[col].mode().iloc[0] if not final_df[col].mode().empty else 'N/A'}")
            print(f"   Std Dev: {final_df[col].std():.2f}")
            print(f"   Variance: {final_df[col].var():.2f}")
            print(f"   Min: {final_df[col].min()}")
            print(f"   Max: {final_df[col].max()}")
            print(f"   Range: {final_df[col].max() - final_df[col].min()}")
            print(f"   Skewness: {final_df[col].skew():.2f}")
            print(f"   Kurtosis: {final_df[col].kurtosis():.2f}")
    
    # Categorical statistics
    if categorical_cols:
        print(f"\n📝 CATEGORICAL COLUMNS STATISTICS")
        print("-" * 40)
        
        for col in categorical_cols:
            print(f"\n📊 {col}:")
            print(f"   Unique values: {final_df[col].nunique()}")
            print(f"   Most frequent: {final_df[col].mode().iloc[0] if not final_df[col].mode().empty else 'N/A'}")
            print(f"   Top 5 values:")
            
            value_counts = final_df[col].value_counts().head()
            for idx, (value, count) in enumerate(value_counts.items(), 1):
                percentage = (count / len(final_df)) * 100
                print(f"     {idx}. {value}: {count} ({percentage:.1f}%)")
    
    # Data quality check
    print(f"\n🔍 DATA QUALITY CHECK")
    print("-" * 40)
    print(f"📊 Total records: {len(final_df)}")
    print(f"📊 Complete records: {len(final_df.dropna())}")
    print(f"📊 Records with missing values: {len(final_df) - len(final_df.dropna())}")
    print(f"📊 Duplicate records: {final_df.duplicated().sum()}")
    
    # Memory usage
    print(f"\n💾 MEMORY USAGE")
    print("-" * 40)
    memory_usage = final_df.memory_usage(deep=True)
    print(f"📊 Total memory usage: {memory_usage.sum() / 1024**2:.2f} MB")
    print(f"📊 Average per column: {memory_usage.mean() / 1024**2:.2f} MB")
    
else:
    print("⚠️  Skipping basic statistics analysis - No data available")

## 8. Exploratory Data Analysis: Visualizations {#visualizations}

Visual exploration helps us understand patterns, distributions, and relationships in the data.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("📊 DATA VISUALIZATIONS")
    print("=" * 50)
    
    # Get column information
    numeric_cols = final_df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = final_df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # 1. Categorical Variables Distribution
    if categorical_cols:
        print("📊 Categorical Variables Distribution")
        
        # Determine subplot layout
        n_cats = len(categorical_cols)
        if n_cats <= 4:
            cols = 2
            rows = (n_cats + 1) // 2
        else:
            cols = 3
            rows = (n_cats + 2) // 3
        
        fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
        
        # Handle single subplot case
        if rows == 1 and cols == 1:
            axes = [axes]
        elif rows == 1 or cols == 1:
            axes = axes.flatten()
        else:
            axes = axes.flatten()
        
        for i, col in enumerate(categorical_cols):
            if i < len(axes):
                try:
                    # Get top 10 values to avoid cluttered plots
                    top_values = final_df[col].value_counts().head(10)
                    
                    if len(top_values) > 0:
                        top_values.plot(kind='bar', ax=axes[i], color='skyblue')
                        axes[i].set_title(f'{col} Distribution (Top 10)')
                        axes[i].set_xlabel(col)
                        axes[i].set_ylabel('Count')
                        axes[i].tick_params(axis='x', rotation=45)
                        
                        # Add value labels on bars
                        for j, (val, count) in enumerate(top_values.items()):
                            axes[i].text(j, count + max(top_values) * 0.01, str(count), 
                                       ha='center', va='bottom', fontsize=10)
                    else:
                        axes[i].text(0.5, 0.5, f'No data for {col}', 
                                   ha='center', va='center', transform=axes[i].transAxes)
                        axes[i].set_title(f'{col} - No Data')
                        
                except Exception as e:
                    axes[i].text(0.5, 0.5, f'Error plotting {col}', 
                               ha='center', va='center', transform=axes[i].transAxes)
                    print(f"Warning: Error plotting {col}: {e}")
        
        # Hide unused subplots
        for i in range(len(categorical_cols), len(axes)):
            axes[i].set_visible(False)
        
        plt.tight_layout()
        plt.show()
    
    # 2. Numeric Variables Distribution
    if numeric_cols:
        print("📊 Numeric Variables Distribution")
        
        n_nums = len(numeric_cols)
        if n_nums <= 4:
            cols = 2
            rows = (n_nums + 1) // 2
        else:
            cols = 3
            rows = (n_nums + 2) // 3
        
        fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
        
        # Handle single subplot case
        if rows == 1 and cols == 1:
            axes = [axes]
        elif rows == 1 or cols == 1:
            axes = axes.flatten()
        else:
            axes = axes.flatten()
        
        for i, col in enumerate(numeric_cols):
            if i < len(axes):
                try:
                    # Remove NaN values for plotting
                    data = final_df[col].dropna()
                    
                    if len(data) > 0:
                        # Create histogram
                        axes[i].hist(data, bins=30, alpha=0.7, color='lightcoral', edgecolor='black')
                        axes[i].set_title(f'{col} Distribution')
                        axes[i].set_xlabel(col)
                        axes[i].set_ylabel('Frequency')
                        
                        # Add statistics text
                        mean_val = data.mean()
                        median_val = data.median()
                        std_val = data.std()
                        
                        stats_text = f'Mean: {mean_val:.2f}\\nMedian: {median_val:.2f}\\nStd: {std_val:.2f}'
                        axes[i].text(0.02, 0.98, stats_text, transform=axes[i].transAxes, 
                                   verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
                    else:
                        axes[i].text(0.5, 0.5, f'No data for {col}', 
                                   ha='center', va='center', transform=axes[i].transAxes)
                        axes[i].set_title(f'{col} - No Data')
                        
                except Exception as e:
                    axes[i].text(0.5, 0.5, f'Error plotting {col}', 
                               ha='center', va='center', transform=axes[i].transAxes)
                    print(f"Warning: Error plotting {col}: {e}")
        
        # Hide unused subplots
        for i in range(len(numeric_cols), len(axes)):
            axes[i].set_visible(False)
        
        plt.tight_layout()
        plt.show()
    
    # 3. Box plots for numeric variables
    if numeric_cols and len(numeric_cols) > 1:
        print("📊 Box Plots for Numeric Variables")
        
        fig, ax = plt.subplots(1, 1, figsize=(12, 6))
        
        try:
            # Create box plot
            final_df[numeric_cols].plot(kind='box', ax=ax)
            ax.set_title('Box Plot of Numeric Variables')
            ax.set_ylabel('Values')
            plt.xticks(rotation=45)
            plt.tight_layout()
            plt.show()
            
        except Exception as e:
            print(f"Warning: Error creating box plot: {e}")
    
    print("✅ Visualizations completed successfully!")
    
else:
    print("⚠️  Skipping data visualizations - No data available")

## 9. Exploratory Data Analysis: Correlation Analysis {#correlation}

Understanding correlations between variables helps identify relationships and potential predictors.

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("📊 CORRELATION ANALYSIS")
    print("=" * 50)
    
    # Get numeric columns for correlation analysis
    numeric_cols = final_df.select_dtypes(include=[np.number]).columns.tolist()
    
    if len(numeric_cols) > 1:
        print(f"🔢 Analyzing correlations for {len(numeric_cols)} numeric columns")
        
        # Calculate correlation matrix
        correlation_matrix = final_df[numeric_cols].corr()
        
        print("\n📊 CORRELATION MATRIX")
        print("-" * 40)
        display(correlation_matrix)
        
        # Create correlation heatmap
        plt.figure(figsize=(10, 8))
        
        # Create a mask for the upper triangle
        mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
        
        # Create heatmap
        sns.heatmap(correlation_matrix, 
                   annot=True, 
                   cmap='coolwarm', 
                   center=0,
                   square=True,
                   mask=mask,
                   fmt='.2f',
                   cbar_kws={'label': 'Correlation Coefficient'})
        
        plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        # Find strong correlations (>0.7 or <-0.7)
        print("\n🔍 STRONG CORRELATIONS (|r| > 0.7)")
        print("-" * 40)
        
        strong_correlations = []
        for i in range(len(correlation_matrix.columns)):
            for j in range(i+1, len(correlation_matrix.columns)):
                corr_val = correlation_matrix.iloc[i, j]
                if abs(corr_val) > 0.7:
                    strong_correlations.append({
                        'Variable 1': correlation_matrix.columns[i],
                        'Variable 2': correlation_matrix.columns[j],
                        'Correlation': corr_val
                    })
        
        if strong_correlations:
            strong_corr_df = pd.DataFrame(strong_correlations)
            strong_corr_df = strong_corr_df.sort_values('Correlation', key=abs, ascending=False)
            display(strong_corr_df)
        else:
            print("No strong correlations found (|r| > 0.7)")
        
        # Find moderate correlations (0.3 < |r| < 0.7)
        print("\n🔍 MODERATE CORRELATIONS (0.3 < |r| < 0.7)")
        print("-" * 40)
        
        moderate_correlations = []
        for i in range(len(correlation_matrix.columns)):
            for j in range(i+1, len(correlation_matrix.columns)):
                corr_val = correlation_matrix.iloc[i, j]
                if 0.3 < abs(corr_val) < 0.7:
                    moderate_correlations.append({
                        'Variable 1': correlation_matrix.columns[i],
                        'Variable 2': correlation_matrix.columns[j],
                        'Correlation': corr_val
                    })
        
        if moderate_correlations:
            moderate_corr_df = pd.DataFrame(moderate_correlations)
            moderate_corr_df = moderate_corr_df.sort_values('Correlation', key=abs, ascending=False)
            display(moderate_corr_df.head(10))  # Show top 10
        else:
            print("No moderate correlations found (0.3 < |r| < 0.7)")
        
        # Correlation insights
        print("\n💡 CORRELATION INSIGHTS")
        print("-" * 40)
        
        # Check for multicollinearity
        high_corr_pairs = len([c for c in strong_correlations if abs(c['Correlation']) > 0.9])
        if high_corr_pairs > 0:
            print(f"⚠️  {high_corr_pairs} pairs with very high correlation (>0.9) - potential multicollinearity")
        
        # Summary statistics
        corr_values = correlation_matrix.values
        upper_triangle = corr_values[np.triu_indices_from(corr_values, k=1)]
        
        print(f"📊 Average correlation: {np.mean(np.abs(upper_triangle)):.3f}")
        print(f"📊 Maximum correlation: {np.max(np.abs(upper_triangle)):.3f}")
        print(f"📊 Minimum correlation: {np.min(np.abs(upper_triangle)):.3f}")
        print(f"📊 Standard deviation: {np.std(upper_triangle):.3f}")
        
    elif len(numeric_cols) == 1:
        print(f"⚠️  Only one numeric column found: {numeric_cols[0]}")
        print("Cannot perform correlation analysis with single variable")
    else:
        print("⚠️  No numeric columns found for correlation analysis")
        print("Available columns:")
        for col in final_df.columns:
            print(f"   - {col}: {final_df[col].dtype}")
    
else:
    print("⚠️  Skipping correlation analysis - No data available")

## 10. Feature Engineering: Define Features and Targets {#define-features}

Now let's prepare our data for machine learning by defining features (input variables) and targets (output variables).

In [None]:
# Only proceed if data is available
if DATA_AVAILABLE and final_df is not None:
    print("🎯 FEATURE ENGINEERING: DEFINE FEATURES AND TARGETS")
    print("=" * 60)
    
    # Define potential feature columns (input variables)
    potential_features = [
        'Opening Rank', 'opening_rank', 'OpeningRank',
        'Closing Rank', 'closing_rank', 'ClosingRank',
        'Gender', 'gender',
        'Seat Type', 'seat_type', 'SeatType',
        'Category', 'category',
        'State', 'state',
        'Quota', 'quota',
        'Branch', 'branch',
        'year'
    ]
    
    # Find available feature columns
    available_features = []
    for col in potential_features:
        if col in final_df.columns:
            available_features.append(col)
    
    print(f"📋 Available columns in dataset: {final_df.columns.tolist()}")
    print(f"🎯 Potential feature columns found: {available_features}")
    
    # Define potential target columns (output variables)
    potential_targets = [
        'Institute', 'institute',
        'round', 'Round',
        'Branch', 'branch',
        'admission_status', 'status'
    ]
    
    # Find available target columns
    available_targets = []
    for col in potential_targets:
        if col in final_df.columns:
            available_targets.append(col)
    
    print(f"🎯 Potential target columns found: {available_targets}")
    
    # Select features and targets based on availability
    if available_features and available_targets:
        # Prioritize key features
        selected_features = []\n        if 'Opening Rank' in available_features:\n            selected_features.append('Opening Rank')\n        elif 'opening_rank' in available_features:\n            selected_features.append('opening_rank')\n        \n        if 'Gender' in available_features:\n            selected_features.append('Gender')\n        elif 'gender' in available_features:\n            selected_features.append('gender')\n        \n        if 'Seat Type' in available_features:\n            selected_features.append('Seat Type')\n        elif 'seat_type' in available_features:\n            selected_features.append('seat_type')\n        \n        # Add other available features\n        for feature in available_features:\n            if feature not in selected_features:\n                selected_features.append(feature)\n        \n        # Select targets\n        selected_targets = []\n        if 'Institute' in available_targets:\n            selected_targets.append('Institute')\n        elif 'institute' in available_targets:\n            selected_targets.append('institute')\n        \n        if 'round' in available_targets:\n            selected_targets.append('round')\n        elif 'Round' in available_targets:\n            selected_targets.append('Round')\n        \n        # Add other available targets\n        for target in available_targets:\n            if target not in selected_targets:\n                selected_targets.append(target)\n        \n        # Remove any targets that are also in features\n        selected_features = [f for f in selected_features if f not in selected_targets]\n        \n        # Create feature and target datasets\n        try:\n            # Filter out rows with missing values in key columns\n            key_columns = selected_features + selected_targets\n            clean_data = final_df[key_columns].dropna()\n            \n            X = clean_data[selected_features]\n            y = clean_data[selected_targets]\n            \n            print(f\"\\n✅ FEATURE AND TARGET SELECTION SUCCESSFUL\")\n            print(f\"📊 Features selected: {selected_features}\")\n            print(f\"📊 Targets selected: {selected_targets}\")\n            print(f\"📊 Features shape: {X.shape}\")\n            print(f\"📊 Targets shape: {y.shape}\")\n            print(f\"📊 Clean data shape: {clean_data.shape}\")\n            \n            # Show feature information\n            print(f\"\\n📋 FEATURE INFORMATION\")\n            print(\"-\" * 30)\n            for i, feature in enumerate(selected_features, 1):\n                dtype = X[feature].dtype\n                unique_vals = X[feature].nunique()\n                print(f\"{i:2d}. {feature}: {dtype} ({unique_vals} unique values)\")\n            \n            # Show target information\n            print(f\"\\n🎯 TARGET INFORMATION\")\n            print(\"-\" * 30)\n            for i, target in enumerate(selected_targets, 1):\n                dtype = y[target].dtype\n                unique_vals = y[target].nunique()\n                print(f\"{i:2d}. {target}: {dtype} ({unique_vals} unique values)\")\n            \n            # Sample data preview\n            print(f\"\\n👀 SAMPLE DATA PREVIEW\")\n            print(\"-\" * 30)\n            print(\"Features (X):\")\n            display(X.head())\n            print(\"\\nTargets (y):\")\n            display(y.head())\n            \n            # Set flags for next steps\n            FEATURES_DEFINED = True\n            print(f\"\\n✅ Features and targets defined successfully!\")\n            \n        except Exception as e:\n            print(f\"❌ Error creating feature/target datasets: {str(e)}\")\n            FEATURES_DEFINED = False\n            X, y = None, None\n    \n    else:\n        print(\"❌ Could not identify suitable features and targets\")\n        print(\"📝 Please check your data structure and column names\")\n        FEATURES_DEFINED = False\n        X, y = None, None\n        \nelse:\n    print(\"⚠️  Skipping feature definition - No data available\")\n    FEATURES_DEFINED = False\n    X, y = None, None

## 11. Feature Engineering: Encode Target Variables {#encode-targets}

Machine learning algorithms work with numeric data, so we need to encode categorical target variables.

In [None]:
# Only proceed if features are defined
if FEATURES_DEFINED and X is not None and y is not None:
    print("🔢 ENCODING TARGET VARIABLES")
    print("=" * 50)
    
    # Store original target data for reference
    y_original = y.copy()
    
    # Dictionary to store encoders for each target
    target_encoders = {}\n    \n    # Process each target column\n    for target_col in y.columns:\n        print(f\"\\n📊 Processing target: {target_col}\")\n        \n        # Check if target is already numeric\n        if pd.api.types.is_numeric_dtype(y[target_col]):\n            print(f\"   ✅ Already numeric: {y[target_col].dtype}\")\n            print(f\"   📊 Unique values: {y[target_col].nunique()}\")\n            print(f\"   📊 Range: {y[target_col].min()} to {y[target_col].max()}\")\n            target_encoders[target_col] = None  # No encoding needed\n        else:\n            print(f\"   🔧 Encoding categorical target: {y[target_col].dtype}\")\n            print(f\"   📊 Unique values before encoding: {y[target_col].nunique()}\")\n            \n            # Show top categories\n            top_categories = y[target_col].value_counts().head(10)\n            print(f\"   📋 Top categories:\")\n            for cat, count in top_categories.items():\n                print(f\"      - {cat}: {count}\")\n            \n            # Create and fit label encoder\n            le = LabelEncoder()\n            y[target_col] = le.fit_transform(y[target_col])\n            target_encoders[target_col] = le\n            \n            print(f\"   ✅ Encoded successfully!\")\n            print(f\"   📊 Unique values after encoding: {y[target_col].nunique()}\")\n            print(f\"   📊 Encoded range: {y[target_col].min()} to {y[target_col].max()}\")\n            \n            # Show mapping for first few categories\n            print(f\"   📋 Encoding mapping (first 10):\")\n            for i, class_name in enumerate(le.classes_[:10]):\n                print(f\"      {class_name} → {i}\")\n            \n            if len(le.classes_) > 10:\n                print(f\"      ... and {len(le.classes_) - 10} more classes\")\n    \n    print(f\"\\n✅ TARGET ENCODING COMPLETED\")\n    print(f\"📊 Encoded targets shape: {y.shape}\")\n    print(f\"📊 All target columns are now numeric\")\n    \n    # Display encoded target information\n    print(f\"\\n📋 ENCODED TARGET SUMMARY\")\n    print(\"-\" * 40)\n    for target_col in y.columns:\n        print(f\"📊 {target_col}:\")\n        print(f\"   Data type: {y[target_col].dtype}\")\n        print(f\"   Unique values: {y[target_col].nunique()}\")\n        print(f\"   Min value: {y[target_col].min()}\")\n        print(f\"   Max value: {y[target_col].max()}\")\n        print(f\"   Mean: {y[target_col].mean():.2f}\")\n        print(f\"   Std: {y[target_col].std():.2f}\")\n    \n    # Show sample of encoded data\n    print(f\"\\n👀 ENCODED TARGET SAMPLE\")\n    print(\"-\" * 40)\n    display(y.head(10))\n    \n    # Compare with original\n    print(f\"\\n🔄 COMPARISON WITH ORIGINAL\")\n    print(\"-\" * 40)\n    comparison_df = pd.DataFrame()\n    for target_col in y.columns:\n        comparison_df[f'{target_col}_original'] = y_original[target_col].head(10)\n        comparison_df[f'{target_col}_encoded'] = y[target_col].head(10)\n    \n    display(comparison_df)\n    \n    # Set flag for next steps\n    TARGETS_ENCODED = True\n    print(f\"\\n🎯 Target encoding completed successfully!\")\n    \nelse:\n    print(\"⚠️  Skipping target encoding - Features not defined\")\n    TARGETS_ENCODED = False\n    target_encoders = {}

## 12. Feature Engineering: Define Feature Types {#feature-types}

We need to categorize our features as numerical or categorical for appropriate preprocessing.

In [None]:
# Only proceed if targets are encoded
if TARGETS_ENCODED and X is not None:
    print("🏷️ DEFINING FEATURE TYPES")
    print("=" * 50)
    
    # Identify categorical and numerical features
    categorical_features = []\n    numerical_features = []\n    \n    for col in X.columns:\n        if pd.api.types.is_numeric_dtype(X[col]):\n            numerical_features.append(col)\n        else:\n            categorical_features.append(col)\n    \n    print(f\"📊 FEATURE TYPE CLASSIFICATION\")\n    print(f\"🔢 Numerical features ({len(numerical_features)}): {numerical_features}\")\n    print(f\"📝 Categorical features ({len(categorical_features)}): {categorical_features}\")\n    \n    # Show detailed information for each feature type\n    if numerical_features:\n        print(f\"\\n📊 NUMERICAL FEATURES DETAILS\")\n        print(\"-\" * 40)\n        for feature in numerical_features:\n            print(f\"🔢 {feature}:\")\n            print(f\"   Data type: {X[feature].dtype}\")\n            print(f\"   Unique values: {X[feature].nunique()}\")\n            print(f\"   Range: {X[feature].min()} to {X[feature].max()}\")\n            print(f\"   Mean: {X[feature].mean():.2f}\")\n            print(f\"   Std: {X[feature].std():.2f}\")\n            print(f\"   Missing values: {X[feature].isnull().sum()}\")\n            print()\n    \n    if categorical_features:\n        print(f\"\\n📝 CATEGORICAL FEATURES DETAILS\")\n        print(\"-\" * 40)\n        for feature in categorical_features:\n            print(f\"📝 {feature}:\")\n            print(f\"   Data type: {X[feature].dtype}\")\n            print(f\"   Unique values: {X[feature].nunique()}\")\n            print(f\"   Missing values: {X[feature].isnull().sum()}\")\n            print(f\"   Top 5 categories:\")\n            \n            top_categories = X[feature].value_counts().head(5)\n            for cat, count in top_categories.items():\n                percentage = (count / len(X)) * 100\n                print(f\"      {cat}: {count} ({percentage:.1f}%)\")\n            print()\n    \n    # Check for potential issues\n    print(f\"📋 FEATURE QUALITY CHECK\")\n    print(\"-\" * 40)\n    \n    # Check for high cardinality categorical features\n    high_cardinality = []\n    for feature in categorical_features:\n        cardinality = X[feature].nunique()\n        if cardinality > 50:  # Threshold for high cardinality\n            high_cardinality.append((feature, cardinality))\n    \n    if high_cardinality:\n        print(\"⚠️  High cardinality categorical features detected:\")\n        for feature, cardinality in high_cardinality:\n            print(f\"   - {feature}: {cardinality} unique values\")\n        print(\"   Consider grouping rare categories or using target encoding\")\n    else:\n        print(\"✅ No high cardinality issues detected\")\n    \n    # Check for low variance numerical features\n    low_variance = []\n    for feature in numerical_features:\n        if X[feature].std() < 0.1:  # Very low standard deviation\n            low_variance.append((feature, X[feature].std()))\n    \n    if low_variance:\n        print(\"⚠️  Low variance numerical features detected:\")\n        for feature, std in low_variance:\n            print(f\"   - {feature}: std = {std:.4f}\")\n        print(\"   Consider removing or transforming these features\")\n    else:\n        print(\"✅ No low variance issues detected\")\n    \n    # Set flag for next steps\n    FEATURE_TYPES_DEFINED = True\n    print(f\"\\n🎯 Feature types defined successfully!\")\n    print(f\"✅ Ready for preprocessing pipeline creation\")\n    \nelse:\n    print(\"⚠️  Skipping feature type definition - Targets not encoded\")\n    FEATURE_TYPES_DEFINED = False\n    categorical_features = []\n    numerical_features = []

## 13. Model Training: Create Preprocessing Pipeline {#preprocessing}

Let's create a preprocessing pipeline that handles both numerical and categorical features appropriately.

In [None]:
# Only proceed if feature types are defined
if FEATURE_TYPES_DEFINED and X is not None and y is not None:
    print("🔧 CREATING PREPROCESSING PIPELINE")
    print("=" * 50)
    
    # Create preprocessing steps for numerical features\n    numerical_transformer = StandardScaler()\n    \n    # Create preprocessing steps for categorical features\n    categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)\n    \n    # Create the column transformer\n    preprocessor = ColumnTransformer(\n        transformers=[\n            ('num', numerical_transformer, numerical_features),\n            ('cat', categorical_transformer, categorical_features)\n        ],\n        remainder='passthrough'  # Keep any remaining columns unchanged\n    )\n    \n    print(f\"📊 PREPROCESSING PIPELINE CREATED\")\n    print(f\"🔢 Numerical features ({len(numerical_features)}): {numerical_features}\")\n    print(f\"   Transformation: StandardScaler (mean=0, std=1)\")\n    print(f\"📝 Categorical features ({len(categorical_features)}): {categorical_features}\")\n    print(f\"   Transformation: OneHotEncoder (binary encoding)\")\n    \n    # Estimate the output dimensionality\n    print(f\"\\n📊 ESTIMATED OUTPUT DIMENSIONS\")\n    print(\"-\" * 30)\n    \n    # Calculate numerical features dimension\n    num_dims = len(numerical_features)\n    print(f\"🔢 Numerical features: {num_dims} dimensions\")\n    \n    # Calculate categorical features dimension\n    cat_dims = 0\n    for feature in categorical_features:\n        unique_vals = X[feature].nunique()\n        cat_dims += unique_vals\n        print(f\"📝 {feature}: {unique_vals} categories → {unique_vals} dimensions\")\n    \n    total_dims = num_dims + cat_dims\n    print(f\"📊 Total estimated dimensions: {total_dims}\")\n    \n    # Create the model based on the number of targets\n    if len(y.columns) == 1:\n        # Single target - use regular RandomForestClassifier\n        model = Pipeline(steps=[\n            ('preprocessor', preprocessor),\n            ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))\n        ])\n        print(f\"\\n🤖 MODEL CREATED: Single-target RandomForestClassifier\")\n    else:\n        # Multiple targets - use MultiOutputClassifier\n        model = Pipeline(steps=[\n            ('preprocessor', preprocessor),\n            ('classifier', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42)))\n        ])\n        print(f\"\\n🤖 MODEL CREATED: Multi-target RandomForestClassifier\")\n    \n    print(f\"📊 Target columns: {y.columns.tolist()}\")\n    print(f\"📊 Number of targets: {len(y.columns)}\")\n    \n    # Model parameters\n    print(f\"\\n⚙️ MODEL PARAMETERS\")\n    print(\"-\" * 30)\n    print(f\"Algorithm: Random Forest\")\n    print(f\"Number of estimators: 100\")\n    print(f\"Random state: 42\")\n    print(f\"Multi-output: {'Yes' if len(y.columns) > 1 else 'No'}\")\n    \n    # Set flag for next steps\n    PIPELINE_CREATED = True\n    print(f\"\\n✅ Preprocessing pipeline created successfully!\")\n    \nelse:\n    print(\"⚠️  Skipping pipeline creation - Feature types not defined\")\n    PIPELINE_CREATED = False\n    model = None\n    preprocessor = None

## 14. Conclusions and Recommendations {#conclusions}

Let's summarize our findings and provide recommendations for the JEE College Prediction project.

In [None]:
print("📝 JEE COLLEGE PREDICTION - ANALYSIS SUMMARY")
print("=" * 60)

# Summary of the analysis
if DATA_AVAILABLE:
    print(f"✅ Data Analysis Completed Successfully!")
    print(f"📊 Dataset Shape: {final_df.shape}")
    
    if FEATURES_DEFINED:
        print(f"🎯 Features Defined: {len(selected_features)} features")
        print(f"🎯 Targets Defined: {len(selected_targets)} targets")
    
    if TARGETS_ENCODED:
        print(f"🔢 Target Encoding: Completed")
    
    if FEATURE_TYPES_DEFINED:
        print(f"🏷️ Feature Types: {len(numerical_features)} numerical, {len(categorical_features)} categorical")
    
    if PIPELINE_CREATED:
        print(f"🔧 ML Pipeline: Created and ready for training")
    
    print(f"\n🎯 KEY FINDINGS:")
    print(f"=" * 40)
    print(f"📊 Data Quality: {'Good' if final_df.isnull().sum().sum() == 0 else 'Needs attention'}")
    print(f"🔢 Numerical Features: {numerical_features if 'numerical_features' in locals() else 'Not defined'}")
    print(f"📝 Categorical Features: {categorical_features if 'categorical_features' in locals() else 'Not defined'}")
    
else:
    print("❌ Data analysis incomplete - No data available")

print(f"\n💡 RECOMMENDATIONS:")
print(f"=" * 40)
print(f"1. 📊 Data Collection: Ensure comprehensive data collection for better predictions")
print(f"2. 🔍 Feature Engineering: Consider creating additional features like:")
print(f"   - Rank percentiles")
print(f"   - Historical admission trends")
print(f"   - Institute rankings")
print(f"   - Branch popularity scores")
print(f"3. 🤖 Model Improvement: Experiment with different algorithms:")
print(f"   - XGBoost for better performance")
print(f"   - Neural Networks for complex patterns")
print(f"   - Ensemble methods for robust predictions")
print(f"4. ✅ Model Validation: Implement cross-validation and time-series validation")
print(f"5. 🚀 Deployment: Create a web application or API for real-time predictions")
print(f"6. 📈 Monitoring: Set up model performance monitoring and retraining")

print(f"\n📋 NEXT STEPS:")
print(f"=" * 40)
print(f"1. 🏋️ Complete model training with train/test split")
print(f"2. 📊 Evaluate model performance using appropriate metrics")
print(f"3. 🔍 Analyze feature importance and model interpretability")
print(f"4. 💾 Save the trained model for deployment")
print(f"5. 📚 Create comprehensive documentation")
print(f"6. 🧪 Set up automated testing and validation")
print(f"7. 🌐 Deploy the model as a web service")

print(f"\n📋 DATA SCIENCE BEST PRACTICES FOLLOWED:")
print(f"=" * 40)
print(f"✅ Comprehensive data exploration and analysis")
print(f"✅ Proper handling of missing values")
print(f"✅ Robust data cleaning and preprocessing")
print(f"✅ Feature engineering and selection")
print(f"✅ Target encoding for categorical variables")
print(f"✅ Scalable preprocessing pipeline")
print(f"✅ Error handling and validation")
print(f"✅ Clear documentation and visualization")

print(f"\n🎯 PROJECT STATUS:")
print(f"=" * 40)
if DATA_AVAILABLE and FEATURES_DEFINED and TARGETS_ENCODED and FEATURE_TYPES_DEFINED and PIPELINE_CREATED:
    print(f"✅ READY FOR MODEL TRAINING AND EVALUATION")
    print(f"📈 All preprocessing steps completed successfully")
    print(f"🚀 Pipeline is ready for production use")
else:
    print(f"⚠️  INCOMPLETE - Some steps need attention")
    print(f"📝 Please review the analysis and address any issues")

print(f"\n" + "="*60)
print(f"📊 JEE COLLEGE PREDICTION - ANALYSIS COMPLETE")
print(f"🎯 Thank you for using this comprehensive analysis notebook!")
print(f"="*60)