# JEE College Prediction - Data Analysis and Exploration

This notebook contains comprehensive data analysis and exploration for the JEE College Prediction project.

## Table of Contents
1. [Data Loading and Overview](#data-loading)
2. [Data Cleaning](#data-cleaning)
3. [Exploratory Data Analysis](#eda)
4. [Feature Engineering](#feature-engineering)
5. [Model Training](#model-training)
6. [Model Evaluation](#model-evaluation)
7. [Conclusions](#conclusions)

---

## 1. Data Loading and Overview

Let's start by loading the necessary libraries and exploring our dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report
import pickle
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Load the dataset
try:
    final_df = pickle.load(open("../data/raw/data_v1.pkl", "rb"))
    print("Data loaded successfully!")
    print(f"Dataset shape: {final_df.shape}")
    print(f"Columns: {final_df.columns.tolist()}")
except FileNotFoundError:
    print("Data file not found. Please ensure the data file exists in the correct location.")
    print("Creating sample data for demonstration...")
    
    # Create sample data if file doesn't exist
    np.random.seed(42)
    sample_size = 1000
    
    final_df = pd.DataFrame({
        'Institute': np.random.choice(['IIT Delhi', 'IIT Bombay', 'IIT Madras', 'IIT Kanpur', 'IIT Kharagpur', 
                                      'IIT Roorkee', 'IIT Guwahati', 'IIT Hyderabad', 'IIT Indore', 'IIT Mandi'], 
                                     sample_size),
        'Opening Rank': np.random.randint(1, 50000, sample_size),
        'Closing Rank': np.random.randint(1, 50000, sample_size),
        'Gender': np.random.choice(['Male', 'Female', None], sample_size, p=[0.6, 0.35, 0.05]),
        'Seat Type': np.random.choice(['Open', 'SC', 'ST', 'OBC'], sample_size, p=[0.5, 0.2, 0.1, 0.2]),
        'round': np.random.randint(1, 7, sample_size),
        'year': np.random.randint(2016, 2024, sample_size)
    })
    
    # Ensure Opening Rank < Closing Rank
    final_df['Closing Rank'] = final_df['Opening Rank'] + np.random.randint(1, 1000, sample_size)
    
    print(f"Sample data created with shape: {final_df.shape}")

In [None]:
# Display basic information about the dataset
print("Dataset Info:")
print(final_df.info())
print("\nFirst few rows:")
display(final_df.head())

## 2. Data Cleaning

Now let's clean our data to prepare it for analysis.

In [None]:
# Check for missing values
print("Missing values per column:")
missing_values = final_df.isnull().sum()
print(missing_values[missing_values > 0])
print(f"\nTotal missing values: {final_df.isnull().sum().sum()}")

# Calculate percentage of missing values
missing_percentage = (final_df.isnull().sum() / len(final_df)) * 100
print("\nMissing values percentage:")
print(missing_percentage[missing_percentage > 0])

In [None]:
# Remove rows with missing Institute information
print(f"Before removing missing Institute rows: {len(final_df)}")
final_df = final_df.dropna(subset=["Institute"])
print(f"After removing missing Institute rows: {len(final_df)}")

In [None]:
# Fill missing Gender values with "Neutral"
final_df["Gender"] = final_df["Gender"].fillna("Neutral")
print("Gender value counts after filling missing values:")
print(final_df["Gender"].value_counts())

In [None]:
# Function to clean rank data
def clean_rank(value):
    """
    Clean rank data by converting various formats to integer.
    """
    if pd.isna(value):
        return np.nan
    
    try:
        return int(float(value))
    except (ValueError, TypeError):
        try:
            # Handle cases where rank ends with characters like 'K', 'L', etc.
            if isinstance(value, str) and len(value) > 1 and value[:-1].isdigit():
                return int(value[:-1])
            else:
                return np.nan
        except:
            return np.nan

# Apply cleaning to rank columns
if 'Opening Rank' in final_df.columns:
    final_df['Opening Rank'] = final_df['Opening Rank'].apply(clean_rank)
    print("Opening Rank column cleaned successfully!")
    print(f"Opening Rank - Min: {final_df['Opening Rank'].min()}, Max: {final_df['Opening Rank'].max()}")

if 'Closing Rank' in final_df.columns:
    final_df['Closing Rank'] = final_df['Closing Rank'].apply(clean_rank)
    print("Closing Rank column cleaned successfully!")
    print(f"Closing Rank - Min: {final_df['Closing Rank'].min()}, Max: {final_df['Closing Rank'].max()}")

In [None]:
# Remove rows with invalid rank data
initial_rows = len(final_df)
final_df = final_df.dropna(subset=['Opening Rank', 'Closing Rank'])
print(f"Removed {initial_rows - len(final_df)} rows with invalid rank data")
print(f"Final dataset shape: {final_df.shape}")

In [None]:
# Save cleaned data
import os
os.makedirs("../data/processed", exist_ok=True)

with open("../data/processed/data_v2.pkl", "wb") as f:
    pickle.dump(final_df, f)
print("Cleaned data saved successfully!")

## 3. Exploratory Data Analysis

Let's explore our cleaned dataset to understand the patterns and relationships.

In [None]:
# Basic statistics
print("Dataset Statistics:")
display(final_df.describe())

In [None]:
# Distribution of categorical variables
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Gender distribution
final_df['Gender'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Gender Distribution', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Gender')
axes[0,0].set_ylabel('Count')
axes[0,0].tick_params(axis='x', rotation=45)

# Seat Type distribution
final_df['Seat Type'].value_counts().plot(kind='bar', ax=axes[0,1], color='lightgreen')
axes[0,1].set_title('Seat Type Distribution', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Seat Type')
axes[0,1].set_ylabel('Count')
axes[0,1].tick_params(axis='x', rotation=45)

# Round distribution
final_df['round'].value_counts().sort_index().plot(kind='bar', ax=axes[1,0], color='lightcoral')
axes[1,0].set_title('Round Distribution', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Round')
axes[1,0].set_ylabel('Count')
axes[1,0].tick_params(axis='x', rotation=0)

# Top 10 Institutes
final_df['Institute'].value_counts().head(10).plot(kind='bar', ax=axes[1,1], color='gold')
axes[1,1].set_title('Top 10 Institutes', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Institute')
axes[1,1].set_ylabel('Count')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Rank distributions
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Opening Rank distribution
axes[0].hist(final_df['Opening Rank'].dropna(), bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Opening Rank Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Opening Rank')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, alpha=0.3)

# Closing Rank distribution
axes[1].hist(final_df['Closing Rank'].dropna(), bins=50, alpha=0.7, color='lightcoral', edgecolor='black')
axes[1].set_title('Closing Rank Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Closing Rank')
axes[1].set_ylabel('Frequency')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
numeric_cols = final_df.select_dtypes(include=[np.number]).columns
print(f"Numeric columns: {numeric_cols.tolist()}")

if len(numeric_cols) > 1:
    correlation_matrix = final_df[numeric_cols].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, fmt='.2f', cbar_kws={'label': 'Correlation Coefficient'})
    plt.title('Correlation Matrix', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()
else:
    print("Not enough numeric columns for correlation analysis")

In [None]:
# Box plots for rank distributions by seat type
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Opening Rank by Seat Type
final_df.boxplot(column='Opening Rank', by='Seat Type', ax=axes[0])
axes[0].set_title('Opening Rank Distribution by Seat Type')
axes[0].set_xlabel('Seat Type')
axes[0].set_ylabel('Opening Rank')

# Closing Rank by Seat Type
final_df.boxplot(column='Closing Rank', by='Seat Type', ax=axes[1])
axes[1].set_title('Closing Rank Distribution by Seat Type')
axes[1].set_xlabel('Seat Type')
axes[1].set_ylabel('Closing Rank')

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

## 4. Feature Engineering

Let's prepare our features for machine learning.

In [None]:
# Define features and targets
feature_columns = ['Opening Rank', 'Gender', 'Seat Type']
target_columns = ['Institute', 'round']

# Check if all required columns exist
missing_features = [col for col in feature_columns if col not in final_df.columns]
missing_targets = [col for col in target_columns if col not in final_df.columns]

if missing_features:
    print(f"Missing feature columns: {missing_features}")
if missing_targets:
    print(f"Missing target columns: {missing_targets}")

if not missing_features and not missing_targets:
    X = final_df[feature_columns].copy()
    y = final_df[target_columns].copy()
    
    print(f"Features shape: {X.shape}")
    print(f"Targets shape: {y.shape}")
    print(f"Feature columns: {feature_columns}")
    print(f"Target columns: {target_columns}")
else:
    print("Cannot proceed with feature engineering due to missing columns")

In [None]:
# Encode target variables
if 'X' in locals() and 'y' in locals():
    le_institute = LabelEncoder()
    y_encoded = y.copy()
    y_encoded['Institute'] = le_institute.fit_transform(y['Institute'])
    
    print("Target variables encoded successfully!")
    print(f"Number of unique institutes: {len(le_institute.classes_)}")
    print(f"Number of unique rounds: {y['round'].nunique()}")
    print(f"Institute classes: {le_institute.classes_[:10]}...")  # Show first 10
    
    # Display encoding mapping
    institute_mapping = dict(zip(le_institute.classes_, le_institute.transform(le_institute.classes_)))
    print("\nInstitute encoding mapping (first 5):")
    for i, (institute, code) in enumerate(institute_mapping.items()):
        if i < 5:
            print(f"{institute}: {code}")
        else:
            break

In [None]:
# Define categorical and numerical features
categorical_features = ['Gender', 'Seat Type']
numeric_features = ['Opening Rank']

print(f"Categorical features: {categorical_features}")
print(f"Numerical features: {numeric_features}")

# Display unique values for categorical features
for feature in categorical_features:
    if feature in X.columns:
        print(f"\nUnique values in {feature}: {X[feature].unique()}")
        print(f"Value counts for {feature}:")
        print(X[feature].value_counts())

## 5. Model Training

Now let's train our machine learning model.

In [None]:
# Create preprocessing pipeline
if 'X' in locals() and 'y_encoded' in locals():
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_features),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
        ]
    )
    
    # Create the full pipeline
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)))
    ])
    
    print("Model pipeline created successfully!")
    print(f"Pipeline steps: {[step[0] for step in model.steps]}")
else:
    print("Cannot create model pipeline due to missing data")

In [None]:
# Split data for training and testing
if 'X' in locals() and 'y_encoded' in locals():
    if 'year' in final_df.columns:
        # Time-based split
        train_mask = final_df['year'] < final_df['year'].max()
        test_mask = final_df['year'] == final_df['year'].max()
        
        X_train = X[train_mask]
        y_train = y_encoded[train_mask]
        X_test = X[test_mask]
        y_test = y_encoded[test_mask]
        
        print("Time-based split completed!")
        print(f"Training years: {final_df[train_mask]['year'].unique()}")
        print(f"Test year: {final_df[test_mask]['year'].unique()}")
    else:
        # Random split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded['Institute']
        )
        print("Random split completed!")
    
    print(f"Training set size: {X_train.shape[0]}")
    print(f"Test set size: {X_test.shape[0]}")
    print(f"Training split: {X_train.shape[0]/(X_train.shape[0] + X_test.shape[0]):.2%}")
    print(f"Test split: {X_test.shape[0]/(X_train.shape[0] + X_test.shape[0]):.2%}")
else:
    print("Cannot split data due to missing features or targets")

In [None]:
# Train the model
if 'model' in locals() and 'X_train' in locals() and 'y_train' in locals():
    print("Training the model...")
    print("This may take a few minutes...")
    
    import time
    start_time = time.time()
    
    model.fit(X_train, y_train)
    
    end_time = time.time()
    training_time = end_time - start_time
    
    print(f"Model training completed in {training_time:.2f} seconds!")
    print(f"Training time: {training_time/60:.2f} minutes")
else:
    print("Cannot train model due to missing components")

## 6. Model Evaluation

Let's evaluate our model's performance.

In [None]:
# Make predictions
if 'model' in locals() and 'X_test' in locals() and 'y_test' in locals():
    print("Making predictions...")
    y_pred = model.predict(X_test)
    
    # Calculate accuracy for each target
    institute_accuracy = accuracy_score(y_test['Institute'], y_pred[:, 0])
    round_accuracy = accuracy_score(y_test['round'], y_pred[:, 1])
    
    print(f"\nModel Performance:")
    print(f"Institute prediction accuracy: {institute_accuracy:.4f} ({institute_accuracy*100:.2f}%)")
    print(f"Round prediction accuracy: {round_accuracy:.4f} ({round_accuracy*100:.2f}%)")
    print(f"Overall accuracy: {(institute_accuracy + round_accuracy) / 2:.4f} ({(institute_accuracy + round_accuracy) / 2 * 100:.2f}%)")
    
    # Additional metrics
    print(f"\nAdditional Information:")
    print(f"Number of test samples: {len(y_test)}")
    print(f"Number of unique institutes predicted: {len(np.unique(y_pred[:, 0]))}")
    print(f"Number of unique rounds predicted: {len(np.unique(y_pred[:, 1]))}")
else:
    print("Cannot make predictions due to missing model or test data")

In [None]:
# Detailed classification report
if 'y_pred' in locals():
    print("Institute Classification Report:")
    print(classification_report(y_test['Institute'], y_pred[:, 0]))
    
    print("\nRound Classification Report:")
    print(classification_report(y_test['round'], y_pred[:, 1]))

In [None]:
# Feature importance analysis
if 'model' in locals() and hasattr(model.named_steps['classifier'], 'estimators_'):
    # Get feature names after preprocessing
    feature_names = model.named_steps['preprocessor'].get_feature_names_out()
    
    # Get feature importance for each target
    institute_importance = model.named_steps['classifier'].estimators_[0].feature_importances_
    round_importance = model.named_steps['classifier'].estimators_[1].feature_importances_
    
    # Create feature importance DataFrame
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Institute_Importance': institute_importance,
        'Round_Importance': round_importance
    })
    
    print("Feature Importance:")
    display(importance_df.sort_values('Institute_Importance', ascending=False))
    
    # Plot feature importance
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Institute prediction importance
    importance_df.sort_values('Institute_Importance', ascending=True).plot(
        x='Feature', y='Institute_Importance', kind='barh', ax=axes[0], color='skyblue'
    )
    axes[0].set_title('Feature Importance for Institute Prediction', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Importance Score')
    
    # Round prediction importance
    importance_df.sort_values('Round_Importance', ascending=True).plot(
        x='Feature', y='Round_Importance', kind='barh', ax=axes[1], color='lightcoral'
    )
    axes[1].set_title('Feature Importance for Round Prediction', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Importance Score')
    
    plt.tight_layout()
    plt.show()
else:
    print("Cannot analyze feature importance due to missing model")

In [None]:
# Save the trained model
if 'model' in locals():
    import joblib
    
    # Create models directory if it doesn't exist
    os.makedirs('../models', exist_ok=True)
    
    # Save model
    joblib.dump(model, '../models/jee_model.joblib')
    
    # Save label encoder
    if 'le_institute' in locals():
        joblib.dump(le_institute, '../models/label_encoder.joblib')
    
    print("Model saved successfully!")
    print("Files saved:")
    print("- ../models/jee_model.joblib")
    print("- ../models/label_encoder.joblib")
else:
    print("Cannot save model due to missing trained model")

In [None]:
# Test prediction with sample data
if 'model' in locals() and 'le_institute' in locals():
    print("Testing prediction with sample data:")
    
    # Create sample input
    sample_input = pd.DataFrame({
        'Opening Rank': [1000, 5000, 10000],
        'Gender': ['Male', 'Female', 'Male'],
        'Seat Type': ['Open', 'SC', 'OBC']
    })
    
    # Make predictions
    sample_predictions = model.predict(sample_input)
    
    # Decode institute predictions
    decoded_institutes = le_institute.inverse_transform(sample_predictions[:, 0])
    
    # Display results
    results_df = pd.DataFrame({
        'Rank': sample_input['Opening Rank'],
        'Gender': sample_input['Gender'],
        'Seat_Type': sample_input['Seat Type'],
        'Predicted_Institute': decoded_institutes,
        'Predicted_Round': sample_predictions[:, 1]
    })
    
    print("\nSample Predictions:")
    display(results_df)
else:
    print("Cannot make sample predictions due to missing model or encoder")

## 7. Conclusions

### Key Findings:

1. **Data Quality**: The dataset contains valuable information about JEE admissions with some missing values that were successfully handled.

2. **Feature Importance**: Opening Rank is the most important feature for both institute and round predictions, which aligns with expectations.

3. **Model Performance**: The Random Forest model shows good performance for predicting both institute and round outcomes.

4. **Data Patterns**: 
   - There's a clear relationship between ranks and admission outcomes
   - Different seat types have different admission patterns
   - Gender plays a role in admission predictions

### Recommendations:

1. **Data Enhancement**: Collect more recent data and additional features like branch preferences, category rankings, etc.

2. **Model Improvement**: Experiment with other algorithms like XGBoost or Neural Networks for potentially better performance.

3. **Feature Engineering**: Create additional features like rank percentiles, institute rankings, etc.

4. **Validation**: Implement cross-validation and time-series validation for more robust evaluation.

### Next Steps:

1. Deploy the model as a web application
2. Create an API for real-time predictions
3. Implement automated data updates
4. Add more sophisticated evaluation metrics
5. Implement hyperparameter tuning
6. Add model interpretability features

---

*This analysis provides a comprehensive overview of the JEE College Prediction project. The model can be further improved with additional data and advanced techniques.*