# Student Assistance Prediction - V1
## Objective:
Predict whether a student requires **'Financial Assistance'** or just **'Academic Advice'** (Don't Need Assistance) based on their socio-economic and demographic profile. 

**Consistency Note:** This notebook uses the same cleaning and preprocessing pipeline as `student_classification_v3.ipynb` to ensure model compatibility and data integrity.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score

import warnings
warnings.filterwarnings('ignore')
print('Libraries imported successfully!')

Libraries imported successfully!


### 1. Data Loading & Label Synthesis
We load the dataset and define the target variable `Assistance_Status` based on socio-economic indicators.

In [11]:
df = pd.read_csv('dataset.csv')

# Define Assistance Needs logic based on socio-economic factors
def determine_assistance(row):
    score = 0
    # Income Factor
    if row['Family_Income_PKR'] == 'Low (<30k)': score += 3
    elif row['Family_Income_PKR'] == 'Lower-Middle (30k-60k)': score += 2
    
    # Resource Factors
    if row['Internet_Access'] == 'No': score += 1
    if row['Device_Available'] == 'None': score += 2
    elif row['Device_Available'] == 'Mobile': score += 1
    
    # Hardship Factors
    if row['Part_Time_Job'] == 'Yes': score += 1
    if row['Electricity_Availability'] == 'Frequent Outages': score += 1
    
    # Classification
    return 'Need Assistance' if score >= 4 else 'Don\'t Need Assistance (Advice)'

df['Assistance_Status'] = df.apply(determine_assistance, axis=1)
print(f"Assistance Status Distribution:\n{df['Assistance_Status'].value_counts()}")
df.head()

Assistance Status Distribution:
Assistance_Status
Don't Need Assistance (Advice)    687
Need Assistance                   313
Name: count, dtype: int64


Unnamed: 0,Student_ID,Age,Gender,City,Province,CGPA,Performance_Class,Family_Income_PKR,Parents_Education,Study_Hours_Per_Week,...,School_Type,Medium_of_Instruction,Distance_to_Institute_km,Transport_Mode,Parental_Support_Level,Health_Issues,Part_Time_Job,Extra_Curricular,Motivation_Level,Assistance_Status
0,1,21,Female,Faisalabad,KPK,1.5,Weak,Low (<30k),Graduate,6,...,Private,Urdu,15.6,Bus,Low,Chronic,Yes,Clubs,Low,Need Assistance
1,2,23,Male,Mardan,Punjab,1.5,Weak,Middle (60k-120k),Primary,10,...,Public,Urdu,19.0,Bike,Medium,Minor,Yes,Clubs,High,Don't Need Assistance (Advice)
2,3,20,Male,Gilgit,Gilgit Baltistan,1.5,Weak,Upper-Middle (120k-250k),Matric,17,...,Public,English,6.4,Van,Medium,Minor,Yes,,Low,Need Assistance
3,4,20,Female,Dir,Punjab,1.51,Weak,Middle (60k-120k),Intermediate,30,...,Private,English,24.2,Walk,Low,Minor,No,Sports,Low,Don't Need Assistance (Advice)
4,5,24,Male,Faisalabad,KPK,1.51,Weak,High (>250k),Primary,23,...,Public,Urdu,22.7,Bus,High,Minor,Yes,Clubs,High,Don't Need Assistance (Advice)


### 2. Consistent Data Cleaning
Dropping irrelevant columns as per the classification pipeline.

In [12]:
# Drop ID and Leakage features (CGPA/Attendance usually relate to performance, 
# but we keep them here if we want to see if performance correlates with need)
if 'Student_ID' in df.columns:
    df = df.drop(columns=['Student_ID'])

# Encode Target
le = LabelEncoder()
df['Assistance_Status'] = le.fit_transform(df['Assistance_Status'])
print(f"Classes: {list(le.classes_)}")

X = df.drop(columns=['Assistance_Status'])
y = df['Assistance_Status']

# Identify columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

Classes: ["Don't Need Assistance (Advice)", 'Need Assistance']


### 3. Preprocessing Pipeline
Using the same `ColumnTransformer` structure for consistency.

In [13]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 4. Model Training & Comparison
We test Logistic Regression, Random Forest, and SVM.

In [14]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(probability=True)
}

results = {}
for name, model in models.items():
    pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results[name] = {'Accuracy': acc, 'F1': f1, 'Model': pipe}
    
    print(f"{name} - Accuracy: {acc:.4f}")

comparison_df = pd.DataFrame(results).T.drop(columns=['Model'])
comparison_df

Logistic Regression - Accuracy: 0.9600
Random Forest - Accuracy: 0.9550
SVM - Accuracy: 0.9300


Unnamed: 0,Accuracy,F1
Logistic Regression,0.96,0.934426
Random Forest,0.955,0.92437
SVM,0.93,0.881356


### 5. Final Model Selection & Save
Based on the results, we can save the best model.

In [16]:
import joblib

# Select model with highest accuracy
best_model_name = comparison_df['Accuracy'].idxmax()
final_model = results[best_model_name]['Model']

print(f"Saving the best model: {best_model_name}")
joblib.dump(final_model, 'student_assistance_model.pkl')
print("Model saved successfully!")

Saving the best model: Logistic Regression
Model saved successfully!
