<h1 align="center"> Academic Success Prediction (Multi Class Classification) </h1>

<img 
    src="https://www.kaggle.com/competitions/73290/images/header" 
    alt="" 
    width="300" 
    height="150" 
    style="display: block; margin: 0 auto" 
/>

---

## Problem Definition

- Domain    
    
    - Education (Higher Education)
    
<br>

- Dataset
    
    - [Classification with an Academic Success Dataset](https://www.kaggle.com/competitions/playground-series-s4e6/data) by Kaggle. The dataset contains 36 features that explains students' demographics, academic background, family background etc. Train dataset comprises 76,518 samples and each instance in the dataset represents one student.

<br>

- Objective
    
    - The goal of this project is to predict academic risk of students in higher education. This is a multi-class classification problem which requires to predict target varaible as 'Graduate', 'Dropout', or 'Enrolled'.

<br>

- Algorithms

    - Following classification algorithms are used to train models. The models are evaluated using the accuracy score.
    
        1. [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
        2. [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
        3. [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html)
        4. [StackingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)
    
<br>


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import accuracy_score

import optuna

In [None]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e6/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e6/test.csv')


In [None]:
print(f"Train Data\nInstance Count: {train_data.shape[0]} \nColumn Count: {train_data.shape[1]}")
train_data.head().style.set_table_attributes(
    'style="overflow-x: auto; display: inline-block;"'
    ).set_properties(**{'min-width': '50px'})


In [None]:
print(f"Test Data\nInstance Count: {test_data.shape[0]} \nColumn Count: {test_data.shape[1]}")

test_data.head().style.set_table_attributes(
    'style="overflow-x: auto; display: inline-block;"'
    ).set_properties(**{'min-width': '50px'})


---

# Step 1: Exploratory Data Analysis (EDA)

---

In [None]:
train_data.info()


In [None]:
test_data.info()


In [None]:
print(f"Train data null value count: {train_data.isnull().sum().sum()}")
print(f"Test data null value count: {test_data.isnull().sum().sum()}")     


In [None]:
duplicates = train_data[train_data.duplicated(keep=False)]
print(len(duplicates))


### Conclusion:

- There are no null values in both train and test datasets.
- There are no missing values in both train and test datasets.
- There are no duplicated rows in train dataset.

---

In [None]:
unique_value_counts = {}
cat_features = []
num_features = []

for col in train_data.columns:
    unique_values = train_data[col].unique()
    value_counts = train_data[col].value_counts()
    unique_value_counts[col] = {'unique_values': unique_values, 'value_counts': value_counts}

for col, values_counts in unique_value_counts.items():
#     print(f"Column '{col}':")
#     print(f"Unique values: {values_counts['unique_values']}")
#     print(f"Value counts:")
#     print(values_counts['value_counts'])
#     print()
    if len(values_counts['unique_values']) < 10:
        cat_features.append(col)
    else: 
        num_features.append(col)
        
        
print("Categorical Features: \n", cat_features, "\n")
print("Numerical Features: \n", num_features, "\n")


In [None]:
cat_features = [
    "Marital status",
    "Application mode",
    "Application order",
    "Course",
    "Daytime/evening attendance",
    "Previous qualification",
    "Nacionality",
    "Mother's qualification",
    "Father's qualification",
    "Mother's occupation",
    "Father's occupation",
    "Displaced",
    "Educational special needs",
    "Debtor",
    "Tuition fees up to date",
    "Gender",
    "Scholarship holder",
    "International"
]

num_features = list(train_data.drop(cat_features, axis=1).columns)
num_features.remove('id')
num_features.remove('Target')


print(f"Categorical Feature Count: {len(cat_features)}")
print(f"Numerical Feature Count: {len(num_features)}")

print(f"\nCategorical Features: {cat_features}")
print(f"\nNumerical Features: {num_features}")

### Conclusion:

- Though all features have numerical values, some features should be identified as categorical features.
- Accordngly, there are 18 categorical features and 18 numerical features.
---


---

## 1.1 Statistical Analysis

---

In [None]:
train_data[num_features].describe().style.set_table_attributes(
    'style="overflow-x: auto; display: inline-block;"').set_properties(**{'min-width': '100px'})


In [None]:
Q1 = train_data[num_features].quantile(0.25)
Q3 = train_data[num_features].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = ((train_data[num_features] < lower_bound) | (train_data[num_features] > upper_bound))

outliers_count = outliers_iqr.sum()
outliers_count = outliers_count[outliers_count > 0].sort_values(ascending=False)
print(f"Outliers Count: \n{outliers_count}")


columns_with_outliers = outliers_iqr.any()
columns_with_no_outliers = ~columns_with_outliers

# filtered_data = train_data[num_features].loc[:, columns_with_no_outliers]
# print(filtered_data)


---

## 1.2 Data Visualization

---

---
## 1.2.1 Univariate Analysis
---

In [None]:
target_counts = train_data['Target'].value_counts()

plt.figure(figsize=(8,6))
sns.barplot(x=target_counts.index, y=target_counts.values, palette='crest')
plt.ylabel('Count')
plt.title('Target Value Counts', fontsize=24)
plt.savefig("target count.png")
plt.show()



In [None]:
cat_features_short = []
cat_features_long = []


for col in cat_features:
    if train_data[col].nunique() < 10:
        cat_features_short.append(col)
    else:
        cat_features_long.append(col)

print(cat_features_short, '\n', cat_features_long)

In [None]:
n_rows = 5
n_cols = 2

fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 5*n_rows))
fig.tight_layout(pad=4.0)

for i, column in enumerate(train_data[cat_features_short].columns):
    row = i // n_cols
    col = i % n_cols
    ax = axes[row, col]

    category_counts = train_data[cat_features_short][column].value_counts()
    sns.barplot(x=category_counts.index, y=category_counts.values, ax=ax, palette='crest')

plt.suptitle('Bar Charts - Academic Success Features', fontsize=20, y=1.0)
plt.savefig("categorical feature analysis 1.png")
plt.show()


In [None]:
n_rows = 4
n_cols = 2

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20,5*n_rows))
fig.tight_layout(pad=4.0)

for i, column in enumerate(train_data[cat_features_long].columns):
    row = i // n_cols
    col = i % n_cols
    ax = axes[row, col]

    category_counts = train_data[cat_features_long][column].value_counts()
    sns.barplot(x=category_counts.index, y=category_counts.values, ax=ax, palette='crest')

    ax.tick_params(axis='x', rotation=90)

plt.suptitle('Bar Charts - Academic Success Features', fontsize=20, y=1.0)    
plt.savefig("categorical feature analysis 2.png")
plt.show()


In [None]:
num_features_1 = ['Previous qualification (grade)', 'Admission grade', 'Age at enrollment', 'Unemployment rate', 'Inflation rate', 'GDP']
num_features_2 = [
    'Curricular units 1st sem (credited)', 'Curricular units 2nd sem (credited)',
    'Curricular units 1st sem (enrolled)', 'Curricular units 2nd sem (enrolled)', 
    'Curricular units 1st sem (evaluations)', 'Curricular units 2nd sem (evaluations)',
    'Curricular units 1st sem (approved)', 'Curricular units 2nd sem (approved)',
    'Curricular units 1st sem (grade)', 'Curricular units 2nd sem (grade)',
    'Curricular units 1st sem (without evaluations)', 'Curricular units 2nd sem (without evaluations)'
]


In [None]:
n_cols = 2
n_rows = 3

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
axes = axes.flatten()


for i, col in enumerate(train_data[num_features_1].columns):
    ax = axes[i]
    sns.histplot(train_data[num_features_1][col], kde=True, ax=ax, color='seagreen')
    ax.set_ylabel('Frequency')

plt.suptitle('Histograms - Academic Success Dataset', fontsize=20, y=1.0)
plt.tight_layout()
plt.savefig('numerical feature analysis 1.png')
plt.show()


In [None]:
n_cols = 2
n_rows = 6

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
axes = axes.flatten()


for i, col in enumerate(train_data[num_features_2].columns):
    ax = axes[i]
    sns.histplot(train_data[num_features_2][col], kde=True, ax=ax, color='seagreen')
    ax.set_ylabel('Frequency')

plt.suptitle('Histograms - Academic Success Dataset', fontsize=20, y=1.0)
plt.tight_layout()
plt.savefig('numerical feature analysis 2.png')
plt.show()


## 1.2.2 Bivariate Analysis
---

In [None]:
corr = train_data[num_features].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.light_palette("darkgreen", as_cmap=True)

plt.figure(figsize=(12, 8))
sns.heatmap(corr, mask=mask, cmap=cmap, annot=True, fmt='.2f', linewidths=0.5, cbar_kws={"shrink": 0.5})
plt.title('Correlation Matrix - Academic Success Dataset', fontsize=20)

plt.savefig("Correlation Matrix.png")
plt.show()


---

# Step 2. Feature Engineering

---

In [None]:
train_data.drop(['id'], axis=1, inplace=True)
test_data.drop(['id'], axis=1, inplace=True)


In [None]:
train_data['Curricular units (credited)'] = train_data['Curricular units 1st sem (credited)'] + train_data['Curricular units 2nd sem (credited)']
train_data['Curricular units (enrolled)'] = train_data['Curricular units 1st sem (enrolled)'] + train_data['Curricular units 2nd sem (enrolled)']
train_data['Curricular units (approved)'] = train_data['Curricular units 1st sem (approved)'] + train_data['Curricular units 2nd sem (approved)']


In [None]:
test_data['Curricular units (credited)'] = test_data['Curricular units 1st sem (credited)'] + test_data['Curricular units 2nd sem (credited)']
test_data['Curricular units (enrolled)'] = test_data['Curricular units 1st sem (enrolled)'] + test_data['Curricular units 2nd sem (enrolled)']
test_data['Curricular units (approved)'] = test_data['Curricular units 1st sem (approved)'] + test_data['Curricular units 2nd sem (approved)']


In [None]:
# train_data.drop(
#     columns=[
#         'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)', 
#          'Curricular units 1st sem (approved)',
#         'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)', 
#         'Curricular units 2nd sem (approved)'
#     ], inplace=True)


In [None]:
# test_data.drop(
#     columns=[
#         'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)', 
#          'Curricular units 1st sem (approved)',
#         'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)', 
#         'Curricular units 2nd sem (approved)'
#     ], inplace=True)


---

# Step 3. Data Pre-processing

---

In [None]:
X = train_data.drop('Target', axis=1)
y = train_data['Target']


In [None]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)


In [None]:
robust_scaler = RobustScaler()
standard_scaler = StandardScaler()
cols_with_outliers = list(columns_with_outliers.index)
cols_with_no_outliers = list(columns_with_no_outliers.index)

preprocessor = ColumnTransformer(
    transformers=[
        ('robust', robust_scaler, cols_with_outliers),
        ('standard', standard_scaler, cols_with_no_outliers)
    ],
    remainder='passthrough'
)

preprocessor


---

# Step 4. Model Training & Evaluation

---


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# X_train.head()


In [None]:

# def xgb_objective(trial, X_train, y_train):
#     param = {
#         'verbosity': 0,
#         'objective': 'multi:softmax',
#         'num_class': len(set(y)),
#         'eval_metric': 'mlogloss',
#         'booster': trial.suggest_categorical('booster', ['gbtree', 'dart']),
#         'lambda': trial.suggest_float('lambda', 1e-8, 1.0, log=True),
#         'alpha': trial.suggest_float('alpha', 1e-8, 1.0, log=True),
#     }

#     if param['booster'] == 'gbtree' or param['booster'] == 'dart':
#         param['max_depth'] = trial.suggest_int('max_depth', 3, 9)
#         param['eta'] = trial.suggest_float('eta', 0.01, 0.3)
#         param['gamma'] = trial.suggest_float('gamma', 1e-8, 1.0, log=True)
#         param['grow_policy'] = trial.suggest_categorical('grow_policy', ['depthwise', 'lossguide'])

#     if param['booster'] == 'dart':
#         param['sample_type'] = trial.suggest_categorical('sample_type', ['uniform', 'weighted'])
#         param['normalize_type'] = trial.suggest_categorical('normalize_type', ['tree', 'forest'])
#         param['rate_drop'] = trial.suggest_float('rate_drop', 1e-8, 1.0, log=True)
#         param['skip_drop'] = trial.suggest_float('skip_drop', 1e-8, 1.0, log=True)

#     model = XGBClassifier(**param, use_label_encoder=False)
#     pipeline = make_pipeline(preprocessor, model)
    
#     score = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy').mean()
    
#     return score


# study = optuna.create_study(direction='maximize')
# study.optimize(lambda trial: xgb_objective(trial, X_train, y_train), n_trials=10)

# print("Best hyperparameters: ", study.best_params)


In [None]:
xgb_best_params = {'booster': 'gbtree', 'lambda': 0.7307682601434828, 'alpha': 6.221868031196627e-06, 'max_depth': 6, 'eta': 0.1595865305437521, 'gamma': 0.13412702643857463, 'grow_policy': 'lossguide'}

In [None]:
best_xgb = XGBClassifier(
#     **study.best_params,
    **xgb_best_params,
    device='gpu'
)

pipeline = make_pipeline(preprocessor, best_xgb)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

accuracy = accuracy_score(y_val, y_pred).round(4)
print("Validation Set Accuracy:", accuracy)


In [None]:

# def lgb_objective(trial, X_train, y_train):
#     param = {
#         'n_estimators': trial.suggest_int('n_estimators', 50, 300),
#         'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
#         'max_depth': trial.suggest_int('max_depth', -1, 50),
#         'num_leaves': trial.suggest_int('num_leaves', 2, 256),
#         'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
#         'subsample': trial.suggest_float('subsample', 0.5, 1.0),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
#     }

#     model = LGBMClassifier(**param, random_state=42)

#     pipeline = make_pipeline(preprocessor, model)
    
#     score = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy').mean()
    
#     return score


# study = optuna.create_study(direction='maximize')
# study.optimize(lambda trial: lgb_objective(trial, X_train, y_train), n_trials=10)

# print("Best hyperparameters: ", study.best_params)


In [None]:
lgb_best_params = {'n_estimators': 266, 'learning_rate': 0.05058460730773053, 'max_depth': 21, 'num_leaves': 95, 'min_child_samples': 70, 'subsample': 0.7118244893103414, 'colsample_bytree': 0.6388926693521347}

In [None]:
best_lgb = LGBMClassifier(
#     **study.best_params,
    **lgb_best_params,
    random_state=42
)

pipeline = make_pipeline(preprocessor, best_lgb)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

accuracy = accuracy_score(y_val, y_pred).round(4)
print("Validation Set Accuracy:", accuracy)


In [None]:
base_learners = [
    ('XGB', best_xgb),
    ('LGB', best_lgb)
]

stacking_model = StackingClassifier(
    estimators=base_learners,
    final_estimator=LogisticRegression(),
    cv=5
)

pipeline = make_pipeline(preprocessor, stacking_model)
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

accuracy = accuracy_score(y_val, y_pred).round(4)
print("Validation Set Accuracy:", accuracy)

stacking_model


---

# Step 5. Make Predictions on Test Data

---

In [None]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', stacking_model)
])

test_predictions = pipeline.predict(test_data)    


In [None]:
sample_submission = pd.read_csv('/kaggle/input/playground-series-s4e6/sample_submission.csv')

sample_submission['Target'] = test_predictions
sample_submission['Target'] = label_encoder.inverse_transform(sample_submission['Target'])

sample_submission.to_csv('submission.csv', index=False)
sample_submission.head(5)


In [None]:
target_counts = sample_submission['Target'].value_counts()

plt.figure(figsize=(8,6))
sns.barplot(x=target_counts.index, y=target_counts.values, palette='crest')
plt.ylabel('Count')
plt.show()
