<a href="https://colab.research.google.com/github/Gihanibhagya/Regression-Based-Models/blob/main/Regression_Based_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression-Based Models**

**step 1: Import Required Libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

**Step 2: Load and Inspect Dataset **

In [3]:
df = pd.read_excel("/content/Students_Performance_data_set.xlsx")
df.head()

Unnamed: 0,University Admission year,Gender,Age,H.S.C passing year,Program,Current Semester,Do you have meritorious scholarship ?,Do you use University transportation?,How many hour do you study daily?,How many times do you seat for study in a day?,...,What is you interested area?,What is your relationship status?,Are you engaged with any co-curriculum activities?,With whom you are living with?,Do you have any health issues?,What was your previous SGPA?,Do you have any physical disabilities?,What is your current CGPA?,How many Credit did you have completed?,What is your monthly family income?
0,2018,Male,24,2016,BCSE,12,Yes,No,3,2,...,Data Schince,Single,Yes,Bachelor,No,2.68,No,3.15,75,25000
1,2021,Male,22,2020,BCSE,4,Yes,Yes,3,2,...,Event management,Single,Yes,Family,No,2.68,No,3.15,36,100000
2,2020,Female,21,2019,BCSE,5,No,No,3,3,...,Software,Single,No,Bachelor,No,2.68,No,3.15,50,50000
3,2021,Male,20,2020,BCSE,4,Yes,No,1,3,...,Artificial Intelligence,Single,No,Bachelor,Yes,2.68,No,3.15,36,62488
4,2021,Male,22,2019,BCSE,4,Yes,No,3,1,...,Software,Relationship,No,Bachelor,Yes,2.68,No,3.15,36,50000


**Step 3: Data Preprocessing (Handle missing values, encode, normalize)**

In [7]:
# Identify target and feature
target_col = 'What is your current CGPA?'
one_hot_col = 'Status of your English language proficiency'

# Separate column types
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols.remove(target_col)
categorical_cols.remove(one_hot_col)

# Impute missing values
df[numerical_cols] = SimpleImputer(strategy='median').fit_transform(df[numerical_cols])
df[categorical_cols + [one_hot_col]] = SimpleImputer(strategy='most_frequent').fit_transform(df[categorical_cols + [one_hot_col]])

# Convert mixed type columns to string before encoding
for col in categorical_cols:
    df[col] = df[col].astype(str)
    df[col] = LabelEncoder().fit_transform(df[col])
df = pd.get_dummies(df, columns=[one_hot_col], prefix="English")

# Normalize
scaler = MinMaxScaler()
df[numerical_cols + [target_col]] = scaler.fit_transform(df[numerical_cols + [target_col]])

In [6]:
for col in categorical_cols:
  print(f"Column: {col}, Data Type: {df[col].dtype}")

Column: Gender, Data Type: int64
Column: Program, Data Type: int64
Column: Do you have meritorious scholarship ?, Data Type: int64
Column: Do you use University transportation?, Data Type: int64
Column: What is your preferable learning mode?, Data Type: int64
Column: Do you use smart phone?, Data Type: int64
Column: Do you have personal Computer?, Data Type: int64
Column: Average attendance on class, Data Type: object
Column: Did you ever fall in probation?, Data Type: object
Column: Did you ever got suspension?, Data Type: object
Column: Do you attend in teacher consultancy for any kind of academical problems?, Data Type: object
Column: What are the skills do you have ?, Data Type: object
Column: What is you interested area?, Data Type: object
Column: What is your relationship status?, Data Type: object
Column: Are you engaged with any co-curriculum activities?, Data Type: object
Column: With whom you are living with?, Data Type: object
Column: Do you have any health issues?, Data Typ

**Step 4: Train-Test Split**

In [8]:
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_temp, y_train, y_temp = train_test_split(X, y,
test_size=0.30, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp,
test_size=0.50, random_state=42)

Step 5: Model Training and Evaluation using XGBoost, LightGBM, MLP, Random
Forest, SVM

In [10]:
from sklearn.metrics import mean_squared_error, r2_score

def evaluate_model(name, y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} => MSE: {mse:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")
    return {"Model": name, "MSE": mse, "RMSE": rmse, "R2": r2}

**Train Models**

In [11]:
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
results = []

# XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=4)
xgb_model.fit(X_train, y_train)
results.append(evaluate_model("XGBoost", y_test,
xgb_model.predict(X_test)))

# LightGBM
lgb_model = lgb.LGBMRegressor(n_estimators=200, learning_rate=0.05)
lgb_model.fit(X_train, y_train)
results.append(evaluate_model("LightGBM", y_test,
lgb_model.predict(X_test)))

# MLP
mlp_model = MLPRegressor(hidden_layer_sizes=(64,), max_iter=1000)
mlp_model.fit(X_train, y_train)
results.append(evaluate_model("MLP", y_test,
mlp_model.predict(X_test)))

# Random Forest
rf_model = RandomForestRegressor(n_estimators=100)
rf_model.fit(X_train, y_train)
results.append(evaluate_model("Random Forest", y_test,
rf_model.predict(X_test)))

# SVM
svm_model = SVR(kernel='rbf')
svm_model.fit(X_train, y_train)
results.append(evaluate_model("SVM", y_test,
svm_model.predict(X_test)))

XGBoost => MSE: 0.0095, RMSE: 0.0974, R²: 0.6997
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000618 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 502
[LightGBM] [Info] Number of data points in the train set: 835, number of used features: 30
[LightGBM] [Info] Start training from score 0.788756
LightGBM => MSE: 0.0094, RMSE: 0.0971, R²: 0.7013
MLP => MSE: 0.0244, RMSE: 0.1561, R²: 0.2288
Random Forest => MSE: 0.0093, RMSE: 0.0965, R²: 0.7053
SVM => MSE: 0.0296, RMSE: 0.1721, R²: 0.0617


**Step 6: Summary Table of Results**

In [12]:
# Summary Table
pd.DataFrame(results)

Unnamed: 0,Model,MSE,RMSE,R2
0,XGBoost,0.009484,0.097387,0.699685
1,LightGBM,0.009432,0.097119,0.701338
2,MLP,0.024355,0.156061,0.228813
3,Random Forest,0.009306,0.096465,0.705345
4,SVM,0.029633,0.172143,0.061684



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

