<a href="https://colab.research.google.com/github/Alidon256/Student-Performance-Prediction-with-Machine-Learning/blob/main/Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
# Q1_Scratch_Regression.py

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# --- 1. Linear Regression Class from Scratch ---
class ScratchLinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1500):
        self.lr = learning_rate
        self.n_iters = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []

    @staticmethod
    def mse(y_true, y_pred):
        """Mean Squared Error loss function"""
        return np.mean((y_true - y_pred)**2)

    def predict(self, X):
        """Linear model prediction: y_hat = XW + b"""
        return np.dot(X, self.weights) + self.bias

    def fit(self, X, y):
        """Trains the model using Gradient Descent"""
        n_samples, n_features = X.shape
        y = y.reshape(-1, 1) # Ensure y is a column vector

        # 1. Initialize parameters (W and b)
        self.weights = np.zeros((n_features, 1))
        self.bias = 0
        self.loss_history = []

        # 2. Gradient Descent Loop
        for _ in range(self.n_iters):
            y_pred = self.predict(X)

            # Calculate gradients
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

            # Record loss
            cost = self.mse(y, y_pred)
            self.loss_history.append(cost)

        return self

# --- 2. Data Loading and Preprocessing ---
RSEED = 42

try:
    df_house = pd.read_csv('/content/drive/MyDrive/question1_house.csv')
except FileNotFoundError:
    print("Error: '/content/drive/MyDrive/question1_house.csv' not found.")
    exit()

features_q1 = ['GrLivArea', 'OverallQual', 'TotalBsmtSF', 'GarageCars']
target_q1 = 'SalePrice'

df_q1 = df_house[features_q1 + [target_q1]].copy()

# Impute missing values
df_q1['TotalBsmtSF'] = df_q1['TotalBsmtSF'].fillna(0)
df_q1['GarageCars'] = df_q1['GarageCars'].fillna(0)
df_q1['GrLivArea'] = df_q1['GrLivArea'].fillna(df_q1['GrLivArea'].median())
df_q1 = df_q1.dropna(subset=[target_q1])

X_q1 = df_q1[features_q1].values
y_q1 = df_q1[target_q1].values.reshape(-1, 1)

# Scaling: Essential for Gradient Descent
scaler_X = StandardScaler()
X_q1_scaled = scaler_X.fit_transform(X_q1)
scaler_y = StandardScaler()
y_q1_scaled = scaler_y.fit_transform(y_q1)

# --- 3. 5-Fold Cross-Validation ---
kf = KFold(n_splits=5, shuffle=True, random_state=RSEED)
r2_scores = []

print("Starting 5-Fold Cross-Validation for Scratch Linear Regression...")
print("-" * 50)

for fold, (train_index, val_index) in enumerate(kf.split(X_q1_scaled)):
    X_train, X_val = X_q1_scaled[train_index], X_q1_scaled[val_index]
    y_train, y_val = y_q1_scaled[train_index], y_q1_scaled[val_index]

    # Initialize and train the scratch model
    model_scratch = ScratchLinearRegression(learning_rate=0.01, n_iterations=1500)
    model_scratch.fit(X_train, y_train)

    # Predict and Evaluate
    y_pred_val = model_scratch.predict(X_val)
    r2 = r2_score(y_val, y_pred_val)
    r2_scores.append(r2)

    # Optional: Inverse transform for real-world MSE printing
    y_val_unscaled = scaler_y.inverse_transform(y_val)
    y_pred_unscaled = scaler_y.inverse_transform(y_pred_val)
    mse_unscaled = ScratchLinearRegression.mse(y_val_unscaled, y_pred_unscaled)

    print(f"Fold {fold+1}: R2 = {r2:.4f}, Unscaled MSE = {mse_unscaled:,.0f}")

print("-" * 50)
print(f"Average R2 Score: {np.mean(r2_scores):.4f} (+/- {np.std(r2_scores):.4f})")

# Note: Visualizations were done in the .ipynb file, not possible here.

Starting 5-Fold Cross-Validation for Scratch Linear Regression...
--------------------------------------------------
Fold 1: R2 = 0.7910, Unscaled MSE = 1,602,796,787
Fold 2: R2 = 0.8033, Unscaled MSE = 1,337,102,548
Fold 3: R2 = 0.4716, Unscaled MSE = 2,919,406,857
Fold 4: R2 = 0.7901, Unscaled MSE = 1,318,247,120
Fold 5: R2 = 0.8047, Unscaled MSE = 1,020,892,047
--------------------------------------------------
Average R2 Score: 0.7321 (+/- 0.1304)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Task
Improve the accuracy of the regression model for predicting student performance by exploring additional features from the `df_student` DataFrame, updating the preprocessing pipeline to include these features, re-evaluating the baseline models with the new features, performing more extensive hyperparameter tuning on the best model, and comparing the final performance to the initial model.

## Feature exploration

### Subtask:
Examine the available columns in the `df_student` DataFrame (using the output from the `fcade769` cell) and identify potentially relevant features that were not included in the initial model. This could include other academic performance metrics (like other GPA scores or course marks), study habits, or demographic information.


**Reasoning**:
Based on the instruction to examine the available columns from the output of cell `fcade769` and identify potentially relevant features not used in the initial model, I will create a list of these column names in a code block.



In [21]:
# Reviewing the column list from cell fcade769 output:
# ['start_2024-03-21 10:08:42.013000', 'end_2024-03-21 10:36:50.758000', 'Year of study_Second year second semester', 'In this section, you are requested to answer questions about your social and demographic factors that are likely to influence the academic performance in examinations of biomedical sciences namely, anatomy, physiology, and biochemistry._Unnamed: 3_level_1', 'A1. How old are you? (age in complete years)_24', 'A2. What is your gender?_Male', 'A3. What is your sponsorship?_Private', 'Specify_Unnamed: 7_level_1', 'A4: If private in A3, specify source of funding_Parent/guardian', 'Specify_Unnamed: 9_level_1', 'A5. What is your religion?_Born again including Protestant', 'Specify_Unnamed: 11_level_1', 'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern', 'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural', 'A8: What is the living status of your parents?. _Both alive', 'A9: What is/was the highest level of education your father attained? _Secondary level', 'A10: What is/was the highest level of education of your mother?_Never went to school', 'A11: What is/was the occupation of your father?_Formal employment', 'Specify_Unnamed: 18_level_1', 'A12: What is/was the occupation of your mother?_No employment', 'Specify_Unnamed: 20_level_1', 'A13: What is your marital status?_Married/Cohabiting', 'A14: Do you have a paying job you do while studying?_No', 'A15: How many people do you stay with while at home other than your self?_9', 'A16: Where do you live while at school?_Privately rented room', 'Specify_Unnamed: 25_level_1', 'A17: How many people do you stay with in the room while at university?_2', 'A18: Approximately, how far is it in km from where you live to the university campus? _0.5', 'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral', 'In this section, you are requested to write the marks ( percentage score)  you scored in the following courses: anatomy, physiology, and biochemistry  at first attempt  (and subsequent attempts if the course was retaken)._Unnamed: 29_level_1', 'Anatomy 1_58', 'LGA1_D+', 'gpant1_2.5', 'cuant1_4', 'Anatomy 2_53', 'LGA2_D', 'gpant2_2', 'cuant2_4', 'Anatomy 3_Unnamed: 38_level_1', 'LGA3_Unnamed: 39_level_1', 'gpant3_Unnamed: 40_level_1', 'cuant3_Unnamed: 41_level_1', 'Anatomy 4_Unnamed: 42_level_1', 'LGA4_Unnamed: 43_level_1', 'gpant4_Unnamed: 44_level_1', 'cuant4_Unnamed: 45_level_1', 'gpaant_2.25', 'Anatomy 1_Unnamed: 47_level_1', 'Anatomy 2_Unnamed: 48_level_1', 'Anatomy 3_Unnamed: 49_level_1', 'Anatomy 4_Unnamed: 50_level_1', 'Physiology 1 _68', 'LGP1_C+', 'gpphys1_3.5', 'cuphys1_4', 'Physiology 2_63', 'LGP2_C', 'gpphys2_3', 'cuphys2_4', 'Physiology 3_Unnamed: 59_level_1', 'LGP3_Unnamed: 60_level_1', 'gpphys3_Unnamed: 61_level_1', 'cuphys3_Unnamed: 62_level_1', 'Physiology 4_Unnamed: 63_level_1', 'LGP4_Unnamed: 64_level_1', 'gpphys4_Unnamed: 65_level_1', 'cuphys4_Unnamed: 66_level_1', 'gpaphys_3.25', 'Physiology 5_Unnamed: 68_level_1', 'Physiology 6_Unnamed: 69_level_1', 'Physiology 1_Unnamed: 70_level_1', 'Physiology 2_Unnamed: 71_level_1', 'Physiology 3_Unnamed: 72_level_1', 'Physiology 4_Unnamed: 73_level_1', 'Physiology 5_Unnamed: 74_level_1', 'Physiology 6_Unnamed: 75_level_1', 'Biochemistry 1_82', 'LGB1_A', 'gpbio1_5', 'cubio1_4', 'Biochemistry 2_68', 'LGB2_C+', 'gpbio2_3.5', 'cubio2_3', 'Biochemistry 3_Unnamed: 84_level_1', 'LGB3_Unnamed: 85_level_1', 'gpbio3_Unnamed: 86_level_1', 'cubio3_Unnamed: 87_level_1', 'Biochemistry 4_Unnamed: 88_level_1', 'LGB4_Unnamed: 89_level_1', 'gpbio4_Unnamed: 90_level_1', 'cubio4_Unnamed: 91_level_1', 'gpabio_4.357142857142857', 'Biochemistry 1_Unnamed: 93_level_1', 'Biochemistry 2_Unnamed: 94_level_1', 'Biochemistry 3_Unnamed: 95_level_1', 'Biochemistry 4_Unnamed: 96_level_1', 'cgpa_3.285714285714286', 'This is scored on a 5-point Likert scale with Strongly Disagree (SD) = 1, Disagree (D)= 2,  Neutral (N)= 3, Agree (A)= 4, Strongly Agree (SA) = 5._Unnamed: 98_level_1', 'B2.1 I feel satisfied with my performance in anatomy        _Neutral', 'B2.2 I feel satisfied with my performance in Physiology        _Agree', 'B2.3 I feel satisfied with my performance in Biochemistry_Agree', 'B2.4 My performance is appropriate for my effort _Neutral', 'B2.5 I feel knowledgeable in these three courses                  _Agree', 'B2.6 I believe can perform better in these courses_Strongly agree', 'B2.7 I apply knowledge of these courses in patient care_Strongly agree', 'For these two statements, select the course chosen. _Unnamed: 106_level_1', 'This course was the most difficult of all (tick)_Anatomy', 'This course was the simpler of the three (tick)_Biochemistry', 'In this section, you are requested to answer questions about how you as an individual student contribute  towards academic performance in biomedical science courses namely, anatomy, physiology,  and biochemistry._Unnamed: 109_level_1', 'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results', 'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E', 'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E', 'C4: In which location was your Advanced Level school?_Rural', 'Specify_Unnamed: 114_level_1', 'C5: How many aggregates did you get at primary seven?_8', 'C6: How many aggregates did you get at O-Level?_31', 'Chemistry_5', 'Biology_6', 'Physics_6', 'Mathematics_2', 'English_6', 'A-Level points _9', 'CGPA at diploma (diploma entry)_Unnamed: 123_level_1', 'Not applicable                                                 _Unnamed: 124_level_1', 'Chemistry_D', 'Biology_O', 'Physics_Unnamed: 127_level_1', 'Mathematics_C', 'General Paper_7', 'Approx Wts_18.2', 'C10: Which choice was the nursing profession in your life_Third', 'Specify_Unnamed: 132_level_1', 'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent', 'C12: Do you intend to change from nursing profession to another profession in future?_No', 'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching', 'Specify_Unnamed: 136_level_1', 'C14: How many hours do you use to do private study or read on daily basis?_4', 'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4', 'C16: Which electronic gargets do you use mostly?_Smart phone', 'Specify_Unnamed: 140_level_1', 'C17: To what extent are you engaged in games and sports at university?_Less extent', 'C18: To what extent do you miss class/lectures_Some extent', 'C19: What is the commonest reason for missing lectures/classes?_Financial constraints', 'Specify_Unnamed: 144_level_1', 'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself', 'Specify_Unnamed: 146_level_1', 'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers', 'Specify_Unnamed: 148_level_1', 'C22: How often do you read from the university library?_Rarely', 'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes', 'Specify_Unnamed: 151_level_1', 'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely', 'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8', 'C26: How adequate was the sleep stated in C25 above?_Not sure', 'C27: To what extent are you confident that you will complete BNS course on time?_Great extent', 'In this section, you are requested to answer questions about how you believe the university contributes to your academic performance in biomedical science courses namely, anatomy, physiology, and biochemistry._Unnamed: 156_level_1', 'Anatomy_At times', 'Physiology _No', 'Biochemistry_No', 'Anatomy_Great extent', 'Physiology_Great extent', 'Biochemistry_Great extent', 'Anatomy_To some extent', 'Physiology_Great extent', 'Biochemistry_Great extent', 'Anatomy_Fair', 'Physiology_Very good', 'Biochemistry_Very good', 'Anatomy_Small extent', 'Physiology_Great extent .1', 'Biochemistry_Great extent .1', 'Anatomy_to some extent', 'Physiology_great extent', 'Biochemistry_to some extent', 'Anatomy (cadaver dissection)_some times', 'Physiology_most times', 'Biochemistry_most times', 'Anatomy_Neutral', 'Physiology_Great extent .2', 'Biochemistry_Small extent', 'Anatomy_Not at all', 'Physiology_Small extent', 'Biochemistry_Not at all', 'Anatomy_Small extent.1', 'Physiology_Great extent', 'Biochemistry_Great extent.1', 'Anatomy_Lecture', 'Specify_Unnamed: 188_level_1', 'Physiology_Lecture', 'Specify_Unnamed: 190_level_1', 'Biochemistry _Lecture', 'Specify_Unnamed: 192_level_1', 'Anatomy_Tutorials', 'Specify_Unnamed: 194_level_1', 'Physiology_Lecture.1', 'Specify_Unnamed: 196_level_1', 'Biochemistry_Lecture', 'Specify_Unnamed: 198_level_1', 'Anatomy_Group presentation', 'Specify _Unnamed: 200_level_1', 'Physiology_Problem based', 'Specify_Unnamed: 202_level_1', 'Biochemistry _Problem based', 'Specify_Unnamed: 204_level_1', 'Anatomy_Face to face/Physical', 'Specify_Unnamed: 206_level_1', 'Physiology_Face2face/Physical', 'Specify_Unnamed: 208_level_1', 'Biochemistry_Face2face/Physical', 'Specify_Unnamed: 210_level_1', 'Anatomy_Face to face/Physical .1', 'Specify_Unnamed: 212_level_1', 'Physiology _Face2face/Physical', 'Select_Unnamed: 214_level_1', 'Biochemistry _Face2face/Physical', 'Specify_Unnamed: 216_level_1', 'Anatomy _always', 'Physiology_always', 'Biochemistry_always', 'Anatomy_No', 'Physiology _No.1', 'Biochemistry _No', 'Anatomy _No', 'Physiology_No', 'Biochemistry_No.1', 'Anatomy _never', 'Physiology_some times', 'Biochemistry_never', 'Anatomy_to some extent.1', 'Physiology _great extent', 'Biochemistry_great extent', 'Anatomy_Not at all.1', 'Physiology _Small extent', 'Biochemistry _Small extent', 'Anatomy_Small extent', 'Physiology_Small extent', 'Biochemistry _Small extent.1', 'Anatomy_Neutral', 'Physiology_Great extent .3', 'Biochemistry_Great extent .2', 'Anatomy_No.1', 'Physiology_No.1', 'Biochemistry_No.2', 'Anatomy_Unnamed: 244_level_1', 'Physiology_Unnamed: 245_level_1', 'Biochemistry_Unnamed: 246_level_1', 'D26: To what extent is the university library accessible to you? _Great extent', 'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent', 'Thank you!_Unnamed: 249_level_1', 'D29: Does the university have online resources such as e-books, magazines, and videos? _Unnamed: 250_level_1', 'D30: If yes in D29 above, to what extent are online learning resources accessible?_Unnamed: 251_level_1', 'A5: In which region in Uganda is your home district found? (where you came from to the university)_Unnamed: 252_level_1', 'A6: In which level of urbanization is your home located? (where you came from to the university)_Unnamed: 253_level_1', 'A8: If No in A7 above, is one of the parents a live or both died?_Unnamed: 254_level_1', 'A9: What is/was the highest level of education your father/guardian attained? _Unnamed: 255_level_1', 'A15: How many people do you stay with at home?_Unnamed: 256_level_1', 'A16: Where do you live while at school?_Unnamed: 257_level_1', 'A18: How far is it from where you live to the University?_Unnamed: 258_level_1', 'C21: If you were to blame one person for your performance in exams of anatomy?_Unnamed: 259_level_1', 'A18: Approximately, how far is it km from where you live to the university campus? _Unnamed: 260_level_1', 'Did you get a retake?_Unnamed: 261_level_1', 'Did you g...']

# Initial features used: 'Age', 'Gender', 'Region', 'Urbanisation', 'Father_Education', 'Entry_Scheme', 'Study_Hours_Daily', 'Sleep_Hours_Daily'

potential_features = [
    'Year of study_Second year second semester', # Year of study
    'A3. What is your sponsorship?_Private', # Sponsorship
    'A4: If private in A3, specify source of funding_Parent/guardian', # Source of funding
    'A5. What is your religion?_Born again including Protestant', # Religion
    'A8: What is the living status of your parents?. _Both alive', # Living status of parents
    'A10: What is/was the highest level of education of your mother?_Never went to school', # Mother's education
    'A11: What is/was the occupation of your father?_Formal employment', # Father's occupation
    'A12: What is/was the occupation of your mother?_No employment', # Mother's occupation
    'A13: What is your marital status?_Married/Cohabiting', # Marital status
    'A14: Do you have a paying job you do while studying?_No', # Paying job while studying
    'A15: How many people do you stay with while at home other than your self?_9', # People at home
    'A16: Where do you live while at school?_Privately rented room', # Living situation at school
    'A17: How many people do you stay with in the room while at university?_2', # People in room at university
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5', # Distance to university
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral', # Parental encouragement
    'Anatomy 1_58', 'Anatomy 2_53', # Anatomy course marks
    'gpant1_2.5', 'gpant2_2', 'gpaant_2.25', # Anatomy GPA
    'Physiology 1 _68', 'Physiology 2_63', # Physiology course marks
    'gpphys1_3.5', 'gpphys2_3', 'gpaphys_3.25', # Physiology GPA
    'Biochemistry 1_82', 'Biochemistry 2_68', # Biochemistry course marks (excluding the target gpabio)
    'gpbio1_5', 'gpbio2_3.5', # Biochemistry GPA (excluding the target gpabio)
    'cgpa_3.285714285714286', # Cumulative GPA
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral', # Satisfaction with performance
    'B2.2 I feel satisfied with my performance in Physiology        _Agree',
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree',
    'B2.4 My performance is appropriate for my effort _Neutral',
    'B2.5 I feel knowledgeable in these three courses                  _Agree',
    'B2.6 I believe can perform better in these courses_Strongly agree',
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree',
    'This course was the most difficult of all (tick)_Anatomy', # Most difficult course
    'This course was the simpler of the three (tick)_Biochemistry', # Simpler course
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E', # O-Level school type
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E', # A-Level school type
    'C4: In which location was your Advanced Level school?_Rural', # A-Level school location
    'C5: How many aggregates did you get at primary seven?_8', # Primary aggregates
    'C6: How many aggregates did you get at O-Level?_31', # O-Level aggregates
    'Chemistry_5', 'Biology_6', 'Physics_6', 'Mathematics_2', 'English_6', # O-Level subject marks
    'A-Level points _9', # A-Level points
    'Chemistry_D', 'Biology_O', 'Mathematics_C', 'General Paper_7', 'Approx Wts_18.2', # A-Level subject grades/points
    'C10: Which choice was the nursing profession in your life_Third', # Choice of nursing profession
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent', # Pride in being nursing student
    'C12: Do you intend to change from nursing profession to another profession in future?_No', # Intention to change profession
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching', # Preferred nursing specialty
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4', # Study hours during biomedical courses
    'C16: Which electronic gargets do you use mostly?_Smart phone', # Electronic gadgets used
    'C17: To what extent are you engaged in games and sports at university?_Less extent', # Engagement in games/sports
    'C18: To what extent do you miss class/lectures_Some extent', # Extent of missing classes
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints', # Reason for missing classes
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself', # Blame for performance
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers', # Responsible for good performance
    'C22: How often do you read from the university library?_Rarely', # Frequency of reading from library
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes', # Most used study resource
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely', # Participation in group discussion
    'C26: How adequate was the sleep stated in C25 above?_Not sure', # Adequacy of sleep
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent', # Confidence in completing course on time
    'D26: To what extent is the university library accessible to you? _Great extent', # University library accessibility
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent', # University library stock
    'D29: Does the university have online resources such as e-books, magazines, and videos? _Unnamed: 250_level_1', # Availability of online resources
    'D30: If yes in D29 above, to what extent are online learning resources accessible?_Unnamed: 251_level_1', # Accessibility of online resources
]

print("Potentially relevant features not in the initial model:")
for feature in potential_features:
    print(f"- {feature}")

Potentially relevant features not in the initial model:
- Year of study_Second year second semester
- A3. What is your sponsorship?_Private
- A4: If private in A3, specify source of funding_Parent/guardian
- A5. What is your religion?_Born again including Protestant
- A8: What is the living status of your parents?. _Both alive
- A10: What is/was the highest level of education of your mother?_Never went to school
- A11: What is/was the occupation of your father?_Formal employment
- A12: What is/was the occupation of your mother?_No employment
- A13: What is your marital status?_Married/Cohabiting
- A14: Do you have a paying job you do while studying?_No
- A15: How many people do you stay with while at home other than your self?_9
- A16: Where do you live while at school?_Privately rented room
- A17: How many people do you stay with in the room while at university?_2
- A18: Approximately, how far is it in km from where you live to the university campus? _0.5
- A19: To what extent does yo

In [9]:
# Initial model results from cell ZhoBJ6dgBlyZ output:
initial_test_rmse = 0.9155
initial_test_r2 = 0.0121

# Final optimized model with new features results from the previous step (final evaluation subtask):
final_test_rmse_new = final_rmse_new # variable from previous step
final_test_r2_new = final_r2_new # variable from previous step

print("--- Model Performance Comparison ---")
print(f"Initial Model (Baseline GBR on limited features):")
print(f"  Test RMSE: {initial_test_rmse:.4f}")
print(f"  Test R2 Score: {initial_test_r2:.4f}")
print("-" * 35)
print(f"Optimized Model (Tuned GBR on expanded features):")
print(f"  Test RMSE: {final_test_rmse_new:.4f}")
print(f"  Test R2 Score: {final_test_r2_new:.4f}")
print("-" * 35)

# Determine if accuracy improved
rmse_improved = final_test_rmse_new < initial_test_rmse
r2_improved = final_test_r2_new > initial_test_r2

print(f"RMSE Improvement: {rmse_improved}")
print(f"R2 Score Improvement: {r2_improved}")

if rmse_improved and r2_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning significantly improved model accuracy.")
elif rmse_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning decreased RMSE, but R2 score did not increase.")
elif r2_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning increased R2 score, but RMSE did not decrease.")
else:
    print("\nConclusion: Adding new features and hyperparameter tuning did not improve model accuracy based on RMSE and R2 score.")

--- Model Performance Comparison ---
Initial Model (Baseline GBR on limited features):
  Test RMSE: 0.9155
  Test R2 Score: 0.0121
-----------------------------------
Optimized Model (Tuned GBR on expanded features):
  Test RMSE: 0.4428
  Test R2 Score: 0.7689
-----------------------------------
RMSE Improvement: True
R2 Score Improvement: True

Conclusion: Adding new features and hyperparameter tuning significantly improved model accuracy.


In [8]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 1. Get the best estimator from the random_search_extensive object
best_gbr_optimized = random_search_extensive.best_estimator_

# 2. Make predictions on the held-out test set X_test_new
y_pred_test_new = best_gbr_optimized.predict(X_test_new)

# 3. Calculate the Mean Squared Error (MSE)
mse_new = mean_squared_error(y_test, y_pred_test_new)

# 4. Calculate the Root Mean Squared Error (RMSE)
final_rmse_new = np.sqrt(mse_new)

# 5. Calculate the R2 score
final_r2_new = r2_score(y_test, y_pred_test_new)

# 6. Print the results
print("\n--- Final Test Set Evaluation (Optimized GBR with New Features) ---")
print(f"Test RMSE: {final_rmse_new:.4f}")
print(f"Test R2 Score: {final_r2_new:.4f}")


--- Final Test Set Evaluation (Optimized GBR with New Features) ---
Test RMSE: 0.4428
Test R2 Score: 0.7689


## Hyperparameter tuning (best model)

### Subtask:
Perform more extensive hyperparameter tuning on the best performing model from the previous step. This might involve a wider search space or more iterations in `RandomizedSearchCV`, or even using `GridSearchCV` if the search space is manageable.

**Reasoning**:
Identify the best baseline model, define a more extensive hyperparameter distribution for Gradient Boosting Regressor, instantiate RandomizedSearchCV with the updated pipeline, distribution, increased iterations, and fitting it to the training data.

In [7]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import randint, uniform

# 1. The best baseline model is Gradient Boosting Regressor, as identified in the previous step.

# 2. Define a more extensive hyperparameter distribution for Gradient Boosting Regressor
param_dist_extensive = {
    'regressor__n_estimators': randint(100, 600), # Wider range
    'regressor__learning_rate': uniform(0.005, 0.3), # Wider range, lower minimum
    'regressor__max_depth': randint(3, 8), # Deeper trees
    'regressor__min_samples_split': randint(2, 20), # Wider range
    'regressor__min_samples_leaf': randint(1, 10), # Add min_samples_leaf
    'regressor__subsample': uniform(0.6, 0.4) # Add subsample
}

# The preprocessor_new and X_train_new, y_train are already defined in previous steps.

# 3. Instantiate RandomizedSearchCV with the updated pipeline and extensive parameter distribution
gbr_pipeline_new = Pipeline(steps=[('preprocessor', preprocessor_new),
                                ('regressor', GradientBoostingRegressor(random_state=RSEED))])

random_search_extensive = RandomizedSearchCV(gbr_pipeline_new, param_distributions=param_dist_extensive,
                                   n_iter=100, cv=5, scoring='neg_mean_squared_error', # Increased iterations
                                   verbose=0, random_state=RSEED, n_jobs=-1)

# 4. Fit the RandomizedSearchCV object to the training data
print("Performing extensive hyperparameter optimization for Gradient Boosting...")
random_search_extensive.fit(X_train_new, y_train)

# 5. Print the best parameters found by RandomizedSearchCV
print("\n--- Extensive Hyperparameter Optimization Results ---")
print(f"Best CV Parameters: {random_search_extensive.best_params_}")

Performing extensive hyperparameter optimization for Gradient Boosting...

--- Extensive Hyperparameter Optimization Results ---
Best CV Parameters: {'regressor__learning_rate': np.float64(0.03593716065077978), 'regressor__max_depth': 3, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 17, 'regressor__n_estimators': 316, 'regressor__subsample': np.float64(0.7280198404122447)}


**Reasoning**:
Select and rename the features for the model, then identify the numerical and categorical columns and convert relevant columns to numeric.



In [10]:
# 1. Define the new set of selected features including original and new ones
selected_features_q2 = [
    'A1. How old are you? (age in complete years)_24', # Original: Age
    'A2. What is your gender?_Male', # Original: Gender
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern', # Original: Region
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural', # Original: Urbanisation
    'A9: What is/was the highest level of education your father attained? _Secondary level', # Original: Father_Education
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results', # Original: Entry_Scheme
    'C14: How many hours do you use to do private study or read on daily basis?_4', # Original: Study_Hours_Daily
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8', # Original: Sleep_Hours_Daily

    # Added features based on exploration
    'Year of study_Second year second semester', # Year of study
    'A3. What is your sponsorship?_Private', # Sponsorship
    'A5. What is your religion?_Born again including Protestant', # Religion
    'A8: What is the living status of your parents?. _Both alive', # Living status of parents
    'A10: What is/was the highest level of education of your mother?_Never went to school', # Mother's education
    'A13: What is your marital status?_Married/Cohabiting', # Marital status
    'A14: Do you have a paying job you do while studying?_No', # Paying job while studying
    'A15: How many people do you stay with while at home other than your self?_9', # People at home
    'A17: How many people do you stay with in the room while at university?_2', # People in room at university
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5', # Distance to university
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral', # Parental encouragement
    'gpaant_2.25', # Anatomy GPA
    'gpaphys_3.25', # Physiology GPA
    'cgpa_3.285714285714286', # Cumulative GPA
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral', # Satisfaction with performance (Anatomy)
    'B2.2 I feel satisfied with my performance in Physiology        _Agree', # Satisfaction with performance (Physiology)
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree', # Satisfaction with performance (Biochemistry)
    'B2.4 My performance is appropriate for my effort _Neutral', # Performance vs Effort
    'B2.5 I feel knowledgeable in these three courses                  _Agree', # Knowledgeable in courses
    'B2.6 I believe can perform better in these courses_Strongly agree', # Belief in better performance
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree', # Application of knowledge
    'This course was the most difficult of all (tick)_Anatomy', # Most difficult course
    'This course was the simpler of the three (tick)_Biochemistry', # Simpler course
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E', # O-Level school type
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E', # A-Level school type
    'C4: In which location was your Advanced Level school?_Rural', # A-Level school location
    'C5: How many aggregates did you get at primary seven?_8', # Primary aggregates
    'C6: How many aggregates did you get at O-Level?_31', # O-Level aggregates
    'A-Level points _9', # A-Level points
    'Approx Wts_18.2', # Approx Wts (related to A-Level)
    'C10: Which choice was the nursing profession in your life_Third', # Choice of nursing profession
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent', # Pride in being nursing student
    'C12: Do you intend to change from nursing profession to another profession in future?_No', # Intention to change profession
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching', # Preferred nursing specialty
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4', # Study hours during biomedical courses
    'C16: Which electronic gargets do you use mostly?_Smart phone', # Electronic gadgets used
    'C17: To what extent are you engaged in games and sports at university?_Less extent', # Engagement in games/sports
    'C18: To what extent do you miss class/lectures_Some extent', # Extent of missing classes
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints', # Reason for missing classes
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself', # Blame for performance
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers', # Responsible for good performance
    'C22: How often do you read from the university library?_Rarely', # Frequency of reading from library
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes', # Most used study resource
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely', # Participation in group discussion
    'C26: How adequate was the sleep stated in C25 above?_Not sure', # Adequacy of sleep
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent', # Confidence in completing course on time
    'D26: To what extent is the university library accessible to you? _Great extent', # University library accessibility
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent', # University library stock
]


# 2. Create a dictionary for renaming features
feature_rename_map_q2 = {
    'A1. How old are you? (age in complete years)_24': 'Age',
    'A2. What is your gender?_Male': 'Gender',
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern': 'Region',
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural': 'Urbanisation',
    'A9: What is/was the highest level of education your father attained? _Secondary level': 'Father_Education',
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results': 'Entry_Scheme',
    'C14: How many hours do you use to do private study or read on daily basis?_4': 'Study_Hours_Daily',
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8': 'Sleep_Hours_Daily',

    'Year of study_Second year second semester': 'Year_of_Study',
    'A3. What is your sponsorship?_Private': 'Sponsorship',
    'A5. What is your religion?_Born again including Protestant': 'Religion',
    'A8: What is the living status of your parents?. _Both alive': 'Parents_Living_Status',
    'A10: What is/was the highest level of education of your mother?_Never went to school': 'Mother_Education',
    'A13: What is your marital status?_Married/Cohabiting': 'Marital_Status',
    'A14: Do you have a paying job you do while studying?_No': 'Paying_Job',
    'A15: How many people do you stay with while at home other than your self?_9': 'People_at_Home',
    'A17: How many people do you stay with in the room while at university?_2': 'People_in_Room',
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5': 'Distance_to_University_km',
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral': 'Parental_Encouragement',
    'gpaant_2.25': 'Anatomy_GPA',
    'gpaphys_3.25': 'Physiology_GPA',
    'cgpa_3.285714285714286': 'CGPA',
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral': 'Satisfaction_Anatomy',
    'B2.2 I feel satisfied with my performance in Physiology        _Agree': 'Satisfaction_Physiology',
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree': 'Satisfaction_Biochemistry',
    'B2.4 My performance is appropriate for my effort _Neutral': 'Performance_vs_Effort',
    'B2.5 I feel knowledgeable in these three courses                  _Agree': 'Knowledgeable_in_Courses',
    'B2.6 I believe can perform better in these courses_Strongly agree': 'Belief_in_Better_Performance',
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree': 'Application_of_Knowledge',
    'This course was the most difficult of all (tick)_Anatomy': 'Most_Difficult_Course',
    'This course was the simpler of the three (tick)_Biochemistry': 'Simpler_Course',
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E': 'O_Level_School_Type',
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E': 'A_Level_School_Type',
    'C4: In which location was your Advanced Level school?_Rural': 'A_Level_School_Location',
    'C5: How many aggregates did you get at primary seven?_8': 'Primary_Aggregates',
    'C6: How many aggregates did you get at O-Level?_31': 'O_Level_Aggregates',
    'A-Level points _9': 'A_Level_Points',
    'Approx Wts_18.2': 'Approx_Wts',
    'C10: Which choice was the nursing profession in your life_Third': 'Choice_Nursing_Profession',
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent': 'Pride_Nursing_Student',
    'C12: Do you intend to change from nursing profession to another profession in future?_No': 'Intent_Change_Profession',
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching': 'Preferred_Nursing_Specialty',
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4': 'Study_Hours_Biomedical',
    'C16: Which electronic gargets do you use mostly?_Smart phone': 'Electronic_Gadgets',
    'C17: To what extent are you engaged in games and sports at university?_Less extent': 'Engagement_Games_Sports',
    'C18: To what extent do you miss class/lectures_Some extent': 'Extent_Missing_Classes',
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints': 'Reason_Missing_Classes',
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself': 'Blame_for_Performance',
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers': 'Responsible_for_Good_Performance',
    'C22: How often do you read from the university library?_Rarely': 'Frequency_Library_Use',
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes': 'Most_Used_Study_Resource',
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely': 'Group_Discussion_Participation',
    'C26: How adequate was the sleep stated in C25 above?_Not sure': 'Adequacy_of_Sleep',
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent': 'Confidence_Complete_Course',
    'D26: To what extent is the university library accessible to you? _Great extent': 'Library_Accessibility',
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent': 'Library_Stock'
}

# 3. Select and rename columns
X_new_q2 = df_q2[selected_features_q2].rename(columns=feature_rename_map_q2)

# 4. Identify numerical and categorical columns
numerical_cols_new = [
    'Age',
    'Study_Hours_Daily',
    'Sleep_Hours_Daily',
    'People_at_Home', # New numerical feature
    'People_in_Room', # New numerical feature
    'Distance_to_University_km', # New numerical feature
    'Anatomy_GPA', # New numerical feature
    'Physiology_GPA', # New numerical feature
    'CGPA', # New numerical feature
    'Primary_Aggregates', # New numerical feature
    'O_Level_Aggregates', # New numerical feature
    'A_Level_Points', # New numerical feature
    'Approx_Wts', # New numerical feature
    'Study_Hours_Biomedical' # New numerical feature
]

categorical_cols_new = [
    'Gender',
    'Region',
    'Urbanisation',
    'Father_Education',
    'Entry_Scheme',
    'Year_of_Study', # New categorical feature
    'Sponsorship', # New categorical feature
    'Religion', # New categorical feature
    'Parents_Living_Status', # New categorical feature
    'Mother_Education', # New categorical feature
    'Marital_Status', # New categorical feature
    'Paying_Job', # New categorical feature
    'Parental_Encouragement', # New categorical feature
    'Satisfaction_Anatomy', # New categorical feature
    'Satisfaction_Physiology', # New categorical feature
    'Satisfaction_Biochemistry', # New categorical feature
    'Performance_vs_Effort', # New categorical feature
    'Knowledgeable_in_Courses', # New categorical feature
    'Belief_in_Better_Performance', # New categorical feature
    'Application_of_Knowledge', # New categorical feature
    'Most_Difficult_Course', # New categorical feature
    'Simpler_Course', # New categorical feature
    'O_Level_School_Type', # New categorical feature
    'A_Level_School_Type', # New categorical feature
    'A_Level_School_Location', # New categorical feature
    'Choice_Nursing_Profession', # New categorical feature
    'Pride_Nursing_Student', # New categorical feature
    'Intent_Change_Profession', # New categorical feature
    'Preferred_Nursing_Specialty', # New categorical feature
    'Electronic_Gadgets', # New categorical feature
    'Engagement_Games_Sports', # New categorical feature
    'Extent_Missing_Classes', # New categorical feature
    'Reason_Missing_Classes', # New categorical feature
    'Blame_for_Performance', # New categorical feature
    'Responsible_for_Good_Performance', # New categorical feature
    'Frequency_Library_Use', # New categorical feature
    'Most_Used_Study_Resource', # New categorical feature
    'Group_Discussion_Participation', # New categorical feature
    'Adequacy_of_Sleep', # New categorical feature
    'Confidence_Complete_Course', # New categorical feature
    'Library_Accessibility', # New categorical feature
    'Library_Stock' # New categorical feature
]


# 5. Convert numerical columns to numeric, coercing errors
for col in numerical_cols_new:
    X_new_q2[col] = pd.to_numeric(X_new_q2[col], errors='coerce')

display(X_new_q2.head())

Unnamed: 0,Age,Gender,Region,Urbanisation,Father_Education,Entry_Scheme,Study_Hours_Daily,Sleep_Hours_Daily,Year_of_Study,Sponsorship,...,Reason_Missing_Classes,Blame_for_Performance,Responsible_for_Good_Performance,Frequency_Library_Use,Most_Used_Study_Resource,Group_Discussion_Participation,Adequacy_of_Sleep,Confidence_Complete_Course,Library_Accessibility,Library_Stock
0,21.0,Female,Eastern,Rural,Never went to school,Advanced Level (UACE) results ...,4.0,8.0,Second year second semester,Private,...,Sickness,Myself,Myself,Never,Videos,Often,Adequate,Neutral,Great extent,Great extent
1,22.0,Male,Western,Rural,Primary level,Advanced Level (UACE) results ...,2.0,8.0,Second year second semester,Private,...,Sickness,Teachers/lecturers,Myself,Often,Textbooks,Often,Not sure,Less extent,Small extent,Great extent
2,23.0,Female,Central,Peri Urban (small towns),Primary level,Advanced Level (UACE) results ...,2.0,6.0,Second year second semester,Private,...,Other (specify),Myself,Classmates or friends,Sometimes,Textbooks,Very often,Adequate,Neutral,Great extent,Small extent
3,23.0,Male,Western,Rural,Never went to school,Advanced Level (UACE) results ...,5.0,6.0,Second year second semester,Government,...,,Myself,Classmates or friends,Often,Textbooks,Often,Adequate,Great extent,Great extent,Great extent
4,23.0,Male,Western,Rural,Never went to school,Advanced Level (UACE) results ...,5.0,6.0,Second year second semester,Government,...,,Myself,Classmates or friends,Often,Textbooks,Often,Adequate,Great extent,Great extent,Great extent


## Summary:

### Data Analysis Key Findings

* The initial baseline Gradient Boosting Regressor model on a limited feature set had a Test RMSE of 0.9155 and a Test R2 Score of 0.0121.
* After adding new features and performing extensive hyperparameter tuning, the optimized Gradient Boosting Regressor model achieved a Test RMSE of 0.4428 and a Test R2 Score of 0.7689 on the held-out test set.
* The optimized model shows a significant decrease in RMSE (from 0.9155 to 0.4428) and a substantial increase in R2 score (from 0.0121 to 0.7689) compared to the initial model.

### Insights or Next Steps

* The inclusion of additional features and hyperparameter tuning significantly improved the model's predictive accuracy for student performance.
* Further analysis could involve feature importance analysis on the final model to understand which of the new features contributed most to the performance improvement.

In [15]:
# Re-executing relevant parts of cell ZhoBJ6dgBlyZ to define df_q2

# --- 1. Data Loading and Cleaning ---
RSEED = 42

try:
    # Load the student data, attempting to handle the complex header
    df_student = pd.read_excel('/content/drive/MyDrive/question2_students.xlsx', header=[0, 1])

    # Flatten multi-level columns
    df_student.columns = ['_'.join(map(str, col)).strip() for col in df_student.columns.values]

    # Select the target (gpabio)
    # Use the flattened column name for gpabio
    gpabio_cols = [col for col in df_student.columns if 'gpabio_' in col]
    target_q2 = gpabio_cols[0] if gpabio_cols else None


    if not target_q2:
        # Fallback for potentially different exact gpabio column name in other runs
        gpabio_cols = [col for col in df_student.columns if 'gpabio' in col and '_Unnamed:' not in col and 'gpaant' not in col and 'gpaphys' not in col]
        target_q2 = gpabio_cols[0] if gpabio_cols else None
        if not target_q2:
             raise ValueError("Target variable 'gpabio' not found after cleaning.")


except FileNotFoundError:
    print("Error: '/content/drive/MyDrive/question2_students.xlsx' not found.")
    exit()
except Exception as e:
    print(f"Error loading or cleaning student data: {e}")
    exit()

# Filter out rows with missing target and select features
df_q2 = df_student[df_student[target_q2].notna()].reset_index(drop=True)
# y is already defined as the target variable in cell ZhoBJ6dgBlyZ, no need to redefine here

# --- Continue with the rest of the steps from the previous attempt ---

# 1. Define the new set of selected features including original and new ones
selected_features_q2 = [
    'A1. How old are you? (age in complete years)_24', # Original: Age
    'A2. What is your gender?_Male', # Original: Gender
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern', # Original: Region
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural', # Original: Urbanisation
    'A9: What is/was the highest level of education your father attained? _Secondary level', # Original: Father_Education
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results', # Original: Entry_Scheme
    'C14: How many hours do you use to do private study or read on daily basis?_4', # Original: Study_Hours_Daily
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8', # Original: Sleep_Hours_Daily

    # Added features based on exploration
    'Year of study_Second year second semester', # Year of study
    'A3. What is your sponsorship?_Private', # Sponsorship
    'A5. What is your religion?_Born again including Protestant', # Religion
    'A8: What is the living status of your parents?. _Both alive', # Living status of parents
    'A10: What is/was the highest level of education of your mother?_Never went to school', # Mother's education
    'A13: What is your marital status?_Married/Cohabiting', # Marital status
    'A14: Do you have a paying job you do while studying?_No', # Paying job while studying
    'A15: How many people do you stay with while at home other than your self?_9', # People at home
    'A17: How many people do you stay with in the room while at university?_2', # People in room at university
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5', # Distance to university
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral', # Parental encouragement
    'gpaant_2.25', # Anatomy GPA
    'gpaphys_3.25', # Physiology GPA
    'cgpa_3.285714285714286', # Cumulative GPA
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral', # Satisfaction with performance (Anatomy)
    'B2.2 I feel satisfied with my performance in Physiology        _Agree', # Satisfaction with performance (Physiology)
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree', # Satisfaction with performance (Biochemistry)
    'B2.4 My performance is appropriate for my effort _Neutral', # Performance vs Effort
    'B2.5 I feel knowledgeable in these three courses                  _Agree', # Knowledgeable in courses
    'B2.6 I believe can perform better in these courses_Strongly agree', # Belief in better performance
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree', # Application of knowledge
    'This course was the most difficult of all (tick)_Anatomy', # Most difficult course
    'This course was the simpler of the three (tick)_Biochemistry', # Simpler course
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E', # O-Level school type
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E', # A-Level school type
    'C4: In which location was your Advanced Level school?_Rural', # A-Level school location
    'C5: How many aggregates did you get at primary seven?_8', # Primary aggregates
    'C6: How many aggregates did you get at O-Level?_31', # O-Level aggregates
    'A-Level points _9', # A-Level points
    'Approx Wts_18.2', # Approx Wts (related to A-Level)
    'C10: Which choice was the nursing profession in your life_Third', # Choice of nursing profession
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent', # Pride in being nursing student
    'C12: Do you intend to change from nursing profession to another profession in future?_No', # Intention to change profession
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching', # Preferred nursing specialty
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4', # Study hours during biomedical courses
    'C16: Which electronic gargets do you use mostly?_Smart phone', # Electronic gadgets used
    'C17: To what extent are you engaged in games and sports at university?_Less extent', # Engagement in games/sports
    'C18: To what extent do you miss class/lectures_Some extent', # Extent of missing classes
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints', # Reason for missing classes
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself', # Blame for performance
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers', # Responsible for good performance
    'C22: How often do you read from the university library?_Rarely', # Frequency of reading from library
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes', # Most used study resource
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely', # Participation in group discussion
    'C26: How adequate was the sleep stated in C25 above?_Not sure', # Adequacy of sleep
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent', # Confidence in completing course on time
    'D26: To what extent is the university library accessible to you? _Great extent', # University library accessibility
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent', # University library stock
]


# 2. Create a dictionary for renaming features
feature_rename_map_q2 = {
    'A1. How old are you? (age in complete years)_24': 'Age',
    'A2. What is your gender?_Male': 'Gender',
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern': 'Region',
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural': 'Urbanisation',
    'A9: What is/was the highest level of education your father attained? _Secondary level': 'Father_Education',
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results': 'Entry_Scheme',
    'C14: How many hours do you use to do private study or read on daily basis?_4': 'Study_Hours_Daily',
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8': 'Sleep_Hours_Daily',

    'Year of study_Second year second semester': 'Year_of_Study',
    'A3. What is your sponsorship?_Private': 'Sponsorship',
    'A5. What is your religion?_Born again including Protestant': 'Religion',
    'A8: What is the living status of your parents?. _Both alive': 'Parents_Living_Status',
    'A10: What is/was the highest level of education of your mother?_Never went to school': 'Mother_Education',
    'A14: Do you have a paying job you do while studying?_No': 'Paying_Job',
    'A15: How many people do you stay with while at home other than your self?_9': 'People_at_Home',
    'A17: How many people do you stay with in the room while at university?_2': 'People_in_Room',
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5': 'Distance_to_University_km',
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral': 'Parental_Encouragement',
    'gpaant_2.25': 'Anatomy_GPA',
    'gpaphys_3.25': 'Physiology_GPA',
    'cgpa_3.285714285714286': 'CGPA',
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral': 'Satisfaction_Anatomy',
    'B2.2 I feel satisfied with my performance in Physiology        _Agree': 'Satisfaction_Physiology',
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree': 'Satisfaction_Biochemistry',
    'B2.4 My performance is appropriate for my effort _Neutral': 'Performance_vs_Effort',
    'B2.5 I feel knowledgeable in these three courses                  _Agree': 'Knowledgeable_in_Courses',
    'B2.6 I believe can perform better in these courses_Strongly agree': 'Belief_in_Better_Performance',
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree': 'Application_of_Knowledge',
    'This course was the most difficult of all (tick)_Anatomy': 'Most_Difficult_Course',
    'This course was the simpler of the three (tick)_Biochemistry': 'Simpler_Course',
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E': 'O_Level_School_Type',
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E': 'A_Level_School_Type',
    'C4: In which location was your Advanced Level school?_Rural': 'A_Level_School_Location',
    'C5: How many aggregates did you get at primary seven?_8': 'Primary_Aggregates',
    'C6: How many aggregates did you get at O-Level?_31': 'O_Level_Aggregates',
    'A-Level points _9': 'A_Level_Points',
    'Approx Wts_18.2': 'Approx_Wts',
    'C10: Which choice was the nursing profession in your life_Third': 'Choice_Nursing_Profession',
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent': 'Pride_Nursing_Student',
    'C12: Do you intend to change from nursing profession to another profession in future?_No': 'Intent_Change_Profession',
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching': 'Preferred_Nursing_Specialty',
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4': 'Study_Hours_Biomedical',
    'C16: Which electronic gargets do you use mostly?_Smart phone': 'Electronic_Gadgets',
    'C17: To what extent are you engaged in games and sports at university?_Less extent': 'Engagement_Games_Sports',
    'C18: To what extent do you miss class/lectures_Some extent': 'Extent_Missing_Classes',
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints': 'Reason_Missing_Classes',
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers': 'Responsible_for_Good_Performance',
    'C22: How often do you read from the university library?_Rarely': 'Frequency_Library_Use',
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes': 'Most_Used_Study_Resource',
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely': 'Group_Discussion_Participation',
    'C26: How adequate was the sleep stated in C25 above?_Not sure': 'Adequacy_of_Sleep',
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent': 'Confidence_Complete_Course',
    'D26: To what extent is the university library accessible to you? _Great extent': 'Library_Accessibility',
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent': 'Library_Stock'
}

# 3. Select and rename columns
X_new_q2 = df_q2[selected_features_q2].rename(columns=feature_rename_map_q2)

# 4. Identify numerical and categorical columns
numerical_cols_new = [
    'Age',
    'Study_Hours_Daily',
    'Sleep_Hours_Daily',
    'People_at_Home',
    'People_in_Room',
    'Distance_to_University_km',
    'Anatomy_GPA',
    'Physiology_GPA',
    'CGPA',
    'Primary_Aggregates',
    'O_Level_Aggregates',
    'A_Level_Points',
    'Approx_Wts',
    'Study_Hours_Biomedical'
]

categorical_cols_new = [
    'Gender',
    'Region',
    'Urbanisation',
    'Father_Education',
    'Entry_Scheme',
    'Year_of_Study',
    'Sponsorship',
    'Religion',
    'Parents_Living_Status',
    'Mother_Education',
    'Marital_Status',
    'Paying_Job',
    'Parental_Encouragement',
    'Satisfaction_Anatomy',
    'Satisfaction_Physiology',
    'Satisfaction_Biochemistry',
    'Performance_vs_Effort',
    'Knowledgeable_in_Courses',
    'Belief_in_Better_Performance',
    'Application_of_Knowledge',
    'Most_Difficult_Course',
    'Simpler_Course',
    'O_Level_School_Type',
    'A_Level_School_Type',
    'A_Level_School_Location',
    'Choice_Nursing_Profession',
    'Pride_Nursing_Student',
    'Intent_Change_Profession',
    'Preferred_Nursing_Specialty',
    'Electronic_Gadgets',
    'Engagement_Games_Sports',
    'Extent_Missing_Classes',
    'Reason_Missing_Classes',
    'Blame_for_Performance',
    'Responsible_for_Good_Performance',
    'Frequency_Library_Use',
    'Most_Used_Study_Resource',
    'Group_Discussion_Participation',
    'Adequacy_of_Sleep',
    'Confidence_Complete_Course',
    'Library_Accessibility',
    'Library_Stock'
]


# 5. Convert numerical columns to numeric, coercing errors
for col in numerical_cols_new:
    X_new_q2[col] = pd.to_numeric(X_new_q2[col], errors='coerce')

display(X_new_q2.head())

Unnamed: 0,Age,Gender,Region,Urbanisation,Father_Education,Entry_Scheme,Study_Hours_Daily,Sleep_Hours_Daily,Year_of_Study,Sponsorship,...,Reason_Missing_Classes,"C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself",Responsible_for_Good_Performance,Frequency_Library_Use,Most_Used_Study_Resource,Group_Discussion_Participation,Adequacy_of_Sleep,Confidence_Complete_Course,Library_Accessibility,Library_Stock
0,21.0,Female,Eastern,Rural,Never went to school,Advanced Level (UACE) results ...,4.0,8.0,Second year second semester,Private,...,Sickness,Myself,Myself,Never,Videos,Often,Adequate,Neutral,Great extent,Great extent
1,22.0,Male,Western,Rural,Primary level,Advanced Level (UACE) results ...,2.0,8.0,Second year second semester,Private,...,Sickness,Teachers/lecturers,Myself,Often,Textbooks,Often,Not sure,Less extent,Small extent,Great extent
2,23.0,Female,Central,Peri Urban (small towns),Primary level,Advanced Level (UACE) results ...,2.0,6.0,Second year second semester,Private,...,Other (specify),Myself,Classmates or friends,Sometimes,Textbooks,Very often,Adequate,Neutral,Great extent,Small extent
3,23.0,Male,Western,Rural,Never went to school,Advanced Level (UACE) results ...,5.0,6.0,Second year second semester,Government,...,,Myself,Classmates or friends,Often,Textbooks,Often,Adequate,Great extent,Great extent,Great extent
4,23.0,Male,Western,Rural,Never went to school,Advanced Level (UACE) results ...,5.0,6.0,Second year second semester,Government,...,,Myself,Classmates or friends,Often,Textbooks,Often,Adequate,Great extent,Great extent,Great extent


**Reasoning**:
The traceback indicates that `df_student` and `pd` are not defined. I need to import `pandas` and include the data loading and cleaning steps at the beginning of the code block to define `df_student`, `df_q2`, and `y`.



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

RSEED = 42

# --- Data Loading and Cleaning (from previous steps) ---
try:
    # Load the student data, attempting to handle the complex header
    df_student = pd.read_excel('/content/drive/MyDrive/question2_students.xlsx', header=[0, 1])

    # Flatten multi-level columns
    df_student.columns = ['_'.join(map(str, col)).strip() for col in df_student.columns.values]

    # Select the target (gpabio)
    gpabio_cols = [col for col in df_student.columns if 'gpabio_' in col]
    target_q2 = gpabio_cols[0] if gpabio_cols else None

    if not target_q2:
        gpabio_cols = [col for col in df_student.columns if 'gpabio' in col and '_Unnamed:' not in col and 'gpaant' not in col and 'gpaphys' not in col]
        target_q2 = gpabio_cols[0] if gpabio_cols else None
        if not target_q2:
             raise ValueError("Target variable 'gpabio' not found after cleaning.")

except FileNotFoundError:
    print("Error: '/content/drive/MyDrive/question2_students.xlsx' not found.")
    exit()
except Exception as e:
    print(f"Error loading or cleaning student data: {e}")
    exit()

# Filter out rows with missing target and select features
df_q2 = df_student[df_student[target_q2].notna()].reset_index(drop=True)
y = df_q2[target_q2].astype(float)

# --- Feature Selection and Engineering (from previous steps) ---
selected_features_q2 = [
    'A1. How old are you? (age in complete years)_24',
    'A2. What is your gender?_Male',
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern',
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural',
    'A9: What is/was the highest level of education your father attained? _Secondary level',
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results',
    'C14: How many hours do you use to do private study or read on daily basis?_4',
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8',
    'Year of study_Second year second semester',
    'A3. What is your sponsorship?_Private',
    'A5. What is your religion?_Born again including Protestant',
    'A8: What is the living status of your parents?. _Both alive',
    'A10: What is/was the highest level of education of your mother?_Never went to school',
    'A13: What is your marital status?_Married/Cohabiting',
    'A14: Do you have a paying job you do while studying?_No',
    'A15: How many people do you stay with while at home other than your self?_9',
    'A17: How many people do you stay with in the room while at university?_2',
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5',
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral',
    'gpaant_2.25',
    'gpaphys_3.25',
    'cgpa_3.285714285714286',
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral',
    'B2.2 I feel satisfied with my performance in Physiology        _Agree',
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree',
    'B2.4 My performance is appropriate for my effort _Neutral',
    'B2.5 I feel knowledgeable in these three courses                  _Agree',
    'B2.6 I believe can perform better in these courses_Strongly agree',
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree',
    'This course was the most difficult of all (tick)_Anatomy',
    'This course was the simpler of the three (tick)_Biochemistry',
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E',
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E',
    'C4: In which location was your Advanced Level school?_Rural',
    'C5: How many aggregates did you get at primary seven?_8',
    'C6: How many aggregates did you get at O-Level?_31',
    'A-Level points _9',
    'Approx Wts_18.2',
    'C10: Which choice was the nursing profession in your life_Third',
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent',
    'C12: Do you intend to change from nursing profession to another profession in future?_No',
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching',
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4',
    'C16: Which electronic gargets do you use mostly?_Smart phone',
    'C17: To what extent are you engaged in games and sports at university?_Less extent',
    'C18: To what extent do you miss class/lectures_Some extent',
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints',
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself',
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers',
    'C22: How often do you read from the university library?_Rarely',
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes',
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely',
    'C26: How adequate was the sleep stated in C25 above?_Not sure',
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent',
    'D26: To what extent is the university library accessible to you? _Great extent',
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent',
]

feature_rename_map_q2 = {
    'A1. How old are you? (age in complete years)_24': 'Age',
    'A2. What is your gender?_Male': 'Gender',
    'A6: In which region in Uganda is your home district found? (where you came from to the university)_Northern': 'Region',
    'A7: In which level of urbanization is your home located? (where you came from to the university)_Rural': 'Urbanisation',
    'A9: What is/was the highest level of education your father attained? _Secondary level': 'Father_Education',
    'C1: Which type of entry scheme did you use to be admitted into this course_Advanced Level (UACE) results': 'Entry_Scheme',
    'C14: How many hours do you use to do private study or read on daily basis?_4': 'Study_Hours_Daily',
    'C25: While studying anatomy, physiology, and biochemistry, on average, how many hours did you use to sleep/rest daily?_8': 'Sleep_Hours_Daily',
    'Year of study_Second year second semester': 'Year_of_Study',
    'A3. What is your sponsorship?_Private': 'Sponsorship',
    'A5. What is your religion?_Born again including Protestant': 'Religion',
    'A8: What is the living status of your parents?. _Both alive': 'Parents_Living_Status',
    'A10: What is/was the highest level of education of your mother?_Never went to school': 'Mother_Education',
    'A13: What is your marital status?_Married/Cohabiting': 'Marital_Status',
    'A14: Do you have a paying job you do while studying?_No': 'Paying_Job',
    'A15: How many people do you stay with while at home other than your self?_9': 'People_at_Home',
    'A17: How many people do you stay with in the room while at university?_2': 'People_in_Room',
    'A18: Approximately, how far is it in km from where you live to the university campus? _0.5': 'Distance_to_University_km',
    'A19: To what extent does your parent(s) or guardian encourage, guide, or motivate you to concentrate on your studies?_Neutral': 'Parental_Encouragement',
    'gpaant_2.25': 'Anatomy_GPA',
    'gpaphys_3.25': 'Physiology_GPA',
    'cgpa_3.285714285714286': 'CGPA',
    'B2.1 I feel satisfied with my performance in anatomy        _Neutral': 'Satisfaction_Anatomy',
    'B2.2 I feel satisfied with my performance in Physiology        _Agree': 'Satisfaction_Physiology',
    'B2.3 I feel satisfied with my performance in Biochemistry_Agree': 'Satisfaction_Biochemistry',
    'B2.4 My performance is appropriate for my effort _Neutral': 'Performance_vs_Effort',
    'B2.5 I feel knowledgeable in these three courses                  _Agree': 'Knowledgeable_in_Courses',
    'B2.6 I believe can perform better in these courses_Strongly agree': 'Belief_in_Better_Performance',
    'B2.7 I apply knowledge of these courses in patient care_Strongly agree': 'Application_of_Knowledge',
    'This course was the most difficult of all (tick)_Anatomy': 'Most_Difficult_Course',
    'This course was the simpler of the three (tick)_Biochemistry': 'Simpler_Course',
    'C2: In which type of school did you complete your Ordinary Level (O-Level) of education?_Government, U.S.E': 'O_Level_School_Type',
    'C3: In which type of school did you complete your Advanced Level (A-Level) of education?_Government, non U.S.E': 'A_Level_School_Type',
    'C4: In which location was your Advanced Level school?_Rural': 'A_Level_School_Location',
    'C5: How many aggregates did you get at primary seven?_8': 'Primary_Aggregates',
    'C6: How many aggregates did you get at O-Level?_31': 'O_Level_Aggregates',
    'A-Level points _9': 'A_Level_Points',
    'Approx Wts_18.2': 'Approx_Wts',
    'C10: Which choice was the nursing profession in your life_Third': 'Choice_Nursing_Profession',
    'C11: To what extent are you proud to be a Bachelor nurse student?_Less extent': 'Pride_Nursing_Student',
    'C12: Do you intend to change from nursing profession to another profession in future?_No': 'Intent_Change_Profession',
    'C13: If you were to choose, which nursing specialty would you prefer to practice after qualifying with BNS degree?_Nursing education/teaching': 'Preferred_Nursing_Specialty',
    'C15: During the time you studied biomedical science courses, how many hours did you use to do private study or read on daily basis? _4': 'Study_Hours_Biomedical',
    'C16: Which electronic gargets do you use mostly?_Smart phone': 'Electronic_Gadgets',
    'C17: To what extent are you engaged in games and sports at university?_Less extent': 'Engagement_Games_Sports',
    'C18: To what extent do you miss class/lectures_Some extent': 'Extent_Missing_Classes',
    'C19: What is the commonest reason for missing lectures/classes?_Financial constraints': 'Reason_Missing_Classes',
    'C20: If you are to blame one person for your performance in exams of anatomy, physiology, and biochemistry; who would you blame?_Myself': 'Blame_for_Performance',
    'C21: Who do you believe is most responsible for your good performance in anatomy, physiology, and biochemistry?_Teachers/lecturers': 'Responsible_for_Good_Performance',
    'C22: How often do you read from the university library?_Rarely': 'Frequency_Library_Use',
    'C23: Choose only one resource you used most to study anatomy, physiology, and biochemistry_Lecture notes': 'Most_Used_Study_Resource',
    'C24: While studying anatomy, physiology, and biochemistry, how often did you participate in group discussion or team learning?_Rarely': 'Group_Discussion_Participation',
    'C26: How adequate was the sleep stated in C25 above?_Not sure': 'Adequacy_of_Sleep',
    'C27: To what extent are you confident that you will complete BNS course on time?_Great extent': 'Confidence_Complete_Course',
    'D26: To what extent is the university library accessible to you? _Great extent': 'Library_Accessibility',
    'D27: To what extent is the university library stocked with very good books of biomedical sciences named anatomy, physiology and biochemistry? _Great extent': 'Library_Stock',
}

X_new_q2 = df_q2[selected_features_q2].rename(columns=feature_rename_map_q2)

# Convert numerical columns to numeric, coercing errors
numerical_cols_new = [
    'Age',
    'Study_Hours_Daily',
    'Sleep_Hours_Daily',
    'People_at_Home',
    'People_in_Room',
    'Distance_to_University_km',
    'Anatomy_GPA',
    'Physiology_GPA',
    'CGPA',
    'Primary_Aggregates',
    'O_Level_Aggregates',
    'A_Level_Points',
    'Approx_Wts',
    'Study_Hours_Biomedical'
]

for col in numerical_cols_new:
    X_new_q2[col] = pd.to_numeric(X_new_q2[col], errors='coerce')

categorical_cols_new = [
    'Gender',
    'Region',
    'Urbanisation',
    'Father_Education',
    'Entry_Scheme',
    'Year_of_Study',
    'Sponsorship',
    'Religion',
    'Parents_Living_Status',
    'Mother_Education',
    'Marital_Status',
    'Paying_Job',
    'Parental_Encouragement',
    'Satisfaction_Anatomy',
    'Satisfaction_Physiology',
    'Satisfaction_Biochemistry',
    'Performance_vs_Effort',
    'Knowledgeable_in_Courses',
    'Belief_in_Better_Performance',
    'Application_of_Knowledge',
    'Most_Difficult_Course',
    'Simpler_Course',
    'O_Level_School_Type',
    'A_Level_School_Type',
    'A_Level_School_Location',
    'Choice_Nursing_Profession',
    'Pride_Nursing_Student',
    'Intent_Change_Profession',
    'Preferred_Nursing_Specialty',
    'Electronic_Gadgets',
    'Engagement_Games_Sports',
    'Extent_Missing_Classes',
    'Reason_Missing_Classes',
    'Blame_for_Performance',
    'Responsible_for_Good_Performance',
    'Frequency_Library_Use',
    'Most_Used_Study_Resource',
    'Group_Discussion_Participation',
    'Adequacy_of_Sleep',
    'Confidence_Complete_Course',
    'Library_Accessibility',
    'Library_Stock'
]


# --- Preprocessing Pipeline Definition ---
# numerical_cols_new and categorical_cols_new are already defined above

numerical_transformer_new = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer_new = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor_new = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer_new, numerical_cols_new),
        ('cat', categorical_transformer_new, categorical_cols_new)
    ],
    remainder='passthrough'
)

# Split the X_new_q2 and y dataframes into training and testing sets
X_train_new, X_test_new, y_train, y_test = train_test_split(X_new_q2, y, test_size=0.2, random_state=RSEED)

print("New preprocessing pipeline defined and data split into training and testing sets.")

New preprocessing pipeline defined and data split into training and testing sets.


## Re-evaluate baseline models

### Subtask:
Train and evaluate the three baseline models (Ridge, Random Forest, Gradient Boosting) again using the updated feature set and preprocessing pipeline with 5-fold cross-validation on the training data.


**Reasoning**:
Train and evaluate the baseline models using the new features and preprocessing pipeline.



In [2]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

# 1. Define the new baseline models
models_new = {
    'Ridge Regression': Ridge(random_state=RSEED),
    'Random Forest Regressor': RandomForestRegressor(random_state=RSEED),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=RSEED)
}

# 2. Initialize dictionary to store results
results_new = {}

# 3. Print start message
print("Training and Evaluating Baseline Models with New Features (5-Fold CV on Training Set)...")
print("-" * 70)

# 4. Iterate and evaluate models
for name, model in models_new.items():
    pipeline_new = Pipeline(steps=[('preprocessor', preprocessor_new), ('regressor', model)])

    cv_scores_new = cross_val_score(pipeline_new, X_train_new, y_train, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    rmse_scores_new = np.sqrt(-cv_scores_new)

    results_new[name] = {
        'RMSE_Mean': rmse_scores_new.mean(),
        'RMSE_Std': rmse_scores_new.std()
    }
    print(f"  {name:<25}: RMSE = {results_new[name]['RMSE_Mean']:.4f} (+/- {results_new[name]['RMSE_Std']:.4f})")

print("-" * 70)

# 5. Convert results to DataFrame and sort
results_df_new = pd.DataFrame(results_new).T.sort_values(by='RMSE_Mean')

# 6. Identify and print the best baseline model
best_baseline_model_new = results_df_new.index[0]
print(f"\nBest Baseline Model with New Features: {best_baseline_model_new}")

display(results_df_new)

Training and Evaluating Baseline Models with New Features (5-Fold CV on Training Set)...
----------------------------------------------------------------------
  Ridge Regression         : RMSE = 0.5023 (+/- 0.0573)
  Random Forest Regressor  : RMSE = 0.4162 (+/- 0.0638)
  Gradient Boosting Regressor: RMSE = 0.4046 (+/- 0.0667)
----------------------------------------------------------------------

Best Baseline Model with New Features: Gradient Boosting Regressor


Unnamed: 0,RMSE_Mean,RMSE_Std
Gradient Boosting Regressor,0.404643,0.06672
Random Forest Regressor,0.416216,0.063843
Ridge Regression,0.502287,0.057307


## Hyperparameter tuning (best model)

### Subtask:
Perform more extensive hyperparameter tuning on the best performing model from the previous step. This might involve a wider search space or more iterations in `RandomizedSearchCV`, or even using `GridSearchCV` if the search space is manageable.


**Reasoning**:
Identify the best baseline model, define a more extensive hyperparameter distribution for Gradient Boosting Regressor, instantiate RandomizedSearchCV with the updated pipeline, distribution, increased iterations, and fitting it to the training data.



In [3]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import randint, uniform

# 1. The best baseline model is Gradient Boosting Regressor, as identified in the previous step.

# 2. Define a more extensive hyperparameter distribution for Gradient Boosting Regressor
param_dist_extensive = {
    'regressor__n_estimators': randint(100, 600), # Wider range
    'regressor__learning_rate': uniform(0.005, 0.3), # Wider range, lower minimum
    'regressor__max_depth': randint(3, 8), # Deeper trees
    'regressor__min_samples_split': randint(2, 20), # Wider range
    'regressor__min_samples_leaf': randint(1, 10), # Add min_samples_leaf
    'regressor__subsample': uniform(0.6, 0.4) # Add subsample
}

# The preprocessor_new and X_train_new, y_train are already defined in previous steps.

# 3. Instantiate RandomizedSearchCV with the updated pipeline and extensive parameter distribution
gbr_pipeline_new = Pipeline(steps=[('preprocessor', preprocessor_new),
                                ('regressor', GradientBoostingRegressor(random_state=RSEED))])

random_search_extensive = RandomizedSearchCV(gbr_pipeline_new, param_distributions=param_dist_extensive,
                                   n_iter=100, cv=5, scoring='neg_mean_squared_error', # Increased iterations
                                   verbose=0, random_state=RSEED, n_jobs=-1)

# 4. Fit the RandomizedSearchCV object to the training data
print("Performing extensive hyperparameter optimization for Gradient Boosting...")
random_search_extensive.fit(X_train_new, y_train)

# 5. Print the best parameters found by RandomizedSearchCV
print("\n--- Extensive Hyperparameter Optimization Results ---")
print(f"Best CV Parameters: {random_search_extensive.best_params_}")

Performing extensive hyperparameter optimization for Gradient Boosting...

--- Extensive Hyperparameter Optimization Results ---
Best CV Parameters: {'regressor__learning_rate': np.float64(0.03593716065077978), 'regressor__max_depth': 3, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 17, 'regressor__n_estimators': 316, 'regressor__subsample': np.float64(0.7280198404122447)}


## Final evaluation

### Subtask:
Evaluate the best optimized model on the held-out test set to assess its final performance using RMSE and R2 score.


**Reasoning**:
Evaluate the best optimized Gradient Boosting model on the held-out test set using RMSE and R2 score.



In [4]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 1. Get the best estimator from the random_search_extensive object
best_gbr_optimized = random_search_extensive.best_estimator_

# 2. Make predictions on the held-out test set X_test_new
y_pred_test_new = best_gbr_optimized.predict(X_test_new)

# 3. Calculate the Mean Squared Error (MSE)
mse_new = mean_squared_error(y_test, y_pred_test_new)

# 4. Calculate the Root Mean Squared Error (RMSE)
final_rmse_new = np.sqrt(mse_new)

# 5. Calculate the R2 score
final_r2_new = r2_score(y_test, y_pred_test_new)

# 6. Print the results
print("\n--- Final Test Set Evaluation (Optimized GBR with New Features) ---")
print(f"Test RMSE: {final_rmse_new:.4f}")
print(f"Test R2 Score: {final_r2_new:.4f}")


--- Final Test Set Evaluation (Optimized GBR with New Features) ---
Test RMSE: 0.4428
Test R2 Score: 0.7689


## Compare results

### Subtask:
Compare the final performance metrics to the initial model's results to determine if the changes improved accuracy.


**Reasoning**:
Compare the final performance metrics from the optimized model with the new features to the initial model's results. The initial model's results were printed in the output of cell ZhoBJ6dgBlyZ.



In [5]:
# Initial model results from cell ZhoBJ6dgBlyZ output:
initial_test_rmse = 0.9155
initial_test_r2 = 0.0121

# Final optimized model with new features results from the previous step (final evaluation subtask):
final_test_rmse_new = final_rmse_new # variable from previous step
final_test_r2_new = final_r2_new # variable from previous step

print("--- Model Performance Comparison ---")
print(f"Initial Model (Baseline GBR on limited features):")
print(f"  Test RMSE: {initial_test_rmse:.4f}")
print(f"  Test R2 Score: {initial_test_r2:.4f}")
print("-" * 35)
print(f"Optimized Model (Tuned GBR on expanded features):")
print(f"  Test RMSE: {final_test_rmse_new:.4f}")
print(f"  Test R2 Score: {final_test_r2_new:.4f}")
print("-" * 35)

# Determine if accuracy improved
rmse_improved = final_test_rmse_new < initial_test_rmse
r2_improved = final_test_r2_new > initial_test_r2

print(f"RMSE Improvement: {rmse_improved}")
print(f"R2 Score Improvement: {r2_improved}")

if rmse_improved and r2_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning significantly improved model accuracy.")
elif rmse_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning decreased RMSE, but R2 score did not increase.")
elif r2_improved:
    print("\nConclusion: Adding new features and hyperparameter tuning increased R2 score, but RMSE did not decrease.")
else:
    print("\nConclusion: Adding new features and hyperparameter tuning did not improve model accuracy based on RMSE and R2 score.")


--- Model Performance Comparison ---
Initial Model (Baseline GBR on limited features):
  Test RMSE: 0.9155
  Test R2 Score: 0.0121
-----------------------------------
Optimized Model (Tuned GBR on expanded features):
  Test RMSE: 0.4428
  Test R2 Score: 0.7689
-----------------------------------
RMSE Improvement: True
R2 Score Improvement: True

Conclusion: Adding new features and hyperparameter tuning significantly improved model accuracy.


## Summary:

### Data Analysis Key Findings

*   The initial baseline Gradient Boosting Regressor model on a limited feature set had a Test RMSE of 0.9155 and a Test R2 Score of 0.0121.
*   After adding new features and performing extensive hyperparameter tuning, the optimized Gradient Boosting Regressor model achieved a Test RMSE of 0.4428 and a Test R2 Score of 0.7689 on the held-out test set.
*   The optimized model shows a significant decrease in RMSE (from 0.9155 to 0.4428) and a substantial increase in R2 score (from 0.0121 to 0.7689) compared to the initial model.

### Insights or Next Steps

*   The inclusion of additional features and hyperparameter tuning significantly improved the model's predictive accuracy for student performance.
*   Further analysis could involve feature importance analysis on the final model to understand which of the new features contributed most to the performance improvement.
