## **Introduction and Project Outline**

### **Introduction**
The goal of this project is to test the robustness of an already established XGBoost model ([xgboost_model.pkl](https://github.com/Compcode1/ml-predictions-exam-scores/blob/main/xgboost_model.pkl)), which was previously trained on a dataset of 250,000 rows containing various medical diagnoses and lifestyle factors to predict exam scores ([medrecords.csv](https://github.com/Compcode1/ml-predictions-exam-scores/blob/main/medrecords.csv)). 

The XGBoost model demonstrated impressive performance on the original dataset, with high R-squared values and low Mean Squared Error (MSE). Here are some of the key performance metrics from the original project:

- **XGBoost Model MSE**: 0.02338906874154974
- **XGBoost Model R-squared**: 0.9999245421116839
- **Cross-Validated MSE for each fold**: [0.02500604, 0.01728912, 0.01483328, 0.02271097, 0.02217558]
- **Mean Cross-Validated MSE**: 0.02040299750467838
- **Standard Deviation of MSE**: 0.003751659821773931
- **Original Exam Scores for the first 5 rows**: [ 81.  72. 110.  77.  77.]
- **Predicted Exam Scores for the first 5 rows**: [ 80.987045, 72.02215, 109.99814, 77.011765, 77.01987]
- **MSE for the first 5 rows**: 0.00023898585932329296
- **Original Exam Scores for the first 1,000 rows**: [ 81., 72., 110., 77., 77., 85.5, 26.136837, 58.905, 47.5657875, 87.]
- **Predicted Exam Scores for the first 1,000 rows**: [ 80.987045, 72.02215, 109.99814, 77.011765, 77.01987, 85.495255, 26.2142, 59.47049, 47.51045, 86.996826]
- **MSE for the first 1,000 rows**: 0.010526539734485856
- **MSE for the entire dataset**: 0.016013743906791843

In this extension of the project, we introduce a new target variable, **Exam Score 2**, based on an arbitrary formula derived from the row index. This new variable is expected to have no significant correlation with the original features. The aim is to observe how the same XGBoost model performs when the target variable is unrelated to the input features, testing the model's generalization and performance under challenging conditions.

My hypothesis is that the model will struggle to make accurate predictions for **Exam Score 2**, given that the new target has little to no alignment with the input features.

---

### **Project Outline**

#### **Step 1: Setting up the Environment**
- Load the dataset (`medrecords.csv`) into a Pandas DataFrame.
- Verify the dataset is clean, with no missing or corrupted values.
- Import the necessary libraries (e.g., pandas, numpy, xgboost, and sklearn).

#### **Step 2: Add a New Target Column (Exam Score 2)**
- Define a function that generates the **Exam Score 2** value based on the last digit of the row index.
- Apply this function across all rows to create the new column.

#### **Step 3: Preprocessing the Data**
- One-hot encode categorical variables (e.g., Gender).
- Standardize numerical features such as Age and BMI.
- Drop unnecessary or redundant columns such as Exam Age and Age Group.

#### **Step 4: Split the Dataset**
- Split the dataset into training and testing sets (e.g., 80/20 split).
- Ensure **Exam Score 2** is the target variable in this iteration.

#### **Step 5: Train the XGBoost Model on Exam Score 2**
- Load the saved XGBoost model (optional: retrain if needed).
- Train the model on the new dataset with **Exam Score 2** as the target.
- Evaluate the model using the same performance metrics: Mean Squared Error (MSE), R-squared, and cross-validation results.

#### **Step 6: Evaluate Model Performance**
- Compare the model’s performance on **Exam Score 2** with the performance metrics from the original **Exam Score**.
- Analyze the results: Does the model struggle with the new, arbitrary target variable? Are the metrics significantly worse?

#### **Step 7: Document Findings**
- Summarize the findings in terms of model generalization and feature importance.
- Discuss whether the model can generalize well with an unrelated target variable, and the implications for using machine learning models on structured data.


In [10]:
# Step 1: Setting up the Environment

# Import the necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset from your Desktop
file_path = '/Users/steventuschman/Desktop/medrecords.csv'  # Adjust the file path if necessary
df = pd.read_csv(file_path)

# Verify the dataset is clean (check for missing or corrupted values)
print(df.head())
print("\nMissing values in each column:")
print(df.isnull().sum())

# Optional: handle any missing values if necessary
# df = df.dropna()  # Uncomment this if you want to drop rows with missing values


   Gender  Age Age Group   BMI  Obesity  Smoking  High Alcohol  Heart Disease  \
0    Male   27     18-44  22.7        0        0             0              0   
1  Female   54     45-64  28.5        0        0             0              0   
2    Male   21     18-44  21.3        0        0             1              0   
3  Female   62     45-64  28.6        0        0             0              0   
4    Male   61     45-64  21.4        0        0             1              0   

   Cancer  COPD  Alzheimers  Diabetes  CKD  High Blood Pressure  Stroke  \
0       0     0           0         0    0                    0       0   
1       0     0           0         1    0                    0       0   
2       0     0           0         0    0                    0       0   
3       0     0           0         0    0                    0       0   
4       0     0           0         0    0                    0       0   

   Liver Dx  Strength Exam Age  Exam Score  
0         1      

In [11]:
import random

# Step 2: Add a New Target Column (Exam Score 2)

# Define a function to generate Exam Score 2 based on the last digit of the index
def generate_exam_score_2(index):
    last_digit = index % 10  # Get the last digit of the index
    
    # Assign a random value based on the last digit of the index
    if last_digit == 0:
        return random.randint(40, 60)
    elif last_digit == 1:
        return random.randint(30, 50)
    elif last_digit == 2:
        return random.randint(30, 90)
    elif last_digit == 3:
        return random.randint(10, 80)
    elif last_digit == 4:
        return random.randint(20, 30)
    elif last_digit == 5:
        return random.randint(80, 90)
    elif last_digit == 6:
        return random.randint(50, 75)
    elif last_digit == 7:
        return random.randint(10, 90)
    elif last_digit == 8:
        return random.randint(30, 40)
    elif last_digit == 9:
        return random.randint(70, 84)

# Apply this function to each row to create the new Exam Score 2 column
df['Exam Score 2'] = df.index.map(generate_exam_score_2)

# Display the updated DataFrame to verify the new column
print(df[['Exam Score', 'Exam Score 2']].head())


   Exam Score  Exam Score 2
0        81.0            59
1        72.0            41
2       110.0            73
3        77.0            26
4        77.0            25


In [12]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Step 3: Preprocessing the Data

# One-hot encode categorical variables (Gender)
df_encoded = pd.get_dummies(df, columns=['Gender'], drop_first=True)

# Drop unnecessary columns such as 'Exam Age' and 'Age Group'
df_encoded = df_encoded.drop(['Exam Age', 'Age Group'], axis=1)

# Standardize numerical features like 'Age' and 'BMI'
scaler = StandardScaler()

# Columns to be standardized
columns_to_standardize = ['Age', 'BMI']

# Apply the standard scaler to the selected columns
df_encoded[columns_to_standardize] = scaler.fit_transform(df_encoded[columns_to_standardize])

# Display the first few rows of the preprocessed data to verify
print(df_encoded.head())


        Age       BMI  Obesity  Smoking  High Alcohol  Heart Disease  Cancer  \
0 -1.116721 -0.971258        0        0             0              0       0   
1  0.318498 -0.164576        0        0             0              0       0   
2 -1.435659 -1.165975        0        0             1              0       0   
3  0.743748 -0.150668        0        0             0              0       0   
4  0.690592 -1.152066        0        0             1              0       0   

   COPD  Alzheimers  Diabetes  CKD  High Blood Pressure  Stroke  Liver Dx  \
0     0           0         0    0                    0       0         1   
1     0           0         1    0                    0       0         0   
2     0           0         0    0                    0       0         0   
3     0           0         0    0                    0       0         0   
4     0           0         0    0                    0       0         0   

   Strength  Exam Score  Exam Score 2  Gender_Male  
0  

In [13]:
from sklearn.model_selection import train_test_split

# Step 4: Split the Dataset

# Define the features (X) and the target variable (y) - Exam Score 2 in this case
X = df_encoded.drop(columns=['Exam Score', 'Exam Score 2'])
y = df_encoded['Exam Score 2']

# Perform an 80/20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print(f"Training set shape: X_train = {X_train.shape}, y_train = {y_train.shape}")
print(f"Testing set shape: X_test = {X_test.shape}, y_test = {y_test.shape}")


Training set shape: X_train = (200000, 16), y_train = (200000,)
Testing set shape: X_test = (50000, 16), y_test = (50000,)


In [14]:
import xgboost as xgb
import pickle
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

# Step 5: Train the XGBoost Model on Exam Score 2

# Load the saved XGBoost model (xgboost_model.pkl) if retraining is not required
model_path = '/Users/steventuschman/Desktop/xgboost_model.pkl'
with open(model_path, 'rb') as file:
    xgboost_model = pickle.load(file)

# Alternatively, if retraining is needed, initialize a new XGBoost model
# Uncomment the following line if retraining is necessary:
# xgboost_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)

# Train the XGBoost model on the training set
xgboost_model.fit(X_train, y_train)

# Evaluate the model using Mean Squared Error (MSE) and R-squared
y_pred = xgboost_model.predict(X_test)

# Calculate the metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Perform cross-validation
cv_scores = cross_val_score(xgboost_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
mean_cv_mse = -cv_scores.mean()
std_cv_mse = cv_scores.std()

# Output the results
print(f"Model Evaluation on Exam Score 2:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r2}")
print(f"Cross-Validated MSE for each fold: {cv_scores}")
print(f"Mean Cross-Validated MSE: {mean_cv_mse}")
print(f"Standard Deviation of MSE: {std_cv_mse}")


Model Evaluation on Exam Score 2:
Mean Squared Error (MSE): 457.16622435908454
R-squared: -0.010028362274169922
Cross-Validated MSE for each fold: [-461.5811903  -459.19734528 -458.24252344 -464.60684403 -462.86848149]
Mean Cross-Validated MSE: 461.2992769069315
Standard Deviation of MSE: 2.33425520545703


#### **Step 6: Evaluate Model Performance**

We compared the model's performance on **Exam Score 2** (an arbitrary target) with its performance on the original **Exam Score** (which had meaningful relationships with the features). Below are the results:

**Original Exam Score Performance:**
- Mean Squared Error (MSE): **0.0234**
- R-squared: **0.9999**
- Cross-Validated MSE for each fold: **[0.0250, 0.0173, 0.0148, 0.0227, 0.0222]**
- Mean Cross-Validated MSE: **0.0204**
- Standard Deviation of MSE: **0.0038**

**Exam Score 2 Performance:**
- Mean Squared Error (MSE): **458.09**
- R-squared: **-0.0107**
- Cross-Validated MSE for each fold: **[-463.03, -458.21, -460.63, -460.91, -461.42]**
- Mean Cross-Validated MSE: **460.84**
- Standard Deviation of MSE: **1.5565**

### Analysis:

- **Mean Squared Error (MSE)**:
  The MSE on **Exam Score 2** is **458.09**, which is substantially worse than the MSE on the original **Exam Score** (0.0234). This large difference indicates that the model is unable to predict **Exam Score 2** accurately due to its arbitrary nature and lack of correlation with the features. 

- **R-squared**:
  The R-squared value for **Exam Score 2** is **-0.0107**, indicating that the model is performing worse than a simple mean prediction, meaning it cannot explain any variance in the target. This is a stark contrast to the near-perfect R-squared of **0.9999** achieved with the original **Exam Score**, where the model was able to capture almost all the variability.

- **Cross-Validation MSE**:
  For **Exam Score 2**, the cross-validated MSE values are consistently poor across all folds, with an average MSE of **460.84** and a low standard deviation (**1.5565**), showing that the model consistently struggles across different data splits. This is much worse than the mean cross-validated MSE of **0.0204** on the original **Exam Score**, further highlighting the model's inability to generalize to this new target variable.


The comparison shows that the model struggles significantly with the arbitrary target variable **Exam Score 2**. The original model had strong performance on the **Exam Score** variable, with very low MSE and high R-squared values. However, when presented with a target variable that has no meaningful relationship with the features, the model's performance declines sharply, as indicated by the extremely high MSE and negative R-squared values for **Exam Score 2**. This demonstrates the importance of having a meaningful correlation between the features and the target variable in machine learning models like XGBoost.


#### **Step 7: Document Findings**

### Summary of Findings:

Through this experiment, we aimed to test the robustness and generalization of a previously trained XGBoost model by introducing a new, arbitrary target variable (**Exam Score 2**) that had no significant correlation with the input features. The model, which previously demonstrated impressive performance on the original **Exam Score**, showed a dramatic decline in predictive ability when tested on **Exam Score 2**.

1. **Model Performance:**
   - The XGBoost model performed exceptionally well on the original **Exam Score**, with very low **Mean Squared Error (MSE)** and near-perfect **R-squared** values, indicating a strong ability to capture the relationships between features and the target variable.
   - However, when predicting **Exam Score 2**, the model's performance metrics (MSE: **458.09**, R-squared: **-0.0107**) demonstrated that it struggled to make accurate predictions due to the lack of meaningful relationships between the input features and the target variable.

2. **Generalization Ability:**
   - The significant performance drop on **Exam Score 2** highlights that the XGBoost model cannot generalize well when the target variable has no inherent connection to the input data. The model relies heavily on finding patterns and relationships in structured data, and when those patterns are absent, its predictive power diminishes substantially.
   - This demonstrates that machine learning models like XGBoost are designed to optimize for patterns and correlations within the data. Without meaningful features, the model becomes ineffective.

3. **Feature Importance:**
   - In the original experiment, key features such as **Strength**, **Heart Disease**, **Alzheimer's**, and **Cancer** were identified as important in predicting **Exam Score**. These features had significant correlations with the target variable, enabling the model to perform well.
   - In contrast, for **Exam Score 2**, which is arbitrarily assigned, feature importance becomes irrelevant, as the target does not depend on any features. This explains why the model struggles with this unrelated target.

### Implications:

The results of this experiment underscore the importance of aligning target variables with meaningful features when using machine learning models. If the target variable does not have a logical relationship with the input features, even highly optimized models like XGBoost will fail to produce accurate predictions. This experiment also illustrates the limitations of machine learning when applied to datasets with poorly defined or arbitrary target variables, emphasizing the need for careful feature selection and target definition in predictive modeling tasks.

In structured data, models like XGBoost excel at identifying relationships and leveraging feature importance to make accurate predictions. However, the model's inability to perform well on **Exam Score 2** reinforces the importance of understanding the data context and ensuring that the target variable is appropriately aligned with the available features for meaningful model training and evaluation.
