# CROSS VALIDATION
## by Hafiz Zain Waheed 2862
--------------------------



## STEP 1: LOAD THE DATA. 

In [45]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score
from math import sqrt

In [46]:
# Load the dataset
df = pd.read_csv('heart.csv')

# Display the first few rows to inspect the data
print(df.head())


   Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR  \
0   40   M           ATA        140          289          0     Normal    172   
1   49   F           NAP        160          180          0     Normal    156   
2   37   M           ATA        130          283          0         ST     98   
3   48   F           ASY        138          214          0     Normal    108   
4   54   M           NAP        150          195          0     Normal    122   

  ExerciseAngina  Oldpeak ST_Slope  HeartDisease  
0              N      0.0       Up             0  
1              N      1.0     Flat             1  
2              N      0.0       Up             0  
3              Y      1.5     Flat             1  
4              N      0.0       Up             0  


## Step 2: Find and filling the missing values

In [48]:

# Handling Missing Values
# Check for any missing values in the dataset
missing_values = df.isnull().sum()
# If there are any missing values, fill them with the mean of the column
df.fillna(df.mean(), inplace=True)

  df.fillna(df.mean(), inplace=True)


## Step 3: Split the dataset in training and testing data set. 

In [49]:
# Data Splitting
# Split the data into features and target variable
X = df.drop('HeartDisease', axis=1)  # Features
y = df['HeartDisease']                # Target variable

In [50]:
# Convert categorical variables to dummy variables
X = pd.get_dummies(X)


In [51]:
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


This code uses train_test_split from sklearn.model_selection to divide the dataset into a training set (80%) and a testing set (20%). The random_state parameter is used to seed the random number generator, which allows for reproducibility of the train-test split.

## Step 4: Linear Regression Testing

In [52]:
# Linear Regression Testing
# Initialize the Linear Regression model
lr = LinearRegression()
# Fit the model on the training data
lr.fit(X_train, y_train)
# Predict the target variable for the testing set
y_pred = lr.predict(X_test)

In [53]:
# Evaluation Metrics on Training Data
# Calculate MSE, MAE, and RMSE on the training dataset
mse = mean_squared_error(y_train, lr.predict(X_train))
mae = mean_absolute_error(y_train, lr.predict(X_train))
rmse = sqrt(mse)

In [64]:
print(mse)
print(mae)
print(rmse)

0.10769341198166363
0.23828437882444234
0.3281667441738477


## Step 5: Cross Validation with varying K Values

In [58]:
# Cross-Validation with Varying K values
# Perform cross-validation with different values of K
cv_scores = cross_val_score(lr, X_train, y_train, cv=5)

In [59]:
# Re-evaluation Metrics on Training Data
# Reassess the performance on the training dataset after cross-validation
mse_cv = -cross_val_score(lr, X_train, y_train, scoring='neg_mean_squared_error', cv=5).mean()
mae_cv = -cross_val_score(lr, X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()
rmse_cv = sqrt(mse_cv)

In [61]:
print(mse_cv)
print(mae_cv)
print(rmse_cv)

0.11491806533741648
0.2470438758241992
0.33899567156147653


## Step 6: Compare results before and after CV

In [63]:
print('MSE (without CV):', mse, '\nMSE (with CV):', mse_cv)
print('MAE (without CV):', mae, '\nMAE (with CV):', mae_cv)
print('RMSE (without CV):', rmse, '\nRMSE (with CV):', rmse_cv)

MSE (without CV): 0.10769341198166363 
MSE (with CV): 0.11491806533741648
MAE (without CV): 0.23828437882444234 
MAE (with CV): 0.2470438758241992
RMSE (without CV): 0.3281667441738477 
RMSE (with CV): 0.33899567156147653


From the results, we observe that the error metrics are slightly higher when cross-validation (CV) is used compared to the initial training set without CV. Specifically, the MSE increased from 0.1077 to 0.1149, the MAE from 0.2383 to 0.2470, and the RMSE from 0.3282 to 0.3390 after applying cross-validation.

The initial lower error metrics suggest that the model fits the training data well. However, the slight increase in error after cross-validation indicates that when the model is exposed to new subsets of data, it does not perform as well. This is not necessarily a sign of a worse model but rather a more realistic evaluation of the model's performance. Cross-validation helps to mitigate overfitting by ensuring that the model's ability to generalize is tested on multiple subsets of the data.

In conclusion, while the non-CV metrics might look better because they are lower, the CV metrics provide a more honest assessment of the model's performance on unseen data. Therefore, the results after cross-validation are considered better in terms of providing a realistic expectation of the model's predictive power.





