<a href="https://colab.research.google.com/github/Rohil72/ML_LAB/blob/main/MLLab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Model Evaluation with Cross Validation and Bias-Variance Analysis


In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)


{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C59K5F', 'creators': ['Saverio Vito'], 'intro_paper': {'ID': 420, 'type': 'NATIVE', 'title': 'On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario', 'authors': 

# Task
Analyze the air quality dataset from "https://archive.ics.uci.edu/dataset/360/air+quality" by performing cross-validation and bias-variance analysis.

## Load and preprocess data

### Subtask:
Handle missing values, convert data types, and prepare the data for modeling.


In [None]:
print(air_quality.data.keys())
print(air_quality.data.features.head())
print(air_quality.data.targets)

# Check if 'targets' is in the keys and if it is None
if 'targets' in air_quality.data.keys() and air_quality.data.targets is not None:
    X = air_quality.data.features
    y = air_quality.data.targets

    # Replace the missing value indicator (-200) with NaN
    X = X.replace(-200, np.nan)
    y = y.replace(-200, np.nan)

    # Convert 'Date' and 'Time' to datetime and combine them
    X['Date'] = pd.to_datetime(X['Date'], format='%m/%d/%Y')
    X['Time'] = pd.to_timedelta(X['Time'].astype(str))
    X['DateTime'] = X['Date'] + X['Time']
    X = X.drop(['Date', 'Time'], axis=1)

    # Impute missing values with the mean
    X = X.fillna(X.mean())
    y = y.fillna(y.mean())

    # Concatenate the preprocessed feature and target DataFrames
    df_processed = pd.concat([X, y], axis=1)

    display(df_processed.head())

else:
    print("Target variable 'y' is not available or is None in the dataset.")


dict_keys(['ids', 'features', 'targets', 'original', 'headers'])
        Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
0  3/10/2004  18:00:00     2.6         1360       150      11.9   
1  3/10/2004  19:00:00     2.0         1292       112       9.4   
2  3/10/2004  20:00:00     2.2         1402        88       9.0   
3  3/10/2004  21:00:00     2.2         1376        80       9.2   
4  3/10/2004  22:00:00     1.6         1272        51       6.5   

   PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
0           1046      166          1056      113          1692         1268   
1            955      103          1174       92          1559          972   
2            939      131          1140      114          1555         1074   
3            948      172          1092      122          1584         1203   
4            836      131          1205      116          1490         1110   

      T    RH      AH  
0  13.6  48.9  0.7578  
1  13.3  

**Reasoning**:
The output confirms that `air_quality.data.targets` is indeed `None`. The task description mentions "target_col: None" in the metadata, which explains this. However, the "variable information" clearly lists several columns that could be potential targets (e.g., 'CO(GT)', 'NMHC(GT)', 'C6H6(GT)', 'NOx(GT)', 'NO2(GT)'). I will proceed by selecting 'CO(GT)' as the target variable from the `features` DataFrame, separating it from the features, and then continue with the preprocessing steps.



In [None]:
# Select 'CO(GT)' as the target variable from the features DataFrame
y = X['CO(GT)']
X = X.drop('CO(GT)', axis=1)

# Replace the missing value indicator (-200) with NaN in the updated X and y
X = X.replace(-200, np.nan)
y = y.replace(-200, np.nan)

# Convert 'Date' and 'Time' to datetime and combine them
X['Date'] = pd.to_datetime(X['Date'], format='%m/%d/%Y')
X['Time'] = pd.to_timedelta(X['Time'].astype(str))
X['DateTime'] = X['Date'] + X['Time']
X = X.drop(['Date', 'Time'], axis=1)

# Impute missing values with the mean
X = X.fillna(X.mean())
y = y.fillna(y.mean())

# Concatenate the preprocessed feature and target DataFrames
df_processed = pd.concat([X, y], axis=1)

display(df_processed.head())

Unnamed: 0,PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,DateTime,CO(GT)
0,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,2004-03-10 18:00:00,2.6
1,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,2004-03-10 19:00:00,2.0
2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,2004-03-10 20:00:00,2.2
3,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,2004-03-10 21:00:00,2.2
4,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,2004-03-10 22:00:00,1.6


## Select and train a model

### Subtask:
Choose a suitable regression model and train it on the preprocessed data.


:**Reasoning**:
Import the Linear Regression model from scikit-learn and define the feature matrix X and target vector y from the preprocessed dataframe. Then instantiate and train the model.



In [None]:
from sklearn.linear_model import LinearRegression

# Define the feature matrix X and target vector y
X = df_processed.drop(['CO(GT)', 'DateTime'], axis=1)
y = df_processed['CO(GT)']

# Instantiate the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X, y)

## Perform cross-validation

### Subtask:
Implement cross-validation to evaluate the model's performance and assess its generalization ability.


**Reasoning**:
Implement cross-validation using KFold and cross_val_score to evaluate the trained model's performance.



In [None]:
from sklearn.model_selection import cross_val_score, KFold

# Define the number of splits for cross-validation
n_splits = 5

# Instantiate a KFold object with the chosen number of splits and shuffle=True
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Use cross_val_score to perform cross-validation on the trained model
# Specify 'neg_mean_squared_error' as the scoring metric
cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Calculate the mean of the cross-validation scores and convert to positive
avg_cv_rmse = np.sqrt(-cv_scores.mean())

# Print the average cross-validation score
print(f"Average Cross-Validation RMSE: {avg_cv_rmse:.4f}")

Average Cross-Validation RMSE: 0.5834


## Perform bias-variance analysis

### Subtask:
Analyze the model's bias and variance to understand its sources of error.


**Reasoning**:
Import the necessary library for bias-variance decomposition.



In [None]:
from mlxtend.evaluate import bias_variance_decomp

**Reasoning**:
Calculate the bias, variance, and average loss using bias_variance_decomp and print the results.



In [None]:
import numpy as np

# Reshape y to be a 1D array if it's a Series
y = y.values.reshape(-1)

# Perform bias-variance decomposition
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
    model, X.values, y, loss='mse', random_seed=42)

# Print the results
print(f'Average expected loss: {avg_expected_loss:.4f}')
print(f'Average bias: {avg_bias:.4f}')
print(f'Average variance: {avg_var:.4f}')

TypeError: bias_variance_decomp() missing 2 required positional arguments: 'X_test' and 'y_test'

**Reasoning**:
The error indicates that `bias_variance_decomp` requires test data (`X_test` and `y_test`). Split the data into training and testing sets before performing bias-variance decomposition.



In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Reshape y_train and y_test to be 1D arrays
y_train = y_train.reshape(-1)
y_test = y_test.reshape(-1)

# Perform bias-variance decomposition
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
    model, X_train.values, y_train, X_test.values, y_test, loss='mse', random_seed=42)

# Print the results
print(f'Average expected loss: {avg_expected_loss:.4f}')
print(f'Average bias: {avg_bias:.4f}')
print(f'Average variance: {avg_var:.4f}')

Average expected loss: 0.3346
Average bias: 0.3336
Average variance: 0.0009


## Evaluate model performance

### Subtask:
Summarize the results from cross-validation and bias-variance analysis.


**Reasoning**:
Summarize the results from cross-validation and bias-variance analysis by comparing the calculated metrics and interpreting the model's performance based on these values.



In [None]:
print(f"Average Cross-Validation RMSE: {avg_cv_rmse:.4f}")
print(f'Average expected loss (MSE): {avg_expected_loss:.4f}')
print(f'Average bias squared: {avg_bias:.4f}')
print(f'Average variance: {avg_var:.4f}')

# Interpretation
print("\n--- Model Performance Interpretation ---")

# Cross-validation vs Expected Loss (MSE)
# RMSE is the square root of MSE. Squaring avg_cv_rmse to compare with avg_expected_loss (MSE).
avg_cv_mse = avg_cv_rmse**2
print(f"Average Cross-Validation MSE: {avg_cv_mse:.4f}")

if avg_cv_mse > avg_expected_loss:
    print("The average cross-validation MSE is slightly higher than the average expected loss.")
    print("This suggests that the performance on unseen data during cross-validation is slightly worse than the performance on the dedicated test set used for bias-variance decomposition.")
else:
     print("The average cross-validation MSE is similar to or lower than the average expected loss.")
     print("This suggests that the performance on unseen data during cross-validation is comparable to or better than the performance on the dedicated test set used for bias-variance decomposition.")

# Bias vs Variance
print(f"\nBias-Variance Analysis:")
print(f"Average Bias: {avg_bias:.4f}")
print(f"Average Variance: {avg_var:.4f}")

if avg_bias > avg_var:
    print("The average bias is significantly higher than the average variance.")
    print("This indicates that bias is the main source of error for this model.")
else:
    print("The average variance is higher than or comparable to the average bias.")
    print("This indicates that variance is a significant source of error for this model.")

# Underfitting vs Overfitting
print(f"\nUnderfitting vs Overfitting:")
if avg_bias > 0.1 and avg_var < 0.01: # Thresholds are illustrative and can be adjusted
    print("High bias and low variance suggest that the model is likely underfitting the data.")
    print("The model is too simple to capture the underlying patterns in the training data, leading to systematic errors (high bias) and consistent predictions across different datasets (low variance).")
elif avg_bias < 0.1 and avg_var > 0.01:
    print("Low bias and high variance suggest that the model is likely overfitting the data.")
    print("The model is too complex and is capturing noise in the training data, leading to small errors on the training set (low bias) but large variations in predictions on unseen data (high variance).")
else:
    print("The balance between bias and variance appears reasonable, or the values fall in a range that doesn't strongly indicate clear underfitting or overfitting based on simple thresholds.")


# Concise Summary
print("\n--- Summary of Findings ---")
print(f"The model's performance on unseen data, as indicated by the average cross-validation RMSE ({avg_cv_rmse:.4f}), is a key metric for generalization.")
print(f"The bias-variance decomposition reveals that the primary source of error is high bias ({avg_bias:.4f}), while the variance ({avg_var:.4f}) is relatively low.")


Average Cross-Validation RMSE: 0.5834
Average expected loss (MSE): 0.3346
Average bias squared: 0.3336
Average variance: 0.0009

--- Model Performance Interpretation ---
Average Cross-Validation MSE: 0.3403
The average cross-validation MSE is slightly higher than the average expected loss.
This suggests that the performance on unseen data during cross-validation is slightly worse than the performance on the dedicated test set used for bias-variance decomposition.

Bias-Variance Analysis:
Average Bias: 0.3336
Average Variance: 0.0009
The average bias is significantly higher than the average variance.
This indicates that bias is the main source of error for this model.

Underfitting vs Overfitting:
High bias and low variance suggest that the model is likely underfitting the data.
The model is too simple to capture the underlying patterns in the training data, leading to systematic errors (high bias) and consistent predictions across different datasets (low variance).

--- Summary of Find

## Summary:

### Data Analysis Key Findings

*   The average cross-validation RMSE for the Linear Regression model is 0.5834.
*   The average expected loss (MSE) of the model is 0.3346.
*   The average bias of the model is 0.3336, which is significantly higher than the average variance (0.0009).

### Insights or Next Steps

*   The analysis indicates that the Linear Regression model is likely underfitting the data due to high bias. The model is too simple to fully capture the underlying patterns.
*   To improve performance, consider using a more complex model or exploring feature engineering techniques to better represent the relationships in the data.
