Title: Understanding Regression Metrics

Task 1: Calculate MAE and MSE on test predictions and compare errors.

In [1]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data (replace with your dataset)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.2, 1.9, 3.2, 3.8, 5.1])

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate MAE and MSE
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Comparison note
if mae < mse:
    print("MAE is smaller than MSE, indicating lower average absolute errors.")
else:
    print("MSE is smaller or equal to MAE, which is unusual as MSE penalizes larger errors more.")


Mean Absolute Error (MAE): 0.2429
Mean Squared Error (MSE): 0.0590
MSE is smaller or equal to MAE, which is unusual as MSE penalizes larger errors more.


Task 2: Evaluate R2 Score on varying datasets and discuss significance.

In [2]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import numpy as np

# Function to train and evaluate model on a dataset
def evaluate_r2(X, y, description):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    print(f"R2 score on {description}: {r2:.4f}")
    return r2

# Dataset 1: Simple linear relation with low noise
X1, y1 = make_regression(n_samples=100, n_features=1, noise=5, random_state=1)
r2_1 = evaluate_r2(X1, y1, "linear data with low noise")

# Dataset 2: Linear relation with higher noise
X2, y2 = make_regression(n_samples=100, n_features=1, noise=20, random_state=1)
r2_2 = evaluate_r2(X2, y2, "linear data with high noise")

# Dataset 3: Non-linear relation (quadratic)
X3 = np.linspace(-3, 3, 100).reshape(-1, 1)
y3 = 2*X3.flatten()**2 + 3 + np.random.normal(0, 3, 100)
r2_3 = evaluate_r2(X3, y3, "non-linear quadratic data")

# Discussion:
print("\nSignificance of R2 score:")
print("- R2 score measures how well the model explains variance in the data.")
print("- Values closer to 1 indicate better fit; values near 0 or negative mean poor fit.")
print("- High noise lowers R2, showing the model explains less variance.")
print("- For non-linear data, a linear model yields lower R2, highlighting model choice importance.")


R2 score on linear data with low noise: 0.9958
R2 score on linear data with high noise: 0.9420
R2 score on non-linear quadratic data: -0.0800

Significance of R2 score:
- R2 score measures how well the model explains variance in the data.
- Values closer to 1 indicate better fit; values near 0 or negative mean poor fit.
- High noise lowers R2, showing the model explains less variance.
- For non-linear data, a linear model yields lower R2, highlighting model choice importance.


Task 3: Use a sample dataset, compute all three metrics, and deduce model performance.

In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Generate a sample regression dataset
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")

# Deduce model performance
print("\nModel Performance Interpretation:")
print("- MAE provides the average magnitude of errors in the predictions, in the same units as the output.")
print("- MSE penalizes larger errors more due to squaring, making it sensitive to outliers.")
print("- R² score indicates the proportion of variance explained by the model (closer to 1 is better).")

if r2 > 0.8:
    print("The model fits the data very well.")
elif r2 > 0.5:
    print("The model has a moderate fit; there might be room for improvement.")
else:
    print("The model does not fit the data well; consider a different model or features.")


Mean Absolute Error (MAE): 12.03
Mean Squared Error (MSE): 246.12
R² Score: 0.9681

Model Performance Interpretation:
- MAE provides the average magnitude of errors in the predictions, in the same units as the output.
- MSE penalizes larger errors more due to squaring, making it sensitive to outliers.
- R² score indicates the proportion of variance explained by the model (closer to 1 is better).
The model fits the data very well.
