# An Introduction to Machine Learning
## Session 2b: Regression and Practical Considerations in ML

Welcome to Session 2b! We’ll explore a new type of machine learning model: regression. While classification models predict categories, regression models predict continuous values, making them useful in situations like forecasting sales, predicting scores, or analysing trends.

We’ll focus on Linear Regression, a foundational model in machine learning. By the end of this session, you’ll understand how Linear Regression works, how to evaluate a regression model, and how to interpret its predictions.

### 1. Importing libraries and data-processing

In [None]:
# Run this cell to import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
# Load both datasets
white_wine = pd.read_csv("../data/winequality-white.csv")
red_wine = pd.read_csv("../data/winequality-red.csv")

# Display the first few rows of each to understand structure
white_wine.head(), red_wine.head()

In [None]:
# Display info for both datasets
print("White Wine Dataset Info:")
white_wine.info()
print("\nRed Wine Dataset Info:")
red_wine.info()
# Check for missing values
wine_data.isnull().sum()

In [None]:
# Define features and target variable
features = ['alcohol', 'sulphates', 'citric acid', 'density', 'pH']
X_white = white_wine[features]
y_white = white_wine['quality']
X_red = red_wine[features]
y_red = red_wine['quality']

In [None]:
# EXERCISE: Split the white wine dataset into training and testing sets.
# Hint: Use train_test_split with test_size=0.2 and random_state=42 for consistency.

X_train_white, X_test_white, y_train_white, y_test_white = train_test_split(____, ____, test_size=____, random_state=____)

### 2. Linear Regression Modelling

In [None]:
# Initialise the Linear Regression model
linear_reg = LinearRegression()

In [None]:
# EXERCISE: Train the Linear Regression model on X_train and y_train.

linear_reg.fit(____, ____)

In [None]:
# EXERCISE: Predict the quality scores on X_test using the trained model.

y_pred = linear_reg.predict(____)

In [None]:
# Calculate Mean Squared Error, Mean Absolute Error, and R-squared
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-squared: {r2}")

REFLECTION:
1. Based on the MSE, MAE, and R² values, what can you conclude about the model's accuracy?
2. Which metric do you think best describes the model's performance?

### 3. Interpreting actual vs predicted values.

In [None]:
# Plot predictions vs. actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Reference line
plt.xlabel('Actual Quality')
plt.ylabel('Predicted Quality')
plt.title('Predicted vs. Actual Quality Scores')
plt.show()

In [None]:
# EXERCISE: Print the model coefficients and intercept.

print("Model Coefficients:", linear_reg.coef_)
print("Intercept:", linear_reg.intercept_)

REFLECTION:
1. Which features have the highest positive or negative coefficients?
2. Do these coefficients align with what you would expect based on the data?

### 4. Running this on the red wine data.

In [None]:
# EXERCISE: Predict the quality of red wine using the model trained on white wine.

y_pred_red = linear_reg.predict(____)

In [None]:
# Calculate evaluation metrics for red wine predictions
mse_red = mean_squared_error(y_red, y_pred_red)
mae_red = mean_absolute_error(y_red, y_pred_red)
r2_red = r2_score(y_red, y_pred_red)

print("Model Performance on Red Wine Data:")
print(f"Mean Squared Error: {mse_red}")
print(f"Mean Absolute Error: {mae_red}")
print(f"R-squared: {r2_red}")

In [None]:
# Create a performance comparison table
comparison = pd.DataFrame({
    "Dataset": ["White Wine Test Set", "Red Wine (New Data)"],
    "Mean Squared Error": [mse_white, mse_red],
    "Mean Absolute Error": [mae_white, mae_red],
    "R-squared": [r2_white, r2_red]
})

comparison

REFLECTION:
1. How does the model's performance change when applied to red wine data?
2. Why do you think the model performed differently on the red wine dataset?
3. What does this tell you about generalising models across similar datasets?