Introduction
Understanding Multiple Linear Regression

This notebook demonstrates multiple linear regression using the Fish Market dataset. We'll predict fish weight using two physical measurements: height and width.

Dataset Overview:
159 fish observations
Multiple species included
Physical measurements: Length, Height, Width
Target: Weight (in grams)
What We'll Learn:
How to fit a plane through 3D data
Understanding regression coefficients
Evaluating model performance
Interpreting results in practical terms

In [2]:
%pip install numpy pandas scikit-learn matplotlib
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Load the Fish Market dataset
df = pd.read_csv('data/Fish.csv', skipinitialspace=True)
print(df.head())
print(df.info())


Collecting numpy
  Using cached numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Using cached scipy-1.16.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl

# Explore data 

In [3]:
# Data exploration
print("Dataset shape:", df.shape)
print("\nBasic statistics:")
print(df.describe())

print("\nMissing values:")
print(df.isnull().sum())

print("\nSpecies distribution:")
print(df['Species'].value_counts())

Dataset shape: (159, 7)

Basic statistics:
            Weight     Length1     Length2     Length3      Height       Width
count   159.000000  159.000000  159.000000  159.000000  159.000000  159.000000
mean    398.326415   26.247170   28.415723   31.227044    8.970994    4.417486
std     357.978317    9.996441   10.716328   11.610246    4.286208    1.685804
min       0.000000    7.500000    8.400000    8.800000    1.728400    1.047600
25%     120.000000   19.050000   21.000000   23.150000    5.944800    3.385650
50%     273.000000   25.200000   27.300000   29.400000    7.786000    4.248500
75%     650.000000   32.700000   35.500000   39.650000   12.365900    5.584500
max    1650.000000   59.000000   63.400000   68.000000   18.957000    8.142000

Missing values:
Species    0
Weight     0
Length1    0
Length2    0
Length3    0
Height     0
Width      0
dtype: int64

Species distribution:
Species
Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       1

# Prepare and feature prep

In [4]:
# Prepare features and target
# Using Height and Width as independent variables
X = df[['Height', 'Width']]
y = df['Weight']

print("Features shape:", X.shape)
print("Target shape:", y.shape)

# Check for any missing values
print("Missing values in X:", X.isnull().sum().sum())
print("Missing values in y:", y.isnull().sum())

Features shape: (159, 2)
Target shape: (159,)
Missing values in X: 0
Missing values in y: 0


Mathematical Foundation
Behind the scenes of regression

The Regression Equation
ŷ = β₀ + β₁·Height + β₂·Width

We find coefficients (β₀, β₁, β₂) that minimize the sum of squared residuals (errors).

Key Formula (Cramer's Rule)
β₁ = (Σx₂²·Σx₁y - Σx₁x₂·Σx₂y) / (Σx₁²·Σx₂² - (Σx₁x₂)²)

β₂ = (Σx₁²·Σx₂y - Σx₁x₂·Σx₁y) / (Σx₁²·Σx₂² - (Σx₁x₂)²)

where x₁, x₂ are centered (mean-subtracted) variables

In [5]:
# Fit the Multiple Linear Regression model
model = LinearRegression()
model.fit(X, y)

# Extract coefficients
beta0 = model.intercept_
beta1, beta2 = model.coef_

print(f"Intercept (β₀): {beta0:.4f}")
print(f"Height coefficient (β₁): {beta1:.4f}")
print(f"Width coefficient (β₂): {beta2:.4f}")

# Equation
print(f"\nRegression Equation:")
print(f"Weight = {beta0:.4f} + {beta1:.4f} × Height + {beta2:.4f} × Width")

Intercept (β₀): -433.5757
Height coefficient (β₁): 4.8246
Width coefficient (β₂): 178.5225

Regression Equation:
Weight = -433.5757 + 4.8246 × Height + 178.5225 × Width


# Interpretation and insights

In [7]:
# Calculate R-squared
r_squared = model.score(X, y)

# Interpretation of coefficients
print("MODEL INTERPRETATION")
print("=" * 50)
print(f"\n1. Intercept (β₀): {beta0:.4f}")
print("   - Weight when Height and Width are 0 cm")
print("   - Often not directly interpretable for categorical data")

print(f"\n2. Height Coefficient (β₁): {beta1:.4f}")
print("   - For every 1 cm increase in Height,")
print(f"   - Weight increases by {beta1:.4f} grams (holding Width constant)")

print(f"\n3. Width Coefficient (β₂): {beta2:.4f}")
print("   - For every 1 cm increase in Width,")
print(f"   - Weight increases by {beta2:.4f} grams (holding Height constant)")

print(f"\n4. R-squared: {r_squared:.4f}")
print(f"   - The model explains {r_squared*100:.2f}% of the variance")
print(f"   - Remaining {(1-r_squared)*100:.2f}% is unexplained")

MODEL INTERPRETATION

1. Intercept (β₀): -433.5757
   - Weight when Height and Width are 0 cm
   - Often not directly interpretable for categorical data

2. Height Coefficient (β₁): 4.8246
   - For every 1 cm increase in Height,
   - Weight increases by 4.8246 grams (holding Width constant)

3. Width Coefficient (β₂): 178.5225
   - For every 1 cm increase in Width,
   - Weight increases by 178.5225 grams (holding Height constant)

4. R-squared: 0.7871
   - The model explains 78.71% of the variance
   - Remaining 21.29% is unexplained
