# Linear Regression - Example: Formula 1 Data

## Overview
In this notebook, we will walk through a simple example of linear regression using a small Formula 1–inspired dataset. No prior knowledge of linear regression is assumed. We will:

- Introduce the concept  
- Load and inspect the data  
- Visualize relationships  
- Train a linear regression model  
- Evaluate and interpret the results

Let’s learn how to use linear regression with a real example:  
**Can we predict a driver’s race finishing position based on their qualifying lap time?**

## 1. What Is Linear Regression?

Linear regression is a statistical technique for modeling the relationship between a **predictor** (input) and a **response** (output) as a straight line:

predicted_y = β₀ + β₁x

- β₀ (intercept) shifts the line up/down.  
- β₁ (slope) controls how steep the line is.  

Our goal: given data points \((x_i, y_i)\), find the β₀ and β₁ that best “fit” those points.

## 2. Setup and Imports

In [1]:
# Standard data-science libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Scikit-learn for linear regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 3. Load the Data

In [2]:
qualifying = pd.read_csv('Data/qualifying.csv')
results = pd.read_csv('Data/results.csv')

In [6]:
# 🔹 Preview: A quick glance at each dataset
print("🔹 First 3 rows from Qualifying:")
print(qualifying[['driverId', 'raceId', 'q1', 'q2', 'q3']].head(3))

print("\n🔹 First 3 rows from Race Results:")
print(results[['driverId', 'positionOrder', 'fastestLap', 'raceId']].head(3))

# 📏 Size check: How big are these files?
print("\n📏 Dataset Sizes (Rows x Columns):")
print(f"Qualifying: {qualifying.shape[0]} rows × {qualifying.shape[1]} columns")
print(f"Results: {results.shape[0]} rows × {results.shape[1]} columns")

# 🔍 Top Performers: Who placed best in races
print("\n🏁 Top 5 Race Finishes:")
print(results.sort_values('positionOrder')[['driverId', 'positionOrder', 'raceId']].head(5))

# ❗Quick data check: Any missing info?
print("\n🧠 Missing Data Overview:")
print("Qualifying missing values:\n", qualifying.isnull().sum())
print("\nResults missing values:\n", results.isnull().sum())

🔹 First 3 rows from Qualifying:
   driverId  raceId        q1        q2        q3
0         1      18  1:26.572  1:25.187  1:26.714
1         9      18  1:26.103  1:25.315  1:26.869
2         5      18  1:25.664  1:25.452  1:27.079

🔹 First 3 rows from Race Results:
   driverId  positionOrder fastestLap  raceId
0         1              1         39      18
1         2              2         41      18
2         3              3         41      18

📏 Dataset Sizes (Rows x Columns):
Qualifying: 10494 rows × 9 columns
Results: 26759 rows × 18 columns

🏁 Top 5 Race Finishes:
       driverId  positionOrder  raceId
14628       224              1     594
22061        20              1     897
22083        20              1     898
6812        102              1     315
22127         3              1     900

🧠 Missing Data Overview:
Qualifying missing values:
 qualifyId         0
raceId            0
driverId          0
constructorId     0
number            0
position          0
q1            

# 5. Merge Data

In [9]:
# Merge on raceId and driverId
merged = pd.merge(
    qualifying,
    results,
    on=['raceId', 'driverId'],
    how='inner'
)

# Keep only what you need
merged = merged[['driverId', 'raceId', 'q3_seconds', 'positionOrder']].dropna()

# Rename columns for clarity
merged = merged.rename(columns={
    'q3_seconds': 'qualifying_time',
    'positionOrder': 'finishing_position'
})

# Preview the merged data
merged.head()


KeyError: "['q3_seconds'] not in index"

# 5. Visualize the Relationship

In [7]:
plt.scatter(data['qualifying_time'], data['finishing_position'])
plt.xlabel('Qualifying Time (seconds)')
plt.ylabel('Finishing Position')
plt.title('Does Qualifying Time Predict Finishing Position?')
plt.show()

NameError: name 'data' is not defined

This scatter plot helps us see if a straight‐line model could make sense.

# 6. Prepare Features and Target

In [None]:
# Our predictor (X) must be a 2D array: shape (n_samples, n_features)
X = data[['qualifying_time']]

# Our response (y) is a 1D array
y = data['finishing_position']

# 7. Split into Training and Test Sets

We split our 10 points into:
- 80% training (for fitting the model)
- 20% testing (for evaluating performance)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# 8. Train the Linear Regression Model

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Intercept:', model.intercept_)
print('Slope:', model.coef_)
print('MAE:', mean_absolute_error(y_test, y_pred))
print('R²:', r2_score(y_test, y_pred))

plt.scatter(y_test, y_pred)
plt.xlabel('Actual Finishing Position')
plt.ylabel('Predicted Finishing Position')
plt.title('Actual vs Predicted')
plt.show()

Once fitted, the model learns:
- model.intercept_ → β₀
- model.coef_[0] → β₁