# Intelligent Systems | HS2025 | SW05 
## Linear regression tutorial, SW05 lecture
### Eugen Rodel, 15/10/2025

# Linear Regression Tutorial: Predicting Sales per Year Based on the Number of Gas Pumps

In this tutorial, we will learn how to perform linear regression to predict the annual sales of gas stations based on the number of gas pumps they have. We will use a dataset with 100 data points, which represents the number of gas pumps and corresponding sales figures.

We are going to use the scikit-learn library, which include many ML tools, more details here: [scikit-learn](https://scikit-learn.org/stable/)


In [None]:
# Install scikit-lear library, which includes tools for ML
!pip install scikit-learn

# 1. Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set Seaborn style for plots
sns.set_style('whitegrid')


# 2. Load Data

In [None]:
# Load the uploaded data from the CSV file
data = pd.read_csv('gas_station_data.csv')

# Display the first few rows of the DataFrame to understand the structure
data.head()


# 3. Visualize/Expolore the data   
(we skipped cleaning the data, not necessary here, will come in later lecture)

In [None]:
# Scatter plot to visualize the relationship between Gas Pumps and Sales per Year
sns.scatterplot(x='Gas Pumps', y='Sales per Year', data=data)
plt.xlabel('Number of Gas Pumps')
plt.ylabel('Sales per Year (in thousands CHF)')
plt.title('Scatter Plot of Sales vs. Number of Gas Pumps')
plt.show()

In [None]:
# Box plot to detect potential outliers in the sales data
sns.boxplot(x=data['Sales per Year'])
plt.xlabel('Sales per Year (in thousands CHF)')
plt.title('Box Plot of Sales per Year')
plt.show()

In [None]:
data['Sales per Year'].describe()

In [None]:
# Histogram to show the distribution of the number of gas pumps
sns.histplot(data['Gas Pumps'], bins=10, kde=True)
plt.xlabel('Number of Gas Pumps')
plt.title('Histogram of Number of Gas Pumps')
plt.show()

In [None]:
# Histogram to show the distribution of sales per year
sns.histplot(data['Sales per Year'], bins=10, kde=True)
plt.xlabel('Sales per Year (in thousands CHF)')
plt.title('Histogram of Sales per Year')
plt.show()

# 4.  Prepare the Data for Linear Regression

In [None]:
# Separate the features/independent variables (Gas Pumps) and the target/dependent variable (Sales per Year)
X = data[['Gas Pumps']].values
y = data['Sales per Year'].values

### Code Explanation

This code separates the features (independent variables) and the target (dependent variable) from the dataset, preparing them for use in machine learning. Here's what each part does:

1. **Features (Independent Variables) and Target (Dependent Variable)**
   - In this dataset, the number of **Gas Pumps** is the **independent variable**, or **feature** (`X`), which we will use to make predictions.
   - **Sales per Year** is the **dependent variable**, or **target** (`y`), which we are trying to predict based on the number of gas pumps.

2. **Extracting the Feature Data: `X = data[['Gas Pumps']].values`**
   - `data[['Gas Pumps']]`: This selects the "Gas Pumps" column from the DataFrame. The double square brackets (`[[]]`) are used to ensure that the result is a DataFrame, not a Series.
   - `.values`: This converts the selected DataFrame into a NumPy array, which is required by many machine learning libraries for numerical operations.
   - As a result, `X` becomes a NumPy array containing the number of gas pumps for each gas station.

3. **Extracting the Target Data: `y = data['Sales per Year'].values`**
   - `data['Sales per Year']`: This selects the "Sales per Year" column from the DataFrame.
   - `.values`: This converts the selected Series into a NumPy array.
   - As a result, `y` becomes a NumPy array containing the corresponding sales figures for each gas station.

### Why This Step Is Important

Separating the features (`X`) and the target (`y`) is a crucial step in preparing the data for machine learning. It allows us to:

By organizing the data this way, we ensure that the machine learning algorithms can process the inputs and outputs correctly.


In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Code Explanation

This line of code splits the dataset into training and testing sets. It uses the `train_test_split` function from the `sklearn.model_selection` module to randomly divide the data into two subsets: one for training the model and one for testing its performance. Let's break down the parameters and variables:

1. **`X` and `y`**
   - `X` represents the **features** or **input variables** (in this case, the number of gas pumps).
   - `y` represents the **target variable** or **output** (in this case, sales per year).

2. **`train_test_split(X, y, test_size=0.2, random_state=0)`**
   - This function randomly splits the `X` and `y` arrays into training and testing sets.
   - **`test_size=0.2`**:
     - Specifies the proportion of the data to be used for testing.
     - `0.2` means that **20% of the data** will be used as the testing set, and the remaining **80% will be used for training**.
   - **`random_state=0`**:
     - This parameter ensures **reproducibility**. Setting `random_state` to a fixed number (e.g., `0`) ensures that the random split will be the same each time you run the code.
     - If `random_state` is not set, the split will be different each time, which could lead to variations in results.

3. **`X_train, X_test, y_train, y_test`**
   - The function returns four arrays:
     - **`X_train`**: The training portion of the feature data (80% of `X`).
     - **`X_test`**: The testing portion of the feature data (20% of `X`).
     - **`y_train`**: The training portion of the target data (80% of `y`).
     - **`y_test`**: The testing portion of the target data (20% of `y`).

### Why Split the Data?
- **Training Set**: Used to **train the machine learning model**, so it can learn the relationship between the input features and the target variable.
- **Testing Set**: Used to **evaluate the model's performance** on unseen data, allowing us to understand how well the model generalizes to new, unseen examples.

By splitting the data into training and testing sets, we can train the model on one portion and test it on another, helping to avoid overfitting and ensuring that the model performs well on data it has not seen before.


# 5. Train the Linear Regression Model

In [None]:
# Initialize the Linear Regression model
model = LinearRegression()

# Fit (or learn) the model to the training data
model.fit(X_train, y_train)

### Code Explanation

This code initializes a Linear Regression model and then fits (trains) it to the training data. Let's break down each line:

1. **Initialize the Linear Regression Model**
   - `model = LinearRegression()`
     - Here, we create an instance of the `LinearRegression` class from the `sklearn.linear_model` module.
     - The `LinearRegression()` function initializes a new linear regression model, which will be used to find the best-fitting line that describes the relationship between the input features (`X_train`) and the target variable (`y_train`).
     - At this stage, the model is just created and has not yet learned anything from the data.

2. **Fit (or Learn) the Model to the Training Data**
   - `model.fit(X_train, y_train)`
     - The `.fit()` method trains the linear regression model using the training data (`X_train` and `y_train`).
     - **`X_train`**: Represents the input features (in this case, the number of gas pumps).
     - **`y_train`**: Represents the corresponding target values (sales per year).
     - During this step, the model learns the relationship between the features and the target by finding the best-fitting line. It does this by calculating the optimal values for the slope (coefficient) and intercept that minimize the error between the predicted and actual target values in the training set.

### Why These Steps Are Important

- **Initialization**: Creating the model instance is the first step in any machine learning workflow. It sets up the algorithm that will be used to make predictions.
- **Training/Fitting**: The `.fit()` method allows the model to learn from the data. It adjusts the model parameters to capture the patterns in the training data, which helps the model make accurate predictions on new, unseen data.

In summary, initializing and fitting the model are key steps in building a machine learning model that can learn from data and make predictions.


In [None]:
# Display the slope (coefficient) and intercept of the trained model
print(f'Coefficient (Slope): {model.coef_[0]:.2f}')
print(f'Intercept: {model.intercept_:.2f}')

# 6. Make Predictions and Evaluate the Model

In [None]:
# Predict the target variable for the test set
y_pred = model.predict(X_test)

### Code Explanation

This line of code is used to make predictions using the trained Linear Regression model. Let's break down what it does:

1. **Predict the Target Variable for the Test Set**
   - `y_pred = model.predict(X_test)`
     - The `.predict()` method uses the trained model to make predictions based on the input features provided.
     - **`X_test`**: Represents the test set of input features (in this case, the number of gas pumps), which was not used during the model training. These values are new to the model, and the goal is to see how well the model can predict the corresponding target values.
     - **`y_pred`**: Stores the predicted values for the target variable (sales per year) based on the test set features.

2. **What Is Happening Here?**
   - The model uses the relationship it learned during training (using `X_train` and `y_train`) to predict the target values (`y_pred`) for the new input data (`X_test`).
   - Since the model has already been trained to find the best-fitting line, it applies that line's equation (slope and intercept) to the values in `X_test` to generate predictions for the sales.

### Why This Step Is Important

- **Prediction**: This is where we see how well our model performs. By making predictions on the test data, we can evaluate the model's accuracy and understand how well it generalizes to new, unseen data.
- **Model Evaluation**: After predicting the target values, we can compare the predicted values (`y_pred`) to the actual values (`y_test`) to assess the model's performance.

In summary, this line of code allows us to use the trained model to make predictions and evaluate its effectiveness on the test data.


In [None]:
print(f'Test Data (How many gas pumps):\n{X_test}\n')
print(f'Predictions from trained model (Predicted sales per year):\n {y_pred}')

In [None]:
# Calculate Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output the evaluation metrics
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

### Code Explanation

This code calculates two evaluation metrics, **Mean Squared Error (MSE)** and **R-squared (R²)**, to assess the performance of the Linear Regression model. Let's break down each part:

1. **Calculate Mean Squared Error and R-squared**
   - `mse = mean_squared_error(y_test, y_pred)`
     - **`mean_squared_error(y_test, y_pred)`** calculates the **Mean Squared Error (MSE)** between the actual target values (`y_test`) and the predicted values (`y_pred`).
     - MSE measures the average squared difference between the actual and predicted values. It is used to understand how far the model's predictions are from the actual values.
     - A **lower MSE** indicates that the model's predictions are closer to the actual values, while a **higher MSE** suggests a larger error.

   - `r2 = r2_score(y_test, y_pred)`
     - **`r2_score(y_test, y_pred)`** calculates the **R-squared (R²)** value, which represents the proportion of the variance in the target variable (`y_test`) that is explained by the features (`X_test`).
     - R² ranges from **0 to 1**, where:
       - **1** indicates a perfect fit, meaning the model explains all the variability in the target variable.
       - **0** means the model does not explain any of the variability.
     - A **higher R²** value indicates a better model fit, while a **lower R²** suggests that the model may not be capturing the underlying patterns well.

2. **Output the Evaluation Metrics**
   - `print(f'Mean Squared Error: {mse:.2f}')`
   - `print(f'R-squared: {r2:.2f}')`
     - These lines print the calculated MSE and R² values to two decimal places, allowing us to see how well the model performed.

### What Do MSE and R² Tell Us About the Model?

- **Mean Squared Error (MSE)**
  - MSE helps quantify the **average error magnitude** between the actual and predicted values.
  - **Lower MSE** values indicate that the model's predictions are more accurate, while **higher MSE** values suggest larger errors.
  - Because MSE squares the differences, larger errors have a bigger impact on the metric, making it sensitive to outliers.

- **R-squared (R²)**
  - R² indicates how well the model explains the variation in the target variable.
  - A **higher R² value** (close to 1) means that the model fits the data well and explains most of the variability.
  - A **lower R² value** (close to 0) means that the model does not explain the variability in the data well, suggesting that a different approach or more features may be needed.

### Why These Metrics Are Important

- **MSE and R² are standard evaluation metrics** in regression analysis, helping us understand the accuracy and goodness-of-fit of the model.
- They provide a quantitative way to **compare different models** or **tune model parameters** to achieve better performance.

In summary, calculating MSE and R² allows us to measure how well the model's predictions align with the actual values and understand the quality of the fit.


# 7. Visualize the Linear Regression Line

In [None]:

# Create a scatter plot for the actual data points
sns.scatterplot(x=X_test.flatten(), y=y_test, color='red', label='Actual', s=50)

# Create a line plot for the regression line (predicted values)
sns.lineplot(x=X_test.flatten(), y=y_pred, color='blue', linewidth=2, label='Predicted')

# Label the plot
plt.xlabel('Number of Gas Pumps')
plt.ylabel('Sales per Year (in thousands)')
plt.title('Linear Regression Fit: Sales vs. Number of Gas Pumps')
plt.legend()
plt.show()

## Alternative to plot the prodiction line using the slope (coefficient) and intercept from trained model

In [None]:
# Create a scatter plot for the actual data points
sns.scatterplot(x=X_test.flatten(), y=y_test, color='red', label='Actual', s=50)

# Calculate the regression line using the model's coefficient and intercept
slope = model.coef_[0]
intercept = model.intercept_

# Generate values for the regression line
x_range = np.linspace(X_test.min(), X_test.max(), 100)
y_range = slope * x_range + intercept

# Plot the regression line
sns.lineplot(x=x_range, y=y_range, color='blue', linewidth=2, label='Predicted')

# Label the plot
plt.xlabel('Number of Gas Pumps')
plt.ylabel('Sales per Year (in thousands)')
plt.title('Linear Regression Fit: Sales vs. Number of Gas Pumps')
plt.legend()
# Conclusionplt.show()

# 8. Predict the sale per year for a new gast station with 12 gas pumps

In [None]:
# New gas station with 12 pumps
new_gas_station = np.array([[12]])

# Use the trained model to predict sales for the new gas station
predicted_sales = model.predict(new_gas_station)

# Output the prediction
print(f'Predicted Sales for a gas station with 12 pumps: {predicted_sales[0]:.2f} (in thousands CHF)')

# 9. Save and reuse our trained model

## Save the trained model

In [None]:
import joblib

# Save the trained model to a file
joblib.dump(model, 'linear_regression_model.joblib')

## Load the saved model

In [None]:
# Load the trained model from the file
loaded_model = joblib.load('linear_regression_model.joblib')

# New gas station with 16 pumps
new_gas_station = np.array([[16]])

# Use the loaded model to make predictions
predicted_sales = loaded_model.predict(new_gas_station)
print(f'Predicted Sales for a gas station with 16 pumps: {predicted_sales[0]:.2f} (in thousands CHF)')

# 10. Conclusion

In this tutorial, we have learned how to:
1. Load and visualize the dataset to understand the relationship between the number of gas pumps and sales per year.
2. Prepare the data for linear regression by splitting it into training and testing sets.
3. Train a linear regression model to predict sales based on the number of gas pumps.
4. Evaluate the model's performance using metrics such as Mean Squared Error and R-squared.
5. Visualize the results to see how well the model fits the data.

Linear regression is a useful technique for understanding and predicting relationships in data, and it forms the basis for more advanced machine learning methods.