# Linear Regression

# ðŸ“ˆ Linear Regression Pipeline
*Professional Guide to Building, Training, and Evaluating a Linear Regression Model*

---

## ðŸ“Š Overview

Linear regression is one of the most fundamental algorithms in machine learning.  
This notebook demonstrates the entire workflow: preparing data, building a model, training, making predictions, and evaluating performance.

**Workflow:** Raw Data â†’ Training/Test Sets â†’ Linear Regression Model â†’ Predictions â†’ Evaluation

---

## ðŸŸ¢ Step 1: Data Preprocessing

### Importing the Dataset
- Load the dataset and inspect the first 5 rows.
- Identify input features (AT, V, AP, RH) and target variable (PE).

**Columns:**
- **AT** â€“ Ambient Temperature (Â°C)  
- **V** â€“ Exhaust Vacuum (cm Hg)  
- **AP** â€“ Ambient Pressure (mbar)  
- **RH** â€“ Relative Humidity (%)  
- **PE** â€“ Net hourly electrical energy output (MW)

---

### Defining Inputs and Output
- Separate **features** (X = AT, V, AP, RH) and **target** (y = PE).

---

### Splitting Data
- Split into **Training Set** and **Test Set**.

---

## ðŸŸ¡ Step 2: Building and Training the Model

### Building the Model
- Define a **Linear Regression** model using scikit-learn.

---

### Training the Model
- Fit the model with the training data (X_train, y_train).

---

### Making Predictions
- Predict values for the **Test Set**.  
- Predict a **single custom data point**:
  - AT = 15  
  - V = 40  
  - AP = 1000  
  - RH = 75  

---

## ðŸ”µ Step 3: Model Evaluation

### R-Squared
- Evaluate the percentage of variance explained by the model.

---

### Adjusted R-Squared
- Adjusted metric considering the number of predictors.

---

### Error Metrics
- **RMSE** â€“ Root Mean Squared Error  
- **MAE** â€“ Mean Absolute Error  

---

âœ… This completes the **Linear Regression pipeline**: from raw data to performance evaluation.


## Part 1 - Data Preprocessing

### Importing the dataset

In [1]:
# Load dataset from Excel
import pandas as pd
dataset = pd.read_excel('reg_data.xlsx')

In [3]:
# Display the first five rows of the dataset to understand its structure and contents.
# The goal is to inspect the input features (AT, V, AP, RH) and the target variable (PE) for linear regression.
# Column descriptions:
# AT: Ambient Temperature (in Â°C)
# V: Exhaust Vacuum (in cm Hg)
# AP: Ambient Pressure (in mbar)
# RH: Relative Humidity (in %)
# PE: Net hourly electrical energy output (in MW)
dataset.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


### Getting the inputs and output

In [6]:
# Extract features (all columns except last)
X = dataset.iloc[:, :-1]

In [7]:
# Preview feature matrix
X

Unnamed: 0,AT,V,AP,RH
0,14.96,41.76,1024.07,73.17
1,25.18,62.96,1020.04,59.08
2,5.11,39.40,1012.16,92.14
3,20.86,57.32,1010.24,76.64
4,10.82,37.50,1009.23,96.62
...,...,...,...,...
9563,16.65,49.69,1014.01,91.00
9564,13.19,39.18,1023.67,66.78
9565,31.32,74.33,1012.92,36.48
9566,24.48,69.45,1013.86,62.39


In [9]:
# Extract target (last column)
y = dataset.iloc[:, -1].values

In [10]:
# Preview target vector
y

array([463.26, 444.37, 488.56, ..., 429.57, 435.74, 453.28], shape=(9568,))

### Creating the Training Set and the Test Set

In [11]:
# Split data into training and test sets (80/20)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [12]:
# Preview training features
X_train

Unnamed: 0,AT,V,AP,RH
496,11.22,43.13,1017.24,80.90
294,13.67,54.30,1015.92,75.42
6796,32.84,77.95,1014.68,45.80
6785,31.91,67.83,1008.76,53.22
1203,10.37,37.50,1013.19,79.25
...,...,...,...,...
7891,16.21,50.90,1012.46,84.45
9225,13.85,44.90,1019.11,76.79
4859,16.81,38.52,1018.26,75.21
3264,12.80,41.16,1022.43,86.19


In [13]:
# Preview test features
X_test

Unnamed: 0,AT,V,AP,RH
4834,28.66,77.95,1009.56,69.07
1768,17.48,49.39,1021.51,84.53
2819,14.86,43.14,1019.21,99.14
7779,22.46,58.33,1013.21,68.68
7065,18.38,55.28,1020.22,68.33
...,...,...,...,...
6452,16.63,39.16,1005.85,72.02
794,16.45,63.31,1015.96,83.97
627,12.24,44.92,1023.74,88.21
3515,27.28,47.93,1003.46,59.22


In [14]:
# Preview training targets
y_train

array([473.93, 467.87, 431.97, ..., 459.01, 462.72, 428.12], shape=(7654,))

In [15]:
# Preview test targets
y_test

array([431.23, 460.01, 461.14, ..., 473.26, 438.  , 463.28], shape=(1914,))

## Part 2 - Building and training the model

### Building the model

In [16]:
# Initialize Linear Regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

### Training the model

In [17]:
# Train the model on training data
model.fit(X_train, y_train)

### Inference

Making the predictions of the data points in the test set

In [18]:
# Predict on the held-out test set
y_pred = model.predict(X_test)

In [19]:
# Preview predictions for test set
y_pred

array([431.42761597, 458.56124622, 462.75264705, ..., 469.51835895,
       442.41759454, 461.88279939], shape=(1914,))

Making the prediction of a single data point with AT = 15, V = 40, AP = 1000, RH = 75

In [20]:
# Predict a single custom data point [AT, V, AP, RH]
model.predict([[15,40,1000,75]])



array([465.80771895])

## Part 3: Evaluating the model

### R-Squared

In [21]:
# Compute R-squared on test set
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

In [22]:
# Preview R-squared value
r2

0.9325315554761303

### Adjusted R-Squared

In [23]:
# Compute Adjusted R-squared
k = X_test.shape[1]
n = X_test.shape[0]
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

In [24]:
# Preview Adjusted R-squared value
adj_r2

0.9323901862890713

In [None]:
# Compute error metrics: RMSE and MAE
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)

rmse, mae
