# Linear Regression

## Part 1 - Data Preprocessing

### Importing the dataset

In [14]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns



from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

import warnings
warnings.filterwarnings('ignore')

**pandas (pd) and numpy (np):** Tools for working with data and numbers in a structured way.
<br>
<br>
**matplotlib.pyplot (plt) and seaborn (sns):** Used to create graphs and visualize data.
<br>
<br>
**scikit-learn tools:**
<br>
<br>
**train_test_split: Splits the data into two parts:** training and testing.
<br>
<br>
**StandardScaler:** Helps to scale/standardize the data so it can be used effectively in the model.
<br>
<br>
**LinearRegression:** The actual algorithm that builds a model to predict values.
mean_squared_error (MSE) and r2_score (R²): Measures to evaluate how well the model is doing.
<br>
<br>
**warnings.filterwarnings('ignore'):** Suppresses unnecessary warnings for a cleaner output.
<br><br>


In [19]:
dataset.head(5) #display 5 rows of dataset

#10,000 rows
#data points collected from a combined cycle power plant over six years
#5 columns: AT ambient temp,V exhaust vacuum, AP ambient pressure, RH relative humdity, PE net hourly  electrical energy output
# independent variables: AT, V, AP and RH
# dependent variable: PE

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Loading and Understanding the Dataset

In [46]:
# [rows,columns]
df = pd.read_csv('2boston.csv')
df


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


Loads the data from a CSV file into a **dataframe** (a table of data) and displays it.

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


Displays information about the data: **column names, data types, and if any data is missing.**

In [48]:
df.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Checks for any missing values in the data.


### Separating the Data (Features and Target)

In [49]:
# Extract features and target
X = df.drop('MEDV', axis=1)
y = df['MEDV']
#X: All the columns except# "MEDV" (the home price) are considered features (factors that help predict the home price).
#y: The "MEDV" column is th#e target (what we're trying to predict: the home price).


### Splitting the Data for Training and Testing


In [50]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Splits the data into two parts:
<br>
<br>
**Training data** (80% of the total): Used to train the model.
<br>
<br>
**Testing data** (20% of the total): Used to see how well the model performs on data it hasn’t seen before.
<br>
<br>
**random_state=42:** Ensures that the splitting is the same each time for consistent results.
<br>
<br>


In [None]:
# Initialize and fit the StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Standardizes (scales) the data so that all features have the same scale (important for models like this one).
<br>
<br>
**fit_transform:** Learns from the training data and scales it.
<br>
<br>
**transform:** Scales the test data using the same transformation applied to the training data.

### Training The Model

In [53]:
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

NameError: name 'X_train_scaled' is not defined

**model = LinearRegression():** Creates a new linear regression model.
<br>
<br>
**model.fit(X_train_scaled, y_train):** Trains the model using the scaled training data.

In [None]:
# Perform cross-validation
cv_predictions = model.predict(X_test_scaled)
cv_mse = mean_squared_error(y_test, cv_predictions)
print(f"Cross-Validation Mean Squared Error: {cv_mse:.4f}")

In [None]:
Predicting and Evaluating the Model

### Creating the Training Set and the Test Set

In [22]:
# scikitlearn is a library
# model_selection is a module
# train_test_split is a function
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,random_state=0)

In [23]:
X_train

array([[3.5809e-01, 0.0000e+00, 6.2000e+00, ..., 1.7400e+01, 3.9170e+02,
        9.7100e+00],
       [1.5876e-01, 0.0000e+00, 1.0810e+01, ..., 1.9200e+01, 3.7694e+02,
        9.8800e+00],
       [1.1329e-01, 3.0000e+01, 4.9300e+00, ..., 1.6600e+01, 3.9125e+02,
        1.1380e+01],
       ...,
       [1.5098e-01, 0.0000e+00, 1.0010e+01, ..., 1.7800e+01, 3.9451e+02,
        1.0300e+01],
       [2.2927e-01, 0.0000e+00, 6.9100e+00, ..., 1.7900e+01, 3.9274e+02,
        1.8800e+01],
       [1.3914e-01, 0.0000e+00, 4.0500e+00, ..., 1.6600e+01, 3.9690e+02,
        1.4690e+01]])

In [24]:
X_test

array([[6.7240e-02, 0.0000e+00, 3.2400e+00, ..., 1.6900e+01, 3.7521e+02,
        7.3400e+00],
       [9.2323e+00, 0.0000e+00, 1.8100e+01, ..., 2.0200e+01, 3.6615e+02,
        9.5300e+00],
       [1.1425e-01, 0.0000e+00, 1.3890e+01, ..., 1.6400e+01, 3.9374e+02,
        1.0500e+01],
       ...,
       [1.4932e-01, 2.5000e+01, 5.1300e+00, ..., 1.9700e+01, 3.9511e+02,
        1.3150e+01],
       [1.4052e-01, 0.0000e+00, 1.0590e+01, ..., 1.8600e+01, 3.8581e+02,
        9.3800e+00],
       [1.2802e-01, 0.0000e+00, 8.5600e+00, ..., 2.0900e+01, 3.9524e+02,
        1.2270e+01]])

In [25]:
y_train

array([26.7, 21.7, 22. , 22.9, 10.4, 21.9, 20.6, 26.4, 41.3, 17.2, 27.1,
       20.4, 16.5, 24.4,  8.4, 23. ,  9.7, 50. , 30.5, 12.3, 19.4, 21.2,
       20.3, 18.8, 33.4, 18.5, 19.6, 33.2, 13.1,  7.5, 13.6, 17.4,  8.4,
       35.4, 24. , 13.4, 26.2,  7.2, 13.1, 24.5, 37.2, 25. , 24.1, 16.6,
       32.9, 36.2, 11. ,  7.2, 22.8, 28.7, 14.4, 24.4, 18.1, 22.5, 20.5,
       15.2, 17.4, 13.6,  8.7, 18.2, 35.4, 31.7, 33. , 22.2, 20.4, 23.9,
       25. , 12.7, 29.1, 12. , 17.7, 27. , 20.6, 10.2, 17.5, 19.7, 29.8,
       20.5, 14.9, 10.9, 19.5, 22.7, 19.5, 24.6, 25. , 24.5, 50. , 14.3,
       11.8, 31. , 28.7, 16.2, 43.5, 25. , 22. , 19.9, 22.1, 46. , 22.9,
       20.2, 43.1, 34.6, 13.8, 24.3, 21.5, 24.4, 21.2, 23.8, 26.6, 25.1,
        9.6, 19.4, 19.4,  9.5, 14. , 26.5, 13.8, 34.7, 16.3, 21.7, 17.5,
       15.6, 20.9, 21.7, 12.7, 18.5, 23.7, 19.3, 12.7, 21.6, 23.2, 29.6,
       21.2, 23.8, 17.1, 22. , 36.5, 18.8, 21.9, 23.1, 20.2, 17.4, 37. ,
       24.1, 36.2, 15.7, 32.2, 13.5, 17.9, 13.3, 11

In [26]:
y_test

array([22.6, 50. , 23. ,  8.3, 21.2, 19.9, 20.6, 18.7, 16.1, 18.6,  8.8,
       17.2, 14.9, 10.5, 50. , 29. , 23. , 33.3, 29.4, 21. , 23.8, 19.1,
       20.4, 29.1, 19.3, 23.1, 19.6, 19.4, 38.7, 18.7, 14.6, 20. , 20.5,
       20.1, 23.6, 16.8,  5.6, 50. , 14.5, 13.3, 23.9, 20. , 19.8, 13.8,
       16.5, 21.6, 20.3, 17. , 11.8, 27.5, 15.6, 23.1, 24.3, 42.8, 15.6,
       21.7, 17.1, 17.2, 15. , 21.7, 18.6, 21. , 33.1, 31.5, 20.1, 29.8,
       15.2, 15. , 27.5, 22.6, 20. , 21.4, 23.5, 31.2, 23.7,  7.4, 48.3,
       24.4, 22.6, 18.3, 23.3, 17.1, 27.9, 44.8, 50. , 23. , 21.4, 10.2,
       23.3, 23.2, 18.9, 13.4, 21.9, 24.8, 11.9, 24.3, 13.8, 24.7, 14.1,
       18.7, 28.1, 19.8])

## Part 2 - Building and training the model

### Building the model

In [27]:
# linear_model is the module
# `LinearRegression is a class` is defining that `LinearRegression` is a class within the `linear_model` module. 
# It indicates that `LinearRegression` is a blueprint or template for creating objects that represent linear regression models.
# Class is a pre-coded blueprint of something we want to build from which objects are created.
from sklearn.linear_model import LinearRegression
model = LinearRegression()

### Training the Model

In [28]:
# fit is a method inside LinearRegression class - they are like functions.
model.fit(X_train, y_train)

### Inference

In [29]:
y_pred = model.predict(X_test)
y_pred

array([24.88963777, 23.72141085, 29.36499868, 12.12238621, 21.44382254,
       19.2834443 , 20.49647539, 21.36099298, 18.8967118 , 19.9280658 ,
        5.12703513, 16.3867396 , 17.07776485,  5.59375659, 39.99636726,
       32.49654668, 22.45798809, 36.85192327, 30.86401089, 23.15140009,
       24.77495789, 24.67187756, 20.59543752, 30.35369168, 22.41940736,
       10.23266565, 17.64816865, 18.27419652, 35.53362541, 20.96084724,
       18.30413012, 17.79262072, 19.96561663, 24.06127231, 29.10204874,
       19.27774123, 11.15536648, 24.57560579, 17.5862644 , 15.49454112,
       26.20577527, 20.86304693, 22.31460516, 15.60710156, 23.00363104,
       25.17247952, 20.11459464, 22.90256276, 10.0380507 , 24.28515123,
       20.94127711, 17.35258791, 24.52235405, 29.95143046, 13.42695877,
       21.72673066, 20.7897053 , 15.49668805, 13.98982601, 22.18377874,
       17.73047814, 21.58869165, 32.90522136, 31.11235671, 17.73252635,
       32.76358681, 18.7124637 , 19.78693475, 19.02958927, 22.89

#### Making the prediction of a single data point with AX1 transaction date = 2012.667, X2 house age = 20.4, X3 distance to the nearest MRT station = 2469.645, X4 number of convenience stores = 4, X5 latitude = 24.96108, X6 longitude = 121.51046, Y house price of unit area = 23.8

In [31]:
PE = model.predict([[0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98]])
PE

array([30.49949836])

## Part 3: Evaluating the Model

### R-Squared

In [32]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.5892223849182529

### Adjusted R-Squared

In [33]:
k = X_test.shape[1]
k

13

In [34]:
n = X_test.shape[0]
n

102

In [35]:
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
adj_r2

0.528539328144813

### Mean Squared Error

In [36]:
# Assuming we have the true target values and predicted values
# Calculate Mean Squared Error

from sklearn.metrics import mean_squared_error
import numpy as np

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mse = mean_squared_error(y_true, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.375


#### Result Chart

In [42]:
# Load the data from the Excel file and create the necessary plots using pandas
# Load the data

import matplotlib.pyplot as plt
import seaborn as sns

real_estate_df = pd.read_csv('2boston.csv')

In [43]:
# Scatter plot: House price vs Distance to the nearest MRT station

plt.figure(figsize=(10, 6))
sns.scatterplot(data=real_estate_df, x='X3 distance to the nearest MRT station', y='Y house price of unit area')
plt.title('House Price vs Distance to Nearest MRT Station')
plt.xlabel('Distance to Nearest MRT Station')
plt.ylabel('House Price of Unit Area')
plt.grid(True)
plt.show()

ValueError: Could not interpret value `X3 distance to the nearest MRT station` for `x`. An entry with this name does not appear in `data`.

<Figure size 1000x600 with 0 Axes>

In [44]:
# Bar chart: Distribution of houses based on the number of convenience stores

plt.figure(figsize=(10, 6))
real_estate_df['X4 number of convenience stores'].value_counts().sort_index().plot(kind='bar')
plt.title('Distribution of Houses Based on Number of Convenience Stores')
plt.xlabel('Number of Convenience Stores')
plt.ylabel('Number of Houses')
plt.grid(True)
plt.show()

KeyError: 'X4 number of convenience stores'

<Figure size 1000x600 with 0 Axes>