# **Build Linear Regression Model in Python**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, I will be showing you how to build a linear regression model in Python using the scikit-learn package.

Inspired by [scikit-learn's Linear Regression Example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

---

## **Load the Diabetes dataset** (via scikit-learn)

### **Import library**

In [6]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.2 threadpoolctl-3.5.0
Note

In [7]:
from sklearn import datasets

### **Load dataset**

In [8]:
diabetes = datasets.load_diabetes()

In [10]:
diabetes

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

### **Description of the Diabetes dataset**

In [11]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

### **Feature names**

In [15]:
print(diabetes.feature_names)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


### **Create X and Y data matrices**

In [16]:
X = diabetes.data
Y = diabetes.target

In [17]:
X.shape, Y.shape

((442, 10), (442,))

### **Load dataset + Create X and Y data matrices (in 1 step)**

In [18]:
X, Y = datasets.load_diabetes(return_X_y=True)

In [19]:
X.shape, Y.shape

((442, 10), (442,))

## **Load the Boston Housing dataset (via GitHub)**

The Boston Housing dataset was obtained from the mlbench R package, which was loaded using the following commands:

```
library(mlbench)
data(BostonHousing)
```

For your convenience, I have also shared the [Boston Housing dataset](https://github.com/dataprofessor/data/blob/master/BostonHousing.csv) on the Data Professor GitHub package.

### **Import library**

In [20]:
import pandas as pd

### **Download CSV from GitHub**

In [11]:
! wget https://github.com/dataprofessor/data/raw/master/BostonHousing.csv

--2020-03-30 07:43:30--  https://github.com/dataprofessor/data/raw/master/BostonHousing.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/data/master/BostonHousing.csv [following]
--2020-03-30 07:43:36--  https://raw.githubusercontent.com/dataprofessor/data/master/BostonHousing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36242 (35K) [text/plain]
Saving to: ‘BostonHousing.csv’


2020-03-30 07:43:37 (1.25 MB/s) - ‘BostonHousing.csv’ saved [36242/36242]



### **Read in CSV file**

In [12]:
BostonHousing = pd.read_csv("BostonHousing.csv")
BostonHousing

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


### **Split dataset to X and Y variables**

In [13]:
Y = BostonHousing.medv
Y

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: medv, Length: 506, dtype: float64

In [14]:
X = BostonHousing.drop(['medv'], axis=1)
X

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48


## **Data split**

### **Import library**

In [21]:
from sklearn.model_selection import train_test_split

### **Perform 80/20 Data split**

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

### **Data dimension**

In [23]:
X_train.shape, Y_train.shape

((353, 10), (353,))

In [24]:
X_test.shape, Y_test.shape

((89, 10), (89,))

## **Linear Regression Model**

### **Import library**

In [25]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

### **Build linear regression**

#### Defines the regression model

In [26]:
model = linear_model.LinearRegression()

#### Build training model

In [27]:
model.fit(X_train, Y_train)

#### Apply trained model to make prediction (on test set)

In [28]:
Y_pred = model.predict(X_test)

## **Prediction results**

### **Print model performance**

In [29]:
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(Y_test, Y_pred))

Coefficients: [  -21.21752065  -225.9328035    507.53881184   304.81030126
 -1039.96123572   683.17197073   220.99103165   213.76249997
   866.32383572    72.41259527]
Intercept: 151.69862172133583
Mean squared error (MSE): 2766.65
Coefficient of determination (R^2): 0.55


### **String formatting**

By default r2_score returns a floating number ([more details](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html))

In [32]:
r2_score(Y_test, Y_pred)

0.5455013815129843

In [39]:
r2_score(Y_test, Y_pred).dtype

AttributeError: 'float' object has no attribute 'dtype'

We will be using the modulo operator to format the numbers by rounding it off.

In [40]:
'%f' % 0.523810833536016

'0.523811'

We will now round it off to 3 digits

In [41]:
'%.3f' % 0.523810833536016

'0.524'

We will now round it off to 2 digits

In [42]:
'%.2f' % 0.523810833536016

'0.52'

## **Scatter plots**

### **Import library**

In [45]:
import seaborn as sns

### **Make scatter plot**

#### The Data

In [46]:
Y_test

array([185., 233., 221., 132., 277.,  78., 113., 244., 321., 162., 258.,
       164.,  90., 200.,  47., 168., 180., 210., 167., 115., 121., 127.,
       196., 128.,  97.,  88.,  96., 125., 220., 107., 201., 229., 111.,
       237., 245.,  65., 259.,  53., 192.,  53., 102., 109.,  85.,  53.,
       104., 257., 296.,  71.,  64., 258.,  45., 283., 262.,  51., 243.,
       180.,  52.,  83.,  84.,  66., 273.,  72., 242., 129., 182., 145.,
        96., 172., 141., 190., 292., 263., 140., 144., 249., 200.,  92.,
       275.,  51.,  93., 248.,  55., 303., 202.,  39.,  99., 230.,  94.,
        88.])

In [47]:
import numpy as np
np.array(Y_test)

array([185., 233., 221., 132., 277.,  78., 113., 244., 321., 162., 258.,
       164.,  90., 200.,  47., 168., 180., 210., 167., 115., 121., 127.,
       196., 128.,  97.,  88.,  96., 125., 220., 107., 201., 229., 111.,
       237., 245.,  65., 259.,  53., 192.,  53., 102., 109.,  85.,  53.,
       104., 257., 296.,  71.,  64., 258.,  45., 283., 262.,  51., 243.,
       180.,  52.,  83.,  84.,  66., 273.,  72., 242., 129., 182., 145.,
        96., 172., 141., 190., 292., 263., 140., 144., 249., 200.,  92.,
       275.,  51.,  93., 248.,  55., 303., 202.,  39.,  99., 230.,  94.,
        88.])

In [48]:
Y_pred

array([146.86157004, 255.67116179, 203.61706725, 118.55731938,
       250.87421364, 189.80485978, 120.79598899, 178.11979023,
       233.6152016 , 127.23483187, 286.44927782, 188.1579659 ,
       175.87892389, 146.829536  ,  99.61786413, 124.43434566,
       172.70148782, 151.21531719, 183.74864546,  92.16703875,
       167.71595743, 162.59467536, 163.42057658,  71.74118642,
       147.75348821, 107.48639221, 114.82596949,  99.84862015,
       260.70103084, 180.89332888,  84.94759995, 192.08633427,
       174.14098374, 157.43790018, 246.07495872, 126.24143298,
       222.0175248 ,  81.30383996, 217.34591876, 100.13411928,
       106.36236358, 110.01314562, 150.88776886, 124.78314272,
        30.51795022, 222.36015277, 219.00590854, 119.18196964,
       117.92392972, 172.85639061,  32.69716783, 184.52170412,
       181.02815937,  75.70577299, 274.83961118, 220.06945154,
        60.86746784, 140.35039036, 179.35704046, 122.40066367,
       250.8503855 ,  50.20021047, 257.45800734, 157.16

#### Making the scatter plot

In [49]:
sns.scatterplot(Y_test, Y_pred)

TypeError: scatterplot() takes from 0 to 1 positional arguments but 2 were given

In [50]:
sns.scatterplot(Y_test, Y_pred, marker="+")

TypeError: scatterplot() takes from 0 to 1 positional arguments but 2 were given

In [51]:
sns.scatterplot(Y_test, Y_pred, alpha=0.5)

TypeError: scatterplot() takes from 0 to 1 positional arguments but 2 were given