In [2]:
import numpy as np

Let us load the iris plants dataset and splitting it into training and testing data

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y  = load_iris(return_X_y= True)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=22,train_size=0.7,test_size=0.3)


Now let us build a linear regression model on this data and print the model parameters- the intercept as well as the coefficients.

In [4]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
print(reg.intercept_,reg.coef_)

0.20850858638763448 [-0.14562569 -0.00744822  0.26953125  0.54028834]


Let us make a prediction, based on the regression line fitted on the prediction data how our test data would be classified, cutoff points between classes will be set to 0.5 and 1.5

In [5]:
y_predict = reg.predict(X_test)
y_predict_classified = np.array(list(map(round,y_predict)))
print(y_predict_classified)

[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 1 1 0 2 0 1 2 0 2 2 2 2 0 0 1 1 1 0 0
 0 2 2 1 1 0 0 1]


Let us see the test actual data as well

In [6]:
print(y_test)

[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 2 1 0 2 0 1 2 0 2 2 2 2 0 0 1 1 1 0 0
 0 2 2 1 1 0 0 1]


Lets provide a naive model performance metric such as R<sup>2</sup> to check goodness-of-fit measure of the model on test data.

In [7]:
from sklearn.metrics import r2_score
print(f"The r-squared score of the model is {r2_score(y_test,y_predict)}")

The r-squared score of the model is 0.9134398043968128


Let us check the accuracy of the prediction.

In [8]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predict_classified)

0.9777777777777777

So far we have created a linear model focusing on the goodness of fit, that approach is biased towards more complex models. One of the ways to tackle this is to penalise the model as the number of coefficients grows is to use Ridge regression. The resulting model fit shrinks the model and an added bonus- it is better hedged against cases of colinearity.

In [10]:
reg_ridge = linear_model.Ridge(alpha = 0.6)
reg_ridge.fit(X_train, y_train)
y_predict_ridge = reg_ridge.predict(X_test)
print(reg_ridge.coef_, reg_ridge.intercept_)
print(f"The r-squared score of the ridged model is {r2_score(y_test,y_predict_ridge)}")

[-0.14404322 -0.00495491  0.28992261  0.48949645] 0.17593596175937132
The r-squared score of the ridged model is 0.9121883197926157


The model accuracy is:

In [11]:
y_predict_ridge_classified = np.array(list(map(round,y_predict_ridge)))
accuracy_score(y_test,y_predict_ridge_classified)

0.9777777777777777

In [12]:
print(y_predict_classified==y_predict_ridge_classified)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]
