<a href="https://www.kaggle.com/code/lxlz1986/linear-regression-model-in-scikit-learn?scriptVersionId=143115090" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## 1. Linear Regression Foundation

Among many machine learning models, **Linear Regression** is one of the simplest models in supervised machine learning. Because of its simplicity, it’s a great starting point for beginners, who just entered the machine-learning world, to explore how a machine learning model actually works and what is the goal of training a model.

A Linear Regression model means that the relationship between the **output** or **prediction** $\hat{y}$ and **input** or **features** $x$ is linear, which can be expressed as:

$$\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + ...+ \theta_nx_n$$

where $x_i(i=1,2,...n)$ is the $i$th feature value,  $\theta_i$ is the $i$th model parameter. Usually, $\theta_0$ is named as the bias term, and $\theta_1$ to $\theta_n$ are the weights of feature $x_1$ to $x_n$. 

In the real-world, if we use a Linear Regression model to make predictions, the goal of model training is to find the bias term and feature weights $\theta_i$ that make the model best fit the training dataset. Here, suitable metrics, such as **Mean Absolute Error(MAE)**, **Mean Squared Error(MSE)** or **Root Mean Square Error (RMSE)** are used to measure how woll a model fits the training dataset. 

In a dataset with total $m$ data samples, since each output $\hat{y}^j(j=1,2,...m)$ in the training dataset can be represented as follows:

$$\hat{y}^j = \theta_0 + \theta_1{x_1}^j + \theta_2{x_2}^j + ...+ \theta_n{x_n}^j$$

and the cost function **RMSE** of a Linear Regression model would be

$$\sqrt{\frac{1}{m}\sum_{j=1}^m(\boldsymbol{\theta}^\intercal\mathbf{x}^j-y^j)^2}$$

where $\boldsymbol{\theta}$ is the model’s parameter vector,  $\mathbf{x}$ is the feature vector and $y$ is the actual target value in the training dataset. The goal for training the Linear Regression model is to find the $\boldsymbol{\theta}$ that can maximally minimize the **RMSE** cost function in training dataset.

## 2. Train a Linear Regression model in Scikit-Learn

In Scikit-Learn, the `fit()` method of the LinearRegression object uses training dataset to find the best $\boldsymbol{\theta}$. 
Here we will look at an example to see how to train a Linear Regression model in Scikit-Learn.

### Import libraries

In [None]:
import numpy as np # import linear algebra library
import matplotlib.pyplot as plt # import matplotlib library
from sklearn.linear_model import LinearRegression # import Linear Regression model from sklearn

### Generate a Dataset
We will generate a 100 samples dataset, in which $\mathbf{x}$(feature) and $y$(target) has a linear-like relationship.

In [None]:
# generat 100 random numbers that are distributed in [0,1.5]
X = 1.5 * np.random.rand(100,1) 

# generate y to make X and y have a linear-like relationship
y = 2 + 1.5* X + np.random.rand(100,1)

### Plot dataset points

In [None]:
# plot dataset(X,y)
plt.plot(X,y,'b.')
plt.axis([0,2,0,8]) # specifie axis range
plt.xlabel('X') # x-axis label
plt.ylabel('y') # y-axis label
plt.title('Linear-like dataset') # set figure title
plt.grid(True)

### Train a Linear Regression model using generated dataset

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X,y) # training a linear regression model
lin_reg.intercept_, lin_reg.coef_ # the intercept and coefficient of linear regression model

### Make predictions

In [None]:
X_new = np.array([[0],[2]]) # generate new X data, 0 and 2
y_predict = lin_reg.predict(X_new) # using tained model to make predictions
fig,ax = plt.subplots()
ax.plot(X,y,'b.', label = 'Data samples')
ax.plot(X_new, y_predict,'g-',label = 'Predictions')
ax.axis([0,2,0,8])
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.legend()
ax.grid(True)
plt.show()
