# Linear Regression using Scikit-Learn

There is an open-source, commercially usable machine learning toolkit called [scikit-learn](https://scikit-learn.org/stable/index.html). This toolkit contains implementations of many of the algorithms that you will work with in this course.

## Goals

- Utilize  scikit-learn to implement linear regression using a close form solution based on the normal equation

In [1]:
import numpy as np
np.set_printoptions(precision=2)
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd

<a name="toc_40291_2"></a>
# Linear Regression, closed-form solution
Scikit-learn has the [linear regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) which implements a closed-form linear regression.

Let's use the data from the early labs - a house with 1000 square feet sold for \\$300,000 and a house with 2000 square feet sold for \\$500,000.

| Size (1000 sqft)     | Price (1000s of dollars) |
| ----------------| ------------------------ |
| 1               | 300                      |
| 2               | 500                      |


### Load the data set

In [2]:
X_train = np.array([1.0, 2.0])   # Features
y_train = np.array([300, 500])   # Target Value

### Create and fit the model
The code below performs regression using scikit-learn. 
The first step creates a regression object.  
The second step utilizes one of the methods associated with the object, `fit`. This performs regression, fitting the parameters to the input data. The toolkit expects a two-dimensional X matrix.

In [3]:
linear_model = LinearRegression()
# X must be a 2-D Matrix
linear_model.fit(X_train.reshape(-1, 1), y_train)

### View Parameters 
The $\mathbf{w}$ and $\mathbf{b}$ parameters are referred to as 'coefficients' and 'intercept' in scikit-learn.

In [4]:
w = linear_model.coef_

b = linear_model.intercept_

print(f"w = {w:}, b = {b:0.2f}")

print(f"'manual' prediction: f_wb = wx+b : {1200*w + b}")

w = [200.], b = 100.00
'manual' prediction: f_wb = wx+b : [240100.]


### Make Predictions

Calling the `predict` function generates predictions.

In [5]:
y_pred = linear_model.predict(X_train.reshape(-1, 1))

print("Prediction on training set:", y_pred)

Prediction on training set: [300. 500.]


In [6]:
X_test = np.array([[1200]])
print(f"Prediction for 1200 sqft house: ${linear_model.predict(X_test)[0]:0.2f}")

Prediction for 1200 sqft house: $240100.00


## Second Example
The second example is from an earlier lab with multiple features. The final parameter values and predictions are very close to the results from the un-normalized 'long-run' from that lab. That un-normalized run took hours to produce results, while this is nearly instantaneous. **The closed-form solution work well on smaller data sets such as these, but can be computationally demanding on larger data sets.** 

**The closed-form solution does not require normalization.**

In [7]:
data = pd.read_csv("data/houses.txt", header=None, 
                   names=['Size(sqft)', 'Bedrooms', 
                          'floors', 'Age', 
                          'Price(1000s dollars)'])

df = pd.DataFrame(data)

df

Unnamed: 0,Size(sqft),Bedrooms,floors,Age,Price(1000s dollars)
0,952.0,2.0,1.0,65.0,271.5
1,1244.0,3.0,1.0,64.0,300.0
2,1947.0,3.0,2.0,17.0,509.8
3,1725.0,3.0,2.0,42.0,394.0
4,1959.0,3.0,2.0,15.0,540.0
...,...,...,...,...,...
95,1224.0,2.0,2.0,12.0,329.0
96,1432.0,2.0,1.0,43.0,388.0
97,1660.0,3.0,2.0,19.0,390.0
98,1212.0,3.0,1.0,20.0,356.0


In [8]:
X_train = df.iloc[:,:4]
X_train

Unnamed: 0,Size(sqft),Bedrooms,floors,Age
0,952.0,2.0,1.0,65.0
1,1244.0,3.0,1.0,64.0
2,1947.0,3.0,2.0,17.0
3,1725.0,3.0,2.0,42.0
4,1959.0,3.0,2.0,15.0
...,...,...,...,...
95,1224.0,2.0,2.0,12.0
96,1432.0,2.0,1.0,43.0
97,1660.0,3.0,2.0,19.0
98,1212.0,3.0,1.0,20.0


In [9]:
y_train = df.iloc[:, 4]
y_train

0     271.5
1     300.0
2     509.8
3     394.0
4     540.0
      ...  
95    329.0
96    388.0
97    390.0
98    356.0
99    257.8
Name: Price(1000s dollars), Length: 100, dtype: float64

In [10]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [11]:
b = linear_model.intercept_
w = linear_model.coef_
print(f"w = {w:}, b = {b:0.2f}")

w = [  0.27 -32.9  -67.29  -1.47], b = 221.50


In [12]:
# print(f"Prediction on training set:\n {linear_model.predict(X_train)[:4]}" )
# print(f"prediction using w,b:\n {(X_train @ w + b)[:4]}")
# print(f"Target values \n {y_train[:4]}")

# x_house = np.array([1200, 3,1, 40]).reshape(-1,4)
# x_house_predict = linear_model.predict(x_house)[0]
# print(f" predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = ${x_house_predict*1000:0.2f}")

## What we learned

- utilized an open-source machine learning toolkit, scikit-learn
- implemented linear regression using a close-form solution from that toolkit

# REFERENCE

[1]https://github.com/Thomson-Cui/2022-Machine-Learning-Specialization/blob/main/Supervised%20Machine%20Learning%20Regression%20and%20Classification/week2/3.Gradient%20descent%20in%20practice/C1_W2_Lab06_Sklearn_Normal_Soln.ipynb

[2]https://www.bilibili.com/video/BV1Pa411X76s/?p=30&spm_id_from=333.880.my_history.page.click&vd_source=8c32dd2bfbfecb1eaa9b0b9c4fb4d83e

[3]https://www.coursera.org/specializations/machine-learning-introduction