<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 4.1.2 Linear Regression

In Lab 4.1.1 we were able to predict house price via a predictor variable from first principles. Here we the see how the same can be done using scikit-learn.

For comparison purposes, we will continue with the same predictor `sq__ft` in our dataset as last time.

## Prediction of House Price Using Linear Regression

### Data

The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.

In [86]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error
%matplotlib inline

### 1. Read in the data

In [30]:
# Read CSV
house_csv = "/Users/tresornoel/Desktop/IOD/DATA/Sacramento_transactions.csv"
df = pd.read_csv(house_csv)
df.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


What are the summary statistics for price, sqft, and beds?

In [31]:
df[['price', 'sq__ft', 'beds']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,985.0,234144.263959,138365.839085,1551.0,145000.0,213750.0,300000.0,884790.0
sq__ft,985.0,1314.916751,853.048243,0.0,952.0,1304.0,1718.0,5822.0
beds,985.0,2.911675,1.307932,0.0,2.0,3.0,4.0,8.0


### 2. Predict Price

We are going to predict the target variable `price` from `sq__ft` using sklearn's linear_model.

Read up on the following methods and attributes here: [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

- coef_
- intercept_
- fit()
- predict()
- score()

In [32]:
# import the LinearRegression class from the sklearn.linear_model module
from sklearn.linear_model import LinearRegression


#### 2.1 Create an instance of LinearRegression.

In [33]:
# ANSWER
model = LinearRegression()


#### 2.2 Fit predictor and target variables using linear regression

In [56]:
df.dropna(inplace = True)
X = df[['sq__ft','beds']].values
y = df['price'].values


In [57]:
# ANSWER
model.fit(X,y)

#### 2.3. Using attributes of the LinearRegression() class find coefficient and intercept.

In [58]:
# ANSWER
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

Coefficient: [   30.46522993 22590.89940988]
Intercept: 128307.66288262016


#### 2.4 Find R^2 Score

Find $R^2$ Using the ```score``` method of LinearRegression.

In [59]:
# ANSWER
score = model.score(X, y)
print("R^2 Score:", score)


R^2 Score: 0.13575119373153988


### 3. Splitting Data


Splitting the data into training and test sets is important in supervised learning.

- We ensure that the test set remains untouched during the model training process. This isolation prevents any information leakage about the test set into the training process.

- It allows us to evaluate the performance of our machine learning model on unseen data.


In [60]:
from sklearn.model_selection import train_test_split

#### 3.1 Create training and testing subsets

Hint: Use the `train_test_split` Library.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
```

In [61]:
# ANSWER
# Create training and testing subsets
df.dropna(inplace = True)
X = df[['sq__ft','beds']].values
y = df['price'].values
assert X.shape[0] == y.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### 3.2 Check Shape, Sample of Test Train Data

In [62]:
# ANSWER
## Check training/test data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(659, 2)
(326, 2)
(659,)
(326,)


#### 3.3 Using Linear Regression Find The Score

1. Fit model using X_train, y_train
2. Find score using X_test, y_test

In [63]:
# ANSWER
model.fit(X_train,y_train)

In [64]:
from sklearn.metrics import r2_score

In [65]:
# ANSWER
score = model.score(X_test, y_test)
print("R^2 Score:", score)
# Predicting the target values for the test set
y_pred = model.predict(X_test)

# Calculating the R^2 score
score_manual = r2_score(y_test, y_pred)

print(f'Manually calculated R^2 score on the test set: {score_manual}')

R^2 Score: 0.14706805098582454
Manually calculated R^2 score on the test set: 0.14706805098582454


### 4. Predict House Price

Let's assume we have information of following house:

- street:	1140 EDMONTON DR
- city:	SACRAMENTO
- zip:	95833
- state:	CA
- beds:	3
- baths:	2
- sq__ft:	1204
- type:	Residential

**Predict the price of this house using the linear regression model.**

In [81]:
# ANSWER
house_to_predict = {
    'street': '1140 EDMONTON DR',
    'city': 'SACRAMENTO',
    'zip': 95833,
    'state': 'CA',
    'beds': 3,
    'baths': 2,
    'sq__ft': 1204,
    'type': 'Residential'
}

# Prepare the data for prediction using pandas
house_to_predict_df = pd.DataFrame([{
    'sq__ft': house_to_predict['sq__ft'],
    'beds': house_to_predict['beds']
}])

# Convert the DataFrame to a NumPy array
house_to_predict = house_to_predict_df.values 
# Make the prediction
prediction = model.predict(house_to_predict)

# #printing the prediction
print(f"the predicted price at 1140 EDMONTON DR, SACRAMENTO is $ {prediction[0]:,.2f}")

the predicted price at 1140 EDMONTON DR, SACRAMENTO is $ 227,649.97


#### Find the error

In [87]:
# Predict price from X_test
score = model.score(X_test, y_test)
print(f'R^2 score on the test set: {score}')
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
# # from sklearn.metrics import mean_absolute_error, mean_squared_error
# print(f'Mean Absolute Error (MAE): ${mae:,.2f}')
# print(f'Mean Squared Error (MSE): ${mse:,.2f}')
# print(f'Root Mean Squared Error (RMSE): ${rmse:,.2f}')

R^2 score on the test set: 0.14706805098582454
Mean Absolute Error (MAE): $100,634.52
Mean Squared Error (MSE): $18,709,460,498.02
Root Mean Squared Error (RMSE): $136,782.53


### Conclusion
We have seen that through scikit-learn, minimal code is needed to implement and evaluate a linear regression model.



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



