# Linear Regression sklearn

In this jupyter notebook, we will be training linear regressor using sklearn library, which we will use as a benchmark for our own linear regression model.

### Importing libraries

First, we need to import necessary libraries. For our ML model, we will be using LinearRegression from sklearn.

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

### Data handling

Next, we need to import our input data file, which has already been preprocessed beforehand.

In [2]:
# Load the data file

data = pd.read_csv('../output/data/real_estate_preprocessed.csv')
data = data[['district', 'size', 'floor', 'registration', 'rooms', 'parking', 'balcony', 'state', 'price']]

In [3]:
# Split data into feature vectors and outputs

X = np.array(data.iloc[:, 0:-1])
y = np.array(data.iloc[:, -1:])

In [4]:
# Use OneHotEncoder for categorical features

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 7])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X.tolist(), y, test_size=0.2, random_state=0)

### Model training and evaluation

We will be using LinearRegression from sklearn.linear_model as our base model. As our evaluation metric, we will be looking at both the RMSE (Root Mean Squared Error) and R2 (R squared) score.

In [6]:
# Use sklearn LinearRegression regressor to fit train set and predict test set

regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [7]:
# Print RMSE

mean_squared_error(y_test, y_pred, squared=False)

52173.01733639259

In [8]:
# Print R squared

r2_score(y_test, y_pred)

0.7328485950994026