# Model training

In this notebook, we will use the training data we got from 01_preprocessing.ipynb to train a simple Linear Regression Model for predicting house price. Like in 01_preprocessing.ipynb, we will use LineaPy to save the model and the code for creating the model into a Linea Artifact.

In [1]:
# NBVAL_IGNORE_OUTPUT

# Uncomment the following lines to install the necessary libraries for running this example.
# ! pip install lineapy
# ! pip install pandas
# ! pip install numpy
# ! pip install sklearn

import lineapy
import numpy as np
import pandas as pd

lineapy.options.set("is_demo", True) # Not for normal use

In [2]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Instead of having to manage CSV files ourselves, we can use LineaPy to directly pass values between notebooks. In 01_preprocesssing.ipynb, we saved the training data to a Linea Artifact called `cleaned_data_housing_lineapy`. We can use the artifact name to retrieve the dataframe.

In [3]:
cleaned_data = lineapy.get("cleaned_data_housing_lineapy").get_value()

In [4]:
len(cleaned_data)

1998

In [5]:
cleaned_data = cleaned_data.dropna()

In [6]:
len(cleaned_data)

1998

In [7]:
train, val = train_test_split(cleaned_data, test_size=0.3, random_state=42)
X_train = train.drop(['SalePrice'], axis = 1)
y_train = train.loc[:, 'SalePrice']
X_val = val.drop(['SalePrice'], axis = 1)
y_val = val.loc[:, 'SalePrice']

In [8]:
# NBVAL_IGNORE_OUTPUT
X_train

Unnamed: 0,Gr_Liv_Area,Garage_Area,LA_v_1st,1st_v_2nd,wd_v_2nd,basement_value,Neighborhood=Blueste,Neighborhood=BrDale,Neighborhood=BrkSide,Neighborhood=ClearCr,...,Neighborhood=NoRidge,Neighborhood=NridgHt,Neighborhood=OldTown,Neighborhood=SWISU,Neighborhood=Sawyer,Neighborhood=SawyerW,Neighborhood=Somerst,Neighborhood=StoneBr,Neighborhood=Timber,Neighborhood=Veenker
557,2787,820,10.149829,0.721433,0.192712,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
843,1436,1488,8.641148,6.890110,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1651,2263,420,10.377682,1.061020,0.131148,0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1345,1559,812,5.195638,0.000000,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1160,1554,627,2.953668,0.000000,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,914,270,18.817287,0.000000,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1294,922,308,10.596529,0.000000,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,2082,484,8.389134,1.891667,0.388889,0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1459,1330,437,6.266165,0.000000,0.000000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# NBVAL_IGNORE_OUTPUT
y_train

557     260700
843     147000
1651    263800
1345    146500
1160    202300
         ...  
1130    134900
1294    109100
860     207500
1459    161140
1126    161900
Name: SalePrice, Length: 1398, dtype: int64

In [10]:
linear_model = LinearRegression(fit_intercept=True)

In [11]:
linear_model.fit(X_train, y_train)
y_fitted = linear_model.predict(X_train)
y_predicted = linear_model.predict(X_val)

In [12]:
X_val["Predicted Sales Price"] = y_predicted

In [13]:
# NBVAL_IGNORE_OUTPUT
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_val, y_predicted, squared=False)
rmse

41354.417675922945

The results look good. Let's save the model using LineaPy

In [14]:
artifact = lineapy.save(linear_model, 'linea_model_housing')

In [15]:
# NBVAL_IGNORE_OUTPUT

from lineapy.utils.utils import prettify
print(prettify(artifact.get_code(use_lineapy_serialization=False)))

import pickle

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

cleaned_data = pickle.load(open("pre-3257800972296068657-post.pkl", "rb"))
cleaned_data = cleaned_data.dropna()
train, val = train_test_split(cleaned_data, test_size=0.3, random_state=42)
X_train = train.drop(["SalePrice"], axis=1)
y_train = train.loc[:, "SalePrice"]
linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(X_train, y_train)

