# Model Training

### scikit - learn

https://scikit-learn.org/stable/

scikitlearn (sklearn) provides simple and efficient tools for predictive data analysis. It is built on NumPy, SciPy, and matplotlib. 

First thing, Import all the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', 50)

In [2]:
# next load the data
df = pd.read_csv('C:\\Users\\Vasan\\#Data science python\\DS LEVEL 2\\DSFINAL PROJECT\\#1RMSwk10\\Dataset\\final.csv')
df.head()

Unnamed: 0,price,year_sold,property_tax,insurance,beds,baths,sqft,year_built,lot_size,basement,popular,recession,property_age,property_type_Condo
0,295850,2013,234,81,1,1,584,2013,0,0,0,1,0,1
1,216500,2006,169,51,1,1,612,1965,0,1,0,0,41,1
2,279900,2012,216,74,1,1,615,1963,0,0,0,1,49,1
3,379900,2005,265,92,1,1,618,2000,33541,0,0,0,5,1
4,340000,2002,88,30,1,1,634,1992,0,0,0,0,10,1


In [None]:
df.shape

## Linear Regression Model

In [3]:
# import linear regression model
from sklearn.linear_model import LinearRegression

In [4]:
# seperate input features in x
x = df.drop('price', axis=1)

# store the target variable in y
y = df['price']

**Train Test Split**
* Training sets are used to fit and tune your models.
* Test sets are put aside as "unseen" data to evaluate your models.
* The `train_test_split()` function splits data into randomized subsets.

In [5]:
# import module
from sklearn.model_selection import train_test_split

# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=1234)

In [6]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((1505, 13), (1505,), (377, 13), (377,))

In [7]:
# train your model
lrmodel = LinearRegression().fit(x_train,y_train)

# make preditions on train set
train_pred = lrmodel.predict(x_train)

In [8]:
# evaluate your model
# we need mean absolute error
from sklearn.metrics import mean_absolute_error

train_mae = mean_absolute_error(train_pred, y_train)
print('Train error is', train_mae)

Train error is 88675.43111087685


In [9]:
lrmodel.coef_

array([ 7.44010547e+03, -4.41550418e+02,  2.22905873e+03,  6.28887604e+03,
        8.83064328e+03,  2.45349098e+01,  4.03674033e+03,  2.25738351e-01,
       -6.59411590e+03, -8.65731825e+03, -5.15347356e+04,  3.40336513e+03,
       -2.53395649e+04])

In [10]:
lrmodel.intercept_

-22774236.00680962

In [12]:
# make predictions om test set
ypred = lrmodel.predict(x_test)

#evaluate the model
test_mae = mean_absolute_error(ypred, y_test)
print('Test error is', test_mae)

Test error is 89564.66781890021


### Our model is still not good beacuse we need a model with Mean Absolute Error < $70,000

Note - We have not scaled the features and not tuned the model.

## Decision Tree Model

In [None]:
# import decision tree model
from sklearn.tree import DecisionTreeRegressor

In [None]:
# create an instance of the class
dt = DecisionTreeRegressor(max_depth=3, max_features=10, random_state=567)

In [None]:
# train the model
dtmodel = dt.fit(x_train,y_train)

In [None]:
# make predictions using the test set
ytest_pred = dtmodel.predict(x_test)

In [None]:
# evaluate the model
test_mae = mean_absolute_error(ytest_pred, y_test)
test_mae

## How do I know if my model is Overfitting or Generalised?

In [None]:
# make predictions on train set
ytrain_pred = dtmodel.predict(x_train)

In [None]:
# import mean absolute error metric
from sklearn.metrics import mean_absolute_error

# evaluate the model
train_mae = mean_absolute_error(ytrain_pred, y_train)
train_mae

## Plot the tree

In [None]:
# get the features
dtmodel.feature_names_in_

In [None]:
# plot the tree
from sklearn import tree

# Plot the tree with feature names
tree.plot_tree(dtmodel, feature_names=dtmodel.feature_names_in_)

#tree.plot_tree(dtmodel)
#plt.show(dpi=300)

# Save the plot to a file
plt.savefig('tree.png', dpi=300)

## Random Forest Model

In [None]:
# import decision tree model
from sklearn.ensemble import RandomForestRegressor

In [None]:
# create an instance of the model
rf = RandomForestRegressor(n_estimators=200, criterion='absolute_error')

In [None]:
# train the model
rfmodel = rf.fit(x_train,y_train)

In [None]:
# make prediction on train set
ytrain_pred = rfmodel.predict(x_train)

In [None]:
# make predictions on the x_test values
ytest_pred = rfmodel.predict(x_test)

In [None]:
# evaluate the model
test_mae = mean_absolute_error(ytest_pred, y_test)
test_mae

In [None]:
# Individual Decision Trees
# tree.plot_tree(rfmodel.estimators_[2], feature_names=dtmodel.feature_names_in_)

## Pickle: 

* The pickle module implements a powerful algorithm for serializing and de-serializing a Python object structure. 

* The saving of data is called Serialization, and loading the data is called De-serialization.

**Pickle** model provides the following functions:
* **`pickle.dump`** to serialize an object hierarchy, you simply use `dump()`. 
* **`pickle.load`** to deserialize a data stream, you call the `loads()` function.

In [None]:
# import pickle to save model
import pickle
 
# Save the trained model on the drive 
pickle.dump(rfmodel, open('RE_Model','wb'))

In [None]:
# Load the pickled model
RE_Model = pickle.load(open('RE_Model','rb'))

In [None]:
# Use the loaded pickled model to make predictions
RE_Model.predict([[2012, 216, 74, 1 , 1, 618, 2000, 600, 1, 0, 0, 6, 0]])