In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%matplotlib inline

---

# Part 2. Using sklearn pipeline

**Goals:**

* Implement ML pipeline in sklearn
    * Check that there no NAs
    * Train/test split
    * Preprocess columns
    * Train model
    * Evaluate on test
* Practice using pipeline

In [2]:
data = pd.read_csv('https://github.com/mbburova/MDS/raw/main/house_prices_small.csv')
data.head()

Unnamed: 0,SalePrice,LotArea,OverallQual,SaleCondition,YearBuilt
0,208500,8450,7,Normal,2003
1,181500,9600,6,Normal,1976
2,223500,11250,7,Normal,2001
3,140000,9550,7,Abnorml,1915
4,250000,14260,8,Normal,2000


## 1. Prepare the data

### 1.1 Explore the dataset


In [None]:
data.head()

Unnamed: 0,SalePrice,LotArea,OverallQual,SaleCondition,YearBuilt
0,208500,8450,7,Normal,2003
1,181500,9600,6,Normal,1976
2,223500,11250,7,Normal,2001
3,140000,9550,7,Abnorml,1915
4,250000,14260,8,Normal,2000


In [None]:
# missing values?

### 1.2  Separate features form the target and perform train-test split

Function `train_test_split` randomly split dataset into two parts: 
- training data, that we will use to find optimal parameters of the model 
- test data, which will be used to report the final performance of the model

E.g. if your dataset initially had 1000 observations and you set argument `test_size=0.2`, it will select return 800 random observations as **train dataset** and the rest (200 observations) as **test dataset**. 

In [None]:
from sklearn.model_selection import train_test_split


### 1.3 Encode categorical and ordinal features, scale numerical ones

In [None]:
X_train.head()

Unnamed: 0,LotArea,OverallQual,SaleCondition,YearBuilt
254,8400,5,Normal,1957
1066,7837,6,Normal,1993
638,8777,5,Normal,1910
799,7200,5,Normal,1937
380,5000,5,Normal,1924


How to preprocess the features:
* `LotArea`,  `YearBuilt`  - numerical features, scale
* `SaleCondition` - categorical feature, one-hot encoding
* `OverallQual` - ordinal feature, no need to encode

That being said, we need to apply different transformations to different columns. It can be done with `ColumnTransformer`:

```
ColumnTransformer([
    ('name1', transorm1, column_names1),
    ('name2', transorm2, column_names2)
])
```

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


## 2. Train the model

Now, we are ready to train the model. We will use `LinearRegression` model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression


## 3. Evaluate on the test set

$$
\text{Root Mean Squared Error} = \sqrt{\frac{1}{N}\sum_i \left(y_i - \hat{y}_i   \right)^2 }
$$

In [None]:
# evaluate on test
