# Build a sklearn Pipeline for a to ML contest submission
In the ML_coruse_train notebook we at first analyzed the housing dataset to gain statistical insights and then e.g. features added new, 
replaced missing values and scaled the colums using pandas dataset methods.
In the following we will use sklearn [Pipelines] (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to integrate all these steps into one final *estimator*. The resulting pipeline can be used for saving an ML estimator to a file and use it later for production.

*Optional:*
If you want, you can save your estimator as explained in the last cell at the bottom of this notebook.
Based on a hidden dataset, it's performance will then be ranked against all other submissions.

In [2]:
# read housing data again
import pandas as pd
import numpy as np 
housing = pd.read_csv("datasets/housing/housing.csv")

# Try to get header information of the dataframe:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


At first, we again add some extra columns (e.g. `rooms_per_household, population_per_household, bedrooms_per_household`) which might correlate better with the predicted parameter `median_house_value`.
For modifying the dataset, we now use a [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html), which we later can put into a pipeline.

In [None]:
from sklearn.preprocessing import FunctionTransformer

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]


def add_extra_features(X):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    bedrooms_per_household = X[:, bedrooms_ix] / X[:, household_ix]
    return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate = False)
housing_extra_attribs = attr_adder.fit_transform(housing.values)

In [None]:
housing.values

For replacing nan values in the dataset with the mean or median of the column they are in, you can also use a [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) :  

In [None]:
from sklearn.impute import SimpleImputer 

# Drop the categorial column ocean_proximity
housing_num = housing.drop('ocean_proximity', axis=1)

print("We have %d nan elements in the numerical columns" %np.count_nonzero(np.isnan(housing_num.values)))

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
housing_num_cleaned = imp_mean.fit_transform(housing_num)

assert np.count_nonzero(np.isnan(housing_num_cleaned)) == 0
housing_num_cleaned[1,:]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(housing_num_cleaned)
print("mean of the columns is: " ,  np.mean(scaled, axis=0))
print("standard deviation of the columns is: " ,  np.std(scaled, axis=0))

Now let's build a pipeline for preprocessing the **numerical** attributes.
The pipeline shall process the data in the following steps:
* [Impute](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) median or mean values for elements which are NaN
* Add attributes using the FunctionTransformer with the function add_extra_features().
* Scale the numerical values using the [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

# Now test the pipeline on housing_num
num_pipeline.fit_transform(housing_num)

Now we have a pipeline for the numerical columns.  
We need one more for the categorical column. Instead of the "Dummy encoding" we used before, we now use the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) from sklearn.

In [None]:
housing['ocean_proximity'].values

In [None]:
from sklearn.preprocessing import OneHotEncoder
housing_cat = housing[['ocean_proximity']]
cat_encoder = OneHotEncoder(sparse = False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

We have everything we need for building a preprocessing pipeline which transforms the columns including all the steps before.
Since we have columns where different transformations should be applied, we use the class [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude","latitude","housing_median_age","total_rooms", "total_bedrooms",
               "population","households", "median_income"]
cat_attribs = ["ocean_proximity"]

full_prep_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

housing_features = housing.drop("median_house_value", axis = 1)
housing_labels = housing["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_labels, test_size = 0.20)


full_pipeline_with_predictor = Pipeline([
        ("preparation", full_prep_pipeline),
        ("forest", RandomForestRegressor())
    ])

full_pipeline_with_predictor.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = full_pipeline_with_predictor.predict(X_test)
tree_mse = mean_squared_error(y_pred, y_test)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
from sklearn.metrics import r2_score

r2_score(y_pred, y_test)

In [None]:
import pickle
import getpass
from sklearn.utils.validation import check_is_fitted

your_regressor = full_pipeline_with_predictor # Put your regression pipeline here
assert isinstance(your_regressor, Pipeline)
pickle.dump(your_regressor, open(getpass.getuser() + "s_model.p", "wb" ) )