# Build a sklearn Pipeline for a to ML contest submission
In the ML_coruse_train notebook we at first analyzed the housing dataset to gain statistical insights and then e.g. features added new, 
replaced missing values and scaled the colums using pandas dataset methods.
In the following we will use sklearn [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to integrate all these steps into one final *estimator*. The resulting pipeline can be used for saving an ML estimator to a file and use it later for production.

*Optional:*
If you want, you can save your estimator as explained in the last cell at the bottom of this notebook.
Based on a hidden dataset, it's performance will then be ranked against all other submissions.

In [None]:
# read housing data again
import pandas as pd
import numpy as np 
housing = pd.read_csv("datasets/housing/housing.csv")

# Try to get header information of the dataframe:
housing.head()

One remark: sklearn transformers do **not** act on pandas dataframes. Instead, they use numpy arrays.  
Now try to [convert](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) a dataframe to a numpy array:

In [None]:
housing.head().to_numpy()

In [None]:
housing.head()

As you can see, the column names are lost now.
In a numpy array, columns indexed using integers and no more by their names. 

### Add extra feature columns
At first, we again add some extra columns (e.g. `rooms_per_household, population_per_household, bedrooms_per_household`) which might correlate better with the predicted parameter `median_house_value`.
For modifying the dataset, we now use a [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html), which we later can put into a pipeline.  
Hints:  
* For finding the index number of a given column name, you can use the method [get_loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html)
* For concatenating the new columns with the given array, you can use numpy method [c_](https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html)

In [None]:
from sklearn.preprocessing import FunctionTransformer

# At first, get the indexes as integers from the column names:
rooms_ix = housing.columns.get_loc("total_rooms")
bedrooms_ix = housing.columns.get_loc("total_bedrooms")
population_ix = housing.columns.get_loc("population")
household_ix = housing.columns.get_loc("households")

# Now implement a function which takes a numpy array a argument and adds the new feature columns
def add_extra_features(X):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    bedrooms_per_household = X[:, bedrooms_ix] / X[:, household_ix]
    
    # Concatenate the original array X with the new columns
    return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate = False)
housing_extra_attribs = attr_adder.fit_transform(housing.values)

assert housing_extra_attribs.shape == (16512, 13)
housing_extra_attribs 

### Imputing missing elements
For replacing nan values in the dataset with the mean or median of the column they are in, you can also use a [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) :  

In [None]:
from sklearn.impute import SimpleImputer 

# Drop the categorial column ocean_proximity
housing_num = housing.drop('ocean_proximity', axis=1)

print("We have %d nan elements in the numerical columns" %np.count_nonzero(np.isnan(housing_num.values)))

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
housing_num_cleaned = imp_mean.fit_transform(housing_num)

assert np.count_nonzero(np.isnan(housing_num_cleaned)) == 0
housing_num_cleaned[1,:]

### Column scaling
For scaling the columns, you can use a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(housing_num_cleaned)
print("mean of the columns is: " ,  np.mean(scaled, axis=0))
print("standard deviation of the columns is: " ,  np.std(scaled, axis=0))

### Putting all preprocessing steps together  
Now let's build a pipeline for preprocessing the **numerical** attributes.
The pipeline shall process the data in the following steps:
* [Impute](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) median or mean values for elements which are NaN
* Add attributes using the FunctionTransformer with the function add_extra_features().
* Scale the numerical values using the [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

# Now test the pipeline on housing_num
num_pipeline.fit_transform(housing_num)

Now we have a pipeline for the numerical columns.  
But we still have a categorical column:

In [None]:
housing['ocean_proximity'].head()

We need one more pipeline for the categorical column. Instead of the "Dummy encoding" we used before, we now use the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) from sklearn.

In [None]:
from sklearn.preprocessing import OneHotEncoder
housing_cat = housing[['ocean_proximity']]
cat_encoder = OneHotEncoder(sparse = False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

We have everything we need for building a preprocessing pipeline which transforms the columns including all the steps before.
Since we have columns where different transformations should be applied, we use the class [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude","latitude","housing_median_age","total_rooms", "total_bedrooms",
               "population","households", "median_income"]
cat_attribs = ["ocean_proximity"]

full_prep_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

full_prep_pipeline.fit_transform(housing)

### Train an estimator
Include `full_prep_pipeline` into a further pipeline where it is followed by an RandomForestRegressor.
This way, at first our data is prepared using `full_prep_pipeline` and then the RandomForestRegressor is trained on it.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_prep_pipeline),
        ("forest", RandomForestRegressor())
    ])

At first, seperate the label colum (`median_house_value`) and feature columns (all other columns).
Split the data into a training and testing dataset using train_test_split.

In [None]:
# Create two dataframes, one for the labels one for the features
housing_features = housing.drop("median_house_value", axis = 1)
housing_labels = housing["median_house_value"]

# Split the two dataframes into a training and a test dataset
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_labels, test_size = 0.20)

# Now train the full_pipeline_with_predictor on the training dataset
full_pipeline_with_predictor.fit(X_train, y_train)

As usual, calculate some score metrics:

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = full_pipeline_with_predictor.predict(X_test)
tree_mse = mean_squared_error(y_pred, y_test)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
from sklearn.metrics import r2_score

r2_score(y_pred, y_test)

Use the [pickle serializer](https://docs.python.org/3/library/pickle.html) to save your estimator to a file for contest participation.

In [None]:
import pickle
import getpass
from sklearn.utils.validation import check_is_fitted

your_regressor = full_pipeline_with_predictor
assert isinstance(your_regressor, Pipeline)
pickle.dump(your_regressor, open(getpass.getuser() + "s_model.p", "wb" ) )

In [None]:
housing_valid = pd.read_csv("datasets/housing/housing_valid.csv")
housing_valid_features = housing_valid.drop("median_house_value", axis = 1)
housing_valid_labels = housing_valid["median_house_value"]
housing_valid_labels

In [None]:
with open('chrus_model.p', 'rb') as handle:
    contestModel = pickle.load(handle)
    y_pred = contestModel.predict(housing_valid_features)
    tree_mse = mean_squared_error(y_pred, housing_valid_labels.to_numpy())
    tree_rmse = np.sqrt(tree_mse)
    print(tree_rmse)

In [None]:
# For generating hidden test set
housing_orig = pd.read_csv("datasets/housing/housing_orig.csv")
course, valid = train_test_split(housing_orig, test_size=0.2, random_state = 26)
valid.to_csv("datasets/housing/housing_valid.csv", index = False)
course.to_csv("datasets/housing/housing.csv", index = False)

In [None]:
course.head(5)

In [None]:
valid.head(5)