# Build a sklearn Pipeline for a to ML contest submission
In the ML_coruse_train notebook we at first analyzed the housing dataset to gain statistical insights and then e.g. features added new, 
replaced missing values and scaled the colums using pandas dataset methods.
In the following we will use sklearn [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to integrate all these steps into one final *estimator*. The resulting pipeline can be used for saving an ML estimator to a file and use it later for production.

*Optional:*
If you want, you can save your estimator as explained in the last cell at the bottom of this notebook.
Based on a hidden dataset, it's performance will then be ranked against all other submissions.

In [19]:
# read housing data again
import pandas as pd
import numpy as np 
housing = pd.read_csv("datasets/housing/housing.csv")

# Try to get header information of the dataframe:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


One remark: sklearn transformers do **not** act on pandas dataframes. Instead, they use numpy arrays.  
Now try to [convert](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) a dataframe to a numpy array:

In [20]:
housing.head().to_numpy()

array([[-122.23, 37.88, 41.0, 880.0, 129.0, 322.0, 126.0, 8.3252,
        452600.0, 'NEAR BAY'],
       [-122.22, 37.86, 21.0, 7099.0, 1106.0, 2401.0, 1138.0, 8.3014,
        358500.0, 'NEAR BAY'],
       [-122.24, 37.85, 52.0, 1467.0, 190.0, 496.0, 177.0, 7.2574,
        352100.0, 'NEAR BAY'],
       [-122.25, 37.85, 52.0, 1274.0, 235.0, 558.0, 219.0, 5.6431,
        341300.0, 'NEAR BAY'],
       [-122.25, 37.85, 52.0, 1627.0, 280.0, 565.0, 259.0, 3.8462,
        342200.0, 'NEAR BAY']], dtype=object)

As you can see, the column names are lost now.
In a numpy array, columns indexed using integers and no more by their names. 

### Add extra feature columns
At first, we again add some extra columns (e.g. `rooms_per_household, population_per_household, bedrooms_per_household`) which might correlate better with the predicted parameter `median_house_value`.
For modifying the dataset, we now use a [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html), which we later can put into a pipeline.  
Hints:  
* For finding the index number of a given column name, you can use the method [get_loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html)
* For concatenating the new columns with the given array, you can use numpy method [c_](https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html)

In [21]:
from sklearn.preprocessing import FunctionTransformer

# At first, get the indexes as integers from the column names:
rooms_ix = housing.columns.get_loc("total_rooms")
bedrooms_ix = 
population_ix = 
household_ix = 

# Now implement a function which takes a numpy array a argument and adds the new feature columns
def add_extra_features(X):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = 
    bedrooms_per_household = 
    
    # Concatenate the original array X with the new columns
    return 

attr_adder = FunctionTransformer(add_extra_features, validate = False)
housing_extra_attribs = attr_adder.fit_transform(housing.values)

assert housing_extra_attribs.shape == (17999, 13)
housing_extra_attribs 

SyntaxError: invalid syntax (<ipython-input-21-ad36653281b6>, line 5)

### Imputing missing elements
For replacing nan values in the dataset with the mean or median of the column they are in, you can also use a [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) :  

In [22]:
from sklearn.impute import SimpleImputer 

# Drop the categorial column ocean_proximity
housing_num = housing.drop(...)

print("We have %d nan elements in the numerical columns" %np.count_nonzero(np.isnan(housing_num.to_numpy())))

imp_mean = ...
housing_num_cleaned = imp_mean.fit_transform(housing_num)

assert np.count_nonzero(np.isnan(housing_num_cleaned)) == 0
housing_num_cleaned[1,:]

KeyError: '[Ellipsis] not found in axis'

### Column scaling
For scaling and normalizing the columns, you can use the class [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
Use numpy [mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and [std](https://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html) to calculate the mean and standard deviation of each column (Hint: columns are axis = 0! ) after scaling.

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = ...
scaled = scaler.fit_transform(housing_num_cleaned)
print("mean of the columns is: " ,  ...)
print("standard deviation of the columns is: " ,  ...)

AttributeError: 'ellipsis' object has no attribute 'fit_transform'

### Putting all preprocessing steps together  
Now let's build a pipeline for preprocessing the **numerical** attributes.
The pipeline shall process the data in the following steps:
* [Impute](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) median or mean values for elements which are NaN
* Add attributes using the FunctionTransformer with the function add_extra_features().
* Scale the numerical values using the [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [24]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
        ('give a name', ...), # Imputer
        ('give a name', ...), # FunctionTransformer
        ('give a name', ...), # Scaler
    ])

# Now test the pipeline on housing_num
num_pipeline.fit_transform(housing_num)

ValueError: Names provided are not unique: ['give a name', 'give a name', 'give a name']

Now we have a pipeline for the numerical columns.  
But we still have a categorical column:

In [25]:
housing['ocean_proximity'].head()

0    NEAR BAY
1    NEAR BAY
2    NEAR BAY
3    NEAR BAY
4    NEAR BAY
Name: ocean_proximity, dtype: object

We need one more pipeline for the categorical column. Instead of the "Dummy encoding" we used before, we now use the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) from sklearn.  
Hint: to make things easier, set the sparse option of the OneHotEncoder to False.

In [26]:
from sklearn.preprocessing import OneHotEncoder
housing_cat = housing[] #get the right column
cat_encoder = 
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

SyntaxError: invalid syntax (<ipython-input-26-0ec62a40c906>, line 2)

We have everything we need for building a preprocessing pipeline which transforms the columns including all the steps before.  
Since we have columns where different transformations should be applied, we use the class [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [27]:
from sklearn.compose import ColumnTransformer

# These are the columns with the numerical features:
num_attribs = ["longitude", ...]

# Here are the columns with categorical features:
cat_attribs = [...]

full_prep_pipeline = ColumnTransformer([
        ("give a name", ..., ...), # Add the numerical pipeline and specify the columns it should work on
        ("give a name", ..., ...), # Add a OneHotEncoder and specify the columns it should work on
    ])

full_prep_pipeline.fit_transform(housing)

ValueError: Names provided are not unique: ['give a name', 'give a name']

### Train an estimator
Include `full_prep_pipeline` into a further pipeline where it is followed by an RandomForestRegressor.  
This way, at first our data is prepared using `full_prep_pipeline` and then the RandomForestRegressor is trained on it.

In [28]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

full_pipeline_with_predictor = Pipeline([
        ("give a name", full_prep_pipeline), # add the full_prep_pipeline
        ("give a name", RandomForestRegressor())  # Add a RandomForestRegressor 
    ])

ValueError: Names provided are not unique: ['give a name', 'give a name']

For training the regressor, seperate the label colum (`median_house_value`) and feature columns (all other columns).
Split the data into a training and testing dataset using train_test_split.

In [29]:
# Create two dataframes, one for the labels one for the features
housing_features = housing...
housing_labels = housing

# Split the two dataframes into a training and a test dataset
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_labels, test_size = 0.20)

# Now train the full_pipeline_with_predictor on the training dataset
full_pipeline_with_predictor.fit(X_train, y_train)

SyntaxError: invalid syntax (<ipython-input-29-7e4640eba1c2>, line 2)

As usual, calculate some score metrics:

In [30]:
from sklearn.metrics import mean_squared_error

y_pred = full_pipeline_with_predictor.predict(X_test)
tree_mse = mean_squared_error(y_pred, y_test)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

NameError: name 'full_pipeline_with_predictor' is not defined

In [31]:
from sklearn.metrics import r2_score

r2_score(y_pred, y_test)

NameError: name 'y_pred' is not defined

Use the [pickle serializer](https://docs.python.org/3/library/pickle.html) to save your estimator to a file for contest participation.

In [32]:
import pickle
import getpass
from sklearn.utils.validation import check_is_fitted

your_regressor = ... # Put your regression pipeline here
assert isinstance(your_regressor, Pipeline)
pickle.dump(your_regressor, open(getpass.getuser() + "s_model.p", "wb" ) )

AssertionError: 