# Putting it all together: SciKit-Learn inference pipeline

In [1]:
import pickle
from pathlib import Path

from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import KNNImputer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

with open(Path('../data/processed/hyperparameters.pkl'), 'rb' ) as input_file:
    hyperparameters = pickle.load(input_file)

with open('../data/processed/train.pkl', 'rb') as input_file:
    train_df = pickle.load(input_file)

with open('../data/processed/test.pkl', 'rb') as input_file:
    test_df = pickle.load(input_file)


from sklearn.ensemble import GradientBoostingRegressor



## 1. Define a column transformer that encodes the categorical features

See SciKit-Learn [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) documentation.

In [2]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['Gender'])
    ],
    remainder='passthrough')

## 2. Define a pipeline that takes raw input and does prediction

See SciKit-Learn [`Pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) documentation.

The pipeline will have three steps:

1. **Feature encoder** - (column transformer from section 1, above) - use the strategy you devised in the data preparation notebook.
2. **Imputer** - to fill in any features the user dident know/specify - use the strategy you came up with in the user input imputation notebook.
3. **Regressor** - use the optimized hyperparameters from the model building notebook.

Make TWO pipelines, one for the time model and one for the calories model, and store them in a dict.

## 3. Train the pipeline

In [3]:
pipelines={
    'calorie_model_pipeline' : Pipeline([
        ('preprocessor', preprocessor),
        ('imputer', KNNImputer(n_neighbors=3, weights='distance')),
        ('model', GradientBoostingRegressor(**hyperparameters))
    ]),

    'time_model_pipeline' : Pipeline([
        ('preprocessor', preprocessor),
        ('imputer', KNNImputer(n_neighbors=3, weights='distance')),
        ('model', GradientBoostingRegressor(**hyperparameters))
    ])
}

In [4]:
print(train_df)
print(test_df)

        User_ID  Calories  Gender  Age  Height  Weight  Duration  Heart_Rate  \
9839   16554569      17.0     0.0   37   179.0    77.0       7.0        81.0   
9680   18903739     167.0     0.0   23   195.0    87.0      26.0       110.0   
7093   11938260      40.0     0.0   33   181.0    77.0      12.0        88.0   
11293  14116395      34.0     1.0   66   156.0    54.0       9.0        77.0   
820    13815395      23.0     1.0   32   144.0    49.0       5.0        90.0   
...         ...       ...     ...  ...     ...     ...       ...         ...   
5191   11890347     151.0     1.0   75   148.0    51.0      22.0       104.0   
13418  13504073     114.0     1.0   21   172.0    67.0      20.0       104.0   
5390   17918506      41.0     0.0   57   189.0    92.0       8.0        90.0   
860    12133833      57.0     0.0   35   174.0    76.0      12.0        97.0   
7270   19189565      59.0     0.0   26   182.0    86.0      16.0        91.0   

       Body_Temp  
9839        39.5  
9

## 4. Evaluate the pipeline

In [5]:
pipelines['calorie_model_pipeline'].fit(train_df.drop(columns=['Calories']), train_df['Calories'])
#score(test_df.drop(columns=['Calories']), test_df['Calories'])

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



## 5. Save pipeline for deployment

In [6]:
Path('../models').mkdir(exist_ok=True)

with open('../models/pipelines.pkl', 'wb') as output_file:
    pickle.dump(pipelines, output_file)