## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [570]:
import numpy as np
import pandas as pd
from pathlib import Path
data_path = Path('./data')
df = pd.read_csv(data_path / "regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [571]:
# separate target variable from the original dataframe
y = df["Item_Outlet_Sales"]
x = df.drop(["Item_Outlet_Sales","Item_Identifier"], axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [572]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y)

---------------------
## Task I

### Split Features into numerical and categorical

In [573]:
cat_feats = x.dtypes[x.dtypes == 'object'].index.tolist()
num_feats = x.dtypes[~(x.dtypes == 'object')].index.tolist()

In [574]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [575]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### replacing null values

In [576]:
# Use SimpleImputer
from sklearn.impute import SimpleImputer

# imputer numerical
impute_num = SimpleImputer(strategy='mean')
# df_num_imputed = impute_num.fit_transform(numFeat(xtrain))

# impute categorical
impute_cat = SimpleImputer(strategy='most_frequent')
# df_cat_imputed = impute_cat.fit_transform(catFeat(xtrain))

### Creating dummy variables

In [577]:
# use OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
# df_cat_endcoded = encoder.fit_transform(df_cat_imputed)

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [578]:
# don't forget ToDenseTransformer after one hot encoder
class ToDenseTransformer():
    def transform(self, x, y=None, **fit_params):
        return x.todense()
    
    def fit(self, x, y=None, **fit_params):
        return self

dense = ToDenseTransformer()
# dense.fit(df_cat_endcoded)
# df_cat_dense = dense.transform(df_cat_endcoded)

In [579]:
from sklearn.decomposition import PCA
def npconvert(data):
    return np.asarray(data)

to_array = FunctionTransformer(npconvert)

pca = PCA(n_components=3)
# df_cat_pca = pca.fit_transform(np.asarray(df_cat_dense))


### Select the 3 best numeric features

In [580]:
# use SelectKBest
from sklearn.feature_selection import SelectKBest
kbest = SelectKBest(k=3)
# df_num_kbest = kbest.fit_transform(df_num_imputed, ytrain)

### Fitting models

In [581]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
ridge_model = Ridge()
forest_model = RandomForestRegressor()
gradient_model = GradientBoostingRegressor()

## Building a Pipeline

In [582]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

# build the features around the pipeline
cat_pipe = Pipeline([
    ('cat_select', keep_cat),
    ('impute_cat', impute_cat),
    ('encoder', encoder),
    ('dense', dense),
    ('convert', to_array),
    ('pca', pca)
])

num_pipe = Pipeline([
    ('num_select', keep_num),
    ('impute_num', impute_num),
    ('kbest', kbest)
])

features = FeatureUnion([
    ('features_num', num_pipe),
    ('features_cat', cat_pipe)
])

### Baseline Pipes

In [583]:
# ridge regression
ridge_pipeline = Pipeline([
    ('features', features),
    ('model', ridge_model)
])

# random forest
forest_pipeline = Pipeline([
    ('features', features),
    ('model', forest_model)
])

# random forest
gradient_pipeline = Pipeline([
    ('features', features),
    ('model', gradient_model)
])

### Baseline Models

In [584]:
# ridge regression
ridge_pipeline.fit(xtrain, ytrain)
ypred_ridge = ridge_pipeline.predict(xtest)

# forest regression
forest_pipeline.fit(xtrain, ytrain)
ypred_forest = forest_pipeline.predict(xtest)

# gradient regression
gradient_pipeline.fit(xtrain, ytrain)
ypred_gradient = gradient_pipeline.predict(xtest)

In [585]:
# model.score(df_test,y_test)
print(ridge_pipeline.score(xtest, ytest))
print(forest_pipeline.score(xtest, ytest))
print(gradient_pipeline.score(xtest, ytest))

0.4081355031074001
0.5628468906722521
0.6073410281290602


----------------------------
## Task II

In [586]:
from sklearn.model_selection import GridSearchCV

In [587]:
# define grid parameters
ridge_params = {
    'features__features_cat__pca__n_components' : [5],
    'features__features_num__kbest__k' : [3],
    'features__features_num__impute_num__strategy' : ['mean', 'medium'],
    'features__features_cat__encoder__drop' : ['first', None],
    'model__alpha' : [0.3],
}

forest_params = {
    'features__features_cat__pca__n_components' : [5],
    'features__features_num__kbest__k' : [2],
    'features__features_num__impute_num__strategy' : ['mean', 'medium'],
    'features__features_cat__encoder__drop' : ['if_binary', None],
    'model__n_estimators' : [50],
    'model__max_depth' : [5],
    'model__min_samples_split' : [3]
}

gradient_params = {
    'features__features_cat__pca__n_components' : [2, 3, 4, 5],
    'features__features_num__kbest__k' : [2, 3, 4, 5],
    'features__features_num__impute_num__strategy' : ['mean', 'medium'],
    'features__features_cat__encoder__drop' : ['first', 'if_binary'],
    'model__loss' : ['squared_error', 'absolute_error', 'huber', 'quantile'],
    'model__learning_rate' : [0.01, 0.03, 0.1, 0.3, 1],
    'model__n_estimators' : [10, 50, 100, 110],
    'model__max_depth' : [2, 3, 4, 5],
    'model__min_samples_split' : [2, 3, 4, 5],
    'model__criterion' : ['squared_error', 'mse', 'mae']
}

In [588]:
# train ridge model and output results
tuned_ridge = GridSearchCV(ridge_pipeline, ridge_params, refit=True, cv=5, return_train_score=True)
tuned_ridge.fit(xtrain, ytrain)

10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/homebr

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('features_num',
                                                                        Pipeline(steps=[('num_select',
                                                                                         FunctionTransformer(func=<function numFeat at 0x17fd1f670>)),
                                                                                        ('impute_num',
                                                                                         SimpleImputer()),
                                                                                        ('kbest',
                                                                                         SelectKBest(k=3))])),
                                                                       ('features_cat',
                                                                        Pipeline

In [589]:
tuned_ridge.score

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('features_num',
                                                 Pipeline(steps=[('num_select',
                                                                  FunctionTransformer(func=<function numFeat at 0x17fd1f670>)),
                                                                 ('impute_num',
                                                                  SimpleImputer()),
                                                                 ('kbest',
                                                                  SelectKBest(k=3))])),
                                                ('features_cat',
                                                 Pipeline(steps=[('cat_select',
                                                                  FunctionTransformer(func=<function catFeat at 0x17fd1f8b0>)),
                                                                 ('impute_cat',
                  

In [590]:
# train forest model and output results
tuned_forest = GridSearchCV(forest_pipeline, forest_params, refit=True, cv=5, return_train_score=True)
tuned_forest.fit(xtrain, ytrain)

10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/lighthouse/lib/python3.8/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/homebr

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('features_num',
                                                                        Pipeline(steps=[('num_select',
                                                                                         FunctionTransformer(func=<function numFeat at 0x17fd1f670>)),
                                                                                        ('impute_num',
                                                                                         SimpleImputer()),
                                                                                        ('kbest',
                                                                                         SelectKBest(k=3))])),
                                                                       ('features_cat',
                                                                        Pipeline

In [None]:
tuned_forest.score

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('features_num',
                                                 Pipeline(steps=[('num_select',
                                                                  FunctionTransformer(func=<function numFeat at 0x17fd1f670>)),
                                                                 ('impute_num',
                                                                  SimpleImputer()),
                                                                 ('kbest',
                                                                  SelectKBest(k=2))])),
                                                ('features_cat',
                                                 Pipeline(steps=[('cat_select',
                                                                  FunctionTransformer(func=<function catFeat at 0x17fd1f8b0>)),
                                                                 ('impute_cat',
                  

In [592]:
# train gradient model and output results
tuned_gradient = GridSearchCV(gradient_pipeline, gradient_params, refit=True, cv=5, return_train_score=True)
tuned_gradient.fit(xtrain, ytrain)

In [None]:
# test predictions
ypred = tuned_ridge.predict(xtest)
yprob = tuned_ridge.predict_proba(xtest)

# scores
print('Final score is: ', tuned_ridge.score(xtest, ytest))

In [None]:
# test predictions
ypred = tuned_forest.predict(xtest)
yprob = tuned_forest.predict_proba(xtest)

# scores
print('Final score is: ', tuned_forest.score(xtest, ytest))

In [None]:
# test predictions
ypred = tuned_gradient.predict(ytest)
yprob = tuned_gradient.predict_proba(ytest)

# scores
print('Final score is: ', tuned_gradient.score(xtest, ytest))