## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [1]:
import pandas as pd
df = pd.read_csv("data/regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [43]:
df.Item_Fat_Content.unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [2]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [158]:
# split test train THEN 
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [159]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

In [4]:
df.dtypes

Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
dtype: object

---------------------
## Task I

### Split Features into numerical and categorical

In [5]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [6]:
cat_feats

['Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type']

In [None]:
num_feats

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

In [28]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]


In [63]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)


In [145]:
#  Feel like low fats should be transformed as well:

def consolidate_low_fat(X):
    X.Item_Fat_Content.replace({'low fat':'Low Fat', 'LF': 'Low Fat', 'reg':'Regular'}, inplace=True )
    return X

consolidate_low_fat_object = FunctionTransformer(consolidate_low_fat)

x = consolidate_low_fat_object.fit_transform(df_train)
x.Item_Fat_Content.unique()


array(['Low Fat', 'Regular'], dtype=object)

### replacing null values

In [10]:
# Use SimpleImputer

In [146]:
import numpy as np

imp_num = SimpleImputer(missing_values=np.nan, strategy='mean')
# df_num_no_null = imp.fit_transform(keep_num.fit_transform(df_train))

imp_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# df_cat_no_null = imp.fit_transform(keep_cat.fit_transform(df_train))


In [101]:
np.isnan(df_num_no_null).sum()

0

### Creating dummy variables

In [196]:
# use OneHotEncoder

In [148]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
# df_cat_ohe = enc.fit_transform(df_cat_no_null) # don't fit transform - pipeline does this


# HOW DO I GET FEATURES OUT? - you would look back with indexing 

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [149]:
# don't forget ToDenseTransformer after one hot encoder (or set sparse=False)

In [150]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
# df_pca = pca.fit_transform(df_cat_ohe, y_train)

### Select the 3 best numeric features

In [3]:
# use SelectKBest

In [151]:
from sklearn.feature_selection import SelectKBest
selection = SelectKBest(k=3)
# df_kbest = selection.fit_transform(df_num_no_null, y_train)

In [152]:
# feature_union = FeatureUnion([("pca", pca), ("univ_select", selection)]) # do in pipeline instead


### Fitting models

In [54]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I

In [154]:
base_model = Ridge()
# base_model.fit(feature_union, y_train)


### Building a Pipeline

In [186]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [187]:
numeric_transform = Pipeline([('impute_mean', SimpleImputer(strategy='mean')), 
                              ('scaling', StandardScaler()),
                             ("univ_select", selection)]) # kbest


categorical_transform = Pipeline([('fix_low_fat', consolidate_low_fat_object),
                                  ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False)),
                                 ("pca", pca)])


preprocessing_sales = ColumnTransformer([('numeric', numeric_transform, num_feats), 
                                        ('categorical', categorical_transform, cat_feats)])


In [190]:
# feature_union = FeatureUnion([("pca", pca), ("univ_select", selection)])

pipeline = Pipeline(steps = [('preprocessing_sales', preprocessing_sales),
                     ("regressor", Ridge())])

pipeline.fit(df_train, y_train)

y_pred = pipeline.predict(df_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f'\nTest set RMSE: {rmse}\nr2: {r2}')
pipeline.steps[1][1].coef_


Test set RMSE: 1298.118054471505
r2: 0.3918016430295822


array([-189.13482509,  973.04446381, -359.86393405,   39.5948291 ,
       -629.10679271,  105.10233984])

----------------------------
## Task II

In [83]:
from sklearn.model_selection import GridSearchCV

ridge = Ridge()
rf = RandomForestRegressor() # n_estimators, max_depth
gb = GradientBoostingRegressor() # learning_rate (default 01, n_estimators def 100, max_depth = 3

pca = PCA() #n_components
selection = SelectKBest() #k
imp_num = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

In [202]:
numeric_transform = Pipeline([('impute_mean', SimpleImputer(strategy='mean')), 
                              ('scaling', StandardScaler()),
                             ("kbest", SelectKBest())])


categorical_transform = Pipeline([('fix_low_fat', consolidate_low_fat_object),
                                  ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False)),
                                 ("pca", PCA())])


preprocessing_sales = ColumnTransformer([('numeric', numeric_transform, num_feats), 
                                        ('categorical', categorical_transform, cat_feats)])

pipeline_ridge = Pipeline(steps = [('preprocessing_sales', preprocessing_sales),
                                     ("regressor", Ridge())])


param_grid_ridge = {'preprocessing_sales__categorical__pca__n_components':[3,5,7],
              'preprocessing_sales__numeric__kbest__k': [1,2,3,4]}
              

grid_ridge = GridSearchCV(pipeline_ridge, param_grid=param_grid_ridge, cv=5)
grid_ridge.fit(df_train, y_train)

best_model = grid_ridge.best_estimator_
best_hyperparams = grid_ridge.best_params_
best_score = grid_ridge.best_score_

print(f'hyperparameters: {best_hyperparams}\n {best_score}')
# y_pred = grid_ridge.predict(df_test)


# rmse = grid_ridge.mean_squared_error(y_test, y_pred, squared=False)
# r2 = grid_ridge.r2_score(y_test, y_pred)
# print(f'\nTest set RMSE: {rmse}\nr2: {r2}')
# pipeline.steps[1][1].coef_
# print(f'hyperparameters: {best_hyperparams}')


hyperparameters: {'preprocessing_sales__categorical__pca__n_components': 7, 'preprocessing_sales__numeric__kbest__k': 2}
 0.5599733752834932


In [207]:
pipeline_rf = Pipeline(steps = [('preprocessing_sales', preprocessing_sales),
                     ("regressor", RandomForestRegressor())])

param_grid_rf = {'regressor__n_estimators': [50,100,200 ],
              'regressor__max_depth': [10, 20, 30],
              'preprocessing_sales__categorical__pca__n_components':[3,5,7],
              'preprocessing_sales__numeric__kbest__k': [1,2,3,4]},
              
grid_rf = GridSearchCV(pipeline_rf, param_grid=param_grid_rf, cv=5, n_jobs = -1)
grid_rf.fit(df_train, y_train)

# best_model = grid_rf.best_estimator_
best_hyperparams = grid_rf.best_params_
best_score = grid_rf.best_score_
print(f'hyperparameters: {best_hyperparams}\n {best_score}')

hyperparameters: {'preprocessing_sales__categorical__pca__n_components': 5, 'preprocessing_sales__numeric__kbest__k': 4, 'regressor__max_depth': 10, 'regressor__n_estimators': 200}
 0.5799716368817872


In [208]:
# learning_rate (default 01, n_estimators def 100, max_depth = 3
pipeline_gb = Pipeline(steps = [('preprocessing_sales', preprocessing_sales),
                     ("regressor", GradientBoostingRegressor())])

param_grid_gb = {'regressor__n_estimators': [50,100,200 ],
              'regressor__max_depth': [10, 20, 30],
              'preprocessing_sales__categorical__pca__n_components':[3,5,7],
              'preprocessing_sales__numeric__kbest__k': [1,2,3,4]},
              
grid_gb = GridSearchCV(pipeline_gb, param_grid=param_grid_gb, cv=5, n_jobs = -1)
grid_gb.fit(df_train, y_train)

# best_model = grid_gb.best_estimator_
best_hyperparams = grid_gb.best_params_
best_score = grid_gb.best_score_
print(f'hyperparameters: {best_hyperparams}\n {best_score}')

hyperparameters: {'preprocessing_sales__categorical__pca__n_components': 7, 'preprocessing_sales__numeric__kbest__k': 3, 'regressor__max_depth': 10, 'regressor__n_estimators': 50}
 0.527652957575053


In [219]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144
