## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
df = pd.read_csv("/regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [3]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

In [4]:
df.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
4,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1


Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [5]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [6]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

---------------------
## Task I

### Split Features into numerical and categorical

In [7]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [8]:
from sklearn.preprocessing import FunctionTransformer


# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [9]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer()
keep_cat = FunctionTransformer()

### replacing null values

In [10]:
# Use SimpleImputer
from sklearn.impute import SimpleImputer

In [11]:
null_num = SimpleImputer(strategy='mean')
null_cat = SimpleImputer(strategy='most_frequent')

### Creating dummy variables

In [12]:
# use OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [44]:
encode_cat = OneHotEncoder()

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [45]:
from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.decomposition import PCA
import numpy as np

# don't forget ToDenseTransformer after one hot encoder
class ToDenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

In [46]:
dense = ToDenseTransformer()
pca = PCA(n_components=3)

### Select the 3 best numeric features

In [19]:
# use SelectKBest
from sklearn.feature_selection import SelectKBest

In [20]:
select_best = SelectKBest(k=3)

### Fitting models

In [21]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building a Pipeline

In [23]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

In [47]:
#Apply transformations to numericals and categories
numeric_transform = Pipeline([('transform', keep_num),
                              ('impute_mean', null_num)])

categ_transform = Pipeline([('transform', keep_cat),
                            ('impute_mode', null_cat),
                            ('one-hot encode', encode_cat),
                            ('to_dense', dense)
                            ])

#apply column transfer to numerical and categorical data
preprocess = ColumnTransformer([('numeric', numeric_transform, num_feats),
                                ('categorical', categ_transform, cat_feats)])

#feaature union PCA and KBest
feature_union = FeatureUnion([("PCA", pca),
                              ('KBest', select_best)])

In [48]:
model = Pipeline(steps = [
    ('preprocessing', preprocess),
    ('features', feature_union),
    ('regression', base_model)
])

model.fit(df_train, y_train)



Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('transform',
                                                                   FunctionTransformer()),
                                                                  ('impute_mean',
                                                                   SimpleImputer())]),
                                                  ['Item_Weight',
                                                   'Item_Visibility',
                                                   'Item_MRP',
                                                   'Outlet_Establishment_Year']),
                                                 ('categorical',
                                                  Pipeline(steps=[('transform',
                                                                   FunctionTransformer()),
                                                

In [49]:
model.score(df_test,y_test)



0.5473279231763989

----------------------------
## Task II

In [32]:
from sklearn.model_selection import GridSearchCV

In [50]:
params = [{
    'regression': [Ridge(alpha = 0.001), Ridge(alpha = 0.1), Ridge(alpha = 1), Ridge(alpha = 10),
                          GradientBoostingRegressor(learning_rate=0.001), GradientBoostingRegressor(learning_rate=0.1),
                          GradientBoostingRegressor(learning_rate=1),
                          RandomForestRegressor(n_estimators=10), RandomForestRegressor(n_estimators=40),
                   RandomForestRegressor(n_estimators=60), RandomForestRegressor(n_estimators=100) 
                          ]}]

In [None]:
tuned_model = GridSearchCV(model, params, verbose = 10, cv=5)

tuned_model.fit(df_train,y_train)

In [69]:
print('Best Params', tuned_model.best_params_)
print('Final score is: ', tuned_model.score(df_test, y_test))

Best Params {'regression': GradientBoostingRegressor(alpha=0.001)}
Final score is:  0.5783155975964194




In [80]:
#let us try more params for Gradient booster
params2 = [{
    'regression': [GradientBoostingRegressor(n_estimators=50, alpha = 0.001, loss='squared_error'), 
                   GradientBoostingRegressor(n_estimators=60, alpha = 0.001, loss='absolute_error'),
                  GradientBoostingRegressor(n_estimators=70, alpha = 0.001, loss='huber')
                          ]}]

In [83]:
tuned_model = GridSearchCV(model, params2, cv=5)

tuned_model.fit(df_train,y_train)



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessing',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('transform',
                                                                                          FunctionTransformer()),
                                                                                         ('impute_mean',
                                                                                          SimpleImputer())]),
                                                                         ['Item_Weight',
                                                                          'Item_Visibility',
                                                                          'Item_MRP',
                                                                          'Outlet_Establishment_Year']),
                                   

In [84]:
print('Best Params', tuned_model.best_params_)
print('Final score is: ', tuned_model.score(df_test, y_test))

Best Params {'regression': GradientBoostingRegressor(alpha=0.001, n_estimators=50)}
Final score is:  0.5823700454044951




In [None]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144
