## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [1]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [7]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [8]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

---------------------
## Task I

### Split Features into numerical and categorical

In [3]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [29]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [30]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### replacing null values

In [195]:
# Use SimpleImputer

In [6]:
from sklearn.impute import SimpleImputer

In [23]:
import numpy as np
imp_mean=SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mode=SimpleImputer(missing_values=np.nan, strategy='most_frequent')

### Creating dummy variables

In [196]:
# use OneHotEncoder

In [10]:
from sklearn.preprocessing import OneHotEncoder

In [21]:
enc = OneHotEncoder(handle_unknown='ignore')

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [24]:
# don't forget ToDenseTransformer after one hot encoder
class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

In [13]:
from sklearn.decomposition import PCA

In [14]:
pca= PCA(n_components=3)

### Select the 3 best numeric features

In [3]:
# use SelectKBest

In [15]:
from sklearn.feature_selection import SelectKBest

In [16]:
selection = SelectKBest(k=3)

### Fitting models

In [18]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building a Pipeline

In [19]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [31]:
pipeline1 = Pipeline([
    ("numFeats", keep_num),
    ('missing_num',imp_mean),
    ('selection', selection)
])
pipeline2 = Pipeline([
    ('catFeats',keep_cat),
    ('missing_cat',imp_mode),
    ('ohe',enc),
    ('to_dense',ToDenseTransformer()),
    ('pca',pca)
])

In [32]:
pipeline3=FeatureUnion([('pipe1',pipeline1),('pipe2',pipeline2)])

In [33]:
final_pipe = Pipeline([("features", pipeline3), ("ridge", base_model)])


In [4]:
# model.score(df_test,y_test)

In [35]:
final_pipe.fit(df_train,y_train)



Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pipe1',
                                                 Pipeline(steps=[('numFeats',
                                                                  FunctionTransformer(func=<function numFeat at 0x000002BF24E03EE0>)),
                                                                 ('missing_num',
                                                                  SimpleImputer()),
                                                                 ('selection',
                                                                  SelectKBest(k=3))])),
                                                ('pipe2',
                                                 Pipeline(steps=[('catFeats',
                                                                  FunctionTransformer(func=<function catFeat at 0x000002BF24E03F70>)),
                                                                 ('missing_cat',
                

In [37]:
final_pipe.score(df_test,y_test)



0.39590528907534517

----------------------------
## Task II

In [39]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [41]:
ridge=Ridge()
rf=RandomForestRegressor()
gbr=GradientBoostingRegressor()

models=[ridge, rf, gbr]

params = [
    {
        'features__pipe1__selection__k':[3,4,5],
        'features__pipe2__pca__n_components': [3, 4,5],
        'models':[ridge],
        'models__alpha':[0.001, 0.01, 0.1, 1.0, 10]
    },
    {
        'features__pipe1__selection__k':[3,4,5],
        'features__pipe2__pca__n_components': [3, 4,5],
        'models':[rf],
        "models__n_estimators":[100, 150, 200],
        "models__max_depth":[5,7, 10]
        
    },
    {
        'features__pipe1__selection__k':[3,4,5],
        'features__pipe2__pca__n_components': [3, 4,5],
        'models':[gbr],
        "models__learning_rate":[0.1, 0.2, 0.3],
        "models__n_estimators":[100, 150, 200],
        "models__max_depth":[5,7, 10]
    }
]
final_pipe = Pipeline([("features", pipeline3), ("models",base_model)])


In [42]:
gridsearch = GridSearchCV(final_pipe, params, verbose=1).fit(df_train, y_train)
print('Final training score is: ', gridsearch.score(df_train, y_train))

Fitting 5 folds for each of 369 candidates, totalling 1845 fits




































































































































































































615 fits failed out of a total of 1845.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
615 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\User\anaconda3\envs\midproj_env\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\User\anaconda3\envs\midproj_env\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\User\anaconda3\envs\midproj_env\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\User\anaconda3\envs\midproj_env\lib\site-packages\joblib\memory.py"



Final training score is:  0.6453135613481296


In [43]:
print('Final testing score is: ', gridsearch.score(df_test, y_test))

Final testing score is:  0.5770243975149199




In [44]:
gridsearch.best_params_

{'features__pipe1__selection__k': 4,
 'features__pipe2__pca__n_components': 3,
 'models': RandomForestRegressor(max_depth=7, n_estimators=150),
 'models__max_depth': 7,
 'models__n_estimators': 150}