## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [1]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [3]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [4]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

---------------------
## Task I

### Split Features into numerical and categorical

In [5]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [6]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [7]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

In [8]:
categorical_cols = [cname for cname in df_train.columns if
                        df_train[cname].dtype == "object"]

In [9]:
numerical_cols = [cname for cname in df_train.columns if df_train[cname].dtype in ['int64', 'float64']]

### replacing null values

In [10]:
# Use SimpleImputer
from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

# need to make matrices dense because PCA does not work with sparse vectors.
# pipeline = Pipeline([
#     ('to_dense',ToDenseTransformer()),
#     ('pca',PCA(n_components=3))])

In [20]:
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)),
                                          #('to_dense', ToDenseTransformer()),
                                          ('pca', PCA(n_components=3))
                                         ])

In [21]:
from sklearn.feature_selection import SelectKBest
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                                      ('Kbest', SelectKBest(k=3))
                                      ])

In [22]:
# Bundle preprocessing for numerical and categorical data
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [23]:
from sklearn.ensemble import RandomForestRegressor

model = GradientBoostingRegressor()

In [34]:
from sklearn.metrics import mean_absolute_error,accuracy_score, r2_score

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model_best)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(df_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(df_test)


# Evaluate the model
score = r2_score(y_test, preds)
print('r2 score:', score)
my_pipeline.score(df_test, y_test)

r2 score: 0.5758001391146859


0.5758001391146859

In [25]:
from sklearn import set_config
set_config(display='diagram')
my_pipeline

In [15]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

----------------------------
## Task II

In [26]:
df_train_grid = preprocessor.fit_transform(df_train, y_train)

In [27]:
df_train_grid_short = df_train_grid[:1000]

In [28]:
y_train_grid_short = y_train[:1000]

In [31]:
parameters = {'learning_rate': [0.01,0.02,0.03,0.04],
              'subsample'    : [0.9, 0.5, 0.2, 0.1],
              'n_estimators' : [100,500,1000, 1500],
              'max_depth'    : [4,6,8,10]
              
                }

grid_gradient = RandomizedSearchCV(GradientBoostingRegressor(), param_distributions=parameters, cv=5,verbose=1)
grid_gradient.fit(df_train_grid_short, y_train_grid_short)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
print('Final score is: ', grid_gradient.score(df_test, y_test))

In [32]:
grid_gradient.best_params_

{'subsample': 0.5, 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.03}

In [33]:
model_best = GradientBoostingRegressor(subsample=0.5, n_estimators=100,max_depth=8, learning_rate=0.03)

In [30]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb

In [None]:
param_grid = {'model': [GradientBoostingRegressor(), RandomForestRegressor(),Ridge(),xgb.XGBRegressor()]    # Which is better, Logistic Regression on a SVM Classifier?
              }
grid = GridSearchCV(my_pipeline, param_grid=param_grid, cv=5)
grid.fit(df_train, y_train)

In [None]:
print('Final score is: ', grid.score(df_test, y_test))

In [None]:
grid.best_params_