## Machine Learning Pipelines

- Pipelines are useful for data scientists to transform data, train machine learning models, and make predictions.
- The data science process involves several steps, but a streamlined process can be achieved with the Pipeline class in scikit-learn.
- Pipelines can integrate multiple steps of the machine learning workflow and allow for comparing different classification techniques.
- Grid search can be integrated into the pipeline to tune hyperparameters in each of the machine learning models while avoiding data leakage.

In [None]:
from sklearn.pipeline import Pipeline

# Create the pipeline
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])

In [None]:
# Fit to the training data
pipe.fit(X_train, y_train)

In [None]:
# Calculate the score on test data
pipe.score(X_test, y_test)

A really good blog post on the basic ideas of pipelines can be found [here](https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html)

# Integrating Grid Search in Pipelines

In [None]:
# Create the pipeline
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])

# Create the grid parameter
grid = [{'tree__max_depth': [None, 2, 6, 10], 
         'tree__min_samples_split': [5, 10]}]


# Create the grid, with "pipe" as the estimator
gridsearch = GridSearchCV(estimator=pipe, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

# Fit using grid search
gridsearch.fit(X_train, y_train)

# Calculate the test score
gridsearch.score(X_test, y_test)

An article with a detailed workflow can be found [here](https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-2.html)


* Machine learning pipelines create a nice workflow to combine data manipulations, preprocessing, and modeling
* Machine learning pipelines can be used along with grid search to evaluate several parameter settings
    * Grid search can considerably blow up computation time when computing for several parameters along with cross-validation
    * Some models are very sensitive to hyperparameter changes, so they should be chosen with care, and even with big grids a good outcome isn't always guaranteed
* Machine learning pipelines can also be pickled so that they can be used in the future without re-training
* Model deployment can be something as simple as pickling a model, or a more complex approach like a cloud function that exposes model predictions through an HTTP API


# Example in sklearn

Nicely done. This pattern (preprocessing and fitting models) is very common. Although this process is fairly straightforward once you get the hang of it, **pipelines** make this process simpler, intuitive, and less error-prone. 

Instead of standardizing and fitting the model separately, you can do this in one step using `sklearn`'s `Pipeline()`. A pipeline takes in any number of preprocessing steps, each with `.fit()` and `transform()` methods (like `StandardScaler()` above), and a final step with a `.fit()` method (an estimator like `KNeighborsClassifier()`). The pipeline then sequentially applies the preprocessing steps and finally fits the model. Do this now.   

## Build a pipeline (I) 

Build a pipeline with two steps: 

- First step: `StandardScaler()` 
- Second step (estimator): `KNeighborsClassifier()` 


In [None]:
# Build a pipeline with StandardScaler and KNeighborsClassifier
scaled_pipeline_1 = Pipeline([('scaler', StandardScaler()),
                                ('clf', KNeighborsClassifier())])

- Transform and fit the model using this pipeline to the training data (you should use `X_train` here) 
- Print the accuracy of the model on the test set (you should use `X_test` here) 

In [None]:
# Fit the training data to pipeline
scaled_pipeline_1.fit(X_train, y_train)

# Print the accuracy on test set
scaled_pipeline_1.score(X_test, y_test)

0.5775

If you did everything right, this answer should match the one from above! 

Of course, you can also perform a grid search to determine which combination of hyperparameters can be used to build the best possible model. The way you define the pipeline still remains the same. What you need to do next is define the grid and then use `GridSearchCV()`. Let's do this now.

## Build a pipeline (II)

Again, build a pipeline with two steps: 

- First step: `StandardScaler()` named 'ss'.  
- Second step (estimator): `RandomForestClassifier()` named 'RF'. Set `random_state=123` when instantiating the random forest classifier 

In [None]:
# Build a pipeline with StandardScaler and RandomForestClassifier
scaled_pipeline_2 = Pipeline([('scaler', StandardScaler()),
                                ('RF', RandomForestClassifier())])

Use the defined `grid` to perform a grid search. We limited the hyperparameters and possible values to only a few values in order to limit the runtime. 

In [None]:
# Define the grid
grid = [{'RF__max_depth': [4, 5, 6], 
         'RF__min_samples_split': [2, 5, 10], 
         'RF__min_samples_leaf': [1, 3, 5]}]

Define a grid search now. Use: 
- the pipeline you defined above (`scaled_pipeline_2`) as the estimator 
- the parameter `grid` 
- `'accuracy'` to evaluate the score 
- 5-fold cross-validation 

In [None]:
# Define a grid search
gridsearch = GridSearchCV(estimator=scaled_pipeline_2, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

After defining the grid values and the grid search criteria, all that is left to do is fit the model to training data and then score the test set. Do this below: 

In [None]:
# Fit the training data
gridsearch.fit(X_train, y_train)

# Print the accuracy on test set
gridsearch.score(X_test, y_test)

0.6025

# Refactoring for pipelines

#### Bringing It All Together

Here is the full preprocessing example without a pipeline:

In [None]:
def preprocess_data_without_pipeline(X):
    
    transformers = []

    ### Encoding categorical data ###

    # Make a transformer
    ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)

    # Create transformed dataframe
    category_encoded = ohe.fit_transform(X[["category"]])
    category_encoded = pd.DataFrame(
        category_encoded,
        columns=ohe.categories_[0],
        index=X.index
    )
    transformers.append(ohe)

    # Replace categorical data with encoded data
    X.drop("category", axis=1, inplace=True)
    X = pd.concat([category_encoded, X], axis=1)

    ### Feature engineering ###

    def is_odd(data):
        """
        Helper function that returns 1 if odd, 0 if even
        """
        return data % 2

    # Instantiate transformer
    func_transformer = FunctionTransformer(is_odd)

    # Create transformed column
    number_odd = func_transformer.fit_transform(X["number"])
    transformers.append(func_transformer)

    # Add engineered column
    X["number_odd"] = number_odd

    ### Scaling ###

    # Instantiate transformer
    scaler = StandardScaler()

    # Create transformed dataset
    data_scaled = scaler.fit_transform(X)
    transformers.append(scaler)

    # Replace dataset with transformed one
    X = pd.DataFrame(
        data_scaled,
        columns=X.columns,
        index=X.index
    )

    return X, transformers

# Reset value of example_X
example_X = example_data.drop("target", axis=1)
# Test out our function
result, transformers = preprocess_data_without_pipeline(example_X)
result

Unnamed: 0,A,B,C,number,number_odd
0,1.224745,-0.816497,-0.5,0.0,0.816497
1,1.224745,-0.816497,-0.5,0.597614,-1.224745
2,-0.816497,1.224745,-0.5,1.195229,0.816497
3,-0.816497,1.224745,-0.5,0.0,0.816497
4,-0.816497,-0.816497,2.0,-1.792843,-1.224745


## Complete Refactored Pipeline Example

Below is the complete pipeline (without the estimator), which produces the same output as the original full preprocessing example:

In [None]:
def preprocess_data_with_pipeline(X):
    
    ### Encoding categorical data ###
    original_features_encoded = ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(categories="auto", handle_unknown="ignore"), ["category"])
    ], remainder="passthrough")
    
    ### Feature engineering ###
    def is_odd(data):
        """
        Helper function that returns 1 if odd, 0 if even
        """
        return data % 2

    feature_eng = ColumnTransformer(transformers=[
        ("add_number_odd", FunctionTransformer(is_odd), ["number"])
    ], remainder="drop")
  
    ### Combine encoded and engineered features ###
    feature_union = FeatureUnion(transformer_list=[
        ("encoded_features", original_features_encoded),
        ("engineered_features", feature_eng)
    ])
    
    ### Pipeline (including scaling) ###
    pipe = Pipeline(steps=[
        ("feature_union", feature_union),
        ("scale", StandardScaler())
    ])
    
    transformed_data = pipe.fit_transform(X)
    
    ### Re-apply labels (optional step for readability) ###
    encoder = original_features_encoded.named_transformers_["ohe"]
    category_labels = encoder.categories_[0]
    all_cols = list(category_labels) + ["number", "number_odd"]
    return pd.DataFrame(transformed_data, columns=all_cols, index=X.index), pipe
    
# Reset value of example_X
example_X = example_data.drop("target", axis=1)
# Test out our new function
result, pipe = preprocess_data_with_pipeline(example_X)
result

Unnamed: 0,A,B,C,number,number_odd
0,1.224745,-0.816497,-0.5,0.0,0.816497
1,1.224745,-0.816497,-0.5,0.597614,-1.224745
2,-0.816497,1.224745,-0.5,1.195229,0.816497
3,-0.816497,1.224745,-0.5,0.0,0.816497
4,-0.816497,-0.816497,2.0,-1.792843,-1.224745


# ChatGPT: How to set up gridsearch for logreg

To set up a grid search for a logistic regression (logreg) model, you can follow these steps:

1. Import the necessary libraries:
```python
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
```

2. Prepare your data:
   - Split your data into features (X) and target variable (y).
   - Perform any necessary data preprocessing steps, such as scaling or encoding categorical variables.

3. Define the parameter grid:
   - Create a dictionary where the keys are the hyperparameters you want to tune, and the values are the possible values for each hyperparameter. For logistic regression, common hyperparameters to tune are:
     - `penalty`: Regularization type, such as 'l1' or 'l2'.
     - `C`: Inverse of regularization strength, controlling the amount of regularization applied.
     - `solver`: Optimization algorithm, such as 'liblinear' or 'lbfgs'.
     - `max_iter`: Maximum number of iterations for the solver to converge.
   - Example parameter grid:
     ```python
     param_grid = {
         'penalty': ['l1', 'l2'],
         'C': [0.1, 1, 10],
         'solver': ['liblinear', 'lbfgs'],
         'max_iter': [100, 500, 1000]
     }
     ```

4. Create an instance of the logistic regression model:
```python
logreg = LogisticRegression()
```

5. Create a GridSearchCV object:
```python
grid_search = GridSearchCV(estimator=logreg,
                           param_grid=param_grid,
                           cv=5)
```
   - `estimator`: The estimator object to be optimized (logreg in this case).
   - `param_grid`: The parameter grid to search over.
   - `cv`: Number of cross-validation folds (5 in this example).

6. Fit the GridSearchCV object on your data:
```python
grid_search.fit(X, y)
```

7. Access the best parameters and best score:
```python
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)
```

After performing the grid search, the `best_params_` attribute will contain the combination of hyperparameters that yielded the best score during the cross-validation process.

You can then use these best parameters to train your logistic regression model on the entire dataset or perform further evaluations.