# Challenge - FeatureUnion

![](https://images.unsplash.com/photo-1491602917301-a0d24c462b8b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1057&q=80)

## Guidelines

In this challenge, we will ork with the **Pima Indians Diabetes** dataset from Kaggle.

### Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

### Content
The datasets consists of several medical predictor variables and one target variable, **Outcome**. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

## Imports

In [None]:
# TODO : import useful libraries

## EDA

**Q1** : Do a quick EDA. Are there any missing values ? Any outliers ?

In [None]:
# TODO : load data

In [None]:
# TODO : exploratory data analysis

In [None]:
# TODO : exploratory data analysis

In [None]:
# TODO : exploratory data analysis

In [None]:
# TODO : exploratory data analysis

In [None]:
# TODO : exploratory data analysis

**Q2** : The dataframe has no NaNs. But several columns contain invalid values 0, such as the column `BMI`: it makes no sense to have a BMI of 0, so we must replace those values by something more logical.

The following columns have invalid values 0 :
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI

Previously, you have used the `.fillna()` method from `pandas`. This time, we will deal with missing values in a slightly different way, by using the `SimpleImputer` from sklearn. Check the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer).

In [None]:
# TODO : select the columns with invalid 0 values

In [None]:
# TODO : plot the distribution of those columns

In [None]:
# TODO : use the SimpleImputer to fill those values

In [None]:
# TODO : plot the new distribution of those columns

How does this filling strategy impact the data ?

**Q3**: Why would we use the `SimpleImputer` when `pandas` does the job just as well ? Well, because, unlike the `.fillna()` method, the `SimpleImputer` is a class. And we can use classes in **Pipelines** !

You will find below the code for a custom class `ColumnSelector` that allows you to select specific columns in a dataframe or in an array by using their indexes.

Using the `ColumnSelector` and the `SimpleImputer`, build a preprocessing Pipeline that will fill all invalid 0 values in the selected columns.

In [None]:
# Custom class ColumnSelector

class ColumnSelector(TransformerMixin):
    def __init__(self, columns_idx):
        self.columns_idx = columns_idx
        
    def fit(self, X, y=None):        
        return self
        
    def transform(self, X, y=None):        
        if isinstance(X, np.ndarray):
            X_tf = X[:, self.columns_idx]
            
        elif isinstance(X, pd.DataFrame):
            X_tf = X.iloc[:, self.columns_idx]
        return X_tf

In [None]:
# Build preprocessing pipeline

In [None]:
# Test your preprocessing pipeline

**Q4**: Let's now deal with the rest of the columns, where there are no null values, and thus no need to replace them. Build a second preprocessing pipeline to deal with those columns. That pipeline only needs to take in the complete dataframe and return the selected columns.

In [None]:
# Build preprocessing pipeline

In [None]:
# Test your preprocessing pipeline

**Q5**: Now, we have two pipelines, that deal differently with two distinct types of columns. The goal would be to use them to preprocess the data, then to append an estimator at the end of the pipeline to make predictions. Unfortunately, that is not possible, since your two preprocessing pipelines deal with different subsets of the data, and as such, cannot follow each other.

Pipelines work linearly :

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1eAxmlWJ3S1CPZScjduroFka4bXqGLXFP">
</p>

But `sklearn` offers an easy way to parallelize different preprocessing steps : `FeatureUnion`. It is like a pipeline that concatenates other pipelines ! You can still use its result in a bigger pipeline.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1BVhuR-4bOqQryemUqdVobEKWxNT0Xj4v">
</p>

Explore the documentation about [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html).

In [None]:
# TODO : Use a FeatureUnion to parallelize both of your preprocessing pipelines.

In [None]:
# TODO : test your resulting FeatureUnion

**Q6**: Let's build our final pipeline, that will predict the outcome (diabetes yes ou no) depending on the rest of the features. Don't forget to split your dataset before fitting anything on it ! You can use any estimator you want.

In [None]:
# TODO : split data

In [None]:
# TODO : build final pipeline

In [None]:
# TODO : build final pipeline

In [None]:
# TODO : make a prediction and score your model

**Q7**: Using a `GridSearchCV` and your final pipeline, run hyperparameter optimization. You should specifically try the strategy "mean" and the strategy "median" in the `SimpleImputer`, and see which one gives the best results.

In [None]:
# TODO : hyperparameters optimization

In [None]:
# TODO : hyperparameters optimization

In [None]:
# TODO : hyperparameters optimization