# Preprocessing and pipelines

This chapter introduces pipelines, and how scikit-learn allows for transformers and estimators to be chained together and used as a single unit. Preprocessing techniques will be introduced as a way to enhance model performance, and pipelines will tie together concepts from previous chapters.

# (1) Preprocessing data

## Dealing with categorical features
- Scikit-learn will not accept categorical features by default
- Need to encode categoriacal features numerically
- Convert to 'dummy variables'
    - 0: Observation was NOT that category
    - 1: Observation was that category

## Dummy variables

<img src="image/Screenshot 2021-02-03 011255.png" width=50%>

## Dealing with categorical features in Python
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()

## Automobile dataset
- mpg: Target variable
- Origin: Categorical Feature

<img src="image/Screenshot 2021-02-03 011715.png">

## EDA w/ categorical feature

<img src="image/Screenshot 2021-02-03 012441.png">

## Encoding dummy variables

In [None]:
import pandas as pd
df = pd.read_csv('auto.csv')
df_origin = pd.get_dummiers(df)
print(df_origin.head())

In [None]:
df_origin = df.origin.drop('origin_Asia', axis=1)
print(df_origin.head())

## Linear regression with dummy variables

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
ridge = Ridge(alpha=0.5, normalize=True).fit(X_train, y_train)

In [None]:
ridge.score(X_test, y_test)

# Exercise I: Exploring categorical features

The Gapminder dataset that you worked with in previous chapters also contained a categorical `'Region'` feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

### Instructions

- Import `pandas` as `pd`.
- Read the CSV file `'gapminder.csv'` into a DataFrame called `df`.
- Use pandas to create a boxplot showing the variation of life expectancy (`'life'`) by region (`'Region'`). To do so, pass the column names in to `df.boxplot()` (in that order).


In [None]:
# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()


<img src="image/2021-02-03-014227.svg" width=50%>

# Exercise II: Creating dummy variables

As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the `'Region'` feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the `'Region'` feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

### Instructions

- Use the pandas `get_dummies()` function to create dummy variables from the `df` DataFrame. Store the result as `df_region`.
- Print the columns of `df_region`. This has been done for you.
- Use the `get_dummies()` function again, this time specifying `drop_first=True` to drop the unneeded dummy variable (in this case, `'Region_America'`).
- Hit 'Submit Answer to print the new columns of `df_region` and take note of how one column was dropped!


In [None]:
# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region.columns)


# Exercise III: Regression with categorical features

Having created the dummy variables from the `'Region'` feature, you can build regression models as you did before. Here, you'll use ridge regression to perform 5-fold cross-validation.

The feature array `X` and target variable array `y` have been pre-loaded.

### Instructions

- Import `Ridge` from `sklearn.linear_model` and `cross_val_score` from `sklearn.model_selection`.
- Instantiate a ridge regressor called `ridge` with `alpha=0.5` and `normalize=True`.
- Perform 5-fold cross-validation on `X` and `y` using the `cross_val_score()` function.
- Print the cross-validated scores.


In [None]:
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)

# Print the cross-validated scores
print(ridge_cv)


# (2) Handling missing data

## PIMA Indians dataset

In [None]:
df = pd.read_csv('diabetes.csv')
df.info

In [None]:
print(df.head())

In [None]:
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)
df.info()

## Dropping missing data

In [None]:
df = df.dropna()
df.shape

In [None]:
## Imputing missing data
- Making an educated guess about the missing values
- Example: Using the mean of the non-missing entries

In [None]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)

## Imputing within a pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()
steps = [('imputation', imp), ('logistic_regression', logreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)

# Exercise IV: Dropping missing data 

The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it's time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame `df`. Explore it in the IPython Shell with the `.head()` method. You will see that there are certain data points labeled with a `'?'`. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a `'9999'`, other times a `0` - real-world data can be very messy! If you're lucky, the missing values will already be encoded as `NaN`. We use `NaN` because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as `.dropna()` and `.fillna()`, as well as scikit-learn's Imputation transformer `Imputer()`.

In this exercise, your job is to convert the `'?'`s to NaNs, and then drop the rows that contain them from the DataFrame.

### Instructions

- Explore the DataFrame `df` in the IPython Shell. Notice how the missing value is represented.
- Convert all `'?'` data points to `np.nan`.
- Count the total number of NaNs using the `.isnull()` and `.sum()` methods. This has been done for you.
- Drop the rows with missing values from `df` using `.dropna()`.
- Hit 'Submit Answer' to see how many rows were lost by dropping the missing values.


In [None]:
# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))


# Exercise V: Imputing missing data in a ML Pipeline I

As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same `.fit()` and `.predict()` methods as before.

### Instructions

- Import `Imputer` from `sklearn.preprocessing` and `SVC` from `sklearn.svm`. SVC stands for Support Vector Classification, which is a type of SVM.
- Setup the Imputation transformer to impute missing data (represented as `'NaN'`) with the `'most_frequent'` value in the column (`axis=0`).
- Instantiate a `SVC` classifier. Store the result in `clf`.
- Create the steps of the pipeline by creating a list of tuples:
    - The first tuple should consist of the imputation step, using `imp`.
    - The second should consist of the classifier.


In [None]:
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

# Exercise VI: Imputing missing data in a ML Pipeline II

Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the `.fit()` and `.predict()` methods on pipelines just as you did with your classifiers and regressors!

Practice this for yourself now and generate a classification report of your predictions. The steps of the pipeline have been set up for you, and the feature array `X` and target variable array `y` have been pre-loaded. Additionally, `train_test_split` and `classification_report` have been imported from `sklearn.model_selection` and `sklearn.metrics` respectively.

### Instructions

- Import the following modules:
    - `Imputer` from `sklearn.preprocessing` and `Pipeline` from `sklearn.pipeline`.
    - `SVC` from `sklearn.svm`.
- Create the pipeline using `Pipeline()` and `steps`.
- Create training and test sets. Use 30% of the data for testing and a random state of `42`.
- Fit the pipeline to the training set and predict the labels of the test set.
- Compute the classification report.


In [None]:
# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline =Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))
