# Preprocessing and pipelines
So far all our data has been nice and clean, ready to use right away. In read life we'll sometimes deal with categorical features like male/female, red/blue. Scikit-learn will not accept them. 
- --> we neeed to encode these cat. features numerically.
- Convert to 'dummy variables' 0 and 1, one for each cat.
- 0 = not this cat., 1 = this cat. yes
We'll be using OneHotEncoder() from scikit-learn or pandas:get_dummies() to do this.

We'll train with automobile dataset where the target variable is mpg (miles/gallon) and the origin is the cat. feature. Origin is either US, Asia or Europe.

In [1]:
import pandas as pd
df = pd.read_csv('auto.csv.txt')
df_origin = pd.get_dummies(df)
print(df_origin.head())

    mpg  displ   hp  weight  accel  size  origin_Asia  origin_Europe  \
0  18.0  250.0   88    3139   14.5  15.0            0              0   
1   9.0  304.0  193    4732   18.5  20.0            0              0   
2  36.1   91.0   60    1800   16.4  10.0            1              0   
3  18.5  250.0   98    3525   19.0  15.0            0              0   
4  34.3   97.0   78    2188   15.8  10.0            0              1   

   origin_US  
0          1  
1          1  
2          0  
3          1  
4          0  


In [9]:
#since we only need the info from two origin columns to know the true origin
#we can drop one of them
df_origin = df_origin.drop('origin_Asia', axis =1)
print(df_origin.head())

KeyError: "['origin_Asia'] not found in axis"

In [31]:
y = df['mpg'].values

In [29]:
#Now we have two features instead of only one, even if they include the same data.
x = df_origin[['origin_US', 'origin_Europe']].values

In [2]:
#now we can use our models as usual
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state=42)
ridge = Ridge(alpha = 0.5, normalize=True).fit(x_train,y_train)
ridge.score(x_test, y_test)

NameError: name 'x' is not defined

### Exploring categorical features
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

In [3]:
# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()


FileNotFoundError: [Errno 2] File b'gapminder.csv' does not exist: b'gapminder.csv'

### Creating dummy variables
As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region' feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region' feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

In [None]:
# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region.columns)

## Handling missing data
No value for a given feature in a partical row. This can be due to:
- transcription error
- no observation
- corrupted data

In [55]:
df = pd.read_csv('diabetes.csv')
df.info()
#missing values can be encoded in different ways, so they might not be evident in info()
#They could be NaN, 0, -1, ?,...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [56]:
df.shape

(768, 9)

In [36]:
df.head()
#we see that some triceps and insulin observations are 0.

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
#we'll make all those entries nan
#inplace = True means that we'll modify the original df
import numpy as np
df.Insulin.replace(0, np.nan, inplace = True)
df.SkinThickness.replace(0, np.nan, inplace = True)
df.BMI.replace(0, np.nan, inplace = True)

df.info()

AttributeError: 'DataFrame' object has no attribute 'Insulin'

In [58]:
#we can deal with missing data dropping every row containing it
df = df.dropna()
#We only have half the rows left now!
df.shape

(393, 9)

If only some rows had missing data then it's not so bad, but we're gonna need better methods to deal with it otherwise. 

We could impute missing data = Make an educated guess about the missing values. Per example we could use the mean of all the non-missing entries to replace all missing values.

In [53]:
from sklearn.preprocessing import Imputer
#axis = 0 means that we're gonna impute along columns
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imp.fit(x)
#this is why imputers are known as transformers
x = imp.transform(x)
#now our data is in good shape and we're ready to implement models

How to do both parts at the same time?

--> Imputing withing a pipeline

Note that in a pipeline every step but the last one has to be a transformer. The last one needs to be a classifier or estimator.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
#now we build the pipeline object
steps = [('imputation', imp), ('logistic_regression', logreg)]
pipeline = Pipeline(steps)

#now we can do the usual with train_test_split, pipeline.fit(), y_pred = pipeline.predict()
#pipeline.score()....

NameError: name 'imp' is not defined

### Example on voting data

In [None]:
# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))


### Imputing missing data in a ML Pipeline I
As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit() and .predict() methods as before.

In [6]:
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values= 'NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]



### Imputing missing data in a ML Pipeline II
Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit() and .predict() methods on pipelines just as you did with your classifiers and regressors!

Practice this for yourself now and generate a classification report of your predictions. The steps of the pipeline have been set up for you, and the feature array X and target variable array y have been pre-loaded. Additionally, train_test_split and classification_report have been imported from sklearn.model_selection and sklearn.metrics respectively.

In [7]:
# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))




NameError: name 'X' is not defined

## Centering and scaling
Why scale our data?

In [80]:
df = pd.read_csv('winequality-red.csv')
df.head()

FileNotFoundError: [Errno 2] File b'winequality-red.csv' does not exist: b'winequality-red.csv'

In [78]:
#This should work but data format is messed up
#x = df.drop('quality', axis=1).values
#y = df['quality'].values

KeyError: "['quality'] not found in axis"

In [74]:
df.columns[3]

IndexError: index 3 is out of bounds for axis 0 with size 1

Our features are chemical properties and our target values in quality ranging from 0 to 1
we see how the ranges vary, density from 0 to 1, sulfur from 6 to 289...


This might influence the model. We want features to be on similar scale 

--> Normalizing = Scaling and centering 
- Standardization: Subtract the mean and divide by variance. All features are centered around 0 and have variance 1
- Can also subtract the minimum and divide by the range. Minimum zero and maximum 1
- Could do the same to have all data bet -1 and 1
Here we'll do standardization. Other methods are available in sci-kit documentation.


In [8]:
from sklearn.preprocessing import scale
x_scaled = scale(x)
np.mean(x), np.std(x)

NameError: name 'x' is not defined

In [63]:
np.mean(x_scaled), np.std(x_scaled)

(3.6252180395923476e-17, 0.9999999999999999)

In [9]:
#puting the scaling in a pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
steps = [('scaler', StandardScaler()),('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)

Performing models with and without scaling we see that we get better scores when scaling.

In [None]:
#to also incluse cv in pipeline
steps = [('scaler', StandardScaler()),('knn', KNeighborsClassifier())]
pipeline= Pipeline(steps)
parameters = {knn__n_neighbors = np.arange(1,50)}
#train_test
cv = GridSearchCV(pipeline, param_grid = parameters)
cv.fit(...)
y_pred = cv.predict()
#now we can predict best param, best score and classification report.


### Centering and scaling in a pipeline
With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.

The feature array and target variable array have been pre-loaded as X and y. Additionally, KNeighborsClassifier and train_test_split have been imported from sklearn.neighbors and sklearn.model_selection, respectively.

In [None]:
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test,y_test)))


### Bringing it all together I: Pipeline for classification
It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.

You'll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are C and gamma. C controls the regularization strength. It is analogous to the C you tuned for logistic regression in Chapter 3, while gamma controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

The following modules have been pre-loaded: Pipeline, svm, train_test_split, GridSearchCV, classification_report, accuracy_score. The feature and target variable arrays X and y have also been pre-loaded.

In [None]:
# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters) #By default cv=3

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))


### Bringing it all together II: Pipeline for regression
For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!

Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.

All the necessary modules have been imported, and the feature and target variable arrays have been pre-loaded as X and y.

In [None]:
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet',ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
