# Chapter 5.2 The Scikit-Learn API

In the previous section, we implemented $k$-nearest neighbors from scratch. Now we will see how to implement it using [Scikit-Learn](http://scikit-learn.org/), a Python library that makes it easy to train and use machine learning models. All models are trained using the exact same steps:

1. Declare the model.
2. Fit the model to training data, consisting of both features $X$ and labels $y$.
3. Use the model to predict the labels for new values of the features.

Let's take a look at how we would use this API to train a model on the Ames housing data set to predict the 2011 price of the Old Town house from the previous section. Scikit-Learn assumes that the data has already been completely converted to  quantitative variables and that the variables have already been standardized (if desired). The code below reads in the data and does the necessary preprocessing. 

(All of this code is copied from the previous section. Read the code, and if you are not sure what a particular line does, refer back to the previous section.)

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

housing = pd.read_csv("/data301/data/AmesHousing.txt", sep="\t")

housing["Date Sold"] = housing["Yr Sold"] + housing["Mo Sold"] / 12
features = ["Lot Area", "Gr Liv Area",
            "Full Bath", "Half Bath",
            "Bedroom AbvGr", 
            "Year Built", "Date Sold",
            "Neighborhood"]
X_train = pd.get_dummies(housing[features])
y_train = housing["SalePrice"]

x_new = pd.Series(index=X_train.columns)
x_new["Lot Area"] = 9000
x_new["Gr Liv Area"] = 1400
x_new["Full Bath"] = 2
x_new["Half Bath"] = 1
x_new["Bedroom AbvGr"] = 3
x_new["Year Built"] = 1980
x_new["Date Sold"] = 2011
x_new["Neighborhood_OldTown"] = 1
x_new.fillna(0, inplace=True)

X_train_std = (X_train - X_train.mean()) / X_train.std()
x_new_std = (x_new - X_train.mean()) / X_train.std()

X_train_std

Unnamed: 0,Lot Area,Gr Liv Area,Full Bath,Half Bath,Bedroom AbvGr,Year Built,Date Sold,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,2.743912,0.309212,-1.024618,-0.755074,0.176064,-0.375473,1.620757,-0.09821,-0.058511,-0.101692,...,-0.157561,-0.245025,-0.297967,-0.129033,-0.233061,-0.211064,-0.257308,-0.133073,-0.158694,-0.090862
1,0.187065,-1.194223,-1.024618,-0.755074,-1.032058,-0.342410,1.684822,-0.09821,-0.058511,-0.101692,...,-0.157561,-0.245025,-0.297967,-0.129033,-0.233061,-0.211064,-0.257308,-0.133073,-0.158694,-0.090862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,-0.017503,-0.218968,-1.024618,-0.755074,-1.032058,0.087408,-1.518428,-0.09821,-0.058511,-0.101692,...,-0.157561,-0.245025,-0.297967,-0.129033,-0.233061,-0.211064,-0.257308,-0.133073,-0.158694,-0.090862
2929,-0.066107,0.989715,0.783894,1.234464,0.176064,0.715604,-1.069973,-0.09821,-0.058511,-0.101692,...,-0.157561,-0.245025,-0.297967,-0.129033,-0.233061,-0.211064,-0.257308,-0.133073,-0.158694,-0.090862


`X_train_std` is a matrix of all numbers, which is the form that Scikit-Learn expects. Now let's see how to use Scikit-Learn to fit a $k$-nearest neighbors model to this data.

In [2]:
from sklearn.neighbors import KNeighborsRegressor

# Step 1: Declare the model.
model = KNeighborsRegressor(n_neighbors=30)

# Step 2: Fit the model to training data.
model.fit(X_train_std, y_train)

# Step 3: Use the model to predict for new observations.
# Scikit-Learn expects 2-dimensional arrays, so we need to 
# turn the Series into a DataFrame with 1 row.
X_new_std = x_new_std.to_frame().T
model.predict(X_new_std)

array([ 132343.33333333])

This is the exact same prediction that we got by implementing $k$-nearest neighbors manually. 

In the case of training a machine learning model to predict for a single observation, Scikit-Learn may seem like overkill. In fact, the above Scikit-Learn code was 5 lines, whereas our implementation of $k$-nearest neighbors in the previous section was only 4 lines. However, learning Scikit-Learn will pay dividends as the problems become more complex.

## Preprocessing in Scikit-Learn

We constructed `X_train_std` and `x_new_std` above using just basic `pandas` operations. But it is also possible to have Scikit-Learn do this preprocessing for us. The preprocessing objects in Scikit-Learn all follow the same basic pattern:

1. First, the preprocessing object has to be "fit" to a data set.
2. The `.transform()` method actually processes the data. This method can be called repeatedly on different data sets and is guaranteed to process each data set in exactly the same way.

It might not be obvious why it is necessary to first "fit" the preprocessing object to a data set before using it to process data. Hopefully, the following examples will make this clear.

### Example 1: Dummy Encoding

Instead of using `pd.get_dummies()`, we can do dummy encoding in Scikit-Learn using the `DictVectorizer` tool. There is one catch: `DictVectorizer` expects the data as a list of dictionaries, not as a `DataFrame`. But each row of a `DataFrame` can be represented as a dictionary, where the keys are the column names and the values are the data. `Pandas` provides a convenience function, `.to_dict()`, that converts a `DataFrame` into a list of dictionaries.

In [7]:
X_train_dict = housing[features].to_dict(orient="records")
X_train_dict[:2]

[{'Bedroom AbvGr': 3,
  'Date Sold': 2010.4166666666667,
  'Full Bath': 1,
  'Gr Liv Area': 1656,
  'Half Bath': 0,
  'Lot Area': 31770,
  'Neighborhood': 'NAmes',
  'Year Built': 1960},
 {'Bedroom AbvGr': 2,
  'Date Sold': 2010.5,
  'Full Bath': 1,
  'Gr Liv Area': 896,
  'Half Bath': 0,
  'Lot Area': 11622,
  'Neighborhood': 'NAmes',
  'Year Built': 1961}]

Now we pass this list to `DictVectorizer`, which will expand each categorical variable (e.g., "Neighborhood") into dummy variables. When the vectorizer is fit to the training data, it will learn all of the possible categories for each categorical variable so that when `.transform()` is called on different data sets, the same dummy variables will be returned (and in the same order). This is important for us because we need to apply the encoding to two data sets, the training data and the new observation, and we want to be sure that the same dummy variables appear in both.

In [14]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)

X_train = vec.transform(X_train_dict)
x_new_dict = {
    "Lot Area": 9000,
    "Gr Liv Area": 1400,
    "Full Bath": 2,
    "Half Bath": 1,
    "Bedroom AbvGr": 3,
    "Year Built": 1980,
    "Date Sold": 2011,
    "Neighborhood": "OldTown"
}
X_new = vec.transform([x_new_dict])

X_train

array([[  3.00000000e+00,   2.01041667e+03,   1.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.96000000e+03],
       [  2.00000000e+00,   2.01050000e+03,   1.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.96100000e+03],
       [  3.00000000e+00,   2.01050000e+03,   1.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.95800000e+03],
       ..., 
       [  3.00000000e+00,   2.00658333e+03,   1.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.99200000e+03],
       [  2.00000000e+00,   2.00633333e+03,   1.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.97400000e+03],
       [  3.00000000e+00,   2.00691667e+03,   2.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.99300000e+03]])

### Example 2: Scaling

We can also use Scikit-Learn to scale our data. The `StandardScaler` function standardizes data, but there are other functions, such as `Normalizer` and `MinMaxScaler`, that normalize and apply min-max scaling to the data, respectively. 

In the previous section, we standardized both the training data and the new observation with respect to the _training data_. To specify that the standardization should be with respect to the training data, we fit the scaler to the training data. Then, we use the scaler to transform both the training data and the new observation.

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_std = scaler.transform(X_train)
X_new_std = scaler.transform(X_new)

X_train_std

array([[ 0.17609421,  1.62103356, -1.02479289, ..., -0.15872127,
        -0.0908778 , -0.37553701],
       [-1.03223376,  1.68510949, -1.02479289, ..., -0.15872127,
        -0.0908778 , -0.34246845],
       [ 0.17609421,  1.68510949, -1.02479289, ..., -0.15872127,
        -0.0908778 , -0.44167415],
       ..., 
       [ 0.17609421, -1.32645923, -1.02479289, ..., -0.15872127,
        -0.0908778 ,  0.68265709],
       [-1.03223376, -1.51868702, -1.02479289, ..., -0.15872127,
        -0.0908778 ,  0.0874229 ],
       [ 0.17609421, -1.07015551,  0.7840283 , ..., -0.15872127,
        -0.0908778 ,  0.71572565]])

## Putting It All Together

The following example shows a complete pipeline: from reading in the raw data and processing it, to fitting a machine learning model and using it for prediction.

In [19]:
# Read in the data.
housing = pd.read_csv("/data301/data/AmesHousing.txt", sep="\t")

# Define the features.
housing["Date Sold"] = housing["Yr Sold"] + housing["Mo Sold"] / 12
features = ["Lot Area", "Gr Liv Area",
            "Full Bath", "Half Bath",
            "Bedroom AbvGr", 
            "Year Built", "Date Sold",
            "Neighborhood"]

# Define the training data.
# Represent the features as a list of dicts.
X_train_dict = housing[features].to_dict(orient="records")
X_new_dict = [{
    "Lot Area": 9000,
    "Gr Liv Area": 1400,
    "Full Bath": 2,
    "Half Bath": 1,
    "Bedroom AbvGr": 3,
    "Year Built": 1980,
    "Date Sold": 2011,
    "Neighborhood": "OldTown"
}]
y_train = housing["SalePrice"]

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(X_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_new_std = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_std, y_train)
model.predict(X_new_std)

array([ 132343.33333333])

# Exercises

**Exercise 1.** Using Scikit-Learn, build a $k$-nearest neighbors model to predict how much tip a person will pay, using the Tips dataset (`/data301/data/tips.csv`) as your training data. Use your model to predict how much a male diner will tip on a bill of \$40.00 on a Sunday.

In [None]:
# TYPE YOUR CODE HERE