## holdout validation
<li>splitting the full dataset into 2 partitions:</li>
    <ul>
        <li>a training set</li>
        <li>a test set</li>
    </ul>
<li>training the model on the training set,</li>
<li>using the trained model to predict labels on the test set,</li>
<li>computing an error metric to understand the model's effectiveness,</li>
<li>switch the training and test sets and repeat,</li>
<li>average the errors.</li>

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [46]:
dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.iloc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
split_one = dc_listings.iloc[:1862].copy()
split_two = dc_listings.iloc[1862:].copy()

In [47]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [48]:
print(train_one.shape)


(1862, 20)


In [49]:
train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

def cal_rmse(train, test):
    knn = KNeighborsRegressor()
    knn.fit(train[['accommodates']], train['price'])
    predictions = knn.predict(test[['accommodates']])
    mse = mean_squared_error(predictions, test['price'])
    return np.sqrt(mse)
iteration_one_rmse = cal_rmse(train_one, test_one)
iteration_two_rmse = cal_rmse(train_two, test_two)
print(iteration_one_rmse)
print(iteration_two_rmse)
avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])
print(avg_rmse)

145.389456928
121.488814424
133.439135676


In [50]:
train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one
# First half
model = KNeighborsRegressor()
model.fit(train_one[["accommodates"]], train_one["price"])
test_one["predicted_price"] = model.predict(test_one[["accommodates"]])
iteration_one_rmse = mean_squared_error(test_one["price"], test_one["predicted_price"])**(1/2)

# Second half
model.fit(train_two[["accommodates"]], train_two["price"])
test_two["predicted_price"] = model.predict(test_two[["accommodates"]])
iteration_two_rmse = mean_squared_error(test_two["price"], test_two["predicted_price"])**(1/2)

avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])

print(iteration_one_rmse, iteration_two_rmse, avg_rmse)

145.389456928 121.488814424 133.439135676


## k-fold cross-validation
<li>splitting the full dataset into k equal length partitions.</li>
    <ul>
    <li>selecting k-1 partitions as the training set and</li>
    <li>selecting the remaining partition as the test set</li>
    </ul>
<li>training the model on the training set.</li>
<li>using the trained model to predict labels on the test fold.</li>
<li>computing the test fold's error metric.</li>
<li>repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.</li>
<li>calculating the mean of the k error values.</li>

In [51]:
# look at how to split into folds below:
dc_listings.loc[dc_listings.index[0:745], "fold"] = 1
dc_listings.loc[dc_listings.index[745:1490], "fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234], "fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978], "fold"] = 4
dc_listings.loc[dc_listings.index[2978:3723], "fold"] = 5


In [52]:
dc_listings['fold'].value_counts()

5.0    745
2.0    745
1.0    745
4.0    744
3.0    744
Name: fold, dtype: int64

In [53]:
train_df = dc_listings[dc_listings['fold'] != 1]
test_df = dc_listings[dc_listings['fold'] == 1]
knn = KNeighborsRegressor()
knn.fit(train_df[['accommodates']], train_df['price'])
predictions = knn.predict(test_df[['accommodates']])
mse = mean_squared_error(predictions, test_df['price'])
iteration_one_rmse = np.sqrt(mse)
iteration_one_rmse

132.72601298724351

In [54]:
def train_and_validate(df, folds):
    rmses = []
    knn = KNeighborsRegressor()
    for i in range(1, folds + 1):
        train = df[df['fold'] != i]
        test = df[df['fold'] == i]
        knn.fit(train[['accommodates']], train['price'])
        predictions = knn.predict(test[['accommodates']])
        mse = mean_squared_error(predictions, test['price'])
        rmses.append(np.sqrt(mse))
    return rmses

In [57]:
rmses = train_and_validate(dc_listings, 5)
avg_rmse = np.mean(rmses)
print(rmses)
print(avg_rmse)

[132.72601298724351, 161.02763529549534, 137.22042780304838, 158.47733683502693, 134.10222625761409]
144.710727836


## sklearn.model_selection
from sklearn.model_selection import KFold
kf = KFold(n_splits, shuffle=False, random_state=None)

from sklearn.model_selection import cross_val_score
cross_val_score(estimator, X, Y, scoring=None, cv=None)

<li>instantiate the scikit-learn model class you want to fit,</li>
<li>instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,</li>
<li>use the cross_val_score() function to return the scoring metric you're interested in.</li>

In [63]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=1)
knn = KNeighborsRegressor()
mses = cross_val_score(estimator=knn, X=dc_listings[['accommodates']], scoring="neg_mean_squared_error", y=dc_listings['price'], cv=kf)
avg_rmse = np.sqrt(np.absolute(mses)).mean()

In [64]:
avg_rmse

129.3992056481454

In [65]:
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  131.385308835 std RMSE:  5.96796480742
5 folds:  avg RMSE:  129.399205648 std RMSE:  12.1782098885
7 folds:  avg RMSE:  126.451277775 std RMSE:  21.0599752859
9 folds:  avg RMSE:  128.943862599 std RMSE:  27.6231159582
10 folds:  avg RMSE:  135.032591014 std RMSE:  26.3726757307
11 folds:  avg RMSE:  130.71939193 std RMSE:  35.8440196593
13 folds:  avg RMSE:  135.652111378 std RMSE:  30.1222852488
15 folds:  avg RMSE:  128.513759695 std RMSE:  29.4054952241
17 folds:  avg RMSE:  131.660759634 std RMSE:  35.4332741402
19 folds:  avg RMSE:  132.419427262 std RMSE:  32.4121688009
21 folds:  avg RMSE:  130.570428452 std RMSE:  34.8565286082
23 folds:  avg RMSE:  124.719963163 std RMSE:  37.4324380026


##  Bias-Variance Tradeoff
