In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [9]:
candy = pd.read_csv('../candy-data.csv')
X = candy.drop(['competitorname', 'winpercent'], axis=1).values
y = candy['winpercent'].values

In [18]:
X.shape

(85, 11)

### scikit-learn's KFold()

You just finished running a colleagues code that creates a random forest model and calculates an out-of-sample accuracy. You noticed that your colleague's code did not have a random state, and the errors you found were completely different than the errors your colleague reported.

To get a better estimate for how accurate this random forest model will be on new data, you have decided to generate some indices to use for KFold cross-validation.
* Instructions

    * Call the KFold() method to split data using five splits, shuffling, and a random state of 1111.
    * Use the split() method of KFold on X.
    * Print the number of indices in both the train and validation indices lists.


In [15]:
from sklearn.model_selection import KFold

# Use KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)

# Create splits
splits = kf.split(X)

# # Print the number of indices
# for train_index, val_index in splits:
#     print("Number of training indices: %s" % len(train_index))
#     print("Number of validation indices: %s" % len(val_index))

### Using KFold indices

You have already created splits, which contains indices for the candy-data dataset to complete 5-fold cross-validation. To get a better estimate for how well a colleague's random forest model will perform on a new data, you want to run this model on the five different training and validation indices you just created.

In this exercise, you will use these indices to check the accuracy of this model using the five different splits. A for loop has been provided to assist with this process.
* Instructions

    * Use train_index and val_index to call the correct indices of X and y when creating training and validation data.
    * Fit rfc using the training dataset
    * Use rfc to create predictions for validation dataset and print the validation accuracy


In [16]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rfc = RandomForestRegressor(n_estimators=25, random_state=1111)

# Access the training and validation indices of splits
for train_index, val_index in splits:

    # Setup the training and validation data
    X_train, y_train = X[train_index], y[train_index]
    X_val, y_val = X[val_index], y[val_index]
    # Fit the random forest model
    rfc.fit(X_train, y_train)
    # Make predictions, and print the accuracy
    predictions = rfc.predict(X_val)
    print("Split accuracy: " + str(mean_squared_error(y_val, predictions)))

Split accuracy: 150.99298148707666
Split accuracy: 171.22206240542593
Split accuracy: 131.72569156195593
Split accuracy: 80.61940183841385
Split accuracy: 221.63020627476214


### scikit-learn's methods

You have decided to build a regression model to predict the number of new employees your company will successfully hire next month. You open up a new Python script to get started, but you quickly realize that sklearn has a lot of different modules. Let's make sure you understand the names of the modules, the methods, and which module contains which method.

Follow the instructions below to load in all of the necessary methods for completing cross-validation using sklearn. You will use modules:

    - metrics
    - model_selection
    - ensemble

* Instructions

    * Load the method for calculating the scores of cross-validation.
    * Load the random forest regression method.
    * Load the mean square error metric.
    * Load the method for creating a scorer to use with cross-validation.


In [17]:
# Instruction 1: Load the cross-validation method
from sklearn.model_selection import cross_val_score

# Instruction 2: Load the random forest regression model
from sklearn.ensemble import RandomForestRegressor

# Instruction 3: Load the mean squared error method
# Instruction 4: Load the function for creating a scorer
from sklearn.metrics import mean_squared_error, make_scorer

### Implement cross_val_score()

Your company has created several new candies to sell, but they are not sure if they should release all five of them. To predict the popularity of these new candies, you have been asked to build a regression model using the candy dataset. Remember that the response value is a head-to-head win-percentage against other candies.

Before you begin trying different regression models, you have decided to run cross-validation on a simple random forest model to get a baseline error to compare with any future results.
* Instructions

    * Fill in cross_val_score().
        * Use X_train for the training data, and y_train for the response.
        * Use rfc as the model, 10-fold cross-validation, and mse for the scoring function.
    * Print the mean of the cv results.


In [22]:
rfc = RandomForestRegressor(n_estimators=25, random_state=1111)
mse = make_scorer(mean_squared_error)

# Set up cross_val_score
cv = cross_val_score(estimator=rfc,
                     X=X,
                     y=y,
                     cv=10,
                     scoring=mse)

# Print the mean error
print(cv.mean())

155.4061992697056


### Leave-one-out-cross-validation

Let's assume your favorite candy is not in the candy dataset, and that you are interested in the popularity of this candy. Using 5-fold cross-validation will train on only 80% of the data at a time. The candy dataset only has 85 rows though, and leaving out 20% of the data could hinder our model. However, using leave-one-out-cross-validation allows us to make the most out of our limited dataset and will give you the best estimate for your favorite candy's popularity!

In this exercise, you will use cross_val_score() to perform LOOCV.
* Instructions

    * Create a scorer using mean_absolute_error for cross_val_score() to use.
    * Fill out cross_val_score() so that the model rfr, the newly defined mae_scorer, and LOOCV are used.
    * Print the mean and the standard deviation of scores using numpy (loaded as np).


In [24]:
from sklearn.metrics import mean_absolute_error

# Create scorer
mae_scorer = make_scorer(mean_absolute_error)

rfr = RandomForestRegressor(n_estimators=15, random_state=1111)

# Implement LOOCV
scores = cross_val_score(estimator=rfr, X=X, y=y, cv=X.shape[0], scoring=mae_scorer)

# Print the mean and standard deviation
print("The mean of the errors is: %s." % np.mean(scores))
print("The standard deviation of the errors is: %s." % np.std(scores))

The mean of the errors is: 9.52044832324183.
The standard deviation of the errors is: 7.349020637882744.


In [27]:
scores

array([2.37803333e-02, 2.47870369e+01, 9.74897547e+00, 1.40292680e+01,
       1.22347290e+01, 1.43984723e+01, 1.17134173e+01, 1.20722279e+01,
       1.67160833e+00, 1.07965839e+01, 2.57086265e+01, 1.06381219e+01,
       2.58195450e+01, 1.50089847e+00, 1.39027689e-01, 7.17617623e+00,
       4.52800693e+00, 3.40470937e+00, 1.32163223e+01, 6.53413516e+00,
       3.93743453e+00, 1.05286297e+01, 1.30081473e+01, 1.63382045e+00,
       1.19905627e+00, 3.74306153e+00, 1.49037726e+01, 1.17556338e+01,
       2.39771329e+01, 2.02822253e+00, 7.46256057e+00, 9.55056979e+00,
       4.66078733e+00, 2.59597487e+00, 1.87996015e+00, 5.31957400e-01,
       1.16452668e+01, 1.66770417e+00, 3.68189193e+00, 1.09313461e+01,
       4.08630213e+00, 1.05372750e+01, 1.04115449e+01, 4.97415680e+00,
       2.13397774e+01, 8.95420552e+00, 1.38958781e+01, 7.27031000e+00,
       1.43695623e+00, 3.13020387e+00, 9.57358255e+00, 3.16714277e+01,
       1.49936082e+01, 1.87028892e+01, 2.88288620e+00, 4.24834540e+00,
      