In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
tic_tac = pd.read_csv('../tic-tac-toe.csv')
tic_tac.head()

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


In [3]:
tic_tac['Class'] = tic_tac['Class'].map({
    'positive':1,
    'negative':0
})

### Create one holdout set

Your boss has asked you to create a simple random forest model on the tic_tac_toe dataset. She doesn't want you to spend much time selecting parameters; rather she wants to know how well the model will perform on future data. For future Tic-Tac-Toe games, it would be nice to know if your model can predict which player will win.

The dataset tic_tac_toe has been loaded for your use.

Note that in Python, =\ indicates the code was too long for one line and has been split across two lines.
* Instructions

    * Create the X dataset by creating dummy variables for all of the categorical columns.
    * Split X and y into train (X_train, y_train) and test (X_test, y_test) datasets.
    * Split the datasets using 10% for testing


In [36]:
 # Create dummy variables using pandas
X = pd.get_dummies(tic_tac.iloc[:,0:9])
y = tic_tac.iloc[:, 9]

# Create training and testing datasets. Use 10% for the test set
X_train, X_test, y_train, y_test   = train_test_split(X, y, test_size=0.1, random_state=1111)

### Create two holdout sets

You recently created a simple random forest model to predict Tic-Tac-Toe game wins for your boss, and at her request, you did not do any parameter tuning. Unfortunately, the overall model accuracy was too low for her standards. This time around, she has asked you to focus on model performance.

Before you start testing different models and parameter sets, you will need to split the data into training, validation, and testing datasets. Remember that after splitting the data into training and testing datasets, the validation dataset is created by splitting the training dataset.

The datasets X and y have been loaded for your use.
* Instructions

    * Create temporary datasets and testing datasets (X_test, y_test). Use 20% of the overall data for the testing datasets.
    * Using the temporary datasets (X_temp, y_temp), create training (X_train, y_train) and validation (X_val, y_val) datasets.
    * Use 25% of the temporary data for the validation datasets.



In [7]:
# Create temporary training and final testing datasets
X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.20, random_state=1111)

# Create the final training and validation datasets
X_train, X_val, y_train, y_val  =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=1111)


### Confusion matrices, again

Creating a confusion matrix in Python is simple. The biggest challenge will be making sure you understand the orientation of the matrix. This exercise makes sure you understand the sklearn implementation of confusion matrices. Here, you have created a random forest model using the tic_tac_toe dataset rfc to predict outcomes of 0 (loss) or 1 (a win) for Player One.

Note: If you read about confusion matrices on another website or for another programming language, the values might be reversed.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score

In [20]:
X_train, X_test, y_train, y_test   = train_test_split(X, y, test_size=0.9, random_state=1111)

In [21]:
rfc = RandomForestClassifier(n_estimators=500, max_depth=None, random_state=1111)
rfc.fit(X_train, y_train)

RandomForestClassifier(n_estimators=500, random_state=1111)

In [22]:
# Create predictions
test_predictions = rfc.predict(X_test)

# Create and print the confusion matrix
cm = confusion_matrix(y_test, test_predictions)
print(cm)

# Print the true positives (actual 1s that were predicted 1s)
print("The number of true positives is: {}".format(cm[1, 1]))

[[177 123]
 [ 92 471]]
The number of true positives is: 471


### Precison vs. Recall

In [23]:
# Create precision or recall score based on the metric you imported
score = precision_score(y_test, test_predictions)

# Print the final result
print("The precision value is {0:.2f}".format(score))

The precision value is 0.79


In [24]:
# Create precision or recall score based on the metric you imported
score = recall_score(y_test, test_predictions)

# Print the final result
print("The recall value is {0:.2f}".format(score))

The recall value is 0.84


### Error due to under/over-fitting
The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might only see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.

In [31]:
candy = pd.read_csv('../candy-data.csv')
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.8, random_state=111)

In [32]:
X_train.shape, X_test.shape

((68, 11), (17, 11))

In [28]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse

#### Create a random forest model with 25 trees, a random state of 1111, and max_features of 2. Read the print statements.

In [33]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=2)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.77
The testing error is 9.67


#### Set max_features to 11 (the number of columns in the dataset). Read the print statements.

In [34]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=11)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.89
The testing error is 10.46


#### Set max_features equal to 4. Read the print statements.

In [35]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=4)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.88
The testing error is 10.42


### Am I underfitting?

You are creating a random forest model to predict if you will win a future game of Tic-Tac-Toe. Using the **tic_tac_toe** dataset, you have created training and testing datasets, `X_train, X_test, y_train, and y_test`.

You have decided to create a bunch of random forest models with varying amounts of trees (1, 2, 3, 4, 5, 10, 20, and 50). The more trees you use, the longer your random forest model will take to run. However, if you don't use enough trees, you risk underfitting. You have created a for loop to test your model at the different number of trees.

In [37]:
tic_tac = pd.read_csv('../tic-tac-toe.csv')
tic_tac.head()

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


In [38]:
tic_tac['Class'] = tic_tac['Class'].map({
    'positive':1,
    'negative':0
})

In [39]:
 # Create dummy variables using pandas
X = pd.get_dummies(tic_tac.iloc[:,0:9])
y = tic_tac.iloc[:, 9]

# Create training and testing datasets. Use 10% for the test set
X_train, X_test, y_train, y_test   = train_test_split(X, y, test_size=0.2, random_state=1111)

* Instructions

    * For each loop, predict values for both the X_train and X_test datasets.
    * For each loop, append the accuracy_score() of the y_train dataset and the corresponding predictions to train_scores.
    * For each loop, append the accuracy_score() of the y_test dataset and the corresponding predictions to test_scores.
    * Print the training and testing scores using the print statements.


In [41]:
# from sklearn.metrics import accuracy_score

test_scores, train_scores = [], []
for i in [1, 2, 3, 4, 5, 10, 20, 50]:
    rfc = RandomForestClassifier(n_estimators=i, random_state=1111)
    rfc.fit(X_train, y_train)
    # Create predictions for the X_train and X_test datasets.
    train_predictions = rfc.predict(X_train)
    test_predictions = rfc.predict(X_test)
    # Append the accuracy score for the test and train predictions.
    train_scores.append(round(accuracy_score(y_train, train_predictions), 2))
    test_scores.append(round(accuracy_score(y_test, test_predictions), 2))

# Print the train and test scores.
print("The training scores were: {}".format(train_scores))
print("The testing scores were: {}".format(test_scores))

The training scores were: [0.94, 0.93, 0.98, 0.97, 0.99, 1.0, 1.0, 1.0]
The testing scores were: [0.83, 0.79, 0.89, 0.91, 0.91, 0.93, 0.97, 0.98]
