In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
candy = pd.read_csv('../candy-data.csv')
candy.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [12]:
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.59, random_state=111)

### Seen vs. unseen data

Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not.

You've built a model based on 50 candies using the dataset X_train and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (X_test) it has never seen. You will use the mean absolute error, mae(), as the accuracy metric.
* Instructions

    * Using X_train and X_test as input data, create arrays of predictions using model.predict().
    * Calculate model accuracy on both data the model has seen and data the model has not seen before.
    * Use the print statements to print the seen and unseen data.


In [14]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=500, random_state=111)

In [29]:
from sklearn.metrics import mean_absolute_error as mae

In [30]:
# The model is fit using X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 3.65.
Model error on unseen data: 12.08.


### Set parameters and fit a model

Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.

In this exercise, you will specify a few parameters using a random forest regression model rfr.
* Instructions

    * Add a parameter to rfr so that the number of trees built is 100 and the maximum depth of these trees is 6.
    * Make sure the model is reproducible by adding a random state of 1111.
    * Use the .fit() method to train the random forest regression model with X_train as the input data and y_train as the response.



In [31]:
rfr = RandomForestRegressor()

In [32]:
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)

RandomForestRegressor(max_depth=6, random_state=1111)

### Feature importances

Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be important to model prediction. After a random forest model has been fit, you can review the model's attribute, `.feature_importances_`, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using `enumerate()`.

If you are unfamiliar with Python's `enumerate()` function, it can loop over a list while also creating an automatic counter.

In [33]:
# Fit the model using X and y
rfr.fit(X_train, y_train)

# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
      # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train.columns[i], item))

chocolate: 0.14
fruity: 0.03
caramel: 0.03
peanutyalmondy: 0.05
nougat: 0.02
crispedricewafer: 0.04
hard: 0.01
bar: 0.14
pluribus: 0.02
sugarpercent: 0.28
pricepercent: 0.26


In [34]:
X_train.shape

(50, 11)

In [35]:
y_train.shape

(50,)

In [36]:
tic_tac = pd.read_csv("../tic-tac-toe.csv")
tic_tac.head()

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


In [37]:
tic_tac['Top-Left'].value_counts()

x    418
o    335
b    205
Name: Top-Left, dtype: int64

In [39]:
ss = tic_tac.values