# Model Fitting

In this file, we will experiment with different prediction models in order to see if we can create a model that will accurately predict if a Pokemon is a legendary or not.

Models Used: Decision Tree, Random Forest

In [2]:
# Module Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [3]:
# data imports
pokemon_data = pd.read_csv("../data/pokemon_clean.csv")

Our goal for our model, is for the model to be able to accurately predict if a given Pokemon is a legendary or not, using other properties (variates) of the Pokemon. 

For this, we will use (and compare) 3 different Models: Decision Tree, Random Forest, and XGBoost.

At the time of writing (Jan 30th 2024), there have been 9 generations of Pokemon with their own set of legendaries. So we will use those legendaries to also test to see if our model is accurate with future data!

Let's start by splitting our data into test data and training data and fit a Decision Tree. For this model, we will remove all unique attributes (such as name, classification, japanese_name, abilities) and also encode types using 25 variables.

In [4]:
y = pokemon_data["is_legendary"]
X = pokemon_data.loc[:, pokemon_data.columns != "is_legendary"]
X = X.loc[:, X.columns != "abilities"]
X = X.loc[:, X.columns != "name"]
X = X.loc[:, X.columns != "classfication"]
X = X.loc[:, X.columns != "japanese_name"]
s = (X.dtypes == "object")
object_cols = list(s[s].index)

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(train_X[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.fit_transform(val_X[object_cols]))

OH_cols_train.index = train_X.index
OH_cols_valid.index = val_X.index

num_X_train = train_X.drop(object_cols, axis=1)
num_X_valid = val_X.drop(object_cols, axis=1)

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

poke_model = DecisionTreeRegressor()
poke_model.fit(OH_X_train.values, train_y)
val_predictions = poke_model.predict(OH_X_valid.values)
print(mean_absolute_error(val_y, val_predictions))

0.014925373134328358


Here we get a MAE of approximately 0.0149. Now let's try to use RandomForests

In [5]:
forest_model_full = RandomForestRegressor(random_state=1)
forest_model_full.fit(OH_X_train.values, train_y)
forest_full_predict = forest_model_full.predict(OH_X_valid.values)
print(mean_absolute_error(val_y, forest_full_predict))

0.014776119402985073


This is marginally better using the full model (by approx 0.0002). 

Now that we've looked at the full models, let's see what we can do using only specific features.

When I think of which factors could affect the legendary predictions, I think of the following:
- Base Stat Totals (they should be high for legendaries)
- Types (certain type-combos are more likely to have legendaries, see explore_data)
- Height and Weight
- Generation

Let's try to fit Decision Tree and RandomForests Models using these features:

In [8]:
cols_to_use = ["base_total", "type1", "type2", "generation"]

X_reduced = pokemon_data[cols_to_use]

train_X_r, val_X_r, train_y_r, val_y_r = train_test_split(X_reduced, y, random_state = 1)

OH_encoder_R = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train_R = pd.DataFrame(OH_encoder_R.fit_transform(train_X_r[object_cols]))
OH_cols_valid_R= pd.DataFrame(OH_encoder_R.fit_transform(val_X_r[object_cols]))

OH_cols_train_R.index = train_X_r.index
OH_cols_valid_R.index = val_X_r.index

num_X_train_r = train_X_r.drop(object_cols, axis=1)
num_X_valid_r = val_X_r.drop(object_cols, axis=1)

OH_X_train_r = pd.concat([num_X_train_r, OH_cols_train_R], axis=1)
OH_X_valid_r = pd.concat([num_X_valid_r, OH_cols_valid_R], axis=1)

tree_r = DecisionTreeRegressor()
tree_r.fit(OH_X_train_r.values, train_y_r)
val_predictions_r = tree_r.predict(OH_X_valid_r.values)
print(mean_absolute_error(val_y_r, val_predictions_r))

0.05223880597014925


This is not an improvement compared to the full model. It actually got worse. Let's see if we can play around with the lead_nodes.

In [9]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

for max_leaf_nodes in [5, 6, 7, 8, 9, 10, 15, 20, 100]:
    my_mae = get_mae(max_leaf_nodes, OH_X_train_r.values, OH_X_valid_r.values, train_y_r, val_y_r)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %f" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  0.058560
Max leaf nodes: 6  		 Mean Absolute Error:  0.060071
Max leaf nodes: 7  		 Mean Absolute Error:  0.060624
Max leaf nodes: 8  		 Mean Absolute Error:  0.060624
Max leaf nodes: 9  		 Mean Absolute Error:  0.059522
Max leaf nodes: 10  		 Mean Absolute Error:  0.057294
Max leaf nodes: 15  		 Mean Absolute Error:  0.054263
Max leaf nodes: 20  		 Mean Absolute Error:  0.061325
Max leaf nodes: 100  		 Mean Absolute Error:  0.054726


It seems like we cannot improve on the model using Decision Trees.

So let's try using RandomForests.

In [10]:
forest_model_reduced = RandomForestRegressor(random_state=1)
forest_model_reduced.fit(OH_X_train_r.values, train_y_r)
forest_reduced_predict = forest_model_reduced.predict(OH_X_valid_r.values)
print(mean_absolute_error(val_y_r, forest_reduced_predict))

0.04820066334991708


It seems that it is still not as good as the model with the full model. So using these tools, the best model is still the full model using RandomForests.

In the future, I want to try doing:
- Feature Engineering
- Cross-Validation
- XGBoost
- Compare the model to data from future generations