Install and setup environment 

First, let's download TensorFlow through pip. While you can install the version of TensorFlow that uses your GPU, we'll be using the CPU-driven TensorFlow. Type this into your terminal:

In [None]:
!pip install tensorflow

Now that it's installed, we can truly begin. Let's import Tensorflow, and a few other packages we'll need. All of this course involve using the command line interface. Enter these commands to import and the necessary packages:


In [2]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing

There are a lot of existing compilations of Pokémon stats, but we'll be using a .CSV version [found on Kaggle](https://www.kaggle.com/alopez247/pokemon). There's a download button on the website, so save the file to your computer and we can begin.

In [5]:
import os
!ls

sample_data


In [6]:
df = pd.read_csv('pokemon.csv')

First, let's see what the categories of data are.

In [7]:
df.columns

Index(['Number', 'Name', 'Type_1', 'Type_2', 'Total', 'HP', 'Attack',
       'Defense', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation', 'isLegendary',
       'Color', 'hasGender', 'Pr_Male', 'Egg_Group_1', 'Egg_Group_2',
       'hasMegaEvolution', 'Height_m', 'Weight_kg', 'Catch_Rate',
       'Body_Style'],
      dtype='object')

we'll narrow our focus a little and only select categories we think will be relevant.

In [8]:
df = df[['isLegendary','Generation', 'Type_1', 'Type_2', 'HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed','Color','Egg_Group_1','Height_m','Weight_kg','Body_Style']]

In [9]:
df.head()

Unnamed: 0,isLegendary,Generation,Type_1,Type_2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Color,Egg_Group_1,Height_m,Weight_kg,Body_Style
0,False,1,Grass,Poison,45,49,49,65,65,45,Green,Monster,0.71,6.9,quadruped
1,False,1,Grass,Poison,60,62,63,80,80,60,Green,Monster,0.99,13.0,quadruped
2,False,1,Grass,Poison,80,82,83,100,100,80,Green,Monster,2.01,100.0,quadruped
3,False,1,Fire,,39,52,43,60,50,65,Red,Monster,0.61,8.5,bipedal_tailed
4,False,1,Fire,,58,64,58,80,65,80,Red,Monster,1.09,19.0,bipedal_tailed


In [10]:
df['isLegendary'] = df['isLegendary'].astype(int)

In [11]:
df.head()

Unnamed: 0,isLegendary,Generation,Type_1,Type_2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Color,Egg_Group_1,Height_m,Weight_kg,Body_Style
0,0,1,Grass,Poison,45,49,49,65,65,45,Green,Monster,0.71,6.9,quadruped
1,0,1,Grass,Poison,60,62,63,80,80,60,Green,Monster,0.99,13.0,quadruped
2,0,1,Grass,Poison,80,82,83,100,100,80,Green,Monster,2.01,100.0,quadruped
3,0,1,Fire,,39,52,43,60,50,65,Red,Monster,0.61,8.5,bipedal_tailed
4,0,1,Fire,,58,64,58,80,65,80,Red,Monster,1.09,19.0,bipedal_tailed


There are a few other categories that we'll need to convert as well. Let's look at "Type_1" as an example. Pokémon have associated elements, such as water and fire. Our first intuition at converting these to numbers could be to just assign a number to each category, such as: Water = 1, Fire = 2, Grass = 3 and so on. This isn't a good idea because these numerical assignments aren't ordinal; they don't lie on a scale. By doing this, we would be implying that Water is closer to Fire than it is Grass, which doesn't really make sense.

In [12]:
def dummy_creation(df, dummy_categories):
    for i in dummy_categories:
        df_dummy = pd.get_dummies(df[i])
        df = pd.concat([df,df_dummy],axis=1)
        df = df.drop(i, axis=1)
    return(df)

In [20]:
sum(df.columns.value_counts())

15

In [21]:
df = dummy_creation(df, ['Egg_Group_1', 'Body_Style', 'Color','Type_1', 'Type_2'])

In [22]:
sum(df.columns.value_counts())

85

Split and Normalize Data

In [26]:
df.Generation.nunique()

6

In [27]:
def train_test_splitter(DataFrame, column):
    df_train = DataFrame.loc[df[column] != 1]
    df_test = DataFrame.loc[df[column] == 1]

    df_train = df_train.drop(column, axis=1)
    df_test = df_test.drop(column, axis=1)

    return(df_train, df_test)

df_train, df_test = train_test_splitter(df, 'Generation')

This function takes any Pokémon whose "Generation" label is equal to 1 and putting it into the test dataset, and putting everyone else in the training dataset. It then drops the Generation category from the dataset.

In [28]:
def label_delineator(df_train, df_test, label):
    
    train_data = df_train.drop(label, axis=1).values
    train_labels = df_train[label].values
    test_data = df_test.drop(label,axis=1).values
    test_labels = df_test[label].values
    return(train_data, train_labels, test_data, test_labels)

This function extracts the data from the DataFrame and puts it into arrays that TensorFlow can understand with.values. We then have the four groups of data:

In [29]:
train_data, train_labels, test_data, test_labels = label_delineator(df_train, df_test, 'isLegendary')

now that we have our labels extracted from the data, let's normalize the data so everything is on the same scale:

In [None]:
def data_normalizer(train_data, test_data):
    train_data = preprocessing.MinMaxScaler().fit_transform(train_data)
    test_data = preprocessing.MinMaxScaler().fit_transform(test_data)
    return(train_data, test_data)

train_data, test_data = data_normalizer(train_data, test_data)

In [30]:
from sklearn.preprocessing import MinMaxScaler

In [31]:
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

In [32]:
data

[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

In [34]:
scaler = MinMaxScaler()
scaler

MinMaxScaler(copy=True, feature_range=(0, 1))



```
# X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
```



In [39]:
print(scaler.fit(data))
print(scaler.data_max_)
print(scaler.data_min_)
print(scaler.fit_transform(data))

MinMaxScaler(copy=True, feature_range=(0, 1))
[ 1. 18.]
[-1.  2.]
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


Now we can get to the machine learning! Let's create the model using Keras. Keras is an API for Tensorflow. We have a few options for doing this, but we'll keep it simple for now. A model is built upon layers. We'll add two fully connected neural layers.

In [40]:
length = train_data.shape[1]

model = keras.Sequential()
model.add(keras.layers.Dense(500, activation='relu', input_shape=[length,]))
model.add(keras.layers.Dense(2, activation='softmax'))

The number associated with the layer is the number of neurons in it. The first layer we'll use is a 'ReLU' (Rectified Linear Unit)' activation function. Since this is also the first layer, we need to specify input_size, which is the shape of an entry in our dataset.

After that, we'll finish with a softmax layer. Softmax is a type of logistic regression done for situations with multiple cases, like our 2 possible groups: 'Legendary' and 'Not Legendary'. With this we delineate the possible identities of the Pokémon into 2 probability groups corresponding to the possible labels:


Once we have decided on the specifics of our model, we need to do two processes: Compile the model and fit the data to the model.

We can compile the model like so:



In [41]:
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Here we're just feeding three parameters to model.compile. We pick an optimizer, which determines how the model is updated as it gains information, a loss function, which measures how accurate the model is as it trains, and metrics, which specifies which information it provides so we can analyze the model.

The optimizer we're using is the Stochastic Gradient Descent (SGD) optimization algorithm, but there are others available. For our loss we're using sparse_categorical_crossentropy. If our values were one-hot encoded, we would want to use "categorial_crossentropy" instead.

In [42]:
#Then we have the model fit our training data:

model.fit(train_data, train_labels, epochs=400)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f8dff6b4588>

Now that the model is trained to our training data, we can test it against our training data:

In [44]:
loss_value, accuracy_value = model.evaluate(test_data, test_labels)
print(f'Our test accuracy was {accuracy_value}')

Our test accuracy was 0.9536423683166504


model.evaluate will evaluate how strong our model is with the test data, and report that in the form of loss value and accuracy value (since we specified accuracy in our selected_metrics variable when we compiled the model). We'll just focus on our accuracy for now. With an accuracy of ~98%, it's not perfect, but it's very accurate.

We can also use our model to predict specific Pokémon, or at least have it tell us which status the Pokémon is most likely to have, with model.predict. All it needs to predict a Pokémon is the data for that Pokémon itself. We're providing that by selecting a certain index of test_data:

In [45]:
def predictor(test_data, test_labels, index):
    prediction = model.predict(test_data)
    if np.argmax(prediction[index]) == test_labels[index]:
        print(f'This was correctly predicted to be a \"{test_labels[index]}\"!')
    else:
        print(f'This was incorrectly predicted to be a \"{np.argmax(prediction[index])}\". It was actually a \"{test_labels[index]}\".')
        return(prediction)

Let's look at one of the more well-known legendary Pokémon: Mewtwo. He's number 150 in the list of Pokémon, so we'll look at index 149:



In [69]:
predictor(test_data, test_labels,11)

This was correctly predicted to be a "0"!


In [70]:
df_test['isLegendary'][11]

0