# Pokemon Team Predictor

Authors:
**Diego Eduardo A. Montenejo** (202005984)

**Jayson Isaiah T. Tan** (202109224)

This python notebook dives into the data preprocessing and model training stage of the project. You may also find the training and validation set accuracies of each of the three models.


# JSON to DataFrame Conversion

In [None]:
#@title Mounting the Drive

import os
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/Shareddrives/CS180 ML Project/Implementation/"

Mounted at /content/drive
/content/drive/Shareddrives/CS180 ML Project/Implementation


In [None]:
#@title Loading the Raw Dataset
import pandas as pd
import json
import numpy as np

# Read Raw JSON data and convert to DataFrame
# NOTE TO SELF: You can also import teams.json from the Google Drive into the runtime storage if mounting doesnt work
file_path = 'teams.json'
with open(file_path, 'r', encoding="utf8") as file:
    file_content = file.read()

data = json.loads(file_content)
df = pd.json_normalize(data)

# Dataset Cleaning & Feature Creation
The following code preprocesses the data by doing the following:
- Drop any irrelevant battle formats
- Drop both the `format` and `replays` columns
- Multiply the number of rows by 6, and select each pokemon from each team to be a "focus" (pokemon whose items, abilities, and moves will be predicted) under the column `pkmn`.
- Concatenate all other non-focus pokemon team members into a single string under column `team`,
  - Delete all non-focus pokemon team members' items moves abilities on each row.
- For the focus pokemon team member, concatenate all four of its moves into a single string under column `moves`.
- One-hot encode the `moves` and `team` columns.
  - `team` is one-hot encoded on the output dataframe `x`.
  - `moves` is one-hot encoded on the output dataframe `y`.

The dataframe is exported to csv.

In [None]:
#@title Initial Preprocessing

# Drop unncessary rows and columns
df = df[(df['format'] == 'gen9Gen 9 OU Archive') | (df['format'] == 'gen9OverUsed')]
df = df.drop(columns=['format', 'replays'])
df = df.rename(columns={'pokemon': 'json'})

def remove_spacedash(x):
  return x.replace(' ', '').replace('-', '')
# Create new Features
for i in range(6):
    df[f'pkmn{i+1}'] = df['json'].apply(lambda x: x[i]['name'])
    for j in range(4):
        df[f'pkmn{i+1}-move{j+1}'] = df['json'].apply(lambda x: x[i]['moves'][j] if j < len(x[i]['moves']) else 'None')
    df[f'pkmn{i+1}-ability'] = df['json'].apply(lambda x: x[i]['ability'] if not (x[i].get('ability') is None) else 'None')
    df[f'pkmn{i+1}-item'] = df['json'].apply(lambda x: x[i]['item'] if not (x[i].get('item') is None) else 'None')

    # Remove all dashes and spaces from all names (to prevent issues with vectorizer)
    df[f'pkmn{i+1}'] = df[f'pkmn{i+1}'].apply(remove_spacedash)
    for j in range(4):
      df[f'pkmn{i+1}-move{j+1}'] = df[f'pkmn{i+1}-move{j+1}'].apply(remove_spacedash)
    df[f'pkmn{i+1}-ability'] = df[f'pkmn{i+1}-ability'].apply(remove_spacedash)
    df[f'pkmn{i+1}-item'] = df[f'pkmn{i+1}-item'].apply(remove_spacedash)

# Drop original JSON and indices
df.reset_index(drop=True, inplace=True)
df = df.drop(columns=['json'])

df.dropna(inplace=True)
df = df.drop_duplicates()

# Now that we have separate all the information of each pokemon and all their abilities,
#   we must separate each pokemon into their own data entry for prediction

# Create a row for each pokemon in each team (effectively multiplying the number of rows by 6)
df = df.loc[df.index.repeat(6)].reset_index(drop=True)

# Copy the attributes of the focus pokemon to separate features
df['pkmn'] = df.apply(lambda x: x[f'pkmn{(x.name % 6) + 1}'], axis=1)
df['ability'] = df.apply(lambda x: x[f'pkmn{(x.name % 6) + 1}-ability'], axis=1)
df['item'] = df.apply(lambda x: x[f'pkmn{(x.name % 6) + 1}-item'], axis=1)

# We will use Bag of Words to aggregate all the team members and moves, thus we must gather them in a single string
def get_teammates(x):
    other_pkmn = ''
    for i in range(6):
      if i == x.name % 6:
        continue
      other_pkmn += x[f'pkmn{i+1}'] + ' '
    return other_pkmn

def get_moves(x):
  moves = ''
  for i in range(4):
    moves += x[f'pkmn{(x.name % 6) + 1}-move{i+1}'] + ' '
  return moves

df['team'] = df.apply(get_teammates, axis=1)
print(df['team'])
df['moves'] = df.apply(get_moves, axis=1)

# Drop all the previous features
def pkmn_names(x):
  return f'pkmn{x}'

def pkmn_abilities(x):
  return f'pkmn{x}-ability'

def pkmn_items(x):
  return f'pkmn{x}-item'

old_features = [f(x+1) for x in range(6) for f in (pkmn_names, pkmn_abilities, pkmn_items)] + [f'pkmn{i+1}-move{j+1}' for j in range(4) for i in range(6)]
df = df.drop(columns=old_features)

# Drop invalid names, easier to do it at this stage rather than earlier
df = df.drop(df[~df.pkmn.str.isalnum()].index)

# Uncomment this if you want to see the full dataframe:
df.to_csv('raw_dataset.csv', index=False)

0       Drednaw Garchomp IronBundle ChienPao Kilowattrel 
1       Pelipper Garchomp IronBundle ChienPao Kilowatt...
2       Pelipper Drednaw IronBundle ChienPao Kilowattrel 
3         Pelipper Drednaw Garchomp ChienPao Kilowattrel 
4       Pelipper Drednaw Garchomp IronBundle Kilowattrel 
                              ...                        
7741    Glimmora IronValiant RoaringMoon SamurottHisui...
7742    Glimmora IronMoth RoaringMoon SamurottHisui Gh...
7743    Glimmora IronMoth IronValiant SamurottHisui Gh...
7744    Glimmora IronMoth IronValiant RoaringMoon Ghol...
7745    Glimmora IronMoth IronValiant RoaringMoon Samu...
Name: team, Length: 7746, dtype: object


In [None]:
#@title One-Hot Encoding the Input
from sklearn.feature_extraction.text import CountVectorizer

# First, we Separate input x the original dataframe
x = df.copy()[['pkmn', 'team']]

# Renaming the contents of the pkmn column so that the focus pokemon is still discernable in the final model
x['pkmn'] = "focus_" + x['pkmn']

# Use a Count Vectorizer to vectorize the teams string and put them into separate
x_vect = CountVectorizer()
x_vect2 = CountVectorizer()
x_doc_vec = x_vect.fit_transform(x['team'])
x_doc_vec2 = x_vect2.fit_transform(x['pkmn'])


team_vect = pd.DataFrame(x_doc_vec.toarray(), columns=x_vect.get_feature_names_out(), index=x.index)
pkmn_vect = pd.DataFrame(x_doc_vec2.toarray(), columns=x_vect2.get_feature_names_out(), index=x.index)
x = pd.concat([pkmn_vect, team_vect], axis=1)

print(x)
# Uncomment this if you want to see the full dataframe:
x.to_csv('preprocessed_input.csv', index=False)

      focus_abomasnow  focus_alakazam  focus_alcremiesaltedcream  \
0                   0               0                          0   
1                   0               0                          0   
2                   0               0                          0   
3                   0               0                          0   
4                   0               0                          0   
...               ...             ...                        ...   
7741                0               0                          0   
7742                0               0                          0   
7743                0               0                          0   
7744                0               0                          0   
7745                0               0                          0   

      focus_alomomola  focus_altaria  focus_altariamega  focus_amoonguss  \
0                   0              0                  0                0   
1                   0          

In [None]:
#@title Separating the Output

# Separate the output y from the original dataframe
y = df.copy()[['ability', 'item', 'moves']]

# Uncomment this if you want to see the full dataframe:
y.to_csv('preprocessed_output.csv', index=False)

# Uncomment to check the abilities and items present:
print(y)

             ability            item  \
0            Drizzle        DampRock   
1          SwiftSwim       WhiteHerb   
2          RoughSkin       FocusSash   
3         QuarkDrive   BoosterEnergy   
4        SwordofRuin  HeavyDutyBoots   
...              ...             ...   
7741      QuarkDrive   BoosterEnergy   
7742      QuarkDrive   BoosterEnergy   
7743  Protosynthesis   BoosterEnergy   
7744       Sharpness  HeavyDutyBoots   
7745      GoodasGold     CovertCloak   

                                               moves  
0                         Surf Roost Uturn KnockOff   
1       ShellSmash Liquidation StoneEdge IceSpinner   
2             StealthRock Spikes Earthquake Outrage   
3                 HydroPump FreezeDry IceBeam Uturn   
4        SwordsDance SacredSword IcicleCrash Crunch   
...                                              ...  
7741  FieryDance SludgeWave DazzlingGleam TeraBlast   
7742   CloseCombat ShadowSneak KnockOff SwordsDance   
7743     DragonDance Ear

In [None]:
#@title Splitting the Dataset
from sklearn.model_selection import train_test_split

# Create train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, test_size=0.3, random_state=33123)

In [None]:
# @title Creating separate Ability, Item, Move Output Dataframes
# 109 abilities listed
# 181 items listed

y_train_ability = y_train["ability"]
y_train_item = y_train["item"]

y_test_ability = y_test["ability"]
y_test_item = y_test["item"]

y_train_moves = y_train.drop(axis=1, labels=['ability', 'item'], inplace=False)
y_test_moves = y_test.drop(axis=1, labels=['ability', 'item'], inplace=False)

# creating a copy of x_test and x_train for testing with moves
x_train_moves = x_train.copy()
x_test_moves = x_test.copy()

# Multiplying each row by 4 (for each move)
x_train_moves = x_train_moves.loc[x_train_moves.index.repeat(4)].reset_index(drop=True)
x_test_moves = x_test_moves.loc[x_test_moves.index.repeat(4)].reset_index(drop=True)

y_train_moves = y_train_moves.loc[y_train_moves.index.repeat(4)].reset_index(drop=True)
y_test_moves = y_test_moves.loc[y_test_moves.index.repeat(4)].reset_index(drop=True)

# Function zeroes out all moves except for 1 in each row, depending on the index of the row
y_train_moves = y_train_moves.apply(lambda x: x['moves'].split()[x.name % 4], axis=1)
y_test_moves = y_test_moves.apply(lambda x: x['moves'].split()[x.name % 4], axis=1)

# Uncomment me to see the full dataframe (warning: huge)
y_train_moves.to_csv('y_train_moves.csv', index=False)

In [None]:
# Uncomment to see the shape of y_train_ability and x_train

#print(y_train_ability.shape)
#print(x_train.shape)

# Training the Model
For our Machine Learning Model, we use three separate Complement Naive Bayes models in order to predict the abilities, items, and moves of a pokemon given their team members. Based on the structure of our data, we found that Multinomial Naive Bayes was appropriate to use for the model, and on testing, Complement Naive Bayes performed better.

In [None]:
#@title Fitting the Model to the Ability Dataset
from sklearn.naive_bayes import ComplementNB

# Ended up using ComplementNB since it's supposed to correct the "assumptions" Multinomial does.
# Our input is like a DTM already so ComplementNB/MultinomialNB should be appropriate

# A model is made just to get the amount of values needed to fill the matrix of prior probabilities
ability_fakecnb = ComplementNB(alpha=1.0, force_alpha='warn', fit_prior=True)
ability_fakecnb.fit(x_train, y_train_ability)

ability_priors = np.full((ability_fakecnb.class_count_.shape[0],), 0.01)

ability_cnb = ComplementNB(class_prior=ability_priors, alpha=1.0, force_alpha='warn', fit_prior=True)
ability_cnb.fit(x_train, y_train_ability)

In [None]:
#@title Predicting the Ability of a Pokemon
from sklearn import metrics

# make class predictions for x_train using the predict() function
y_train_ability_class = ability_cnb.predict(x_train)

# make class predictions for x_test using the predict() function
y_pred_ability_class = ability_cnb.predict(x_test)

# calculate accuracy of class predictions
print("Accuracy on training data")
print(metrics.accuracy_score(y_train_ability, y_train_ability_class))
print("Accuracy on test data")
print(metrics.accuracy_score(y_test_ability, y_pred_ability_class))

# Uncomment to check the probabilities for each class on the first sample in the training set:
# ability_predictions = pd.DataFrame(ability_cnb.predict_proba(x_train)[0],index = ability_cnb.classes_).transpose()
# ability_predictions.to_csv('ability_predictions.csv', index=False)
# x_train.to_csv("x_train.csv", index=False)

Accuracy on training data
0.9377999261720192
Accuracy on test data
0.9254952627045651


In [None]:
#@title Fitting the Model to the Held Item Dataset
from sklearn.naive_bayes import ComplementNB

# A model is made just to get the amount of values needed to fill the matrix of prior probabilities

item_fakecnb = ComplementNB(alpha=1.0, force_alpha='warn', fit_prior=True)
item_fakecnb.fit(x_train, y_train_item)

item_priors = np.full((item_fakecnb.class_count_.shape[0],), 0.01)

item_cnb = ComplementNB(class_prior=item_priors)
item_cnb.fit(x_train, y_train_item)

In [None]:
#@title Predicting the Held Item of a Pokemon
from sklearn import metrics

# make class predictions for x_train using the predict() function
y_train_item_class = item_cnb.predict(x_train)

# make class predictions for x_test using the predict() function
y_pred_item_class = item_cnb.predict(x_test)

# calculate accuracy of class predictions
print("Accuracy on training data")
print(metrics.accuracy_score(y_train_item, y_train_item_class))
print("Accuracy on test data")
print(metrics.accuracy_score(y_test_item, y_pred_item_class))

Accuracy on training data
0.6244001476559616
Accuracy on test data
0.5602928509905254


In [None]:
#@title Fitting the Model to the Moves Dataset
from sklearn.naive_bayes import ComplementNB

# WARNING: This takes like 4-5 minutes
# A model is made just to get the amount of values needed to fill the matrix of prior probabilities
moves_fakecnb = ComplementNB(alpha=1.0, force_alpha='warn', fit_prior=True)
moves_fakecnb.fit(x_train_moves, y_train_moves)

moves_priors = np.full((moves_fakecnb.class_count_.shape[0],), 0.01)

moves_cnb = ComplementNB(class_prior=moves_priors)
moves_cnb.fit(x_train_moves, y_train_moves)

In [None]:
#@title Predicting the Moveset of a Pokemon


# Uncomment to check the probabilities for each class on the first sample in the training set:
#move_predictions = pd.DataFrame(moves_cnb.predict_proba(x_train_moves)[1],index = moves_cnb.classes_).transpose()
#move_predictions.to_csv('moves_predictions.csv', index=False)
#x_train_moves.to_csv("x_train_moves.csv", index=False)
#y_train_moves.to_csv("y_train_moves.csv", index=False)


# Due to the nature of the data and the predictions, the predict method of ComplementNB cannot be used since the top 4 moves should be predicted
# and should not be done in a strict order
# For this model, the accuracy is tested by checking if each move in the guessed moveset appears in the real moveset. If it does, increase the tally by 1
# tally / {the total number of moves in the set} = accuracy of the model
def move_accuracy_test(x_df, y_df):
  move_predictions = pd.DataFrame(moves_cnb.predict_proba(x_df), columns=moves_cnb.classes_)
  move_predictions.reset_index(inplace=True)

  moves_true = []
  for i in range(0, y_df.shape[0], 4):
    moves_true.append(y_df[i:i+4].values)

  moves_guesses = []
  for index, row in move_predictions.iterrows():
      if index % 4 == 0:
        predicted_moves = row.sort_values(ascending=False).index
        moves_guesses.append(predicted_moves[1:5])

  tally = 0
  for i in range(len(moves_guesses)):
    for move in moves_guesses[i]:
      if move in moves_true[i]:
        tally += 1

  score = tally / (len(moves_guesses) * 4)

  return score

print("Accuracy on training data:")
print(move_accuracy_test(x_train_moves, y_train_moves))
print("Accuracy on test data:")
print(move_accuracy_test(x_test_moves, y_test_moves))

Accuracy on training data:
0.7332041343669251
Accuracy on test data:
0.6599913867355728


# Exporting the Models

In [None]:
#@title Export the Machine Learning Models

import pickle

# save the models to disk
pickle.dump(ability_cnb, open('ability_cnb.sav', 'wb'))
pickle.dump(item_cnb, open('item_cnb.sav', 'wb'))
pickle.dump(moves_cnb, open('moves_cnb.sav', 'wb'))

In [None]:
#@title Export the list of valid Pokemon names
import csv

valid_pkmn = df['pkmn'].unique()
valid_pkmn.sort()
print(valid_pkmn)

with open('VALID_POKEMON.csv', 'w') as f:
    write = csv.writer(f)
    write.writerow(valid_pkmn)

['Abomasnow' 'Alakazam' 'AlcremieSaltedCream' 'Alomomola' 'Altaria'
 'AltariaMega' 'Amoonguss' 'Ampharos' 'Annihilape' 'Appletun' 'Araquanid'
 'Arboliva' 'Arcanine' 'ArcanineHisui' 'Archaludon' 'Armarouge'
 'ArticunoGalar' 'Avalugg' 'Azelf' 'Azumarill' 'Barraskewda' 'Basculegion'
 'BasculegionF' 'Baxcalibur' 'Bellibolt' 'Bergmite' 'Bewear' 'Bisharp'
 'Blaziken' 'Blissey' 'Brambleghast' 'Breloom' 'Bronzong' 'BruteBonnet'
 'Cacturne' 'Carbink' 'Carvanha' 'Centiskorch' 'Ceruledge' 'Cetitan'
 'Chandelure' 'Chansey' 'Charizard' 'CharizardMegaY' 'Chesnaught' 'ChiYu'
 'ChienPao' 'Cinccino' 'Cinderace' 'Clefable' 'Clodsire' 'Cloyster'
 'Cobalion' 'Comfey' 'Conkeldurr' 'Copperajah' 'Corviknight'
 'Crabominable' 'Cramorant' 'Crawdaunt' 'Cresselia' 'Croagunk' 'Crocalor'
 'Cryogonal' 'Cyclizar' 'Dachsbun' 'Darkrai' 'Darmanitan' 'DecidueyeHisui'
 'Delibird' 'DeoxysDefense' 'DeoxysSpeed' 'Diancie' 'Dipplin' 'Ditto'
 'Dolliv' 'Dondozo' 'Dragalge' 'Dragapult' 'Dragonair' 'Dragonite'
 'Drednaw' 'Drifbl