This is the Preprocessing document. When I pulled the pokedex and typing information from the API, each entry and type took up a different line, see below for the results.


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse

In [2]:
# Read in and turn to df
dex_entries = pd.read_csv("Documents/GT/Potential Projects/pokedex.csv")
dex_entries = pd.DataFrame(dex_entries)
display(dex_entries.shape)
display(dex_entries.head())

type_entries = pd.read_csv("Documents/GT/Potential Projects/types.csv")
type_entries = pd.DataFrame(type_entries)
display(type_entries.shape)
display(type_entries.head())

(8650, 6)

Unnamed: 0,Pokemon,Number,Color,Habitat,Generation,Description
0,Bulbasaur,1,green,grassland,generation-i,a strange seed was planted on its back at birt...
1,Bulbasaur,1,green,grassland,generation-i,it can go for days without eating a single mor...
2,Bulbasaur,1,green,grassland,generation-i,the seed on its back is filled with nutrients....
3,Bulbasaur,1,green,grassland,generation-i,it carries a seed on its back right from birth...
4,Bulbasaur,1,green,grassland,generation-i,"while it is young, it uses the nutrients that ..."


(1551, 2)

Unnamed: 0,Pokemon,Type
0,Bulbasaur,Grass
1,Bulbasaur,Poison
2,Ivysaur,Grass
3,Ivysaur,Poison
4,Venusaur,Grass


Unfortunately, after Gen III, the habitat value is blank so we need to do a little change in that value. I showed a random pokemon just for testing so I could see that the NaN went to 'unknown'.

In [3]:
dex_entries['Habitat'] = dex_entries['Habitat'].fillna('unknown')
display(dex_entries.iloc[6000])

Pokemon                                                 Drifloon
Number                                                       425
Color                                                     purple
Habitat                                                  unknown
Generation                                         generation-iv
Description    stories go that it grabs the hands of small ch...
Name: 6000, dtype: object

In this document, I want to combine those rows so every pokemon has one line in the dataframe, so I combined the different dex entries to look more like a paragraph. The reason the habitat fix came first is because it was a blank value in the csv file, causing a NaN value in Python, so the below groupby function was cutting off all values with a NaN in the row. That would be fine if I only wanted the first three generations, but I want the entire dex of pokemon.

In [4]:
# Combine dex entries into one long description so each pokemon species has one line in the df
agg_descriptions = dex_entries.groupby(['Pokemon','Number','Color','Habitat','Generation'], as_index = False).agg({'Description':' '.join}).sort_values(by = ['Number']).reset_index(drop = True)
display(agg_descriptions.shape)
display(agg_descriptions.head())
display(agg_descriptions.tail())

(1025, 6)

Unnamed: 0,Pokemon,Number,Color,Habitat,Generation,Description
0,Bulbasaur,1,green,grassland,generation-i,a strange seed was planted on its back at birt...
1,Ivysaur,2,green,grassland,generation-i,"when the bulb on its back grows large, it appe..."
2,Venusaur,3,green,grassland,generation-i,the plant blooms when it is absorbing solaren...
3,Charmander,4,red,mountain,generation-i,"obviously prefers hot places. when it rains, s..."
4,Charmeleon,5,red,mountain,generation-i,"when it swings its burning tail, it elevates t..."


Unnamed: 0,Pokemon,Number,Color,Habitat,Generation,Description
1020,Raging-bolt,1021,yellow,unknown,generation-ix,it's said to incinerate everything around it w...
1021,Iron-boulder,1022,gray,unknown,generation-ix,it resembles a pokémon described in a dubious ...
1022,Iron-crown,1023,blue,unknown,generation-ix,it resembles a mysterious object introduced in...
1023,Terapagos,1024,blue,unknown,generation-ix,terapagos protects itself using its power to t...
1024,Pecharunt,1025,purple,unknown,generation-ix,it feeds others toxic mochi that draw out desi...


And I did the same thing with the pokemon type entries, two lines at most into one line. This one I don't feel mattered to include the number and sort by it, since I am just using the Pokemon to match to the descriptions table.

In [5]:
# Combine types into two columns so each pokemon species has one line in the df
agg_types = type_entries.groupby(['Pokemon'])['Type'].agg(' '.join).reset_index()
display(agg_types.head())
display(agg_types.tail())

Unnamed: 0,Pokemon,Type
0,Abomasnow,Grass Ice
1,Abra,Psychic
2,Absol,Dark
3,Accelgor,Bug
4,Aegislash-shield,Steel Ghost


Unnamed: 0,Pokemon,Type
1020,Zoroark,Dark
1021,Zorua,Dark
1022,Zubat,Poison Flying
1023,Zweilous,Dark Dragon
1024,Zygarde-50,Dragon Ground


I joined the tables so the big description table has a Type column now with the one to two types in it. Then I realized I'm almost certainly only going to be using the Pokemon and Description columns to compare to the users' inputs, so I copied the color of the pokemon and the type, and if the habitat was known, I put that in as well. I don't want three quarters of the descriptions to contain the word 'unknown' so I left that out if it was present in the habitat column.

In [6]:
# Join tables so entries can have types
full_dex = pd.merge(agg_descriptions, agg_types, on = 'Pokemon', how = 'left')
columns_order = ['Pokemon', 'Number', 'Color', 'Habitat', 'Type', 'Generation', 'Description']
full_dex = full_dex[columns_order]

In [7]:
# We will just compare the user input to the pokemon description, so we need the color, habitat, and type as part of the description column
full_dex['Description'] = full_dex.apply(
    lambda row: f"{row['Color']} {row['Habitat']} {row['Type']} - {row['Description']}" if row['Habitat'] != 'unknown' 
    else f"{row['Color']} {row['Type']} - {row['Description']}",
    axis=1
)

full_dex['Description'] = full_dex['Description'].str.lower()
print(full_dex.shape)
display(full_dex.head())
display(full_dex.tail())

(1025, 7)


Unnamed: 0,Pokemon,Number,Color,Habitat,Type,Generation,Description
0,Bulbasaur,1,green,grassland,Grass Poison,generation-i,green grassland grass poison - a strange seed ...
1,Ivysaur,2,green,grassland,Grass Poison,generation-i,green grassland grass poison - when the bulb o...
2,Venusaur,3,green,grassland,Grass Poison,generation-i,green grassland grass poison - the plant bloom...
3,Charmander,4,red,mountain,Fire,generation-i,red mountain fire - obviously prefers hot plac...
4,Charmeleon,5,red,mountain,Fire,generation-i,red mountain fire - when it swings its burning...


Unnamed: 0,Pokemon,Number,Color,Habitat,Type,Generation,Description
1020,Raging-bolt,1021,yellow,unknown,Electric Dragon,generation-ix,yellow electric dragon - it's said to incinera...
1021,Iron-boulder,1022,gray,unknown,Rock Psychic,generation-ix,gray rock psychic - it resembles a pokémon des...
1022,Iron-crown,1023,blue,unknown,Steel Psychic,generation-ix,blue steel psychic - it resembles a mysterious...
1023,Terapagos,1024,blue,unknown,Normal,generation-ix,blue normal - terapagos protects itself using ...
1024,Pecharunt,1025,purple,unknown,Poison Ghost,generation-ix,purple poison ghost - it feeds others toxic mo...


In [8]:
# To csv
full_dex.to_csv('pokedex_full.csv',index = False)

The inputs will change with every user, but the pokedex will only ever update when a new game comes out. Because of that, I only want to process the pokedex when the dex is updated, not every time I run the app. I will turn the dex into a vector/bag of words here so I have the vector of 1s and 0s to reference in the app.

In [9]:
# Bag of Words Prep
# Initialize CountVectorizer and fit the vectorizer on the 'Description' column
vectorizer = CountVectorizer(stop_words='english')
description_matrix = vectorizer.fit_transform(full_dex['Description'])

# Convert the matrix to an array and create a DataFrame for features
features_df = pd.DataFrame(description_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Create a new column in full_dex that contains the unique features (Bag of Words)
full_dex['Features'] = features_df.apply(lambda row: ' '.join([word for word, val in zip(features_df.columns, row) if val > 0]), axis=1)

# Save bag-of-words matrix in 'Features' column
full_dex_bow = vectorizer.transform(full_dex['Features']).toarray()

# Export as a npy file to maintain data format as a numpy array
np.save("full_dex_bow.npy", full_dex_bow)

# TF-IDF Prep
# Combine Pokémon descriptions into a list
full_dex_descriptions = full_dex['Description'].tolist()

# Initialize TF-IDF Vectorizer and fit-transform on descriptions
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(full_dex_descriptions)

# Convert TF-IDF matrix to a sparse format
tfidf_sparse = sparse.csr_matrix(tfidf_matrix)

# Save as .npz (compressed sparse format)
sparse.save_npz("full_dex_tfidf_sparse.npz", tfidf_sparse)

This concludes the data preprocessing. For the user input questions, please read that document.