<a href="https://colab.research.google.com/github/AseiSugiyama/PokemonAnalytics/blob/add-pokemon-analysis-notebook/pokemon_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Road to Pokémon Master!

Welcome to this data science hands-on! This notebook contains following contents.

1. Exploratory data analysis with [The Complete Pokemon Dataset](https://www.kaggle.com/rounakbanik/pokemon)
2. Statistical tests
3. Outlier detection
4. Legendary pokémons prediction
5. Pokémon battle usage ranking
6. Co-occurence network

## Setup

In Colab, it is required to hit "Restart Runtime" at the first time.

In [0]:
!pip install swifter
!pip install seaborn
!pip install statsmodels
!pip install scikit-learn
!pip install numpy
!pip install matplotlib
!pip install pyvis
!pip install -U pandas-profiling[notebook,html]

## Download datasets

In [0]:
!rm -rf PokemonAnalytics
!git clone https://github.com/AseiSugiyama/PokemonAnalytics.git

In [0]:
!ls -R ./PokemonAnalytics/data

## Pokèdex

In [0]:
import pandas as pd
import swifter
%config InlineBackend.figure_formats = {'png', 'retina'}

In [0]:
usecols = [
           "pokedex_number",
           "name",
           "japanese_name",
           "type1",
           "type2",
           "height_m",
           "weight_kg",
           "hp",
           "attack",
           "defense",
           "sp_attack",
           "sp_defense",
           "speed",
           "base_egg_steps",
           "base_happiness",
           "capture_rate",
           "base_total",
           "classfication",
           "experience_growth",
           "generation",
           "is_legendary",
]

raw_pokedex = pd.read_csv(
    "./PokemonAnalytics/data/pokedex/Pokemon.csv",
    usecols=usecols,
    index_col="pokedex_number",
)[usecols[1:]].rename(columns={
    "height_m":"height",
    "weight_kg":"weight",
})

In [0]:
raw_pokedex.head()

### Preprocessing

#### Anomalies

In [0]:
raw_pokedex.dtypes

In [0]:
def is_int(value):
  try:
    int(value)
    return True
  except ValueError:
    return False

capture_rate_int_values = raw_pokedex.capture_rate.swifter.apply(is_int)
raw_pokedex[~capture_rate_int_values]

In [0]:
raw_pokedex.loc[~capture_rate_int_values, ["name", "japanese_name", "capture_rate"]]

In [0]:
raw_pokedex.at[774, "capture_rate"] = 30
raw_pokedex = raw_pokedex.astype({
    "capture_rate":int
})
raw_pokedex.dtypes

#### Missing Values

In [0]:
raw_pokedex.isna().sum()

In [0]:
raw_pokedex[raw_pokedex.height.isna()]

In [0]:
raw_pokedex[raw_pokedex.weight.isna()]

Join data

In [0]:
weight_and_heights = pd.read_csv(
    "./PokemonAnalytics/data/pokedex/height_and_weight.csv",
    usecols=["ndex", "height", "weight"],
).drop_duplicates(
    subset=['ndex'],
    keep='first',
).rename(columns={
    "height":"height_from_feet",
    "weight":"weight_from_ponds",
})
weight_and_heights.index = weight_and_heights.ndex
weight_and_heights = weight_and_heights[["height_from_feet", "weight_from_ponds"]]

In [0]:
joined_pokedex = pd.merge(
    raw_pokedex.rename(columns={
        "height":"raw_height",
        "weight":"raw_weight",
        }),
    weight_and_heights,
    left_on="pokedex_number",
    right_on="ndex",
    left_index=True,
    right_index=True,
)

In [0]:
joined_pokedex.head()

In [0]:
height_is_na = joined_pokedex.raw_height.isna().astype(int)
weight_is_na = joined_pokedex.raw_weight.isna().astype(int)
pokedex_without_na = joined_pokedex.fillna(value={
    "type2":"NONE",
    "raw_height":0.0,
    "raw_weight":0.0,
})
pokedex_without_na["height_m"] = (1 - height_is_na) * pokedex_without_na.raw_height + height_is_na * pokedex_without_na.height_from_feet
pokedex_without_na["weight_kg"] = (1 - weight_is_na) * pokedex_without_na.raw_weight + weight_is_na * pokedex_without_na.weight_from_ponds

pokedex = pokedex_without_na.drop(columns=[
                                           "height_from_feet",
                                           "weight_from_ponds",
                                           "raw_height",
                                           "raw_weight",
                                           ]
                                  )[usecols[1:]].rename(columns={
                                      "height_m":"height",
                                      "weight_kg":"weight",})
pokedex.head()

In [0]:
pokedex[pokedex.index == 26] # Raichu height: 0.8m, weight: 30kg

In [0]:
pokedex[pokedex.is_legendary==1]

### EDA

#### Descriptive statistics

In [0]:
pokedex.describe()

In [0]:
from pandas_profiling import ProfileReport

profile = ProfileReport(pokedex, title='Pokedex Profiling Report', html={'style':{'full_width':True}})
profile

#### Generations

In [0]:
import seaborn as sns
sns.factorplot(
    x="generation",
    data=pokedex,
    kind='count',
)

In [0]:
sns.factorplot(
    x='generation',
    data=pokedex,
    col='type1',
    kind='count',
    col_wrap=3
)

#### Types

##### Type1

In [0]:
sns.factorplot(
    y='type1',
    data=pokedex,
    kind='count',
    order=pokedex['type1'].value_counts().index,
    aspect=1.5,
    color='darkblue'
)

In [0]:
pokedex[pokedex.type1 == "flying"]

##### Type 2

In [0]:
sns.factorplot(
    y='type2',
    data=pokedex,
    kind='count',
    order=pokedex['type2'].value_counts().index,
    aspect=1.5,
    color='darkblue'
)

In [0]:
pokedex[pokedex.type2 == "bug"]

##### Type1 x Type2

In [0]:
pokedex.groupby(['type1', 'type2']).size().unstack()

In [0]:
sns.heatmap(
    pokedex.groupby(['type1', 'type2']).size().unstack(),
    linewidths=1,
    annot=True
)

In [0]:
sns.heatmap(
    pokedex[pokedex.type2 != "NONE"].groupby(['type1', 'type2']).size().unstack(),
    linewidths=1,
    annot=True,
)

#### Ranking

##### Strongest

In [0]:
pokedex.sort_values('base_total', ascending=False).head(10)

##### Weakest

In [0]:
pokedex.sort_values('base_total', ascending=True).head(10)

##### HP

In [0]:
pokedex.sort_values('hp', ascending=False).head(10)

##### Height


In [0]:
pokedex.sort_values('height', ascending=False).head(10)

##### Weight

In [0]:
pokedex.sort_values('weight', ascending=False).head(10)

### Statistical tests

What type of pokèmon is xxxer than others?

#### Check data

In [0]:
sns.heatmap(
    pokedex.groupby('type1').mean().loc[:, 'hp':'speed'], 
    linewidths=1,
    cmap="YlGnBu"
)

In [0]:
pokedex.groupby('type1').mean().loc[:, 'defense'].plot.bar()

In [0]:
sns.barplot(
    y="type1",
    x="defense",
    data=pokedex,
    order=pokedex.groupby(['type1']).mean().sort_values("defense", ascending=False).index,
    color="lightblue",
)

#### Steel is harder than Rock?

In [0]:
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW
steels_stats = DescrStatsW(pokedex[pokedex.type1 == "steel"].defense.values)
rocks_stats = DescrStatsW(pokedex[pokedex.type1 == "rock"].defense.values)
compare_means = CompareMeans(steels_stats, rocks_stats)
compare_means.summary()

#### Pairwise multiple test

In [0]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd, tukeyhsd
multiple_test = pairwise_tukeyhsd(pokedex.defense, pokedex.type1, alpha=0.05)
summary =  multiple_test.summary()
(pd
 .DataFrame(summary.data[1:], columns=summary.data[0])[:20]
 .style.highlight_max(["reject"], axis=0))

Complete list

In [0]:
summary

### Outlier detection

In [0]:
from sklearn.svm import OneClassSVM

status = pokedex[["hp", "attack", "defense", "sp_attack", "sp_defense", "speed"]]
model = OneClassSVM(nu=0.01)
model.fit(status)

In [0]:
pokedex["outlier"] = (-model.predict(status) + 1) // 2
pokedex[pokedex.outlier == 1][["name", "japanese_name"] + status.columns.to_list()].style.bar(["hp", "attack", "defense", "sp_attack", "sp_defense", "speed"], color="lightblue")

### Legendary pokèmon prediction

#### Train test split

In [0]:
numerics = [
            "height", 
            "weight", 
            "hp", 
            "attack",
            "defense",
            "sp_attack",
            "sp_defense",
            "speed",
            "base_egg_steps",
            "base_happiness",
            "capture_rate",
            "experience_growth"
            ]
train = pokedex[pokedex.generation < 7]
test = pokedex[pokedex.generation == 7]
X_train, y_train = train[numerics], train.is_legendary
X_test, y_test = test[numerics], test.is_legendary

#### Training

In [0]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42, criterion="entropy")
model.fit(X_train, y_train)

#### Inference

In [0]:
pokedex["predicted"] = model.predict(pokedex[numerics])
sun_moon = pokedex[pokedex.generation == 7]
sun_moon[sun_moon.is_legendary == sun_moon.predicted][["name", "japanese_name", "is_legendary", "predicted"]]

In [0]:
sun_moon[sun_moon.is_legendary != sun_moon.predicted][["name", "japanese_name", "is_legendary", "predicted"]]

#### Evaluation

In [0]:
from sklearn.metrics import plot_confusion_matrix
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap="Blues")

#### Model analysis

In [0]:
import numpy as np
import matplotlib.pyplot as plt

feature_importances = model.feature_importances_
sorted_idx = feature_importances.argsort()

y_ticks = np.arange(0, len(numerics))
fig, ax = plt.subplots()
ax.barh(y_ticks, feature_importances[sorted_idx])
ax.set_yticklabels(np.array(numerics)[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Random Forest Feature Importances")
fig.tight_layout()
plt.show()

In [0]:
sun_moon[sun_moon.is_legendary != sun_moon.predicted][["name", "japanese_name", "is_legendary", "predicted"] + numerics].style.bar(numerics, color="lightblue")

## Pokèmon Battle

### Load Data

In [0]:
import json
from pathlib import Path
ladder = 'gen8battlestadiumsingles' # gen8
dataset_path = Path("PokemonAnalytics") / "data" /"battle" / f"{ladder}_dataset.json"
raw_data = json.loads(dataset_path.read_text())
teams = raw_data["train"] + raw_data["valid"]

### Frequency

In [0]:
from collections import Counter

def frequency(teams, pokemons=None):
  counter = Counter()
  for team in teams:
    if not team:
      continue
    elif not pokemons:
      counter.update(team)
    else:
      if set(pokemons).issubset(set(team)):
        counter.update(team)

  return counter

In [0]:
counter = frequency(teams)

In [0]:
names, freqs = zip(*counter.most_common())
df = pd.DataFrame([freqs], columns=names, index=['freq']).T
df.head(10)

In [0]:
barplot = sns.barplot(
    data=df.T,
)
barplot.figure.set_figheight(10)
barplot.figure.set_figwidth(30)

### Co-occurrence


In [0]:
counter = frequency(teams, pokemons=["mimikyu"])

In [0]:
names, freqs = zip(*counter.most_common())
df = pd.DataFrame([freqs], columns=names, index=['freq']).T
df.head(10)

In [0]:
counter = frequency(teams, pokemons=["mimikyu", "dragapult"])

In [0]:
names, freqs = zip(*counter.most_common())
df = pd.DataFrame([freqs], columns=names, index=['freq']).T
df.head(10)

### Co-occurrence Network

In [0]:
from itertools import combinations
combinations_in_teams = [list(combinations(team, 2)) for team in teams]
combinations_in_teams = [[tuple(sorted(combination)) for combination in combinations] for combinations in combinations_in_teams]
combinations = set(sum(combinations_in_teams, []))

In [0]:
def pokemons_similarity(teams, poke1, poke2):
  intersection = len([team for team in teams if (poke1 in team) and (poke2 in team)])
  union = len([team for team in teams if (poke1 in team) or (poke2 in team)])
  return intersection / union

In [0]:
edges = [(poke1, poke2, pokemons_similarity(teams, poke1, poke2)) for poke1, poke2 in combinations]

In [0]:
edges_df = pd.DataFrame(edges, columns=["source", "destination", "weight"])
edges_df.head()

In [0]:
edges_df.plot.hist(bins=20)

In [0]:
from pyvis.network import Network

network = Network(height="1000px", width="95%", bgcolor="#FFFFFF", font_color="black", notebook=True)
network.force_atlas_2based()
threshold = (
    # 0.027 # top 25%
    0.111 # top 100
    # 0.249 # top 10
)

for edge in edges:
  source, destination, weight = edge
  if weight < threshold:
    continue
  network.add_node(source, source, title=source)
  network.add_node(destination, destination, title=destination)
  network.add_edge(source, destination, value=weight)

neighbor_map = network.get_adj_list()

for node in network.nodes:
  node["title"] += " Neighbors:<br>" + "<br>".join(neighbor_map[node["id"]])
  node["value"] = len(neighbor_map[node["id"]])

network.show_buttons(filter_=['physics'])
network.save_graph("co-occurence.html")

## Next Steps

### More samples!

- [Exploring the Pokemon dataset with pandas and seaborn | In Machines We Trust](https://inmachineswetrust.com/posts/exploring-pokemon-dataset/)

## More datsets!

Following datasets are available to explore the pokèmon world.

- Pokemon Sun and Moon (Gen 7) Stats https://www.kaggle.com/mylesoneill/pokemon-sun-and-moon-gen-7-stats
- Pokemon with stats Kaggle https://www.kaggle.com/abcsds/pokemon
