# Explainer Notebook

This notebook is structured into 4 sections:
* Motivation
* Basic Stats
* Tools, theory and analysis
* Discussion

The starting point will be numerous imports relevant for the entire notebook!

In [18]:
# imports
import requests
import os
import random
import pickle
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import netwulf as nw
import community as community_louvain
import scipy.stats as stats


from bs4 import BeautifulSoup
from tqdm import tqdm
from collections import defaultdict

warnings.filterwarnings('ignore')


## Motivation

This brief section will cover the motivation behind this project.

* What is your dataset?

The dataset behind this project is the Pokémon world, i.e. all the Pokémon found in the Pokédex using PokeAPI, and all the episodes from the Pokémon TV-show collected from Bulbapedia. Using the Pokédex, it was possible to extract all the names of each Pokémon as well as some attributes such as their type, abilities, and egg groups which are important in the games. From Bulbapedia, it was possible to scrape the plots of each episode from each season as well as lists of which Pokémon appeared in which episodes. This is relevant for graph purposes. 

* Why did you choose this/these particular dataset(s)?

We initially thought it would be interesting to go in a different direction than taking a "real-world" dataset, and see if it was still possible to apply methods from this course, and perform a relevant analysis. As such, we needed as much data from the Pokémon world as possible such that it was both possible to construct a graph with a number of attributes, and also have some text to analyse. 

* What was your goal for the end user’s experience?

The goal for the end user is to gain insight into the Pokémon world, and get a brief grasp of the different seasons, what separates them, and what makes them unique. This is the hope for someone who would come across this project. Essentially, this project can be boiled down to the following research questions:
1. Bla
2. Bla
3. Bla

These will lead the analysis done below.

## Basic Stats

Now, the focus will be shifted onto data collection and preprocessing. For this, numerous functions will be used, and these will be defined below.

In [3]:
def data_scrape():
    # scrape the data from PokéAPI
    temp_dict = {
        'pokemon': [],
        'abilities': [], 
        'types': [], 
        'egg_groups': [], 
        'moves': [],
        'pokedex_entry': []
    }
    
    for i, name in tqdm(enumerate(pokemons)):
        r = requests.get('https://pokeapi.co/api/v2/pokemon/' + str(i+1)).json()
        # append the name of the pokemon
        temp_dict['pokemon'].append(name)

        # append the abilities of the pokemon
        abilities = [r['abilities'][j]['ability']['name'] for j in range(len(r['abilities']))]
        temp_dict['abilities'].append(abilities)

        # append the types of the pokemon
        types = [r['types'][i]['type']['name'] for i in range(len(r['types']))]
        temp_dict['types'].append(types)

        # append the moves of the pokemon
        moves = [r['moves'][j]['move']['name'] for j in range(len(r['moves']))]
        temp_dict['moves'].append(moves)

        # make new request to get the egg groups and pokedex entry
        r = requests.get('https://pokeapi.co/api/v2/pokemon-species/' + str(i+1)).json()

        # append the egg groups of the pokemon
        egg_groups = [r['egg_groups'][j]['name'] for j in range(len(r['egg_groups']))]
        temp_dict['egg_groups'].append(egg_groups)

        # append the pokedex entry of the pokemon
        entry = r['flavor_text_entries'][0]['flavor_text'].replace('\n', ' ').replace('\f', ' ') if len(r['flavor_text_entries']) > 0 else None
        temp_dict['pokedex_entry'].append(entry)
        

    print('Done!')

    return temp_dict

def find_unique(df, col):
    vals = df[col].values
    all_vals = [item for sublist in vals for item in sublist]
    unique_vals = list(set(all_vals))
    return unique_vals

def get_text_entries(attribute, unique_vals):
    temp_dict = {
        attribute: [],
        'text_entry': []
    }

    for i, val in tqdm(enumerate(unique_vals)):
        r = requests.get('https://pokeapi.co/api/v2/' + attribute + '/' + val).json()
        
        # check if the text entry exists in english
        if len(r['effect_entries']) == 0:
            for j in range(len(r['flavor_text_entries'])):
                if r['flavor_text_entries'][j]['language']['name'] == 'en':
                    temp_dict[attribute].append(val)
                    temp_dict['text_entry'].append(r['flavor_text_entries'][j]['flavor_text'].replace('\n', ' ').replace('\f', ' '))
                    break
        else:
            for j in range(len(r['effect_entries'])):
                if r['effect_entries'][j]['language']['name'] == 'en':
                    temp_dict[attribute].append(val)
                    temp_dict['text_entry'].append(r['effect_entries'][j]['effect'].replace('\n', ' ').replace('\f', ' '))
                    break

    return temp_dict

In [6]:
# make the initial request to get the pokemon names
data = requests.get('https://pokeapi.co/api/v2/pokemon?limit=1000').json()['results']

# get the names of the pokemons
pokemons = []
# get the name of the pokemon
for i in range(len(data)):
    pokemons.append(data[i]['name'])

print("Check the first 5 pokemons: ", pokemons[:5])

Check the first 5 pokemons:  ['bulbasaur', 'ivysaur', 'venusaur', 'charmander', 'charmeleon']


In [14]:
# next, use the function to create the dataset (only if the file does not exist)
if not os.path.exists('pokemon.pickle'):     
    print('Scraping data...')
    poke_dict = data_scrape()
    poke_df = pd.DataFrame(poke_dict)
    poke_df.to_pickle('pokemon.pickle')
else:
    poke_df = pd.read_pickle('pokemon.pickle')
    print('Data loaded!')


Data loaded!


In [16]:
# check the info of the dataframe to get a quick overview
poke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   pokemon        1000 non-null   object
 1   abilities      1000 non-null   object
 2   types          1000 non-null   object
 3   egg_groups     1000 non-null   object
 4   moves          1000 non-null   object
 5   pokedex_entry  905 non-null    object
dtypes: object(6)
memory usage: 47.0+ KB


Now that the initial dataframe has been gathered there is a need for some cleaning. This is done in 2 simple steps:
1. Remove all NaN values. These are the entries that does not have a pokedex entry.
2. Capitalize the names of each Pokémon.

In [19]:
# first, remove NaN values
poke_df_clean = poke_df.dropna()

# second, capitalize the pokemon names
poke_df_clean['pokemon'] = poke_df_clean['pokemon'].str.capitalize()

In [21]:
# finally, we save the cleaned dataframe to a pickle file (only if the file does not exist)
poke_df_clean.to_pickle('pokemon_clean.pickle') if not os.path.exists('pokemon_clean.pickle') else print('File already exists')

File already exists


The next step is to check the unique values in each of the columns of the dataframe. This is simply to gain a quick overview of how many there are of each.

In [24]:
unique_abilities = find_unique(poke_df_clean, 'abilities')
unique_types = find_unique(poke_df_clean, 'types')
unique_egg_groups = find_unique(poke_df_clean, 'egg_groups')
unique_moves = find_unique(poke_df_clean, 'moves')
print('Number of pokemon: ', len(poke_df_clean))
print('Number of unique abilities: ', len(unique_abilities))
print('Number of unique types: ', len(unique_types))
print('Number of unique egg groups: ', len(unique_egg_groups))
print('Number of unique moves: ', len(unique_moves))

Number of pokemon:  905
Number of unique abilities:  249
Number of unique types:  18
Number of unique egg groups:  15
Number of unique moves:  747


This sums up the initial dataset preprocessing. This means that going forward, this project will only consider the 905 Pokémon found above. It is important to note that there are 249 unique abilities, 18 unique types and 15 unique egg groups, and this will become important during the graph analysis. Do however also note, there are many more combinations of these.

The next step is to collect data from all the Pokémon seasons. This process is a little more complicated, and as before it starts with defining a couple of functions.

In [26]:
def make_number(num):
    if num < 10:
        return '00' + str(num)
    elif num < 100:
        return '0' + str(num)
    else:
        return str(num)
        
def get_pokemon_data(episode, names, season):
    lookup = season_dict[season]
    r = requests.get(os.path.join('https://bulbapedia.bulbagarden.net/wiki', lookup + episode)).text
    soup = BeautifulSoup(r, 'html.parser')
    elems = soup.find_all('a', href=True)
    episode_pokemon = []

    for name in names:
        for elem in elems:
            if name in elem.text:
                text = elem.text
                episode_pokemon.append(text)

    unique_pokemon = list(set(episode_pokemon))

    # remove elements that are not single words
    unique_pokemon = [p for p in unique_pokemon if len(p.split()) == 1]

    # remove nature names
    if 'Nature' in unique_pokemon:
        unique_pokemon.remove('Nature')
    return unique_pokemon

# get episode plots
def get_episode_plot(episode, season):
    lookup = season_dict[season]
    r = requests.get(os.path.join('https://bulbapedia.bulbagarden.net/wiki', lookup + episode)).text
    soup = BeautifulSoup(r, 'html.parser')
    elems = soup.find_all('p')
    plot = ''
    for i in range(1,len(elems)):
        if "Who's That Pokémon?" in elems[i].text:
            break
        plot += elems[i].text

    #plot = plot.replace('\n ', ' ')
    plot = plot.replace('\n', ' ')

    # remove trailing whitespace
    plot = plot.strip()
    
    return plot

def gather_pokemon_data(episode_numbers, names, season):
    episode_dict = {}
    for episode in tqdm(episode_numbers):
        episode_pokemon = get_pokemon_data(episode, names, season)
        plot = get_episode_plot(episode, season)
        episode_dict[episode] = []
        episode_dict[episode].append(episode_pokemon)
        episode_dict[episode].append(plot)
    return episode_dict

In [27]:
# now, we can get the pokemon data for each episode
# first, we get the names of all pokemon from the initial dataframe
names = poke_df_clean['pokemon'].values.tolist()

In [28]:
# then, we start collecting data for each season
# this requires a bit of manual work, since the episodes are not numbered in a consistent way
# also, we need a season dict
season_dict = {
    'Indigo League': 'EP',
    'Adventures on the Orange Islands': 'EP',
    'The Johto Journeys': 'EP',
    'Hoenn': 'AG',
    'Battle Frontier': 'AG',
    'Diamond and Pearl': 'DP',
    'Black and White': 'BW',
    'XY': 'XY',
    'Sun and Moon': 'SM',
    'Pocket Monsters': 'JN'
}


In [29]:
if not os.path.exists('indigo_df.pkl'):
    print('Scraping data...')
    episode_numbers_indigo_league = [make_number(i) for i in range(1, 81)]
    indigo_dict = gather_pokemon_data(episode_numbers_indigo_league, names, 'Indigo League')
    indigo_df = pd.DataFrame.from_dict(indigo_dict, orient='index', columns=['pokemon', 'plot'])
    indigo_df.to_pickle('indigo_df.pkl')
else:
    indigo_df = pd.read_pickle('indigo_df.pkl')
    print('Data loaded!')

Data loaded!


In [30]:
if not os.path.exists('orange_df.pkl'):
    print('Scraping data...')
    episode_numbers_orange_islands = [make_number(i) for i in range(81, 117)]
    orange_dict = gather_pokemon_data(episode_numbers_orange_islands, names, 'Adventures on the Orange Islands')
    orange_df = pd.DataFrame.from_dict(orange_dict, orient='index', columns=['pokemon', 'plot'])
    orange_df.to_pickle('orange_df.pkl')
else:
    orange_df = pd.read_pickle('orange_df.pkl')
    print('Data loaded!')

Data loaded!


In [31]:
if not os.path.exists('johto_df.pkl'):
    print('Scraping data...')
    episode_numbers_johto_journeys = [make_number(i) for i in range(117, 275)]
    johto_dict = gather_pokemon_data(episode_numbers_johto_journeys, names, 'The Johto Journeys')
    johto_df = pd.DataFrame.from_dict(johto_dict, orient='index', columns=['pokemon', 'plot'])
    johto_df.to_pickle('johto_df.pkl')
else:
    johto_df = pd.read_pickle('johto_df.pkl')
    print('Data loaded!')

Data loaded!


In [32]:
if not os.path.exists('hoenn_df.pkl'):
    print('Scraping data...')
    episode_numbers_hoenn_league = [make_number(i) for i in range(1, 135)]
    hoenn_dict = gather_pokemon_data(episode_numbers_hoenn_league, names, 'Hoenn')
    hoenn_df = pd.DataFrame.from_dict(hoenn_dict, orient='index', columns=['pokemon', 'plot'])
    hoenn_df.to_pickle('hoenn_df.pkl')
else:
    hoenn_df = pd.read_pickle('hoenn_df.pkl')
    print('Data loaded!')

Data loaded!


In [33]:
if not os.path.exists('battle_df.pkl'):
    print('Scraping data...')
    episode_numbers_battle_frontier = [make_number(i) for i in range(135, 193)]
    battle_dict = gather_pokemon_data(episode_numbers_battle_frontier, names, 'Battle Frontier')
    battle_df = pd.DataFrame.from_dict(battle_dict, orient='index', columns=['pokemon', 'plot'])
    battle_df.to_pickle('battle_df.pkl')
else:
    battle_df = pd.read_pickle('battle_df.pkl')
    print('Data loaded!')

Data loaded!


In [34]:
if not os.path.exists('diamond_df.pkl'):
    print('Scraping data...')
    episode_numbers_diamond_pearl = [make_number(i) for i in range(1, 192)]
    diamond_dict = gather_pokemon_data(episode_numbers_diamond_pearl, names, 'Diamond and Pearl')
    diamond_df = pd.DataFrame.from_dict(diamond_dict, orient='index', columns=['pokemon', 'plot'])
    diamond_df.to_pickle('diamond_df.pkl')
else:
    diamond_df = pd.read_pickle('diamond_df.pkl')
    print('Data loaded!')

Data loaded!


In [35]:
if not os.path.exists('black_df.pkl'):
    print('Scraping data...')
    episode_numbers_black_white = [make_number(i) for i in range(1, 143)]
    black_dict = gather_pokemon_data(episode_numbers_black_white, names, 'Black and White')
    black_df = pd.DataFrame.from_dict(black_dict, orient='index', columns=['pokemon', 'plot'])
    black_df.to_pickle('black_df.pkl')
else:
    black_df = pd.read_pickle('black_df.pkl')
    print('Data loaded!')

Data loaded!


In [36]:
if not os.path.exists('xy_df.pkl'):
    print('Scraping data...')
    episode_numbers_xy = [make_number(i) for i in range(1, 141)]
    xy_dict = gather_pokemon_data(episode_numbers_xy, names, 'XY')
    xy_df = pd.DataFrame.from_dict(xy_dict, orient='index', columns=['pokemon', 'plot'])
    xy_df.to_pickle('xy_df.pkl')
else:
    xy_df = pd.read_pickle('xy_df.pkl')
    print('Data loaded!')

Data loaded!


In [37]:
if not os.path.exists('sun_df.pkl'):
    print('Scraping data...')
    episode_numbers_sun_moon = [make_number(i) for i in range(1, 147)]
    sun_dict = gather_pokemon_data(episode_numbers_sun_moon, names, 'Sun and Moon')
    sun_df = pd.DataFrame.from_dict(sun_dict, orient='index', columns=['pokemon', 'plot'])
    sun_df.to_pickle('sun_df.pkl')
else:
    sun_df = pd.read_pickle('sun_df.pkl')
    print('Data loaded!')

Data loaded!


In [38]:
if not os.path.exists('pocket_monsters.pkl'):
    print('Scraping data...')
    episode_numbers_pocket_monsters = [make_number(i) for i in range(1, 148)]
    pocket_dict = gather_pokemon_data(episode_numbers_pocket_monsters, names, 'Pocket Monsters')
    pocket_df = pd.DataFrame.from_dict(pocket_dict, orient='index', columns=['pokemon', 'plot'])
    pocket_df.to_pickle('pocket_monsters.pkl')
else:
    pocket_df = pd.read_pickle('pocket_monsters.pkl')
    print('Data loaded!')

Data loaded!


That was quite a bit of work!

The only thing left to do is to add a single column to each dataframe that has the season number for that dataframe, and collect the dataframes into that that then has all info.

In [41]:
# collect all the dataframes into one
frames = [indigo_df, orange_df, johto_df, hoenn_df, battle_df, diamond_df, black_df, xy_df, sun_df, pocket_df]

# add a column for the season
for i in range(len(frames)):
    frames[i]['season'] = i + 1

# combine all the dataframes
all_seasons_df = pd.concat(frames)

# save the dataframe
if not os.path.exists('all_seasons_df.pkl'):
    all_seasons_df.to_pickle('all_seasons_df.pkl')
else:
    all_seasons_df = pd.read_pickle('all_seasons_df.pkl')
    print('Data loaded!')

Data loaded!


In [42]:
# summarize the data in each season
seasons = all_seasons_df.groupby('season')
seasons.describe()

Unnamed: 0_level_0,pokemon,pokemon,pokemon,pokemon,plot,plot,plot,plot
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
season,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,80,80,"[Pikachu, Mankey, Spearow, Gyarados, Hypnosis,...",1,80,80,Pokémon - I Choose You! (Japanese: ポケモン！きみにきめた...,1
2,35,35,"[Poliwag, Pikachu, Mankey, Spearow, Staryu, Pi...",1,35,35,After battling in the Pokémon League Tournamen...,1
3,158,158,"[Pikachu, Chansey, Lickitung, Meowth, Fearow, ...",1,158,158,"Ash begins his journey in Johto, a largely une...",1
4,134,134,"[Entei, Pikachu, Mudkip, Poochyena, Beautifly,...",1,134,134,Team Rocket's failed attempt to catch Pikachu ...,1
5,58,58,"[Pikachu, Rhyhorn, Manectric, Pinsir, Meowth, ...",1,58,58,The Battle Factory is Ash's next destination—i...,1
6,191,191,"[Bidoof, Pikachu, Starly, Chatot, Mantyke, Meo...",1,191,191,It's always exciting when new Pokémon Trainers...,1
7,142,142,"[Minccino, Pikachu, Reshiram, Deerling, Meowth...",1,142,142,Ash excitedly arrives in the Unova region alon...,1
8,140,140,"[Pikachu, Furret, Staryu, Pidgeotto, Lickitung...",1,140,140,"After a quick introduction to Serena, a buddin...",1
9,146,146,"[Pikachu, Mankey, Litten, Staryu, Whimsicott, ...",1,146,146,It’s a beautiful day on Melemele Island in the...,1
10,147,147,"[Poliwag, Pikachu, Mankey, Spearow, Dugtrio, E...",1,147,147,"In Pallet Town, a young Ash Ketchum is beside ...",1


Notice, these dataframes require no cleaning at all! Their purpose is simply to become the backbone of the graph creation, which is the next step in the process. What is important to notice, is that there are big differences between the seasons when it comes to the number of episodes in each. This might play role for the graphs.

The next step is to create and analyse all the graphs. This is essentially done in one big function that has been composed by many smaller steps. First, we define the smaller utility functions, and second, the main function.

In [48]:
save_name_dict = {"indigo": "Indigo League",
                  "orange": "Orange Islands",
                    "johto": "Johto League",
                    "hoenn": "Hoenn League",
                    "battle": "Battle Frontier",
                    "sinnoh": "Sinnoh League",
                    "unova": "Unova League",
                    "kalos": "Kalos League",
                    "alola": "Alola League",
                    "journeys": "Pokémon Journeys",
                    "all_seasons": "All Seasons"}

def make_anime_edgelist(df):
    # make a dictionary to store the edges
    edgelist = defaultdict(lambda: 0)
    # loop over all episodes
    for i in tqdm(range(len(df))):
        # loop over all pokemon in the episode
        for j in range(len(df['pokemon'].iloc[i])):
            for k in range(j+1, len(df['pokemon'].iloc[i])):
                edgelist[(df['pokemon'].iloc[i][j], df['pokemon'].iloc[i][k])] += 1
                edgelist[(df['pokemon'].iloc[i][k], df['pokemon'].iloc[i][j])] += 1

    # make the edgelist undirected 
    edgelist = [(k[0], k[1], v) for k, v in edgelist.items()]

    # only keep every other edge to avoid duplicates
    edgelist = edgelist[::2]
    return edgelist

def calc_frac(graph, fields):
    """ Calculate the fraction of neighbors with the same attribute value as the node itself."""
    fracs = []
    for node in graph.nodes:
        c = 0
        for neighbor in graph.neighbors(node):
            if fields[neighbor] == fields[node]:
                c += 1
        fracs.append(c/graph.degree(node))

    return np.mean(fracs)

def set_group(graph, group_dict):
    nx.set_node_attributes(graph, group_dict, 'group')

def frac_same_field(graph, field):
    fields = nx.get_node_attributes(graph, field)
    return calc_frac(graph, fields)

def frac_rand_graph(graph, field):
    fields = nx.get_node_attributes(graph, field)
    field_list = list(fields.values())
    for key in fields.keys():
        fields[key] = random.choice(field_list)

    return calc_frac(graph, fields)

# we use the same seed as before to ensure reproducibility
def modularity_test(graph, nswap):
    temp_graph = nx.double_edge_swap(graph, nswap=nswap, max_tries=1000000)
    partition = community_louvain.best_partition(temp_graph)
    return community_louvain.modularity(partition, temp_graph)

# time to make a function for all graph analysis
def graph_analysis(df, save_name: str, save: bool = False):
    # setup relevant folders for saving
    if save:
        os.makedirs(f'figures/{save_name}', exist_ok=True)

    txt_lines = []

    # make big print statement
    print(f"Analysing the graph for the {save_name_dict[save_name]} season")
    txt_lines.append("Analysing the graph for the " + save_name_dict[save_name] + " season")
    # make the initial graph
    G = nx.Graph()
    print("Making graph...")
    G.add_weighted_edges_from(make_anime_edgelist(df))
    print("Done!")
    txt_lines.append(f"The graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

    # make dataframe with only pokemon in original df
    anime_pokemon = find_unique(df, 'pokemon')
    anime_pokemon_df = poke_df_clean[poke_df_clean['pokemon'].isin(anime_pokemon)].reset_index(drop=True)

    # remove all nodes that are not in the anime pokemon dataframe
    pokemon = anime_pokemon_df['pokemon'].values.tolist()
    G.remove_nodes_from([n for n in G.nodes() if n not in pokemon])

    # add pokemon attributes to graph
    types = [t for t in anime_pokemon_df['types'].values]
    type_dict = dict(zip(anime_pokemon_df['pokemon'], types))

    abilities = [a for a in anime_pokemon_df['abilities'].values]
    ability_dict = dict(zip(anime_pokemon_df['pokemon'], abilities))

    egg_groups = [e for e in anime_pokemon_df['egg_groups'].values]
    egg_group_dict = dict(zip(anime_pokemon_df['pokemon'], egg_groups))

    nx.set_node_attributes(G, type_dict, 'types')
    nx.set_node_attributes(G, ability_dict, 'abilities')
    nx.set_node_attributes(G, egg_group_dict, 'egg_groups')

    # degree rank plot
    degree_sequence = sorted([d for _, d in G.degree()], reverse=True)

    # plot degree distribution
    plt.figure(figsize=(10, 6))
    plt.plot(degree_sequence, 'b-', marker='o')
    # add the name to the highest degree
    plt.title(f"Degree rank plot for {save_name_dict[save_name]}")
    plt.ylabel("degree")
    plt.xlabel("rank")
    figure_path = os.path.join('figures', save_name, 'degree_rank_plot.png')
    if save:
        plt.savefig(figure_path)
    plt.close()

    # make histogram of degree distribution w. mean
    plt.figure(figsize=(10, 6))
    plt.hist(degree_sequence, bins=20)
    plt.axvline(np.mean(degree_sequence), color='r', linestyle='dashed', linewidth=1)
    plt.text(np.mean(degree_sequence) + 10, 100, 'Mean: {:.0f}'.format(np.mean(degree_sequence)))
    plt.title(f"Degree Distribution for {save_name_dict[save_name]}")
    plt.ylabel("Count")
    plt.xlabel("Degree")
    if save:
        plt.savefig(os.path.join('figures', save_name, 'degree_distribution.png'))
    plt.close()

    # identify the ten pokemon with the highest degree
    sorted_degree = sorted(G.degree, key=lambda x: x[1], reverse=True)
    # print the top ten pokemon with the highest degree and their degree value each on one line
    txt_lines.append("The top ten pokemon with the highest degree are:")
    for i in range(10):
        txt_lines.append(f"{sorted_degree[i][0]}: {sorted_degree[i][1]}")

    # get degree assortativity
    dac = nx.degree_assortativity_coefficient(G)
    txt_lines.append(f"The degree assortativity coefficient is {dac:.2f}")

    # explore connections between pokemon types, abilities and egg groups
    avg_typing = frac_same_field(G, 'types')
    txt_lines.append(f"The average fraction of neighbors with the same typing as the node itself is {avg_typing*100:.2f}%")

    avg_abilities = frac_same_field(G, 'abilities')
    txt_lines.append(f"The average fraction of neighbors with the same ability as the node itself is {avg_abilities*100:.2f}%")

    avg_egg_groups = frac_same_field(G, 'egg_groups')
    txt_lines.append(f"The average fraction of neighbors with the same egg group as the node itself is {avg_egg_groups*100:.2f}%")

    # now, we repeat the above 100 times and calculate the average fraction of neighbors with the same field as the node itself
    avg_rand_type_100 = [frac_rand_graph(G, 'types') for _ in range(100)]
    avg_rand_type_100_mu = np.mean(avg_rand_type_100)
    txt_lines.append(f"The average fraction of neighbors with the same typing as the node itself when random is {avg_rand_type_100_mu*100:.2f}%")
    

    avg_rand_abilities_100 = [frac_rand_graph(G, 'abilities') for _ in range(100)]
    avg_rand_abilities_100_mu = np.mean(avg_rand_abilities_100)
    txt_lines.append(f"The average fraction of neighbors with the same ability as the node itself when random is {avg_rand_abilities_100_mu*100:.2f}%")

    avg_rand_egg_groups_100 = [frac_rand_graph(G, 'egg_groups') for _ in range(100)]
    avg_rand_egg_groups_100_mu = np.mean(avg_rand_egg_groups_100)
    txt_lines.append(f"The average fraction of neighbors with the same egg group as the node itself when random is {avg_rand_egg_groups_100_mu*100:.2f}%")

    # now we make three subplots of the random distributions with the actual values plotted as vertical lines with text
    fig, ax = plt.subplots(1, 3, figsize=(15, 6), sharey=True)
    ax[0].hist(avg_rand_type_100, bins=20)
    ax[0].axvline(avg_typing, color='r', linestyle='dashed', linewidth=1)
    ax[0].set_title("Typing")
    ax[0].set_ylabel("Count")
    ax[0].set_xlabel("Fraction of neighbors with same typing")
    ax[1].hist(avg_rand_abilities_100, bins=20)
    ax[1].axvline(avg_abilities, color='r', linestyle='dashed', linewidth=1)
    ax[1].set_title("Abilities")
    ax[1].set_xlabel("Fraction of neighbors with same ability")
    ax[2].hist(avg_rand_egg_groups_100, bins=20)
    ax[2].axvline(avg_egg_groups, color='r', linestyle='dashed', linewidth=1)
    ax[2].set_title("Egg Groups")
    ax[2].set_xlabel("Fraction of neighbors with same egg group")
    
    plt.suptitle(f"Random distributions for {save_name_dict[save_name]}")
    if save:
        plt.savefig(os.path.join('figures', save_name, 'random_distributions.png'))
    plt.close()

    # make statistical tests for the three fields
    import scipy.stats as stats
    txt_lines.append("Statistical tests for the three fields:")

    p_val_typing = stats.ttest_1samp(avg_rand_type_100, avg_typing)[1]
    txt_lines.append(f"Typing: {p_val_typing}")

    p_val_abilities = stats.ttest_1samp(avg_rand_abilities_100, avg_abilities)[1]
    txt_lines.append(f"Abilities: {p_val_abilities}")

    p_val_egg_groups = stats.ttest_1samp(avg_rand_egg_groups_100, avg_egg_groups)[1]
    txt_lines.append(f"Egg Groups: {p_val_egg_groups}")
    

    # find best partition
    partition = community_louvain.best_partition(G)
    # print the modularity
    mod = community_louvain.modularity(partition, G)
    txt_lines.append(f"The modularity is {mod:.2f}")

    num_communities = len(set(partition.values()))
    txt_lines.append(f"There are {num_communities} communities")

    # Community sizes
    community_sizes = [len(list(filter(lambda x: x[1] == i, partition.items()))) for i in range(num_communities)]
    txt_lines.append(f"The community sizes are {community_sizes}")

    # add the community as an attribute to the nodes
    set_group(G, partition)

    # time to test modularity
    txt_lines.append("Testing modularity")
    print("Testing modularity")
    if save_name != 'all_seasons':
        # if we are using the all seasons graph, we need to remove the edges between seasons
        
        mods = []
        for _ in tqdm(range(100)):
            mods.append(modularity_test(G, G.number_of_edges()/2))
        txt_lines.append(f"The average modularity after double edge swap test is {np.mean(mods):.2f}")
        # statistical test
        p_val_mod = stats.ttest_1samp(mods, mod)[1]
        txt_lines.append(f"The p-value for the modularity test is {p_val_mod}")

        # plot the distribution of modularity values with the actual modularity value plotted as a vertical line
        plt.hist(mods, bins=20)
        plt.axvline(mod, color='r', linestyle='dashed', linewidth=1)
        plt.title(f"Modularity distribution for {save_name_dict[save_name]}")
        plt.xlabel("Modularity")
        plt.ylabel("Count")
        if save:
            plt.savefig(os.path.join('figures', save_name, 'modularity_distribution.png'))
        plt.close()

    # write all the text lines to a file
    with open(os.path.join('txt_files', f'{save_name}_text.txt'), 'w') as f:
        f.write('\n'.join(txt_lines))

    return G

def save_graph(G, save_name):
    """
    Saves the graph as a pkl file
    """
    # make the directory if it doesn't exist
    
    print(f"Saving graph as {save_name}_G.pkl")
    with open(os.path.join('graphs', f'{save_name}_G.pkl'), 'wb') as f:
        pickle.dump(G, f)

In [49]:
# make the graphs
name_to_df_dict = {"indigo": indigo_df,
                   "orange": orange_df,
                   "johto": johto_df,
                   "hoenn": hoenn_df,
                   "sinnoh": diamond_df,
                   "unova": black_df,
                   "kalos": xy_df,
                   "alola": sun_df,
                   "journeys": pocket_df,
                   "all_seasons": all_seasons_df
}

# loop through all the dataframes and make the graphs
for name, df in name_to_df_dict.items():
    if not os.path.exists(os.path.join('graphs', f'{name}_G.pkl')):
        G = graph_analysis(df, name, save=True)
        save_graph(G, name)
    else:
        print(f"Graph for {name} already exists")

Analysing the graph for the Indigo League season
Making graph...


100%|██████████| 80/80 [00:00<00:00, 127.35it/s]


Done!
Testing modularity


100%|██████████| 100/100 [00:44<00:00,  2.24it/s]


Saving graph as indigo_G.pkl
Analysing the graph for the Orange Islands season
Making graph...


100%|██████████| 35/35 [00:00<00:00, 129.35it/s]


Done!
Testing modularity


100%|██████████| 100/100 [00:23<00:00,  4.31it/s]


Saving graph as orange_G.pkl
Analysing the graph for the Johto League season
Making graph...


100%|██████████| 158/158 [00:01<00:00, 153.53it/s]


Done!
Testing modularity


100%|██████████| 100/100 [01:16<00:00,  1.31it/s]


Saving graph as johto_G.pkl
Analysing the graph for the Hoenn League season
Making graph...


100%|██████████| 134/134 [00:01<00:00, 128.58it/s]


Done!
Testing modularity


100%|██████████| 100/100 [01:29<00:00,  1.12it/s]


Saving graph as hoenn_G.pkl
Analysing the graph for the Sinnoh League season
Making graph...


100%|██████████| 191/191 [00:01<00:00, 102.46it/s]


Done!
Testing modularity


100%|██████████| 100/100 [02:06<00:00,  1.26s/it]


Saving graph as sinnoh_G.pkl
Analysing the graph for the Unova League season
Making graph...


100%|██████████| 142/142 [00:01<00:00, 140.63it/s]


Done!
Testing modularity


100%|██████████| 100/100 [01:16<00:00,  1.30it/s]


Saving graph as unova_G.pkl
Analysing the graph for the Kalos League season
Making graph...


100%|██████████| 140/140 [00:01<00:00, 80.21it/s] 


Done!
Testing modularity


100%|██████████| 100/100 [02:26<00:00,  1.47s/it]


Saving graph as kalos_G.pkl
Analysing the graph for the Alola League season
Making graph...


100%|██████████| 146/146 [00:02<00:00, 51.02it/s]


Done!
Testing modularity


100%|██████████| 100/100 [03:45<00:00,  2.25s/it]


Saving graph as alola_G.pkl
Analysing the graph for the Pokémon Journeys season
Making graph...


100%|██████████| 147/147 [00:03<00:00, 37.78it/s]


Done!
Testing modularity


100%|██████████| 100/100 [08:17<00:00,  4.98s/it]


Saving graph as journeys_G.pkl
Analysing the graph for the All Seasons season
Making graph...


100%|██████████| 1231/1231 [00:16<00:00, 72.92it/s]


Done!
Testing modularity
Saving graph as all_seasons_G.pkl
