# Get characters 
![header](../images/The_Marvel_Universe.png)

In this first notebook we are going to substract all the characters from different Marvel heroes and villain teams to create the graph that is going to be used on the project

In [1]:
import json
import urllib.request

import re
import os

import pandas as pd
import numpy as np

import utils

from concurrent.futures import ThreadPoolExecutor, as_completed
  from itertools import repeat

from tqdm.notebook import tqdm
tqdm.pandas()

In [2]:
def get_character_names(title: str):
  
  cmcontinue_text = ""
  first_time = True
  
  character_list = []
  
  while cmcontinue_text or first_time: 
    first_time = False
  
    baseurl = "https://marvel.fandom.com/api.php?"
    action = "action=query&list=categorymembers"
    q_title = "cmtitle={}".format(urllib.parse.quote_plus(title.replace(" ", "_")))

    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"
    cmcontinue = "cmlimit=max&cmcontinue={}".format(cmcontinue_text)

    query = "{}{}&{}&{}&{}&{}".format(baseurl, action, q_title, content, dataformat, cmcontinue)
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    
    wiki_json = json.loads(wikitext)
    
    character_list += [character["title"] for character in wiki_json["query"]["categorymembers"]]

    if "continue" in list(wiki_json.keys()):
      cmcontinue_text = wiki_json["continue"]["cmcontinue"]
    else:
      cmcontinue_text = ""
      
  return character_list

## Characters
From the query above we are able to get all characters in the Marvel Universe. As the lenght of the list will show, there is a lot of characters! To somehow make the size of the network managable we have chosen to fixate on one specific universe - Universe 616, which can be considered to be the main storyline of the Marvel Universe. From this universe alone we find almost 30.000 unique characters.

Using the `get_character_names()` we find all the character names. We then find the description for each character finding their current enemy and allied teams.

In [3]:
character_names = get_character_names("Category:Earth-616/Characters")
character_names

["'Spinner (Earth-616)",
 '01100010 01110010 01110101 01110100 01100101 (Earth-616)',
 '107 (Earth-616)',
 '11-Ball (Earth-616)',
 '115 (Legion Personality) (Earth-616)',
 '14 (Earth-616)',
 '181 (Legion Personality) (Earth-616)',
 '26 (Earth-616)',
 '27 (Earth-616)',
 '302 (Legion Personality) (Earth-616)',
 '6-Ball (Earth-616)',
 '6R (Earth-616)',
 '7-X9 (Earth-616)',
 '749 (Legion Personality) (Earth-616)',
 '762 (Legion Personality) (Earth-616)',
 '8-Ball (Hobgoblin) (Earth-616)',
 '8-Ball II (Earth-616)',
 '803 (Earth-616)',
 '88 (Earth-616)',
 '9-Ball (Earth-616)',
 '933 (Legion Personality) (Earth-616)',
 '99 (Earth-616)',
 'A Friend (Earth-616)',
 "A'di (Earth-616)",
 "A'ishah (Earth-616)",
 "A'kane (Earth-616)",
 "A'Kurru U'mbaya (Earth-616)",
 "A'Lars (Earth-616)",
 "A'Sai (Earth-616)",
 "A'yin (Earth-616)",
 'A-14 (Earth-616)',
 'A-Bomb (Warp World) (Earth-616)',
 'A. G. Bell (Earth-616)',
 'A. Summers (Earth-616)',
 "A.C. O'Connor (Earth-616)",
 'A.J. Patton (Earth-616)',
 

# Downloading character content
Because our data set is so big (29.998) distinct characters, and the marvel wiki query functions are not the fastest, we have done some work to make the downloading faster. 

First we figured, that instead of doing a query for each title (character name), found in the `titles=`-option, we could query up to 50 characters at a time. 

The second optimization is that we utilized multithreading. At first we tried to use `async`-functions, but saw the opposite of a benefit, as the runtime increased. We hypothise, that either we made a mistake in the code, which we checked thouroughly, or the api was so fast, that asyncio actually slowed down the process. Either way, utlizing threads instead allowed us to go from a download time of 4 hours to 16 minutes!

In [4]:
image  = re.compile(r"Image\s*=\s(.*?)\n")
gender = re.compile("Gender\s*=\s(.*?)\n")
links  = re.compile(r"\[\[(.*?)(?:\|.*?)?\]\]")

content_path = "../data/character_content/"


def get_character_content(names: list):
    # Querying wiki for name
    base_url = "http://marvel.fandom.com/api.php?"
    action = "action=query"
    title = f"titles={'|'.join([urllib.parse.quote_plus(name.replace(' ', '_')) for name in names])}"
    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"

    query = "{}{}&{}&{}&{}".format(base_url, action, content, title, dataformat)
    
    resp = urllib.request.urlopen(query)
    text = resp.read().decode("utf-8")
    data = json.loads(text)

    pages = data["query"]["pages"]
    for page_id, page in pages.items():
        page["page_id"] = page_id
        filename = utils.generate_filename(page["title"])
        with open(f"{content_path}{filename}.json", "w") as f:
            json.dump(page, f, indent=4)

def get_character_info(name: str, all_names_set: set):
    filename = utils.generate_filename(name)
    data = json.loads(open(f"{content_path}{filename}.json", "r").read())

    title   = data["title"]
    content = data["revisions"][0]["slots"]["main"]["*"]
    pageid  = data["pageid"]
    
    # Link for image for visualization in online
    img = image.findall(content)
    if len(img) >= 1: img = img[0]
    
    # Gender(s)
    gen = gender.findall(content)
    
    # All square bracket links in the description
    # Removing all links, that are not already an established character
    # in our character_name_list, and is also not the name
    # of the current person; no self loops!
    lin = links.findall(content)
    lin = [x for x in lin if x in all_names_set and x != title]

    return title, pageid, img, gen, lin

# Multithreaded solution for faster speed
def get_content(names: list, max_workers: int = 16) -> None:
    files = set(os.listdir(content_path))
    missing_names = list(
        filter(lambda x: f"{utils.generate_filename(x)}.json" not in files,
        names
        ))

    if len(missing_names) == 0:
        print("No missing names found 😊")
        return
        
    chunks = utils.generate_chunks(missing_names)
    print(f"Generated {len(chunks)} chunks!")
    with tqdm(total=len(chunks)) as pbar:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            futures = [ex.submit(get_character_content, chunk) 
                       for chunk in chunks]
            for future in as_completed(futures):
                pbar.update(1)
    print("Done downloading files!")

def get_character_infos(names: list, max_workers: int = 16) -> list:
    all_names_set = set(names)
    infos = []
    with tqdm(total=len(names)) as pbar:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            for result in ex.map(get_character_info, names, repeat(all_names_set)):
                infos.append(result)
                pbar.update(1)
    print("Done getting character information")
    print("Converting to pd.DataFrame")
    return pd.DataFrame(infos, columns=["title", "pageid", "imglink", "gender", "links"])

In [5]:
get_content(character_names)

Generated 601 chunks!


  0%|          | 0/601 [00:00<?, ?it/s]

Done downloading files!


In [6]:
df = get_character_infos(character_names)

  0%|          | 0/30014 [00:00<?, ?it/s]

Done getting character information
Converting to pd.DataFrame


In [7]:
df.to_csv(f"../data/marvel_characters.csv")