# Team Assembler

![header](../images/The_Marvel_Universe.png)

In this first notebook we are going to substract all the characters from different Marvel heroes and villain teams to create the graph that is going to be used on the project

In [1]:
import json
import urllib.request

import re

import pandas as pd
import numpy as np


from tqdm.notebook import tqdm

tqdm.pandas()

In [2]:
def get_json(title):
  baseurl = "https://marvel.fandom.com/api.php?"
  action = "action=query"
  title = "titles={}".format(urllib.parse.quote_plus(title.replace(" ", "_")))
   
  content = "prop=revisions&rvprop=content&rvslots=*"
  dataformat ="format=json"

  query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
    
  wikiresponse = urllib.request.urlopen(query)
  wikidata = wikiresponse.read()
  wikitext = wikidata.decode('utf-8')
    
  return json.loads(wikitext)

def displayWiki(wiki):
    code = str(list(wiki["query"]["pages"].keys())[0])
    title = wiki["query"]["pages"][code]["title"]
    content = wiki["query"]["pages"][code]["revisions"][0]["slots"]["main"]["*"]
    return title, content

## Teams

Instead of getting every single character from Marvel, we are going to work with a small subset. Why? Well, first of all, in the Marvel wiki there are more than 30.000 characters. Recopilate and process all that information would take a long time, and most of those characterrs are secondary characters that do not give much information.The second reason, and the deciding factor was that there is no easy way to get the link to all the characters from the wiki.

In [3]:
displayWiki(get_json("Category:Characters"))[1]

'{{MessageBox\n| Message = A comprehensive list of every character that can be found on the {{SITENAME}}.<br /> With over 70,000 characters in the [[Multiverse|Marvel Multiverse]], this is the most complete public listing in existence.<br /><br />{{CategoryTOC}}<br />\n}}<center>[[File:The Marvel Universe.png|630px|Marvel Characters]]</center>\n\n[[zh:Category:角色]]\n[[Category:Marvel Database]]'

In [4]:
displayWiki(get_json("Category:Earth-616/Characters"))[1]

'{{Reality Category}}'

As can be seen in the previous cells, doing a query over the character wiki page, it does not return any link or information about any character.

### Some background

For those of you that are not huge Marvel nerd fans, you will probably appreciate some background to be able to understand the data a bit more.

Marvel is divided in many (many) universes, in the Marveel multiverse (it makes sense right). This is done to allow any Marvel writer some creative freedom. If someone wants to make a story where Spider-Man is a cartoony pig, well, they can do it ([Spider-Ham](https://marvel.fandom.com/wiki/Peter_Porker_(Earth-8311))), in a different universe, so it does not collide with the main characters in other universes.

The main universe, where most relevant events and different, and more canon stories occur, is the universe called *Earth-616*, and is the one we are going to use for our analysis. Is where the most known stories happen, and where there are more superheores and supervillains.

Ok, but what teams did we select. We tried to get the most famous ones, both of heores and villains, so we could get more characters from it, and we could get the most famous ones too.

In [5]:
teams = None
with open("../data/teams.txt", "r") as f:
  teams = f.read().split("\n")[:-1]
teams

['Avengers (Earth-616)',
 'X-Men (Earth-616)',
 'Illuminati (Earth-616)',
 'Inhuman Royal Guard (Earth-616)',
 'Guardians of the Galaxy (Earth-616)',
 'Avengers (1,000,000 BC) (Earth-616)',
 'Sinister Six (Earth-616)',
 'Thunderbolts (Earth-616)',
 'Elders of the Universe (Earth-616)',
 'Young Avengers (Earth-616)',
 'Dark Avengers (Earth-616)',
 'Fantastic Four (Earth-616)',
 'Strategic Homeland Intervention, Enforcement and Logistics Division (Earth-616)',
 'Defenders (Earth-616)',
 'Hydra (Earth-616)',
 'Black Order (Earth-616)',
 'Cabal (Dark Illuminati) (Earth-616)',
 'Hand (Earth-616)',
 'Heralds of Galactus (Earth-616)',
 'Winter Guard (Earth-616)']

All links are either `[[link]]` or `[[link|known_as]]`.

In [6]:
regex_links = r"\[\[(.*?)(?:|\|.*?)\]\]"

Every team has a similar structure:

```
{{Marvel Database:Team Template
| Title                   = 
| Image                   = 
| ImageSize               = 
| Name                    = 
| EditorialNames          = 
| Aliases                 = 
| Status                  = 
| Identity                = 
| Reality                 = 
| BaseOfOperations        = 
| Leaders                 = 
| CurrentMembers          = 
| FormerMembers           = 
| Allies                  = 
| Enemies                 = 
| Origin                  = 
| PlaceOfFormation        = 
| PlaceOfDissolution      = 
| Creators                = 
| First                   =
| Last                    = 
 
 ...
 ...
 ...
 }}
```

We want to obtain the characters that are leaders of a group, those that are current members, those that are former members, allies and enemies. And, just in case those are useful too, those that appear on the subsequent text.

That's why we divide in groups of information, and get the links from them.

In [7]:
def createDataFrame(teams):
  df = pd.DataFrame(columns=["team_name", "leaders", "current_members",
                             "former_members", "allies", "enemies",
                             "additional_links"]
                   )
  
  for team in tqdm(teams):
    content = displayWiki(get_json(team))[1]
    header = re.split(r"\| First", content)[0]
    web_content = re.split(r"\| Origin ", content)[1]

    name = re.sub(" \(Earth-.*", "", team)
    
    leaders_raw = re.findall(r"Leaders.*?\| CurrentMembers", header, flags=re.DOTALL)[0]
    leaders = re.findall(regex_links, leaders_raw)

    current_member_raw = re.findall(r"\| CurrentMembers.*?\| FormerMembers", header, flags=re.DOTALL)[0]
    current_member = re.findall(regex_links, current_member_raw)

    former_member_raw = re.findall(r"\| FormerMembers.*?\| Allies", header, flags=re.DOTALL)[0]
    former_member = re.findall(regex_links, former_member_raw)

    allies_raw = re.findall(r"\| Allies.*?\| Enemies", header, flags=re.DOTALL)[0]
    allies = re.findall(regex_links, allies_raw)

    enemies_raw = re.findall(r"\| Enemies.*?\| Origin", header, flags=re.DOTALL)[0]
    enemies = re.findall(regex_links, enemies_raw)
    
    additional_links = re.findall(regex_links, web_content)
  
    with open("../data/teams/"+team.replace(" ", "_")+".txt", "w") as f:
      f.write(content)
  
    row = {
      "team_name"       : name,
      "leaders"         : leaders,
      "current_members" : current_member,
      "former_members"  : former_member,
      "allies"          : allies,
      "enemies"         : enemies,
      "additional_links": additional_links
    }
    
    df = df.append(row, ignore_index=True)
  
  return df

In [8]:
marvel_df = createDataFrame(teams)
marvel_df.head()

  0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,team_name,leaders,current_members,former_members,allies,enemies,additional_links
0,Avengers,"[T'Challa (Earth-616), Dane Whitman (Earth-616...","[List of Avengers members, T'Challa (Earth-616...","[Abyss (Ex Nihilo's) (Earth-616), Aleta Ogord ...","[Shuri (Earth-616), Alpha Flight (Earth-616), ...","[Ares (Earth-616), Attuma (Earth-616), Abner J...","[Loki Laufeyson (Earth-616), Henry Pym (Earth-..."
1,X-Men,"[Scott Summers (Earth-616), Jean Grey (Earth-6...","[Scott Summers (Earth-616), Jean Grey (Earth-6...","[Thomas Jones (Earth-616), Warren Worthington ...","[Alpha Flight (Earth-616), Alpha Flight (Space...","[Acolytes (Earth-616), Adjunct (Earth-616), Ad...","[Homo Superior, Xavier's School for Gifted You..."
2,Illuminati,[],[],"[Blackagar Boltagon (Skrull) (Earth-616), Stev...","[Avengers (Earth-616), Roberto Da Costa (Earth...","[Thanos (Earth-616), Deviant Skrulls, Parker R...","[Kree-Skrull War, Wakanda, Maximus Boltagon (E..."
3,Inhuman Royal Guard,"[Frank McGee (Earth-616), Medusalith Amaquelin...","[Alaris (Earth-616), Arvak (Earth-616), Asmode...","[Kirren (Earth-616), Leonus (Earth-616), Medus...",[Avengers (Earth-616)],"[Unspoken (Earth-616), Lash (Earth-616)]","[Inhumans (Inhomo supremis), Attilan, House of..."
4,Guardians of the Galaxy,"[Richard Rider (Earth-616), Rocket Raccoon (Ea...","[Phyla-Vell (Earth-18897), Arthur Douglas (Clo...","[Aldrif Odinsdottir (Earth-616), Scott Lang (E...","[Avengers (Earth-616), Camille Benally (Earth-...","[Annihilus (Earth-616), Badoon, Black Order (E...","[Hala (Planet), Richard Rider (Earth-616), Pha..."


In [9]:
marvel_df.to_csv("../data/marvel_teams.csv")

## Characters

Ok, now that we have each team with their characters, it's time to get each character. That would be straight forward if it wasn't because the data is not clean. As we have gotten every single link, some of them reference somethings that are nor characters, such as other teams, races or concepts.

In [10]:
characters_df = pd.DataFrame(columns=["name", "teams"])

In [11]:
all_characters = list()

for i, row in marvel_df.iterrows():
  all_characters += [*row.leaders, *row.current_members, *row.former_members, *row.allies, *row.enemies]

characters_df = pd.DataFrame(list(set(all_characters)), columns=["name"])
characters_df.head()

Unnamed: 0,name
0,Ebenezer Laughton (Earth-616)
1,Alisande Morales (Earth-616)
2,Zodiac (LMD) (Earth-616)
3,Dmitri Petrovich (Earth-616)
4,"Strategic Homeland Intervention, Enforcement a..."


Luckily, the characters have a template unique, that looks like:
  
```
{{Marvel Database:Character Template
| Image                   = 
| Name                    = 
| CurrentAlias            = 
| Aliases                 = 
| Affiliation             = 
| Relatives               = 
| MaritalStatus           = 
| CharRef                 = 
| Gender                  = 
| Height                  = 
| Weight                  = 
| Eyes                    = 
| Hair                    = 
| UnusualFeatures         = 
  ...
  ...
}}
```
Ok, this looks similar to the teams template, but what we have to look at is the keyword `CharRef`. This keyword is unique for the characters (as well as `Gender`, `Height`, `Weight`, `Eyes`, `Hair` and `UnusualFeatures`)

Knowing this, we can decide if a link is a character, or other thing. At the same time, because we are looking at their wiki content, we can get the links they reference.

In [12]:
def getContent(row):
  try:
    content = displayWiki(get_json(row["name"]))[1]

    isCharacter = len(re.findall(r"\| CharRef", content)) > 0
    links = list()
    
    with open("../data/characters/"+row["name"].replace(" ", "_").replace("/", "_") + ".txt", "w") as f:
      f.write(content)

    if isCharacter:
      links = re.findall(regex_links, content)
  except KeyError:    
    links = list()
    isCharacter = False
    
    
  return pd.Series([isCharacter, links])

In [13]:
characters_df[["is_character", "links"]] = characters_df.progress_apply(getContent, axis=1)

characters_df

  0%|          | 0/1737 [00:00<?, ?it/s]

Unnamed: 0,name,is_character,links
0,Ebenezer Laughton (Earth-616),True,"[Ebenezer Laughton, Scarecrow, Green Goblin, H..."
1,Alisande Morales (Earth-616),False,[]
2,Zodiac (LMD) (Earth-616),False,[]
3,Dmitri Petrovich (Earth-616),True,"[:Category:Bald, Category:Bald, File:Dmitri Pe..."
4,"Strategic Homeland Intervention, Enforcement a...",False,[]
...,...,...,...
1732,Ytitnedion (Earth-616),True,"[Tunnelworld, Tunnelworld, Defenders (Earth-61..."
1733,Kirren (Earth-616),True,"[Inhomo supremis, Inhuman Royal Guard (Earth-6..."
1734,Spite (Earth-616),True,"[Cyttorak (Earth-616), Dweller-in-Darkness (Ea..."
1735,Wakers (Earth-616),False,[]


In [14]:
characters_df = characters_df[characters_df["is_character"]].drop(columns=["is_character"])
characters_df.reset_index(drop=True)

characters_df.to_csv("../data/marvel_characters.csv")

characters_df

Unnamed: 0,name,links
0,Ebenezer Laughton (Earth-616),"[Ebenezer Laughton, Scarecrow, Green Goblin, H..."
3,Dmitri Petrovich (Earth-616),"[:Category:Bald, Category:Bald, File:Dmitri Pe..."
5,Silvio Manfredi (Earth-616),"[Silvermane, Supreme Hydra, Supreme Hydra, Cat..."
6,Maria de Guadalupe Santiago (Earth-616),"[Maria de Guadalupe Santiago, Silverclaw, Jaim..."
7,Frances Barrison (Earth-616),"[Frances Barrison, Shriek, Cletus Kasady (Eart..."
...,...,...
1730,Bonita Juarez (Earth-616),"[Bonita Juarez, Firebird, Bonita Juarez (Earth..."
1732,Ytitnedion (Earth-616),"[Tunnelworld, Tunnelworld, Defenders (Earth-61..."
1733,Kirren (Earth-616),"[Inhomo supremis, Inhuman Royal Guard (Earth-6..."
1734,Spite (Earth-616),"[Cyttorak (Earth-616), Dweller-in-Darkness (Ea..."
