# Team Assembler

![header](../images/The_Marvel_Universe.png)

In this first notebook we are going to substract all the characters from different Marvel heroes and villain teams to create the graph that is going to be used on the project

In [None]:
import json
import urllib.request

import re

import pandas as pd
import numpy as np


from tqdm.notebook import tqdm

tqdm.pandas()

In [None]:
def get_json(title):
  baseurl = "https://marvel.fandom.com/api.php?"
  action = "action=query"
  title = "titles={}".format(urllib.parse.quote_plus(title.replace(" ", "_")))
   
  content = "prop=revisions&rvprop=content&rvslots=*"
  dataformat ="format=json"

  query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
    
  wikiresponse = urllib.request.urlopen(query)
  wikidata = wikiresponse.read()
  wikitext = wikidata.decode('utf-8')
    
  return json.loads(wikitext)

def displayWiki(wiki):
    code = str(list(wiki["query"]["pages"].keys())[0])
    title = wiki["query"]["pages"][code]["title"]
    content = wiki["query"]["pages"][code]["revisions"][0]["slots"]["main"]["*"]
    return title, content

## Teams

Instead of getting every single character from Marvel, we are going to work with a small subset. Why? Well, first of all, in the Marvel wiki there are more than 30.000 characters. Recopilate and process all that information would take a long time, and most of those characterrs are secondary characters that do not give much information.The second reason, and the deciding factor was that there is no easy way to get the link to all the characters from the wiki.

In [None]:
displayWiki(get_json("Category:Characters"))[1]

In [None]:
displayWiki(get_json("Category:Earth-616/Characters"))[1]

As can be seen in the previous cells, doing a query over the character wiki page, it does not return any link or information about any character.

### Some background

For those of you that are not huge Marvel nerd fans, you will probably appreciate some background to be able to understand the data a bit more.

Marvel is divided in many (many) universes, in the Marveel multiverse (it makes sense right). This is done to allow any Marvel writer some creative freedom. If someone wants to make a story where Spider-Man is a cartoony pig, well, they can do it ([Spider-Ham](https://marvel.fandom.com/wiki/Peter_Porker_(Earth-8311))), in a different universe, so it does not collide with the main characters in other universes.

The main universe, where most relevant events and different, and more canon stories occur, is the universe called *Earth-616*, and is the one we are going to use for our analysis. Is where the most known stories happen, and where there are more superheores and supervillains.

Ok, but what teams did we select. We tried to get the most famous ones, both of heores and villains, so we could get more characters from it, and we could get the most famous ones too.

In [None]:
teams = None
with open("../data/teams.txt", "r") as f:
  teams = f.read().split("\n")[:-1]
teams

All links are either `[[link]]` or `[[link|known_as]]`.

In [None]:
regex_links = r"\[\[(.*?)(?:|\|.*?)\]\]"

Every team has a similar structure:

```
{{Marvel Database:Team Template
| Title                   = 
| Image                   = 
| ImageSize               = 
| Name                    = 
| EditorialNames          = 
| Aliases                 = 
| Status                  = 
| Identity                = 
| Reality                 = 
| BaseOfOperations        = 
| Leaders                 = 
| CurrentMembers          = 
| FormerMembers           = 
| Allies                  = 
| Enemies                 = 
| Origin                  = 
| PlaceOfFormation        = 
| PlaceOfDissolution      = 
| Creators                = 
| First                   =
| Last                    = 
 
 ...
 ...
 ...
 }}
```

We want to obtain the characters that are leaders of a group, those that are current members, those that are former members, allies and enemies. And, just in case those are useful too, those that appear on the subsequent text.

That's why we divide in groups of information, and get the links from them.

In [None]:
def createDataFrame(teams):
  df = pd.DataFrame(columns=["team_name", "leaders", "current_members",
                             "former_members", "allies", "enemies",
                             "additional_links"]
                   )
  
  for team in tqdm(teams):
    content = displayWiki(get_json(team))[1]
    header = re.split(r"\| First", content)[0]
    web_content = re.split(r"\| Origin ", content)[1]

    name = re.sub(" \(Earth-.*", "", team)
    
    leaders_raw = re.findall(r"Leaders.*?\| CurrentMembers", header, flags=re.DOTALL)[0]
    leaders = re.findall(regex_links, leaders_raw)

    current_member_raw = re.findall(r"\| CurrentMembers.*?\| FormerMembers", header, flags=re.DOTALL)[0]
    current_member = re.findall(regex_links, current_member_raw)

    former_member_raw = re.findall(r"\| FormerMembers.*?\| Allies", header, flags=re.DOTALL)[0]
    former_member = re.findall(regex_links, former_member_raw)

    allies_raw = re.findall(r"\| Allies.*?\| Enemies", header, flags=re.DOTALL)[0]
    allies = re.findall(regex_links, allies_raw)

    enemies_raw = re.findall(r"\| Enemies.*?\| Origin", header, flags=re.DOTALL)[0]
    enemies = re.findall(regex_links, enemies_raw)
    
    additional_links = re.findall(regex_links, web_content)
  
    with open("../data/teams/"+team.replace(" ", "_")+".txt", "w") as f:
      f.write(content)
  
    row = {
      "team_name"       : name,
      "leaders"         : leaders,
      "current_members" : current_member,
      "former_members"  : former_member,
      "allies"          : allies,
      "enemies"         : enemies,
      "additional_links": additional_links
    }
    
    df = df.append(row, ignore_index=True)
  
  return df

In [None]:
marvel_df = createDataFrame(teams)
marvel_df.head()

## Characters

Ok, now that we have each team with their characters, it's time to get each character. That would be straight forward if it wasn't because the data is not clean. As we have gotten every single link, some of them reference somethings that are nor characters, such as other teams, races or concepts.

In [None]:
characters_df = pd.DataFrame(columns=["name", "teams"])

In [None]:
all_characters = list()

for i, row in marvel_df.iterrows():
  all_characters += [*row.leaders, *row.current_members, *row.former_members, *row.allies, *row.enemies]

characters_df = pd.DataFrame(list(set(all_characters)), columns=["name"])
characters_df.head()

Luckily, the characters have a template unique, that looks like:
  
```
{{Marvel Database:Character Template
| Image                   = 
| Name                    = 
| CurrentAlias            = 
| Aliases                 = 
| Affiliation             = 
| Relatives               = 
| MaritalStatus           = 
| CharRef                 = 
| Gender                  = 
| Height                  = 
| Weight                  = 
| Eyes                    = 
| Hair                    = 
| UnusualFeatures         = 
  ...
  ...
}}
```
Ok, this looks similar to the teams template, but what we have to look at is the keyword `CharRef`. This keyword is unique for the characters (as well as `Gender`, `Height`, `Weight`, `Eyes`, `Hair` and `UnusualFeatures`)

Knowing this, we can decide if a link is a character, or other thing. At the same time, because we are looking at their wiki content, we can get the links they reference.

In [None]:
def getContent(row):
  regex = r'\/|\"|\:|\*| '
  
  new_name = re.sub(regex, "_", row["name"])
  
  try:
    content = displayWiki(get_json(row["name"]))[1]

    isCharacter = len(re.findall(r"\| CharRef", content)) > 0
    links = list()
    
    with open("../data/characters/" + new_name + ".txt", "w") as f:
      f.write(content)

    if isCharacter:
      links = re.findall(regex_links, content)
      links = [re.sub(regex, "_", x) for x in links]
      
  except KeyError:    
    links = list()
    isCharacter = False
    
    
  return pd.Series([new_name, isCharacter, links])

In [None]:
characters_df[["name", "is_character", "links"]] = characters_df.progress_apply(getContent, axis=1)

characters_df

In [None]:
characters_df = characters_df[characters_df["is_character"]].drop(columns=["is_character"])
characters_df.reset_index(drop=True)

characters_df

In [None]:
def replace_names(row):
  characters = [row.leaders, row.current_members, row.former_members,
                row.allies, row.enemies, row.additional_links]
  
  regex = r'\/|\"|\:|\*| '
  
  new_characters = list()
  
  for i, character_list in enumerate(characters):
    new_characters.append(list())
    for character in character_list:
      new_characters[i].append(re.sub(regex, "_", character))
  
  return pd.Series(new_characters)

## Cleaning

After we have the links, is necessary to clean those links that do not reference any characters from the links from both datasets

In [None]:
def get_real_links(row):
  links = [x for x in row.links if x in characters_df["name"].values]
  links = list(set(links))
  
  return pd.Series([links, len(links)])

In [None]:
characters_df[["links", "number_links"]] = characters_df.progress_apply(get_real_links, axis=1)

characters_df

In [None]:
marvel_df[["leaders", "current_members",
           "former_members", "allies", 
           "enemies", "additional_links"]] = marvel_df.progress_apply(replace_names, axis=1)

In [None]:
def clear_dataset(row):

  leaders = [x for x in row.leaders if x in characters_df["name"].values]
  current_members = [x for x in row.current_members if x in characters_df["name"].values]
  former_members = [x for x in row.former_members if x in characters_df["name"].values]
  allies = [x for x in row.allies if x in characters_df["name"].values]
  enemies = [x for x in row.enemies if x in characters_df["name"].values]
  additional_links = [x for x in row.additional_links if x in characters_df["name"].values]
  
  return pd.Series([leaders, current_members, former_members, allies, enemies, additional_links])

In [None]:
marvel_df[["leaders", "current_members",
           "former_members", "allies",
           "enemies", "additional_links"]] = marvel_df.progress_apply(clear_dataset, axis=1)

marvel_df

## Mixin the information from both datasets

Now we want to get wich team does any characterr belongs to, is ally or enemy to, so we could build more meaningfull relations after.

In [None]:
def belongs_to(team, character_name, column):
  members = marvel_df.loc[marvel_df['team_name'] == team][column]
  return character_name in members.values[0]

def get_team_info(row):
  
  leader = []
  member = []
  ally = []
  enemy = []
  
  for _, team_row in marvel_df.iterrows():
    if belongs_to(team_row["team_name"], row["name"], "leaders"):
      leader.append(team_row["team_name"])
    if belongs_to(team_row["team_name"], row["name"], "current_members"):
      member.append(team_row["team_name"])
    if belongs_to(team_row["team_name"], row["name"], "former_members"):
      member.append(team_row["team_name"])
    if belongs_to(team_row["team_name"], row["name"], "allies"):
      ally.append(team_row["team_name"])
    if belongs_to(team_row["team_name"], row["name"], "enemies"):
      enemy.append(team_row["team_name"])
  
  return pd.Series([leader, member, ally, enemy])

In [None]:
characters_df[["leader", "member", "ally", "enemy"]] = characters_df.progress_apply(get_team_info, axis=1)

In [None]:
characters_df.head()

## Save the datasets

Now that the entries are somewhat clear, we can save them

In [None]:
marvel_df.to_csv("../data/marvel_teams.csv", index=False)
characters_df.to_csv("../data/marvel_characters.csv", index=False)

In [None]:
import tqdm.notebook as tqdm
import time
import os

In [None]:
total = 29998
cur = len(os.listdir("../data/character_content"))

while True:
    cur = len(os.listdir("../data/character_content"))
    print(cur, end="->")
    if cur == total:
        break

    time.sleep(2)

        