# Saving Universe 616

This project aims to analyze a graph made by characters and teams from the [Marvel Earth-616](https://marvel.fandom.com/wiki/Earth-616). Being the central Marvel universe, it has a lot of different characters and teams (30021 other characters).

![header](../images/marvel616.webp)

The aim is to offer some understanding of the relationships inside this vast universe and try to get meaningful information from it.

The data is obtained from the [official marver wiki](https://marvel.fandom.com/wiki/Marvel_Database).


## Motivation

The Marvel multiverse is formed by all the marvel comics, animated series and films that uses the Marvel characters. As you can imagine, it is pretty huge multiverse. Thuis, is divided in universes, defined by the word `Earth` followed by a number, that works as a universe ID, for example `Earth-616` is the main marvel universe, where the most famous Marvel events happened, such as Civil War, Infinity Saga or the House of M.

We are focusing in this universe, as is the one with the most charcaters and most interesting stories.

The motivation of the work is to try to analyze and get information from the characters in the Earth-616 that belong to, at least, one team.

This limitation is due to how many characters there are, in an attempt to reduce the size of our netwrok. 


## Basic stats

At first, we worked with just a few characters, because we had to manually select a couple teams, and we only got the characters that belonged to that team. After some more research, we were able to get every single team and orgnaization, but we had to do some more complex queries. 

We had to specify that the `list` is `categorymember` What this does is it gets all the elements that belong to a category. In this case the category is `Category:Earth-616/Teams` or `Category:Earth-616/Organizations`. This is because we only want the teams that belong to the universe 616, qhich is the main marvel universe. That way we can get all the teams. And luckily, there is another category that is very useful to us, which is `Category:TEAM/Members`, which returns as a list all the members of a single team. That way we can get all teams and all characters that belong to at least a team from the marvel 616 universe.

There is just one small problem. The maximun elements that `categorymember` return is 500, and there are 3054 teams and 19148 elements. To solve this there is the `cmcontinue` parameter, which allows to get the next 500 elements. This eleme nt is returned by the query, so to get every element from the list, we need to do a while loop until the `cmcontinue` value is empty, because it means there are no more values to retrieve with the query.   

## Imports

What would be of a project without some imports ;).

In [None]:
# For API queries
import urllib
import json

# For fancy looking loops
from tqdm.notebook import tqdm

# Data manipulation
import pandas as pd

# For transforming to literal value instead of string ("[a, b, c]" => ["a", "b", "c"])
import ast

# For os operations (duh)
import os

# Small file that have useful functions
import utils

# For multithreading
from concurrent.futures import ThreadPoolExecutor, as_completed

# For regex operations
import re

# For sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# For plot visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For wordclouds
from wordcloud import WordCloud

# For natural language processing
import nltk

# For graph building
import networkx as nx

# For numeric transformations and operations
import numpy as np

# For calculating the powerlaw
import powerlaw

# For more iterating tools
import itertools

# For the fa2 algorithm
from fa2 import ForceAtlas2

# For randomness
import random

In [None]:
# So tqdm works with pandas
tqdm.pandas()

# Cool default plots
sns.set()

## Teams

We will start by analyzing and getting information from the teams. We use `Category:Earth-616/Teams` from the wiki as it is where the teams are. Unfortunately, it is not the only place that information is. We realized that superhero teams, such as the **Avengers**, are not actually a team, but an organization, so we also need to use `Category:Earth-616/Organizations`.

The function is quite lengthy and maybe a bit convoluted, so we try to explain it as clearly as possible (note that all the steps are repeated for `Teams` and `Organization):

The query we do is something like:

```
https://marvel.fandom.com/api.php?action=query&list=categorymembers&cmtitle=Category:Earth-616/GROUP&prop=revisions&rvprop=content&rvslots=*&format=json&cmlimit=max&cmcontinue=CONTINUE_CODE"
```

We know it is very long, but it has some fundamental values: `list=categorymembers`, which returns all elements that belong to a category, in this case, `Team` or `Organization`, and `cmcontinue=CONTINUE_CODE`, which allows us to do a for loop over multiple queries, as all the teams do not fit in the 500 limit the query has. Because we do not know how many there are, a `while` loop is needed.

In [None]:
def get_teams():
  
  cmcontinue_text = ""
  first_time = True
  
  team_list = []
  
  while cmcontinue_text or first_time: 
  
    first_time = False
  
    baseurl = "https://marvel.fandom.com/api.php?"
    action = "action=query&list=categorymembers"
    q_title = "cmtitle=Category:Earth-616/Teams"

    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"
    cmcontinue = "cmlimit=max&cmcontinue={}".format(cmcontinue_text)

    query = "{}{}&{}&{}&{}&{}".format(baseurl, action, q_title, content, dataformat, cmcontinue)
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    
    wiki_json = json.loads(wikitext)
    
    team_list += [team["title"] for team in wiki_json["query"]["categorymembers"]
                 if not team["title"].startswith("Category:")]
    
    if "continue" in list(wiki_json.keys()):
      cmcontinue_text = wiki_json["continue"]["cmcontinue"]
    else:
      cmcontinue_text = ""
      
  first_time = True
    
  while cmcontinue_text or first_time: 
  
    first_time = False
  
    baseurl = "https://marvel.fandom.com/api.php?"
    action = "action=query&list=categorymembers"
    q_title = "cmtitle=Category:Earth-616/Organizations"

    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"
    cmcontinue = "cmlimit=max&cmcontinue={}".format(cmcontinue_text)

    query = "{}{}&{}&{}&{}&{}".format(baseurl, action, q_title, content, dataformat, cmcontinue)
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    
    wiki_json = json.loads(wikitext)
    
    team_list += [team["title"] for team in wiki_json["query"]["categorymembers"]
                 if not team["title"].startswith("Category:")]
    
    if "continue" in list(wiki_json.keys()):
      cmcontinue_text = wiki_json["continue"]["cmcontinue"]
    else:
      cmcontinue_text = ""
      
  return team_list

Now, with the massive function, we can get all the teams.

In [None]:
teams = list(set(get_teams()))
print(f"There are {len(teams)} teams.")

### Getting members

Now that we have all our teams, we need to get the members of each team.  Luckily, there is a particular page where we can get that: `Category:TEAM_NAME/Members`.

The query is very similar to the previous one: 

```
https://marvel.fandom.com/api.php?action=query&list=categorymembers&cmtitle=Category:TEAM_NAME/Members&rop=revisions&rvprop=content&rvslots=*&format=json&cmlimit=max&cmcontinue=CONTINUE_CODE
```

Which works pretty much the same as the previous one.

The only remark is that, for some reason, some teams have 'members' that are actually references to files, so we do not append any character that starts with the keyword `File:`.

In [None]:
def getMembers(team):  
  cmcontinue_text = ""
  first_time = True
  
  member_list = []
  
  while cmcontinue_text or first_time: 
  
    first_time = False
  
    baseurl = "https://marvel.fandom.com/api.php?"
    action = "action=query&list=categorymembers"
    q_title = "cmtitle=Category:{}/Members".format(urllib.parse.quote_plus(team.replace(" ", "_")))

    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"
    cmcontinue = "cmlimit=max&cmcontinue={}".format(cmcontinue_text)

    query = "{}{}&{}&{}&{}&{}".format(baseurl, action, q_title, content, dataformat, cmcontinue)
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    
    wiki_json = json.loads(wikitext)
    
    member_list += [member["title"]
                    for member in wiki_json["query"]["categorymembers"]
                    if not member["title"].startswith("File:")]
    
    if "continue" in list(wiki_json.keys()):
      cmcontinue_text = wiki_json["continue"]["cmcontinue"]
    else:
      cmcontinue_text = ""
      
  return member_list

The next operation is quite lenghty, so if you have the dataframe already, press enter without input.

In [None]:
answer = input("Do you want to get the members of the team? ")

if answer:
  dataset = []
  for team in tqdm(teams):
    dataset.append([team, getMembers(team)])

  df = pd.DataFrame(dataset, columns=["team_name", "members"])
  df.to_csv("../data/marvel_teams.csv", index=False)
else:
  df = pd.read_csv("../data/marvel_teams.csv")
  df["members"] = df["members"].apply(ast.literal_eval)
  
df

### Information from members

Cool right? Now that we have the members of each team, we can start doing fancy stuff. Let's start by getting how many members each team has, and let's check the top 10 teams with more members.

In [None]:
df["number_members"] = df["members"].apply(len)

df.sort_values("number_members", ascending=False).reset_index(drop=True).head(10)

Ok, this is useful and entirely unexpected. The team with the second most members is not a superhero team, but a police department, and not only that, but the third one is the german nazi party and fourth the US army.

It makes sense, as each of those has, or had, a lot of real-life members that probably make appearances in the comics and thus are in the wiki.

Let's try to see how many different characters there are in total in our dataset right now.

In [None]:
len_diff_char = len(list(set(df['members'].sum())))

print(f"There are {len_diff_char} different characters.")
print(f"This means there are around {30021 - len_diff_char} characters that do not belong to any team.")

To make the network analysis more manageable, we will only work with those characters that belong to at least one team.

Well, let's try to see more data from our teams. We can start with some fancy analysis and expand with some sentiment analysis. For that, we can use the quotes of each of the characters of a team, but that can get messy if we do it multiple times for each character (as one character can belong to various teams), so let's try to create a character data frame, and come back to the sentiment analysis for the teams for later.

## Characters

### Create data frame

The first step is to create the data frame with all the characters and which teams they belong to.

In [None]:
all_characters = []
for _, row in df.iterrows():
  for member in row["members"]:
    all_characters += [(member, row["team_name"])]
    
print(f"Number of characters: {len(all_characters)}")

With each character with the team they belong to, we can do a `groupby` and get all the teams in a list and quickly get how many teams each character belongs to.

Let's check the top ten characters that belong to more teams.

In [None]:
df_char = pd.DataFrame(all_characters, columns=["name", "team"])

df_char = df_char.groupby("name")["team"].progress_apply(list).to_frame("teams").reset_index()
df_char["number_teams"] = df_char["teams"].progress_apply(len)

df_char.sort_values("number_teams", ascending=False).reset_index(drop=True).head(10)

As expected, most of those characters are pretty famous (in order: Wolverine, Spider-Man, Iron Man, Storm, Beast, Deadpool, Captain Marvel, Taskmaster, Mystique and Hawkeye).


### Quotes 

We are now going to get the quotes for all the characters.

The code is quite messy and difficult to read, but the idea is that we use multiple threads to download the content faster, as we can only query one characters quotes at a time, and we have almost 13.000 characters.

The query we do is:

```
https://marvel.fandom.com/api.php?action=query&list=categorymembers&cmtitle=Category:CHARACTER/Quotes&prop=revisions&rvprop=content&rvslots=*&format=json&cmlimit=max&cmcontinue=CONTINUE_CODE
```

Which, again, works similar to those already seen.

In [None]:
quote_path = "../data/character_quotes/"

def get_character_quotes(name: str):
  # This is a little bit hacky, but works
  cmcontinue_text = ""
  quote_titles = []

  # Breaks when there are no more cm_continues
  while True:
    baseurl = "https://marvel.fandom.com/api.php?"
    args = {
      "action"      : "action=query&list=categorymembers",
      "q_title"     : "cmtitle=Category:{}/Quotes".format(urllib.parse.quote_plus(name.replace(" ", "_"))),
      "content"     : "prop=revisions&rvprop=content&rvslots=*",
      "dataformat"  : "format=json",
      "cmcontinue"  :  "cmlimit=max&cmcontinue={}".format(cmcontinue_text),
    }
    
    query = f"{baseurl}{'&'.join(args.values())}"

    wikiresponse = urllib.request.urlopen(query)
    wikitext = wikiresponse.read().decode('utf-8')
    wiki_json = json.loads(wikitext)
    
    quote_titles += [page["title"] for page in wiki_json["query"]["categorymembers"]]

    if "continue" in list(wiki_json.keys()):
      cmcontinue_text = wiki_json["continue"]["cmcontinue"]
    else: break
  
  quote_title_chunks = utils.generate_chunks(quote_titles)
  quotes = []

  for chunk in quote_title_chunks:
    quote_data = search_quotes(chunk)
    for content in quote_data["query"]["pages"].values():
      content  = content["revisions"][-1]["slots"]["main"]["*"]
      quotes += re.findall(r"Quotation.*?= (.*?)\n", content)
  
  filename = utils.generate_filename(name)
  with open(f"{quote_path}{filename}.json", "w") as f:
    json.dump(quotes, f, indent = 4)

def get_chunk_quotes(chunk: list):
  for name in chunk:
    get_character_quotes(name)
  return

def get_quotes(names: list, max_workers=16):
  files = set(os.listdir(quote_path))
  missing_names = list(
    filter(lambda x: f"{utils.generate_filename(x)}.json" not in files,
    names)
  )

  if len(missing_names) == 0:
    print("No missign quotes found 😊")
    return
  
  chunks = utils.generate_chunks(missing_names)
  print (f"Generated {len(chunks)} chunks!")
  with tqdm(total=len(chunks)) as pbar:
    with ThreadPoolExecutor(max_workers=max_workers) as ex:
      futures = [ex.submit(get_chunk_quotes, chunk)
                  for chunk in chunks]
      for future in as_completed(futures):
        pbar.update(1)


get_quotes(df_char["name"].values , max_workers=24)

Now that we have successfully downloaded all quotes, we will load them in the characters dataset and display the top 10 characters with the most quotes.

In [None]:
def getQuotes(row):
  quotes = None
  try:
    with open("../data/character_quotes/"+utils.generate_filename(row["name"])+".json") as f:
      quotes = ast.literal_eval(f.read())
  except:
    quotes = []
  
  return pd.Series([quotes, len(quotes)])


df_char[["quotes", "number_quotes"]] = df_char.progress_apply(getQuotes, axis=1)

df_char.sort_values("number_quotes", ascending=False).reset_index(drop=True).head(10)

Again, most of the most famous characters have the most quotes, being Spider_man the one with the most.

## Back with teams

### Quotes

Now that we have the quotes, we are going to give get the quotes per team, and display the top ten with most quotes.

In [None]:
def getQuotes(row):
  quotes = []
  for member in row["members"]:
    try:
      with open("../data/character_quotes/"+member.replace(" ", "_")+".json") as f:
        quotes += ast.literal_eval(f.read())
    except:
      quotes += []
  
  return pd.Series([quotes, len(quotes)])


df[["quotes", "number_quotes"]] = df.progress_apply(getQuotes, axis=1)

df.sort_values("number_quotes", ascending=False).reset_index(drop=True).head(10)

It is not surprising that the team with the most quotes is the Avengers, as it is one of the biggest and most famous teams.

Now that each team has its quotes, we can do some text analysis. For example, we can start with how many words, unique words and lexical richness each team has.

For that, first, the links, all `\n` symbols and any non-alphabetical characters will be removed from the quotes. Lastly, the additional spaces are removed.

We will display the top 10 teams with more than 500 unique words with the highest lexical richness.

In [None]:
def getLexicalRichness(row):
  regex_links = r"(?:\{\{.*?\}\}|\[\[.*?\]\])"
  regex_newline = r"\\n"
  regex_no_alpha = r"[^a-zA-Z ]"
  regex_aditional_space = r"\s\s+"
  
  text = " ".join(row["quotes"])
  text = re.sub(regex_links, "", text)
  text = re.sub(regex_newline, " ", text)
  text = re.sub(regex_no_alpha, " ", text)
  text = re.sub(regex_aditional_space, " ", text)
  
  words = [word for word in text.lower().split(" ") if len(word) > 1]
  
  number_words = len(words)
  number_unique_words = len(list(set(words)))
  lexical_richness = 0
  
  if number_words > 0:
    lexical_richness = number_unique_words/number_words
  return pd.Series([number_words, number_unique_words, lexical_richness])

df[["number_words",
    "number_unique_words",
    "lexical_richness"]] = df.progress_apply(getLexicalRichness, axis=1)

df[df["number_unique_words"] > 500].sort_values("lexical_richness", ascending=False)\
                                   .reset_index(drop=True).head(10)

We can do even more interesting stuff, such as sentiment analysis per team. We are going to use VADER sentiment analysis to get the average feeling of each team.

In [None]:
def getVaderSentiment(row):
  sid_obj = SentimentIntensityAnalyzer()
  
  happy = []
  sad = []
  neutral = []
  compound = []
  category = []
  
  for quote in row["quotes"]:
    sentiment_dict = sid_obj.polarity_scores(quote)
    happy.append(sentiment_dict["pos"])
    sad.append(sentiment_dict["neg"])
    neutral.append(sentiment_dict["neu"])
    compound.append(sentiment_dict["compound"])
    
    if sentiment_dict['compound'] >= 0.05:
        category.append("Positive")
 
    elif sentiment_dict['compound'] <= - 0.05:
        category.append("Negative")
 
    else:
        category.append("Neutral")
  if len(row["quotes"]) == 0:
    return pd.Series([0, 0, 0, 0, "Neutral"])
  
  happy_val = sum(happy)/len(happy)
  sad_val = sum(sad)/len(sad)
  neutral_val = sum(neutral)/len(neutral)
  compound_val = sum(compound)/len(compound)
  category_val = max(category, key=category.count)
  return pd.Series([happy_val*100, sad_val*100, neutral_val*100, compound_val, category_val])
    

df[["%happy",
   "%sad",
   "%neutral",
   "compound_sentiment",
   "overall_category"]] = df.progress_apply(getVaderSentiment, axis=1)

Now we can display the top 10 happier teams, and top 10 saddest, among those with at least 500 unqieu words (based on compound value).

In [None]:
df[df["number_unique_words"] > 500].sort_values("compound_sentiment", ascending=False)\
                                   .reset_index(drop=True).head(10)

In [None]:
df[df["number_unique_words"] > 500].sort_values("compound_sentiment", ascending=True)\
                                   .reset_index(drop=True).head(10)

### WordClouds

After this brief sentiment analysis, we can try to do the cool wordclouds :). We are going to display the wordclouds of the top 20 teams with the most number of unique words.

For that first we need to preprocess the quotes. Firts, we are going to remove the links, then the `\n` and not alphabetical numbers. Then remove additional spaces, and set all the words to lower.

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
wnl = nltk.WordNetLemmatizer()

def preprocessQuotes(row):
  processed_quotes = []
  
  regex_links = r"(?:\{\{.*?\}\}|\[\[.*?\]\])"
  regex_newline = r"\\n"
  regex_no_alpha = r"[^a-zA-Z ]"
  regex_aditional_space = r"\s\s+"
  
  for quote in row["quotes"]:
    quote = re.sub(regex_links, " ", quote)
    quote = re.sub(regex_newline, " ", quote)
    quote = re.sub(regex_no_alpha, " ", quote)
    quote = re.sub(regex_aditional_space, " ", quote)
    
    tokens = tokenizer.tokenize(quote)
    
    all_words = [x.strip().lower() for x in tokens]
    
    stop_words = list(nltk.corpus.stopwords.words("english")) # Stopwords
    
    filtered = [x for x in all_words if x not in stop_words]
    
    lemmatized = [wnl.lemmatize(w) for w in filtered]
    
    new_stopwords = ["im", "one", "earth", "know", "im", "b", "u"]
    
    lemmatized = [x for x in lemmatized if x not in new_stopwords]
    
    processed_quotes.append(lemmatized)
  
  return processed_quotes

df["processed_quotes"] = df.progress_apply(preprocessQuotes, axis=1)

Then with the processed quotes, we can display the wordclouds.

In [None]:
fig, axarr = plt.subplots(10, 2, figsize=(20, 60))

df_quotes = df.sort_values("number_unique_words", ascending=False)

word_cloud = WordCloud(max_words=2000,
                       background_color="white"
                      )

for i, ax in enumerate(axarr.flatten()):
  content = " ".join(word
                     for quote in df_quotes.iloc[i, 13]
                     for word in quote)
  ax.imshow(word_cloud.generate(content))
  ax.axis("off")
  ax.set_title(f"{df_quotes.iloc[i, 0]}, (Number members: {df_quotes.iloc[i, 2]})")

plt.show()

From these wordclouds we can see a clear result. Either all of those teams share one specific theme (`thing`, `time`, `going`...), or all share a couple of characters that make take most of the quotes. Either way, it is not giving too much information.

### Graph

Now the moment of truth is here. It is time to build the graph. We are going to join two teams if they share at least one member. So the first step is, obviously, get wich teams share a member. For that, we can use the previously build `df_char`, dataframe of characters. We can check which ones are the top ten most connected teams.

In [None]:
def getConnections(row):
  connections = []
  
  df_char_row = df_char[df_char["name"].isin(row["members"])]
    
  for _, row_char in df_char_row.iterrows():
    if row["team_name"] in row_char["teams"]:
      connections += row_char["teams"]
      connections.remove(row["team_name"])
      
  connections = list(set(connections))
  
  return pd.Series([connections, len(connections)])

df[["connections", "number_connections"]] = df.progress_apply(getConnections, axis=1)  

df.sort_values("number_connections", ascending=False).reset_index(drop=True).head(10)

Once we have to whom each team connects, we can build the graph

In [None]:
def getEdges(row):
  edges = []
  for connection in row["connections"]:
    edges.append([row["team_name"], connection])
  
  return edges

team_graph = nx.Graph()

team_graph.add_nodes_from(df["team_name"])

all_edges = df.progress_apply(getEdges, axis=1)

for edges in all_edges:
  team_graph.add_edges_from(edges)


print(f"The team graph has {len(team_graph.nodes)} nodes and {len(team_graph.edges)} edges.")

Now we can show multiple graph properties.

In [None]:
print("Graph basic stats:")
print(f"\tNumber of nodes: {len(team_graph.nodes)}")
print(f"\tNumber of edges: {len(team_graph.edges)}")
print(f"\tAverage degree: {sum(x[1] for x in team_graph.degree)/len(team_graph.degree):.2f}")
print()
print(f"\tMost connected node: {max(team_graph.degree, key=lambda x: x[1])[0]} \
with a degree of {max(team_graph.degree, key=lambda x: x[1])[1]}")

And we can reduce the size by getting only the Giant Connected Component (GCC)

In [None]:
team_graph_gcc = team_graph.subgraph(max(nx.connected_components(team_graph), key=len))

print(f"The team graph GCC has {len(team_graph_gcc.nodes)} nodes and {len(team_graph_gcc.edges)} edges.")

From this GCC, we can plot the graph and the degree distribution and check if it is random, or if it follows a powerlaw distribution.

In [None]:
forceatlas2 = ForceAtlas2(
                          # Behavior alternatives
                          outboundAttractionDistribution=False,  # Dissuade hubs
                          linLogMode=False,  # NOT IMPLEMENTED
                          adjustSizes=False,  # Prevent overlap (NOT IMPLEMENTED)
                          edgeWeightInfluence=1.0,

                          # Performance
                          jitterTolerance=5.0,  # Tolerance
                          barnesHutOptimize=False,
                          barnesHutTheta=1.2,
                          multiThreaded=False,  # NOT IMPLEMENTED

                          # Tuning
                          scalingRatio=10.0,
                          strongGravityMode=False,
                          gravity=100.0,

                          # Log
                          verbose=True)
positions = forceatlas2.forceatlas2_networkx_layout(team_graph_gcc, pos=None, iterations=50)

In [None]:
sizes = []
alphas = []
colors = []
max_degree = max(team_graph_gcc.degree(), key=lambda x: x[1])[1]

for node in tqdm(team_graph_gcc.nodes):
  size = team_graph_gcc.degree(node) * 20 + 50
  alpha = max([team_graph_gcc.degree(node)/max_degree, .2])
  
  colors.append((random.random(), random.random(), random.random()))
  
  sizes.append(size)
  alphas.append(alpha)

In [None]:
fig, ax = plt.subplots(figsize=(30, 30))

nx.draw_networkx_nodes(team_graph_gcc,
                       positions,
                       linewidths  = 1,
                       node_size   = sizes,
                       node_color  = colors,
                       alpha       = alphas,
                       ax          = ax
                      )

nx.draw_networkx_edges(team_graph_gcc,
                       positions,
                       edge_color  = "black",
                       arrowstyle  = "-",
                       alpha       = 0.5,
                       width       = .5,
                       ax          = ax
                      )  
plt.axis("off")

plt.show()

The plot of the network does not give too much information. We can see that there are many nodes with a low degree (the ones from the border) and that there are a lot of nodes interconnected, but we cannot say much more only with this. To understand the network more, we do the degree distribution analysis.

In [None]:
fig, ((ax, l_ax), (ax_bp, ax_bp_l)) = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle("Degree distribution")

degrees = dict(team_graph_gcc.degree()).values()

hist, bins = np.histogram(np.array(list(degrees)), bins=500)
center = (bins[:-1] + bins[1:])/2


ax.plot(center, hist)
ax.set_title("Degree distribution")
ax.set_xlabel("Degree")
ax.set_ylabel("Count")

l_ax.plot(center, hist)
l_ax.set_title("Degree distribution (log)")
l_ax.set_xlabel("Degree")
l_ax.set_ylabel("Count")
l_ax.set_xscale("log")
l_ax.set_yscale("log")

ax_bp.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp.set_title("Box plot of the Degree")
ax_bp.set_xlabel("Degree")

ax_bp_l.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp_l.set_title("Box plot of the Degree (log)")
ax_bp_l.set_xlabel("Degree")
ax_bp_l.set_xscale("log")

plt.show()

It does look like a powerlaw distribution, where a few nodes have a high degree while the majority of nodes only have a couple connections, but we need to prove it.

In [None]:
results = powerlaw.Fit(list(degrees))
print(f"The alpha of the degree dinstribution is: {results.power_law.alpha}")

The alpha is $>1$, so this network belongs to the **Superlinear Regime** which means that, on top of not being a random networks, it has a few disproportionaly atractive nodes, which group most of the data.

### Wiki content

Other metrics we can get from the data, is the lenght and content of their wikis. For that, first we need to get the content from them.

The query looks lisk:

```
http://marvel.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=TEAM_NAME&format=json
```

Again the function is a bit long, but is done this way so we can do multithreading.

In [None]:
content_path = "../data/team_content/"

links  = re.compile(r"\[\[(.*?)(?:\|.*?)?\]\]")


def get_team_content(names):
    # Querying wiki for name
    base_url = "http://marvel.fandom.com/api.php?"
    action = "action=query"
    title = f"titles={'|'.join([urllib.parse.quote_plus(name.replace(' ', '_')) for name in names])}"
    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"

    query = "{}{}&{}&{}&{}".format(base_url, action, content, title, dataformat)
    
    resp = urllib.request.urlopen(query)
    text = resp.read().decode("utf-8")
    data = json.loads(text)
    
    pages = data["query"]["pages"]
    for page_id, page in pages.items():
        page["page_id"] = page_id
        filename = utils.generate_filename(page["title"])
        
        with open(f"{content_path}{filename}.json", "w") as f:
            json.dump(page, f, indent=4)

def get_team_info(name, all_names_set):
    filename = utils.generate_filename(name)
    data = json.loads(open(f"{content_path}{filename}.json", "r").read())

    content = data["revisions"][0]["slots"]["main"]["*"]
    
    return content

# Multithreaded solution for faster speed
def get_content(names, max_workers = 16):
    files = set(os.listdir(content_path))
    missing_names = list(
        filter(lambda x: f"{utils.generate_filename(x)}.json" not in files,
        names
        ))

    if len(missing_names) == 0:
        print("No missing names found 😊")
        return
        
    chunks = utils.generate_chunks(missing_names)
    print(f"Generated {len(chunks)} chunks!")
    with tqdm(total=len(chunks)) as pbar:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            futures = [ex.submit(get_team_content, chunk) 
                       for chunk in chunks]
            for future in as_completed(futures):
                pbar.update(1)
    print("Done downloading files!")

def get_teams_infos(names, max_workers = 16):
    all_names_set = set(names)
    infos = []
    with tqdm(total=len(names)) as pbar:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            for result in ex.map(get_team_info, names, itertools.repeat(all_names_set)):
                infos.append(result)
                pbar.update(1)
    print("Done getting character information")
    return infos

In [None]:
get_content(df["team_name"])

In [None]:
df["wiki_content"] = get_teams_infos(df["team_name"])
df.head()

Once we have the wiki, we can, for example, get the number of words, number of unique words, lexical richeness and wordclouds of the top 20 teams whith most unique words.

In [None]:
def getLexicalRichnessWiki(row):
  regex_links = r"(?:\{\{.*?\}\}|\[\[.*?\]\])"
  regex_newline = r"\\n"
  regex_no_alpha = r"[^a-zA-Z ]"
  regex_aditional_space = r"\s\s+"
  
  text = " ".join(row["wiki_content"])
  text = re.sub(regex_links, "", text)
  text = re.sub(regex_newline, " ", text)
  text = re.sub(regex_no_alpha, " ", text)
  text = re.sub(regex_aditional_space, " ", text)
  
  words = [word for word in text.lower().split(" ") if len(word) > 1]
  
  number_words = len(words)
  number_unique_words = len(list(set(words)))
  lexical_richness = 0
  
  if number_words > 0:
    lexical_richness = number_unique_words/number_words
  return pd.Series([number_words, number_unique_words, lexical_richness])

df[["number_words_wiki",
    "number_unique_words_wiki",
    "lexical_richness_wiki"]] = df.progress_apply(getLexicalRichness, axis=1)

df[df["number_unique_words_wiki"] > 500].sort_values("lexical_richness_wiki", ascending=False)\
                                        .reset_index(drop=True).head(10)

In [None]:

df[df["number_unique_words_wiki"] > 500].sort_values("%happy", ascending=False)\
                                        .reset_index(drop=True).head(10)

In [None]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
wnl = nltk.WordNetLemmatizer()

def preprocessWiki(row):
  regex_links = r"(?:\{\{.*?\}\}|\[\[.*?\]\])"
  regex_newline = r"\\n"
  regex_no_alpha = r"[^a-zA-Z ]"
  regex_aditional_space = r"\s\s+"
  
  content = row["wiki_content"]
  content = re.sub(regex_links, " ", content)
  content = re.sub(regex_newline, " ", content)
  content = re.sub(regex_no_alpha, " ", content)
  content = re.sub(regex_aditional_space, " ", content)

  tokens = tokenizer.tokenize(content)

  all_words = [x.strip().lower() for x in tokens]

  stop_words = list(nltk.corpus.stopwords.words("english")) # Stopwords

  filtered = [x for x in all_words if x not in stop_words]

  lemmatized = [wnl.lemmatize(w) for w in filtered]
  
  new_stopwords = ["im", "one", "earth", "know", "im", "b", "u",
                   "vol", "jpg", "template", "database", "h",
                   "e", "l", "title", "br", "imagesize", "px",
                   "ref", "image", "pageid", "n", "revisions"]

  lemmatized = [x for x in lemmatized if x not in new_stopwords]
  
  return lemmatized

df["processed_wiki"] = df.progress_apply(preprocessWiki, axis=1)

In [None]:

df_wiki = df.sort_values("number_unique_words_wiki", ascending=False)


word_cloud = WordCloud(max_words=2000,
                       background_color=None,
                       height=400,
                       width=400,
                      )

for i in range(20):
  content = " ".join(df_wiki.iloc[i, 20])
  word_cloud.generate(content).to_file("../data/wordclouds/" + str(df_wiki.iloc[i, 0]) + ".png")



word_cloud = WordCloud(max_words=2000,
                       background_color="white"
                      )
fig, axarr = plt.subplots(10, 2, figsize=(20, 60))

for i, ax in enumerate(axarr.flatten()):
  content = " ".join(df_wiki.iloc[i, 20])
  ax.imshow(word_cloud.generate(content))
  ax.axis("off")
  ax.set_title(f"{df_wiki.iloc[i, 0]}, (Number members: {df_wiki.iloc[i, 2]})")

plt.show()

### Correlation

We can try to do some quick analysis using the correlation between the different columns, to see, for example, if having more members translates to a longer wikipage, or if the more lexical richness the quotes have, the more lexical richness the wiki page will have. To display this information we can use a heatmap, that summarize all the correlations.

In [None]:
fix, ax = plt.subplots(figsize=(15, 13))

ax.set_title("Heatmap of the teams values correlation")

sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap="icefire", ax=ax)

plt.show()

Yeah, we know it is pretty tricky to read, but here I can remark the most relevant results:

 - As expected, the **number of nodes** is highly correlated to the **number of connections**and is somehow related to the **number of quotes**, **number of words**, and **number of words in the wiki**.
 - Again, as expected, the **number of words** is highly correlated to the **number of unique words**, and **number of connections** and **number of unique words in the wiki**. 
 - **Lexical richness** is related to **happy**, **sad**, and especially, **neutral**, and to the **lexical richness** of the wiki.
 
## And back with characters

We have analyzed the team information, done natural language processing over it, and seen a lot of stuff. It's time to jump into the characters and see what we can get from here.

We already have the quotes from each character in our data frame. We can see that the character with more quotes is **Peter Parker (Spider-Man)**and that the top 10 characters with most quotes are all famous characters, but with only the name and not the alias, it is sometimes difficult to see who are they. Luckily, we can get their aliases from their wiki content, and now that we are at it, let's get their whole wiki.

In [None]:
df_char.sort_values("number_quotes", ascending=False).reset_index(drop=True).head(10)

We can get multiple pages at the same time. We can get the content, the links and number of connections, and the alias.

In [None]:
links  = re.compile(r"\[\[(.*?)(?:\|.*?)?\]\]")

content_path = "../data/character_content/"


def get_character_content(names):
    # Querying wiki for name
    base_url = "http://marvel.fandom.com/api.php?"
    action = "action=query"
    title = f"titles={'|'.join([urllib.parse.quote_plus(name.replace(' ', '_')) for name in names])}"
    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"

    query = "{}{}&{}&{}&{}".format(base_url, action, content, title, dataformat)
    
    resp = urllib.request.urlopen(query)
    text = resp.read().decode("utf-8")
    data = json.loads(text)

    pages = data["query"]["pages"]
    for page_id, page in pages.items():
        page["page_id"] = page_id
        filename = utils.generate_filename(page["title"])
        with open(f"{content_path}{filename}.json", "w") as f:
            json.dump(page, f, indent=4)

# Multithreaded solution for faster speed
def get_content(names, max_workers = 16):
    files = set(os.listdir(content_path))
    missing_names = list(
        filter(lambda x: f"{utils.generate_filename(x)}.json" not in files,
        names
        ))

    if len(missing_names) == 0:
        print("No missing names found 😊")
        return
        
    chunks = utils.generate_chunks(missing_names)
    print(f"Generated {len(chunks)} chunks!")
    with tqdm(total=len(chunks)) as pbar:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            futures = [ex.submit(get_character_content, chunk) 
                       for chunk in chunks]
            for future in as_completed(futures):
                pbar.update(1)
    print("Done downloading files!")

def GetCharacterInfo(row, all_names):
  
  all_names_set = set(all_names)
  filename= utils.generate_filename(row["name"])
  
  regex_alias = r"\|\sCurrentAlias\s+=\s(.*?)(?:|\|.*?)\\n"
  regex_alphanum = r"[^a-zA-Z0-9\- ]"
  regex_spaces = r"\s\s+"
  alias = "-"

  try:
    with open("../data/character_content/"+filename+".json") as f:
      wiki = f.read()
      alias = re.findall(regex_alias, wiki)[0]
      alias = re.sub(regex_alphanum, "", alias)
      alias = re.sub(regex_spaces, " ", alias)
      letters_alias = [x for x in alias if x != " "]
      if len(letters_alias) == 0:
        alias = "-"

  except:
    pass

  lin = links.findall(wiki)
  lin = [x for x in lin if x in all_names_set and x != row["name"]]

  lin = list(set(lin))

  return pd.Series([wiki, lin, len(lin),  alias.rstrip()])


As always, we will show the top 10 characters with more links

In [None]:
get_content(df_char["name"])

df_char[["wiki_content",
         "links",
         "number_links",
         "alias"]] = df_char.progress_apply(GetCharacterInfo, all_names=df_char["name"], axis=1)

df_char.sort_values("number_links", ascending=False).reset_index(drop=True).head(10)

Now we can see the alias, and the character are more easy to recognise :). Umm, there is a surprising character in here. On further inspect with Krakoa, which seems like an odd character for being the one with most links, we saw that it is a sentiment colony creature (don't ask, Marvel's universes are weird), where many other character live, and all of them are referenced in it's wiki. 

We can do some of the same analysis we did for the teams, but for characters. For example, we can check the lexical richness of each character. Now we are going to display the top 10 character by lexical richness where they have at least 100 unique words.

In [None]:
df_char[["number_words",
         "number_unique_words",
         "lexical_richness"]] = df_char.progress_apply(getLexicalRichness, axis=1)

df_char[df_char["number_unique_words"] > 100].sort_values("lexical_richness", ascending=False)\
                                             .reset_index(drop=True).head(10)

We can do the same for the content, check which are the top 10 characters by lexical richness in their wiki, where they have at least 500 different words on it.

In [None]:
df_char[["number_words_wiki",
         "number_unique_words_wiki",
         "lexical_richness_wiki"]] = df_char.progress_apply(getLexicalRichness, axis=1)

df_char[df_char["number_unique_words_wiki"] > 500].sort_values("lexical_richness_wiki", ascending=False)\
                                        .reset_index(drop=True).head(10)

Interesting. Now that we have a preliminary text analysis let's check the super cool wordclouds. First, let's do a word cloud for the top 20 characters with the most quotes.

One important remark. When doing the wordclouds for a team, many had the word 'time'. Now, seeing the top 20 characters, they also have the word time, so we will remove it from the `content`, same for 'thing'.

In [None]:
df_char["processed_quotes"] = df_char.progress_apply(preprocessQuotes, axis=1)

fig, axarr = plt.subplots(10, 2, figsize=(20, 60))

df_char_quotes = df_char.sort_values("number_unique_words", ascending=False)

word_cloud = WordCloud(max_words=2000,
                       background_color="white"
                      )

for i, ax in enumerate(axarr.flatten()):
  content = " ".join(word
                     for quote in df_char_quotes.iloc[i, 15]
                     for word in quote
                     if word != "time"
                     and word != "thing")
  
  ax.imshow(word_cloud.generate(content))
  ax.axis("off")
  ax.set_title(f"{df_char_quotes.iloc[i, 0]}")

plt.show()

We can do the wordclouds for the wiki content, and try to get if the general idea or sentiment is similar to their quotes.

In [None]:
df_char

In [None]:
df_char["processed_wiki"] = df_char.progress_apply(preprocessWiki, axis=1)

fig, axarr = plt.subplots(10, 2, figsize=(20, 60))

df_char_wiki = df_char.sort_values("number_unique_words", ascending=False)

word_cloud = WordCloud(max_words=2000,
                       background_color="white"
                      )

for i, ax in enumerate(axarr.flatten()):
  content = " ".join(df_char_wiki.iloc[i, 16])
  ax.imshow(word_cloud.generate(content))
  ax.axis("off")
  ax.set_title(f"{df_char_wiki.iloc[i, 0]}")

plt.show()

We can do the VADER sentiment too. Let's display the top 10 happiest and sadest that have at least 100 unique words in their quotes.

In [None]:
df_char[["%happy",
         "%sad",
         "%neutral",
         "compound_sentiment",
         "overall_category"]] = df_char.progress_apply(getVaderSentiment, axis=1)

In [None]:
df_char[df_char["number_unique_words"] > 100].sort_values("compound_sentiment", ascending=False)\
                                             .reset_index(drop=True).head(10)

In [None]:
df_char[df_char["number_unique_words"] > 100].sort_values("compound_sentiment", ascending=True)\
                                             .reset_index(drop=True).head(10)

### Graph

Now it's time to build the graph again. Now the connections are going to be based on if they have references in the wiki. That means that, in this case, it will be a directed graph.

In [None]:
char_graph = nx.DiGraph()

char_graph.add_nodes_from(df_char["name"])
for _, row in df_char.iterrows():
  char_graph.add_edges_from([[row["name"], x] for x in row["links"]])
  
print(f"The graph has {len(char_graph.nodes)} nodes and {len(char_graph.edges)} edges")

In [None]:
print("Graph basic stats:")
print(f"\tNumber of nodes: {len(char_graph.nodes)}")
print(f"\tNumber of edges: {len(char_graph.edges)}")
print(f"\tAverage degree: {sum(x[1] for x in char_graph.degree)/len(char_graph.degree):.2f}")
print()
print(f"\tMost connected node: {max(char_graph.degree, key=lambda x: x[1])[0]} \
with a degree of {max(char_graph.degree, key=lambda x: x[1])[1]}")
print(f"\tMost in connections node: {max(char_graph.in_degree, key=lambda x: x[1])[0]} \
with an in degree of {max(char_graph.in_degree, key=lambda x: x[1])[1]}")
print(f"\tMost out connections node: {max(char_graph.out_degree, key=lambda x: x[1])[0]} \
with an out degree of {max(char_graph.out_degree, key=lambda x: x[1])[1]}")

We can again obtain the GCC for the charachter graph, but for that we have to not have a directed graph, so we can do a non directed copy of the graph.

In [None]:
char_und_graph = nx.Graph(char_graph)

char_graph_gcc = char_und_graph.subgraph(max(nx.connected_components(char_und_graph), key=len))

print(f"The character graph GCC has {len(char_graph_gcc.nodes)} nodes and {len(char_graph_gcc.edges)} edges.")

And again, we can plot the graph, to see how it looks like

In [None]:
forceatlas2 = ForceAtlas2(
                          # Behavior alternatives
                          outboundAttractionDistribution=False,  # Dissuade hubs
                          linLogMode=False,  # NOT IMPLEMENTED
                          adjustSizes=False,  # Prevent overlap (NOT IMPLEMENTED)
                          edgeWeightInfluence=1.0,

                          # Performance
                          jitterTolerance=5.0,  # Tolerance
                          barnesHutOptimize=False,
                          barnesHutTheta=1.2,
                          multiThreaded=False,  # NOT IMPLEMENTED

                          # Tuning
                          scalingRatio=10.0,
                          strongGravityMode=False,
                          gravity=100.0,

                          # Log
                          verbose=True)
positions = forceatlas2.forceatlas2_networkx_layout(char_graph_gcc, pos=None, iterations=5)

In [None]:
sizes = []
alphas = []
colors = []
max_degree = max(char_graph_gcc.degree(), key=lambda x: x[1])[1]

for node in tqdm(char_graph_gcc.nodes):
  size = char_graph_gcc.degree(node) * 20 + 50
  alpha = max([char_graph_gcc.degree(node)/max_degree, .2])
  
  colors.append((random.random(), random.random(), random.random()))
  
  sizes.append(size)
  alphas.append(alpha)

In [None]:
fig, ax = plt.subplots(figsize=(30, 30))

nx.draw_networkx_nodes(char_graph_gcc,
                       positions,
                       linewidths  = 1,
                       node_size   = sizes,
                       node_color  = colors,
                       alpha       = alphas,
                       ax          = ax
                      )

nx.draw_networkx_edges(char_graph_gcc,
                       positions,
                       edge_color  = "black",
                       arrowstyle  = "-",
                       alpha       = 0.5,
                       width       = .5,
                       ax          = ax
                      )  
plt.axis("off")
ax.set_facecolor(None)

plt.savefig("../docs/assets/img/networkimgs/character_network.png", transparent=True)

plt.show()

Well :/, this is a huge hairball. Let's try to see if we can get any meaningfull information from the degree distribution analysis.

In [None]:
fig, ((ax, l_ax), (ax_bp, ax_bp_l)) = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle("Degree distribution")

degrees = dict(char_graph.degree()).values()

hist, bins = np.histogram(np.array(list(degrees)), bins=500)
center = (bins[:-1] + bins[1:])/2


ax.plot(center, hist)
ax.set_title("Degree distribution")
ax.set_xlabel("Degree")
ax.set_ylabel("Count")

l_ax.plot(center, hist)
l_ax.set_title("Degree distribution (log)")
l_ax.set_xlabel("Degree")
l_ax.set_ylabel("Count")
l_ax.set_xscale("log")
l_ax.set_yscale("log")

ax_bp.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp.set_title("Box plot of the Degree")
ax_bp.set_xlabel("Degree")

ax_bp_l.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp_l.set_title("Box plot of the Degree (log)")
ax_bp_l.set_xlabel("Degree")
ax_bp_l.set_xscale("log")

plt.show()

fig, ((ax, l_ax), (ax_bp, ax_bp_l)) = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle("In Degree distribution")

degrees = dict(char_graph.in_degree()).values()

hist, bins = np.histogram(np.array(list(degrees)), bins=500)
center = (bins[:-1] + bins[1:])/2


ax.plot(center, hist)
ax.set_title("In Degree distribution")
ax.set_xlabel("Degree")
ax.set_ylabel("Count")

l_ax.plot(center, hist)
l_ax.set_title("In Degree distribution (log)")
l_ax.set_xlabel("Degree")
l_ax.set_ylabel("Count")
l_ax.set_xscale("log")
l_ax.set_yscale("log")

ax_bp.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp.set_title("Box plot of the In Degree")
ax_bp.set_xlabel("Degree")

ax_bp_l.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp_l.set_title("Box plot of the In Degree (log)")
ax_bp_l.set_xlabel("Degree")
ax_bp_l.set_xscale("log")

plt.show()

fig, ((ax, l_ax), (ax_bp, ax_bp_l)) = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle("Out Degree distribution")

degrees = dict(char_graph.out_degree()).values()

hist, bins = np.histogram(np.array(list(degrees)), bins=500)
center = (bins[:-1] + bins[1:])/2


ax.plot(center, hist)
ax.set_title("Out Degree distribution")
ax.set_xlabel("Degree")
ax.set_ylabel("Count")

l_ax.plot(center, hist)
l_ax.set_title("Out Degree distribution (log)")
l_ax.set_xlabel("Degree")
l_ax.set_ylabel("Count")
l_ax.set_xscale("log")
l_ax.set_yscale("log")

ax_bp.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp.set_title("Box plot of the Out Degree")
ax_bp.set_xlabel("Degree")

ax_bp_l.boxplot(degrees, vert=False, labels=["Degree"])
ax_bp_l.set_title("Box plot of the Out Degree (log)")
ax_bp_l.set_xlabel("Degree")
ax_bp_l.set_xscale("log")

plt.show()

It does look like a powerlaw distribution, Let's prove it.

In [None]:
degrees = dict(char_graph.degree()).values()
results = powerlaw.Fit(list(degrees), suppress_output=True)

degrees_in = dict(char_graph.in_degree()).values()
results_in = powerlaw.Fit(list(degrees_in), suppress_output=True)

degrees_out = dict(char_graph.out_degree()).values()
results_out = powerlaw.Fit(list(degrees_out), suppress_output=True)

In [None]:
print(f"The alpha of the degree dinstribution is: {results.power_law.alpha}")
print(f"The alpha of the in degree dinstribution is: {results_in.power_law.alpha}")
print(f"The alpha of the out degree dinstribution is: {results_out.power_law.alpha}")

All the alphas are $>1$, so this network belongs to the **Superlinear Regime** which means that, on top of not being a random networks, it has a few disproportionaly atractive nodes, which group most of the data.

We can get more information regarding the in and out degree.

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(x=degrees_in, y=degrees_out)
plt.title("In degree vs. Out degree")
plt.xlabel("In degree")
plt.ylabel("Out degree")

plt.show()

From this plot, we can see that most nodes with a low in-degree have a low out-degree too, and vice versa with a high degree. There are a few outliers, but they seem pretty correlated.


## Conclusions
From this project, we can first and foremost conclude that the Marvel Universe 616, and presumably the other marvel universes are very interlinked. 

We can from the network analysis observe that we have prominent characters linking to a lot of teams and a lot of characters. This is presumably because of the long history of the marvel comics and movies. 

As we have delved into mostly the comic universe, and as that universe contains so much more information than the cinematic, we can observe that the characters that are commonly known to some degree also are the characters most known (connected) in the comics: Wolverine, Spiderman, Magneto, etc. 

From the sentiment analysis, we observed that although some teams and characters are happier than others, the general happiness score is relatively low. This might be because the Marvel Universe is jam-packed with action. Most of the lore around the characters is about their struggles against supervillains or the descriptions of the supervillains mowing down planets. Although the narratives might be heroic or stoic, we still observed a lot of words that could have a negative connotation.

Another lesson from this project is just how convoluted stories over time become. Initially, we wanted to divide the characters into heroes and villains. Still, as we progressed through the project, it became clear that the definition of good and evil is not as static in the Marvel Universe as one might imagine. The seed for us was reading up on Magneto, where he is first an ally of the X-Men and later an enemy. This, in our eyes, makes the universe a lot more interesting as the characters are no longer clearly black or white but greyer, and their actions and plotlines become more attractive as an effect of that.

## Discussion

We display multiple stats and characteristics from the data, but we were unable to reconstruct the original teams, as for that, more complex techniques would have been necessary. We started this proyect with an ambitious idea, without thinking if it was among the competences of this course, or how difficult would it be to achieve. The original idea was to recreate from the links the original teams. What's the problem, right? That shouldn't be so difficult. Well, the main problem is that, from the wiki content, one can get the characters that are referenced in other, but there is nop way, or at least no easy way, to get if the refrences from a character are allies, enemies, partners in a team or just random meetings between characters.

The second problem is the complexity of the marvel timeline. Marvel comics have been around for quite some time (since 1939), and the main universe was created on 1961. Since then, many teams have been created and have disappeared. Many heores have been in multiple teams, and relationship between characters has changed a lot. Because of the size of the universe many different authors insert different ideas of what each character means to them. That results in characters that started in a team, then leave it, then came back and maybe, later, they became enemies to that team. This relationships results in that some characters are enemies and allies of themselves.

As you can guess, this makes the team reconstruction goal almost impossible. 

After realising this, wqe thought that maybe we could try to insert some kind of timeline to the characters, to see how the evolution of the characters goes. But again, this is very difficult for many reasons. First, the time in the marvel universe is not constant at all. Peter Parker (Spider-Man) was introduced in 1962, but as today, he is still not over 30 years old. On top of that, some events have time travel and inmortal beings. It is a huge chaos.

At the end, we decided to just analyze and do some kind of informative wiki of the marvel 616 universe, showing different stats for each character and each team, so we could apply different techniques and anlysis from what we learn during the course and show how dfficult and confusing the Marvel universe is.

What is missing? Well, that is easy, a reconstruction of the teams based on the links from each character. Is an obtainable goal? Not really, at least not without using much more complex techniques.

About possible improvements and future works from this proyect could be trying to incorpore more universes to the network. Apatrtrt from the 616, there are some more interesting and big universes from the marvel multiverse, such as *Earth-1610* (Ultimate Marvel Universe), Earth-199999 (Marvel Cinematic Universe) and Earth-928 (2099 year universe).


## Contributions

Althoug both of us have worked on all the proyect, I (Alejandro) have dedicated more time on this notebook, and have done more research on how to do the queries and how to obtain the analysis. Gustav is the one to thank for the multithreading of the download of the content, that allow us to get all the information in appropiate time (It went from around 10 hourst to download to 3 minutes, quite impressive to say the least :) ). He is the responsible of the cool website too.

Both of us agreed on the idea on what to do, thorugh multiple sesions of brainstorming. Unfortunatly, we got stuck with the proyect multiple times, because we didn't know what to get from the data, as we could not do what we thought we could at first. But, after finishing the full proyect, we are happy with the results, and how everything turn out at the end, as it feel informative, and we have learnt quite a lot.