# Exlpainer Notebook
> "This section describes the in detail analysis done in this project."

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [explainer notebook, in depth analysis]
- image: images/some_folder/your_image.png
- hide: true
- search_exclude: false
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2


<p align="center">
  <img src="https://sm.ign.com/ign_ap/news/g/game-of-th/game-of-thrones-showrunners-reveal-the-night-kings-true-moti_2swf.jpg" />
</p>

<h1 align = "center"> 1. Motivation </h1>

---

Game Of Thrones is a very popular series, and achieved the status of being the most watched series to date. The series consist of multiple characters, allegiances, religions, locations and at the same time all these stories play out concurrently. Which makes the information and storytelling at some point quite overwhelming - do all of these characters have a pattern in how they act? Are there any pattern in how their story play out? Could we forecast these events? 

We wish to dig into these massive amounts of informations of each character to investigate if we could unveil any underlying pattern which could help all the Game Of Thrones fans understand the story even better. This would include investigation of network patterns, but also if we could find the most important character in each season, and overall in the series. Also to see if we can find any underying communities and thereby see if we can decoded the underlying pattern. 


<h2 align = "center">1.1 What is your dataset? </h2>

Our main dataset is from [The Game Of Thrones wiki-page](https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki), where we have extracted the character pages for all characters present in the Game Of Thrones series. This dataset is restricted to only contain the characters that are stated on the Wiki-page, and when this is compared with the number of characters from the book, the data amount in number of characters is smaller in comparison. 

The data for each character are both extracted as a total amount of data across all seasons ie. all the data from each character page, but also based on each season. From the character page some basic attributes were extracted: allegiance, religion, culture, appearances, status and what other characters the given character were linked to, and how often these were linked. The thumbnail image was also extracted for later visualization, and lastly the text was cleaned for the purpose of text analysis later on. 

As the wiki-pages are written by other people this could effect the sentiment of the character pages, therefore it was also decided to include an extra data source namely the character dialogoues. A data-source containing clean transcripts from all the seasons and episodes were used, so no data cleaning was needed here besides excluding all the dialogoue of characters that was not found through the wiki-page data.

Lastly, ratings and reviews from IMDB was used for each episode, as it was wished to compare this with the dialogoue. This was found by the use of a python package, which extracted data from IMDB. We did further investigate the possibility to scrape the IMDB webpage for reviews, but later it was found that this was not allowed and therefore we did not further pursue this opportunity.

<p align="center">
  <img src="https://assets.datacamp.com/production/project_76/img/got_network.jpeg" />
</p>
<h2 align = "center">1.2 Why did you choose this/these particular dataset(s)? </h2>

The dataset from the Game of Thrones wikipage, was used as this could both deliver information about each character ie. some of their general attributes but also text relating to interactions of the characters but also words describing each character. This could help us understand the interactions between all characters, and hopefully help us understand some of the underlying patterns in the series. 

We could investigate the network properties of the series both on an overall level across all seasons but also in each season. Through this we could make an analysis of each season, and investigate how this developed through each season. By investigation of words describing each character, and on the basis of their attributes eg. allegiance, we would perform text and sentiment analysis to see if we could understand the grouping. 

As the sentiment of the character pages could be biased by the writer of the page we wanted to find the root source namely the transcripts from the series, and therefore we utilized the dialogoue of all the characters as an second data source. This was done as we thought this could add extra valuable information to understand how the characters were feeling, and furthermore how they were grouped.

Lastly, we added the ratings from IMDB to see if the reviewers was affected by the general mood in the series. 

<h2 align = "center">1.3 What was your goal for the end user's experience? </h2>

The goal of the website is to present nice interactive visualizations that engage the user to explore the data and the Game Of Thrones series. Further, the site should be easy to grasp and understandable also for people without the theoretical knowledge learned in the course. The website contains results presented in different visualizations and tables that the user can dive into, and should be tempted to dive deeper into the analysis and the results. The site should be engaging, and it should not be necessary to visit the site for hours in order to get any insights, though this should be an opportunity. 



The website should let the end user explore the complex character interactions from the Game Of Thrones series, and let the user understand some of these patterns, which is explored both through network analysis, text analysis and sentiment analysis. The user should understand which characters are the most important characters in the series but also in each season, and further explore these characters properties. This could include words they use, how the characters feel and who they interact with. The user should feel the urge to dive deeper into this, and should feel confident in what the data contains and can be used for. 

<h1 align = "center">2. Data and basic statistics </h1>

---

<h2 align = "center">2.1 Data extraction, cleaning and pre-processing </h2>



In [None]:
import networkx as nx
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import json
import requests
import plotly.graph_objects as go

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from plotly.subplots import make_subplots
nltk.download('punkt')
nltk.download('wordnet')
base_path ="../data/"

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

<h3 align = "center"> 2.1.1 The Game of Thrones Wiki-page </h4>

The main data set of this project has been extracted from [the Game of Thrones fan Wikipedia](https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki). In order to generate a Game of Thrones network, it is first nescessary to compile a list of characters to build the network on. This was done by, for each season, using the wiki API to extract the cast of the series as seen below.


In [None]:
def api_call(t):
    # Basic function to call the Wiki-pedia API
    baseurl = "https://gameofthrones.fandom.com/api.php?"
    action = "action=query"
    title = "titles="+t.replace('é', 'e')
    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"

    query = "{}{}&{}&{}&{}".format(baseurl, action, content,title, dataformat)

    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    j = json.loads(wikitext)
    return j



def get_vals(nested, key):
    # Extracts data from a nest dictionary. Taken from:
    ## https://stackoverflow.com/questions/67747851/python-extracting-data-from-nested-dictionary
    result = []
    if isinstance(nested, list) and nested != []:   #non-empty list
        for lis in nested:
            result.extend(get_vals(lis, key))
    elif isinstance(nested, dict) and nested != {}:   #non-empty dict
        for val in nested.values():
            if isinstance(val, (list, dict)):   #(list or dict) in dict
                result.extend(get_vals(val, key))
        if key in nested.keys():   #key found in dict
            result.append(nested[key])
    return result

# Base url for the seasons
base_cat = "Game_of_Thrones_Season_"

names = []
pattern1 = "\[\[(.*?)\]\] as \[\[(.*?)\]\].*" 
for i in range(1,9):
    tmp = api_call(base_cat + str(i))
    txt = get_vals(tmp, '*')[0]
    names.extend(re.findall(pattern1, txt))
characters = {char: {"actor": actor,
                "link": char.split("|")[0].replace(" ", "_") if len(char.split("|")) > 0 else char.replace(" ", "_")} 
                for actor, char in names}

Using the method above, we generate a dictionary with the game of thrones characters as keys and the actor of the character and the wiki-link as values. In order to extract the character and actor names from the returned txt, we use the regular expression defined as **pattern1**. Looking at the website, it was noticed that the actor and his played role followed the simple syntax of: "\[\[Actor]] as \[\[Character]]". So, the regex expression simply looks explicitly for two sets of \[\[]] seperated by "as" and extracts all characters enclosed by these brackets. 

As we now have a compiled list of all characters present in the Game of Thrones series, we now want to extract the character wiki-pages. This is done similarly as above, simply replacing the season URL with the character URL.

In [None]:
links = get_vals(characters,'link')
for link in links:
    j = api_call(link)
    if j['batchcomplete'] != '':
        print('error:', char)
    else:
        txt = get_vals(j, '*')[0]
        if("REDIRECT" in txt):
            tmp = re.findall(("\[\[(.*)\]\]"),txt)
            j = api_call(tmp[0].replace(" ", "_"))
            txt = get_vals(j, '*')[0]
        with open(base_path+'got/'+link+".txt", "w") as text_file:
            text_file.write(txt)
        text_file.close()

It should also be noted, that if the returned text contains "REDIRECT", we perform a new API call to the link specified by the redirect term. Finally, all the returned character pages are saved as *.txt* files. 

<h4 align="center"> Creating the Game of Thrones Network </h4>

We can now generate the overall Game of Thrones network containing all the listed characters as network nodes. This is done by iterating through all the extracted character *.txt* files. The whole code chunk can be seen below but is easier understood if seperated into specific parts and purposes.

**Extracting character links**:
In order to compute edges for our Game of Thrones network, we iterate through all the generated *.txt* files and look for any links to other game of thrones characters. This is done by using the following:
```py
pattern = "\[\[(.*?)\]\]"
links = re.findall(pattern, node_description)
links, counts = np.unique([link.replace(" ", "_") for link in links if link.replace(" ", "_") in char_list], return_counts=True)
final_links = []
for i, link in enumerate(links):
        final_links.append((name,link,counts[i]))    
G.add_weighted_edges_from(final_links)
```

All links on the Game of Thrones wiki-page are enclosed by double brackets \[\[link]]. For each saved *.txt* file, we search for all links in the given file and look up whether this link is a characters in Game of Thrones. Additionally, we also count how many times a character is linked. This count will constitute the given weight of an edge. Finally, all the found links and their respective weight are added to the given node. This is then done for all nodes in the network.

**Extracting character attributes**:
On the wiki-page, most of the Game of Thrones characters have a fact box containing generic information such as: allegiance, religion, culture, status, amount of episodes and more. We extract information regarding culture, religion and allegiance using the following regex expression:

```py
culture_pattern = '((?<=Culture[\s\=\s]{1})|(?<=Culture[\=\s]{2})|(?<=Culture[\s\=\s]{3}))\[\[(.*?)\]\]'
religion_pattern = '((?<=Religion[\s\=\s]{1})|(?<=Religion[\=\s]{2})|(?<=Religion[\s\=\s]{3}))\[\[(.*?)\]\]'
allegiance_pattern = '((?<=Allegiance[\s\=\s]{1})|(?<=Allegiance[\=\s]{2})|(?<=Allegiance[\s\=\s]{3}))\[\[(.*?)\]\]'
```
The above regex expressions contains three cases of different positive lookbehinds. The positive lookbehind is expressed using *?<=* and simply looks for the given attribute, i.e. Culture for instance. The wiki-page uses three different formats in their textboxes: *"Culture = "*, *"Culture= "* or *"Culture ="* which is the reasoning for the use of three different positive lookbehind cases. If any of these positive lookbehinds match we extract the given information enclosed in \[\[]]. To sum it up, we only extract information enclosed in \[\[]] if these brackets are preceded by a match of one of the three different cases of positive lookbehind. We extract information regarding Status and number of Apperances in a similair way. 

**Minor pre-processing of extracted attributes**
There is some inconsistencies regarding naming on the wiki-pages. For instance, both *Andal* and *Andals* is used regarding characters culture. This is handled simply by the use of if-statements as seen below. Other inconsistencies are handled similarly. 

```py
    if culture == 'Andal':
        culture = 'Andals'
```

**Extraction of thumbnail and formatting of text for network application**
In order to extract links to thumbnail images for the network app seen on the webpage, we made use of the `python` library `BeautifulSoup`. This is done by parsing the html code of the given character site, and looking for the keyword *img* of class *pi-image-thumbnail* and extracting the *src* link. This extracted image link is then added to the network as an attribute for the use of the network app. 

Similarly, the extracted character name, attributes, most used words and a link to the characters' wiki-page is also added to the network as a string under the attribute *text*. This is also for the use of the network app. The network app will be further explained in Section 3.1 and how the most used words are found is explained in Section 3.2.

Below is the used code for creating the compiled Game of Thrones Network and the helper functions to extract image thumbnails for the network app.

In [None]:
# Code for extracting thumbnail images for network app
def getdata(url):
    r = requests.get(url)
    return r.text
def get_img(name):
    htmldata = getdata("https://gameofthrones.fandom.com/wiki/"+name)
    soup = BeautifulSoup(htmldata, 'html.parser')
    image = soup.find('img', attrs={"class":"pi-image-thumbnail"})
    return image['src']

In [None]:
#Creating the overall game of thrones network
files = os.listdir(base_path+"got/")
char_list = [file.split('.txt')[0] for file in files] # List of all characters 


G = nx.DiGraph() #Create empty Directed Graph Object

for file in files: # Iterating through all the saved .txt files for characters

    tf_idf_char = tf_idf_func() # For computation of most used words
    with open(base_path+"got/" + file) as f:
        node_description = f.read() # read in current .txt file
    name = file.split(".txt")[0] # name of current character

    if name == "Royal_Steward": #Special case
        continue 
    pattern = "\[\[(.*?)\]\]" # Pattern for finding links to other characters in the text file
    links = re.findall(pattern, node_description)

    # Extracting only links to other characters and count of how many times a character is linked.
    links, counts = np.unique([link.replace(" ", "_") for link in links if link.replace(" ", "_") in char_list], return_counts=True)

    # Regex expression for extracting character attributes
    culture_pattern = '((?<=Culture[\s\=\s]{1})|(?<=Culture[\=\s]{2})|(?<=Culture[\s\=\s]{3}))\[\[(.*?)\]\]'
    religion_pattern = '((?<=Religion[\s\=\s]{1})|(?<=Religion[\=\s]{2})|(?<=Religion[\s\=\s]{3}))\[\[(.*?)\]\]'
    allegiance_pattern = '((?<=Allegiance[\s\=\s]{1})|(?<=Allegiance[\=\s]{2})|(?<=Allegiance[\s\=\s]{3}))\[\[(.*?)\]\]'

    status = re.findall("\|.*Status.*\[\[.*\|(.*?)\]\]",node_description)[0] if len(re.findall("\|.*Status.*\[\[.*\|(.*?)\]\]",node_description))>0 else "" 
    appearances = re.findall("\|.*Appearances.* (\d+)", node_description)[0] if len(re.findall("\|.*Appearances.* (\d+)", node_description))>0 else ""
    
    allegiance = re.findall(allegiance_pattern, node_description)[0] if len(re.findall(allegiance_pattern, node_description))>0 else ""
    allegiance = allegiance[1] if len(allegiance)>1 else allegiance
    allegiance = allegiance.split('|')[1] if len(allegiance.split('|'))>1 else allegiance
    
    culture = re.findall(culture_pattern, node_description)[0] if len(re.findall(culture_pattern, node_description)) > 0 else ""
    culture = culture[1] if len(culture)>1 else culture
    culture = culture.split('|')[1] if len(culture.split('|'))>1 else culture

    religion = re.findall(religion_pattern, node_description)[0] if len(re.findall(religion_pattern, node_description)) >0 else "" 
    religion = religion[1] if len(religion)>1 else religion
    religion = religion.split('|')[1] if len(religion.split('|'))>1 else religion
    
    #Special cases
    if name == 'Tommen_Baratheon':
        allegiance = 'House Baratheon of King\'s Landing'*

    if culture == 'Andal':
        culture = 'Andals'
    if culture == 'Valyrian':
        culture = 'Valyrians'

    if allegiance == 'King of the Andals and the First Men':
        allegiance = 'King of the Andals, the Rhoynar, and the First Men'

    if religion == '':
        religion = "No known religion"
    
    # Text attribute for network app
    name_for_text = name.replace('_', ' ').capitalize()
    if len(name_for_text.split(' '))> 1:
        name_for_text = name_for_text.split(' ')[0].capitalize() + ' ' + name_for_text.split(' ')[1].capitalize()
    text = "The selected character: " + name_for_text + " has the following attributes: "+ \
            "\n    - Allegiance: " + allegiance + \
            "\n    - Religion: " + religion + \
            "\n    - Culture: "+culture + \
            "\nMost used words by " + name_for_text + " are: " +  \
            ", ".join([word[0] for word in sorted(tf_idf_char[name].items(), key=lambda value: value[1], reverse = True)[:5]]) + \
            "\nLink to character page are: https://gameofthrones.fandom.com/wiki/"+ name.replace(' ', '_')
    
    # Thumbnail image for network app
    try:
        thumbnail = get_img(name)
    
    except:
        site = 'Iron_Throne'
        thumbnail = get_img(site)

    # Add given character as a network node with the found attributes
    G.add_node(name, **{"status": status, "appearances": appearances, "culture":culture,
                    'allegiance': allegiance,"religion": religion, "text" : text, "thumbnail": thumbnail})
    
    # Compiling the found links into network edges with weights = counts of links
    final_links = []
    for i, link in enumerate(links):
            final_links.append((name,link,counts[i]))    
    G.add_weighted_edges_from(final_links)
    

<h4 align = "center"> Creating the season Networks </h4>


In order to create networks only for a given season, we use a similair approach as the one to create the overall network. On most character pages on the Game Of Thrones wiki, there is subsections stating their story progression for each season. This information is used in order to create networks for each season. 

We start out by creating a nested dictionary that for each season holds a dictionary with characters, respective actor and wiki-page links. We extract characters exactly as when creating the overall network with the only difference being that this is done for each season.

The code for this can be seen below.

In [None]:
names = {}
pattern1 = "\[\[(.*?)\]\] as \[\[(.*?)\]\].*" 
for i in range(1,9):
    tmp = api_call(base_cat + str(i))
    txt = get_vals(tmp, '*')[0]
    tmp = re.findall(pattern1, txt)
    names[i] = {char: {"actor": actor,
                "link": char.split("|")[0].replace(" ", "_") if len(char.split("|")) > 0 else char.replace(" ", "_")} 
                for actor, char in tmp}

We now need to extract the text for the characters. This is done by iterating through the respective seasons, utilizing the just before created dictionary containing all characters present for a given season. The character links are extracted from the dictionary and using the wiki-api the character webpage is extracted as well. The relevant text for a given season is extracted by splitting the text as seen below. Again, the subsections of seasons are not consistently using the same format hence the many different if-statements. 

The extracted text snippet for each season for a given character is then saved as a *.txt* file.

In [None]:
for i in range(1,9):
    links = get_vals(names[i],'link')
    for link in links:
        j = api_call(link)
        if j['batchcomplete'] != '':
            print('error:', char)
        else:
            txt = get_vals(j, '*')[0]
            if("REDIRECT" in txt):
                tmp = re.findall(("\[\[(.*)\]\]"),txt)
                j = api_call(tmp[0].replace(" ", "_"))
                txt = get_vals(j, '*')[0]
            if len(txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]==="))>1:
                txt = txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]===")[1]

            elif len(txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ==="))>1:
                txt = txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ===")[1]
            
            elif len(txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]==="))>1:
                txt = txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]===")[1]
            
            elif len(txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ==="))>1:
                txt = txt.split("===[[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ===")[1]

            elif len(txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]==="))>1:
                txt = txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]===")[1]
            
            elif len(txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ==="))>1:
                txt = txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ===")[1]
            
            elif len(txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ==="))>1:
                txt = txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ===")[1]
            elif len(txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ==="))>1:
                txt = txt.split("===  [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]]  ===")[1]
            elif len(txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ==="))>1:
                txt = txt.split("=== [[Game of Thrones Season "+str(i)+ "|Season "+str(i)+"]] ===")[1]
            else: 
                continue
            txt = txt.split("==")[0]
            with open(base_path+'got2/s'+str(i)+'/'+link+".txt", "w") as text_file:
                text_file.write(txt)
            text_file.close()

We now know all characters for a given season and have extracted the relevant part of their character page. Hence, we can, again, create the networks using the almost the same approach as when creating the overall Game of Thrones network but using the seasonal character texts instead. Instead of extracting character attributes again using regex expression we utilize that this information is already stored in the overall Game of Thrones network instead. 

One minor tweak for how we create these seasonal networks is, that a lot of the character references in the seasonal text files are not referenced by links but simply by their first name. In order to take this into account we altered the way we extracted character links slightly. Instead of using regex to match for a certain syntax of links, we simply used regex to search through the text for all character names: both first name and first name + last name.

In [None]:
G_raw = nx.read_gpickle(base_path+"got_G.gpickle")

# Create attribute dict for seasonal networks from the overall network
attribute_dict = {}
for node, attribute in G_raw.nodes(data = True):
    attribute_dict[node] = attribute

In [None]:
# Create network objects for each game of throne season
char_list_all = [f.split('.txt')[0] for f in os.listdir(base_path+"got_cleaned/")]

char_names_cap = [] # List of all characters
for char in char_list_all:
    char_names_cap.append(char)

for i in range(1,9):
    tf_idf_season = tf_idf_func(i) # Computation of most used words
    files = os.listdir(base_path+"got2/s"+str(i) +"/") # List of all character files for a given season
    char_list = [file.split('.txt')[0] for file in files] #List of all characters
    G = nx.DiGraph() # Create Directed Graph Object
    for file in files: #Iterate through character files for a given season
        with open(base_path+"got2/s"+str(i)+'/' + file) as f:
            node_description = f.read()
        name_char = file.split(".txt")[0]

        # Special case
        if name_char == "Royal_Steward":
            continue 
        
        #pattern = "\[\[(.*?)\]\]" # Pattern for extracting links
        # links = re.findall(pattern, node_description)

        #Extract character links by searching through txt file for character names.
        links = []
        for name in char_names_cap:

            if len(re.findall(name, node_description)) >0 :
                links.extend(re.findall(name, node_description))
            elif len(name.split(' '))>1:
                if len(re.findall(name.split(" ")[0], node_description)) >1:
                    tmp_link = re.findall(name.split(" ")[0], node_description)
                    if len(tmp_link) > len(re.findall(name, node_description)):
                        links.extend([name] * (len(tmp_link) - len(re.findall(name, node_description))))

        # Count of each unique link
        links, counts = np.unique([link.replace(" ", "_") for link in links if link.replace(" ", "_") in char_list], return_counts=True)
        # print(name_char)

        # Saving attributes for network app
        attributes = attribute_dict[name_char]  
        name_for_text = name_char.replace('_', ' ').capitalize()
        if len(name_for_text.split(' '))> 1:
            name_for_text = name_for_text.split(' ')[0].capitalize() + ' ' + name_for_text.split(' ')[1].capitalize()
        attributes['text'] = "The selected character: " + name_for_text + " has the following attributes: "+ \
            "\n    - Allegiance: " + attributes['allegiance'] + \
            "\n    - Religion: " + attributes['religion'] + \
            "\n    - Culture: "+attributes['culture'] + \
            "\nMost used words by " + name_for_text + " are: " +  \
            ", ".join([word[0] for word in sorted(tf_idf_season[name_char].items(), key=lambda value: value[1], reverse = True)[:5]]) + \
            "\nLink to character page are: https://gameofthrones.fandom.com/wiki/"+ name_char.replace(' ', '_')
        attributes['thumbnail'] = get_img(name_char)
        attributes['wordcloud'] = ""

        # Create node for given character with extracted attributes
        G.add_node(name_char, **attributes)
        # Add edges to node
        final_links = []
        for j, link in enumerate(links):
                final_links.append((name_char,link,counts[j]))    
        G.add_weighted_edges_from(final_links)
    # Save .gpickle file containing the network.
    nx.write_gpickle(G, base_path+"got_G_s"+str(i)+".gpickle")
        

<h3 align = "center"> 2.1.2 The Dialogue of Game of Thrones </h3>

The second data set used in this project is all the dialogue in the Game of Thrones series. This data set can be found [here](https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json). This data set contains, for each episode in the series, the episode name and all the dialogue for the given episode. As this data is already in `.json` format it's simply loaded in, and we create a dictionary of characters and all their dialogue throughout the show.


In [None]:
resp = requests.get("https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json")

diag = json.loads(resp.text)

char_diag = {}
for element in diag:
    episode = element['episodeNum']
    season = element['seasonNum']
    title = element['episodeTitle']
    text = element['text']
    for textObj in text:
        if textObj['name'] in char_diag:
            char_diag[textObj['name']].append(textObj['text'])
        else:
            char_diag[textObj['name']] = [textObj['text']]

No pre-processing was needed for this dataset.

<h3 align = "center"> 2.1.2 IMBd ratings and reviews </h3>

The last data set we utilized is the episodes of the show IMDb rating and reviews. This was extracted using the python library [IMDbPY](https://imdbpy.github.io/). This data contains the average ratings of each episode and demographic information for each reviewer such as gender and age. In order to utilize the `IMDbPY` package, the IMDb ID for each episode was needed. This was extracted manually:


In [None]:
imdb_ids = {'season 1': ["1480055", "1668746", "1829962", "1829963","1829964","1837862","1837863","1837864","1851398",
                          "1851397"],
            'season 2' : ["1971833","2069318","2070135","2069319","2074658","2085238","2085239","2085240","2084342",
                            "2112510"],
            'season 3' : ["2178782","2178772","2178802","2178798","2178788","2178812","2178814","2178806","2178784",
                            "2178796"],
            'season 4' : ["2816136","2832378","2972426","2972428","3060856","3060910","3060876","3060782","3060858",
                            "3060860"], 
            'season 5' : ["3658012","3846626","3866836","3866838","3866840","3866842","3866846","3866850","3866826",
                            "3866862"],
            'season 6' : ["3658014","4077554","4131606","4283016","4283028","4283054","4283060","4283074","4283088","4283094"],
            'season 7' : ["5654088","5655178","5775840","5775846","5775854","5775864","5775874",], 
            'season 8' : ["5924366","6027908","6027912","6027914","6027916","6027920"]
            }

Next, we initiate an instance of the IMDb class and iterate through all the episode IDs listed above. We then extract the reviews for each episode and the rating votes. Unfortunately, using this method we were limited to only recieve 25 reviews for each episode. The extracted information is then stored in a dictionary and saved as a *.json* file as can be seen below. 

In [None]:
# create an instance of the IMDb class
ia = IMDb()
#Extract reviews:
imdb_reviews = {}
for season, episodes in imdb_ids.items():
    imdb_reviews[season] = {}
    for episode in episodes:
        reviews = ia.get_movie_reviews(episode)
        votes = ia.get_movie_vote_details(episode)
        print(len(reviews['data']['reviews']))
        imdb_reviews[season][episode] = {"reviews" : reviews['data']['reviews'],
                                        "ratings" : votes['data']}

with open(base_path+'imdb_reviews.json','w+') as f:
    json.dump(imdb_reviews, f, indent = 4)

Again, no pre-processing of this data was needed.

<h2 align = "center">2.2 Basic statistics </h2>

In this section we are going to investigate the basic statistics of the Game Of Thrones network build on the full data from the wiki-pages which can be found [here](https://github.com/MikkelMathiasen23/GameOfThrones_Network/raw/master/data/got_G.gpickle). Further, are we going to investigate the properties of the Game Of Thrones dialogoue which can be found [here](https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json). Lastly, we are going to investigate the reviews and ratings data from IMDB which were gathered using a python package. The data can be found [here](https://github.com/MikkelMathiasen23/GameOfThrones_Network/raw/master/data/imdb_reviews.json).

Each of these exploratory analysis are going to have a subsection in order to make it easier to find it in this notebook. 

<h3 align = "center"> 2.2.1 Wiki-page data </h3>

We are going to start out by loading the pickle file containing the network and all its attributes. The network contains the edges between all characters, and each node which are representing a character contains five attributes: status, appearances, culture, allegiance and religion which we are going to investigate further. 

In [None]:
#Load network
G = nx.read_gpickle(base_path+"got_G.gpickle")

#Make space in a property dict for the values:
property_dict = {
    "status": [],
    "appearances": [],
    "culture" : [],
    "allegiance": [],
    "religion" : []
}
#Allocate space for characters and define attributes: 
attributes = ["status", "appearances", "culture", "allegiance", "religion"]
characters = []

First are we going to investigate the number of characters and edges in the network. 

Next, we are going to look into how the characters are distributed across the different allegiances, religions, cultures. How many characters die through the series, and how often does the characters occur. 

In [None]:
print(f'The network contains: {G.number_of_nodes()} nodes and {G.number_of_edges()} number of edges.')

The network contains: 162 nodes and 3085 number of edges.


In [None]:
#Iterate through all the nodes in the network and look into their attributes. 
for x,y in G.nodes(data = True): 
    #Save the character name
    node_name = x
    #Append name to save all names.
    characters.append(node_name.replace('_', ' '))

    #For each of the five attributes save their value:
    for attribute in attributes:
        #A number of special handles if the value is empty then save No known XXX
        if attribute == "appearances":
            if y[attribute] == "":
                yat = 0
            else:
                yat = int(y[attribute])
        elif attribute == "allegiance":
            if y[attribute] == "":
                yat = "No known allegiance"
            else:
                yat = y[attribute]
        elif attribute == "culture":
            if y[attribute] == "":
                yat = "No known culture"
            else: 
                yat = y[attribute]
        elif attribute == "status":
            if y[attribute] == "Place = [[Haystack Hall" or y[attribute] == '':
                yat = 'Unknown status'
            else: 
                yat = y[attribute]
        else:
            yat = y[attribute]
        
        #Save the attribute in the property_dict for plotting:
        property_dict[attribute].append(yat)   

Now that we have saved all the attributes and character names we are ready for plotting it. This will be done using `plotly` to make the figures interactive. All figures are saved in a dict so after making the figures we can visualize them one by one. 

In [None]:
#Convert dict to dataframe for easy plotting in plotly:
df_property = pd.DataFrame.from_dict(property_dict, orient = "columns")

#Allocate space for all dataframes and figures:
dfs = {}
figs ={}

#Iterate through all 5 attributes:
for attribute in attributes:
    #Count number of occurences of the categories for the given attribute:
    dfs[attribute] = df_property[attribute].value_counts()
    #Reset index to have the attribute values as column instead of index:
    dfs[attribute] = dfs[attribute].reset_index()
    #Convert eg. status to Status for niceness in plot:
    dfs[attribute].columns = [attribute.capitalize(), "Counts"]

    #Special handling for appearances as we dont want to see zeros:
    if attribute == 'appearances':
        dfs['appearances'] = dfs['appearances'][dfs['appearances']['Appearances'] != 0]
    #Make and save figure:
    figs[attribute] = px.bar(dfs[attribute], x=attribute.capitalize(),
             y="Counts", color=attribute.capitalize(), title="Distribution of character "+attribute)

Next, we are going to look at the plots one by one: 

In [None]:
figs["religion"].show()

From this it is apparent that a lot of the characters does not have a known religion, which is the majority of the characters, but if we toggle this of, we can see that the majority of the characters are part of *Faith of The Seven* and *Old Gods of the Forest*. On the other side it should be noted that the least frequent religions are *White Walkers* and *Ghiscari religion*. From the basic knowledge of the Game Of Thrones universe it also makes sense that the two most popular religions are *Faith of The Seven* and *Old Gods of the Forest* as the *The Seven Kingdoms* are practicing the *Faith of the Seven* whereas the people in the *North* are practicing the *Old Gods of the Forest*.

Further it should be noted that the the Game Of Thrones universe contains 8 different religions based on the [Wiki pages](https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki). 

In [None]:
figs["allegiance"].show()

Again, some of the characters does not have an associated allegiance. The two most frequent allegiances are *House Stark* and *Hose Lannister*, followed by *Night's Watch* and *House Targaryen*. These allegiances are also the main allegiances in Game Of Thrones and further also the allegiances of the main characters in the series. 

*House Lannister* has characters as Cersei, Jamie and Tyrion whereas *House Stark* has Robb, Bran and the bastard Jon Snow. Jon Snow is one of the series most well known character which is also part of the *Night's Watch*, and the the *Night's Watch* are playing a big role later in the series when the battle against the *White Walkers* are happening. Lastly, *House Targaryen* are a house which is beaten down but as the series are evolving Daenerys are becoming a larger player in the universe as she conquers the world part by part. 

In [None]:
figs["culture"].show()

From the above figure it can be seen that the most prominent culture are *Andals* followed by *Northmen*, again a large group has a unknown culture. From this it is apparent that most of the characters are found in the *Andals* and *Northmen* cultures, and makes the majority of the Game Of Thrones universe. Further, it should be noted that the universe contains a lot of small cultures such *Children of the Forest*. 

The *Andals* are the people who invaded Westeros in the beginning of the universe, and are the dominant group. The *Northmen* are also a big cultural group defined by all the characters living in the North of the Game Of Thrones world.  The *Children of the Forest* are a small group of characters which are presented fairly late in the series. They are small non-human characters, and should be the original people of Westeros. Further it should be noticed that the network contains a lot of different cultures. 

In [None]:
figs["status"].show()

From the plot we can see that 121 characters dies throughout the series, and anyone who has seen the series would be able to confirm that a lot of characters die as the series progresses. In the figure below the distribution of the number of episode appearences can be seen. 

In [None]:
figs["appearances"].show()

Again, a lot of characters do not have this attribute on their character page, and these observations have been omitted in the figure above. We can see that the majority of the characters only appear a couple of times ie. below 10-15 apperances. This would make sense as a lot of the characters are not main characters and therefore only appear in a season or likewise. We can further see a little group around 40 appearances and 60 appearances which could indicate we have a little group of characters appearing in most episodes, which would be expected as the series have a couple of main characters. 

<h3 align="center"> 2.2.2 Character dialogoue </h3>

Next, we going to do some exploratory analysis of the character dialogoue. The dialogoue comes in a list of json objects, which each contains the information which episode, season it comes from. Each episode object does also contain a list of objects which contains a name of the character speaks and then the given dialogoue of the character. 

We start out by loading the dialogoue from a github page. 

In [None]:
#Download data and load as json:
resp = requests.get("https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json")
diag = json.loads(resp.text)

#Allocate space for character dialog:
char_diag = {}
char_count = {}

Next, we are going to pre-process the dialogoue into two dictionaries, namely a dictionary `char_diag` that contains the character name as a key, the value of the dictionary is the dialogoue of the character. 

The other dictionary `char_count` contains the number of occurences for the given character at a season level and episode level. This are used to investigate the number of appearances of each character. Further, we are saving the length of the given character dialogoue.

In [None]:
#Iterate through each object:
for element in diag:

    #Extract season, episode number and the title of the episode:
    episode = element['episodeNum']
    season = element['seasonNum']
    title = element['episodeTitle']

    #Create new object with the text:
    text = element['text']
    #Iterate through each object in the text object which contains all dialogoue of the episode:
    for textObj in text:

        #If the character name are not in our character network then continue:
        if textObj['name'] not in characters: 
            continue 
        #If the character already are in the dictionary:
        if textObj['name'] in char_diag:
            #Then append text:
            char_diag[textObj['name']].append(textObj['text'])
            #Put the episode into the character count dictionary - to keep track of which seasons and episodes:
            if ("S"+str(season)+"E"+str(episode)) not in char_count[textObj["name"]]['episodes']:
                char_count[textObj["name"]]['episodes'].append("S"+str(season)+"E"+str(episode))
            if season not in char_count[textObj["name"]]['seasons']:
                char_count[textObj['name']]['seasons'].append(season) 
            #Save the length of the dialogoue:
            char_count[textObj['name']]['diag'] += len(word_tokenize(textObj['text']))
        else:#If not already present in the object add it:
            char_diag[textObj['name']] = [textObj['text']]
            char_count[textObj['name']] = {'episodes': ["S"+str(season)+"E"+str(episode)], 'seasons': [season], "diag": 0}

Having saved the data in two dictionaries we are going to convert the `char_count`into a dataframe which makes it more easy to plot afterwards. 

Simultaneously, we count the number of unique occurences across episodes, seasons for each character.

In [None]:
df_diag = pd.DataFrame(
    #Iterate through all characters:
    {"character" : [char for char in char_count.keys()],
    #Count number of episodes the character are present in:
    'Character episode count': [len(v['episodes']) for char, v in char_count.items()],
    #Count number of seasons the character are present in:
     'Character season count':[len(v['seasons']) for char, v in char_count.items()],
     #Add total dialogoue length for each character:
     'Character diag length': [v['diag'] for char, v in char_count.items()]    })

We are now ready to plot the 3 dialogoue attributes, this are again done using `plotly` and are going to be done using bar-plots. 

In [None]:
fig_char_diag_len = px.bar(df_diag,x = "character", y="Character diag length", title = "Character dialogoue length distribution")
fig_char_diag_len.update_layout(xaxis={'categoryorder':'total descending'})

fig_char_diag_len.show()

From the figure above we can see that *Tyrion Lannister* clearly are the character with the longest dialogoue, which for anyone who has seen the series knows that Tyrion talks a lot and likes to talk. Next we can see that *Jon Snow*, *Cersei Lannister* and *Daenerys Targaryen* also has a lot of dialogoue. This makes sense as these three are part of the main characters, and appear in a lot of episodes. 

In [None]:
fig_char_season  = px.bar(df_diag,x = "character", y="Character season count", title = "Character appearances distribution (season level)")
fig_char_season.update_layout(xaxis={'categoryorder':'total descending'})

fig_char_season.show()

From the figure above it can be seen that a lot of characters are present in all 8 season such as: *Jon Snow, Sansa Stark, Tyrion Lannister, Bronn and Samwell Tarly* and again this is expected as these characters are part of the key characters. On the other side a lot of characters are only present in 1 season such as *Syrio Forel* which is Arya Starks "dancing teacher" when she moves to King's Landing. 

In [None]:
fig_char_episode  = px.bar(df_diag,x = "character", y="Character episode count", title = "Character appearances distribution (episode level)")
fig_char_episode.update_layout(xaxis={'categoryorder':'total descending'})

fig_char_episode.show()

From this we can see that the character which appear in most episodes are *Tyrion Lannister* followed by *Jon Snow, Sansa Stark, Daenerys Targaryen* which makes perfect sense as these characters are main characters. Only a couple of characters are present only ones which clearly would indicate they had a small role in the Game Of Thrones plot. 

<h3 align="center"> 2.2.3 Reviews and ratings </h3>

As a last data-source we are looking into the reviews and ratings taken from IMDB, though using a python package (see earlier section) as it is illegal to scrape the IMDB webpage which we otherwise was looking into. 

We are going to look into the average rating at episode and season level. Also we are looking into the demographics of the raters/reviewers, and lastly, what is the average review length for each episode/season.

We start out by loading the data which is saved in a `.json` file.  

In [None]:
#Load data and allocate space for ratings:
f = open(base_path+"imdb_reviews.json")
ratings = json.load(f)

episode_rating = {}
season_rating = {}

Now, we want to organize the data to ease the plotting afterwards. The ratings dictioanary is organized in a manner where the upper level key is the season, next the episode, and in each episode it contains the reviews and the ratings. This we want to extract and organize so it is more easy to plot.

We create two new dictionaries: one for storing the rating on season level and another for storing the rating on episode level. 

In [None]:
s = 0
#Iterate through each season:
for season, episodes in ratings.items():
    season_rating["S" + str(s+1)] = 0
    c = 0
    #Iterate throug all episodes in this season:
    for episode in episodes:
        #Add rating:
        season_rating["S"+ str(s+1)] += episodes[episode]['ratings']['demographics']['imdb users']['rating']
        episode_rating["S" + str(s+1) + " E" + str(c+1)]= episodes[episode]['ratings']['demographics']['imdb users']['rating']
        c+= 1
    #Divide by number of episodes in this season:
    season_rating["S"+ str(s+1)] = season_rating["S"+str(s+1)]/c
    s+=1

Now that we have organized the data we are ready for investigating the average rating pr. season and episode.

In [None]:
#Convert to dataframe for easy plotting:
df_season_rating = pd.DataFrame.from_dict(season_rating, orient= 'index')
#Reset index so we have a column named season instead:
df_season_rating = df_season_rating.reset_index()
df_season_rating.columns = ['Season', "IMDB rating"]
#Plot it:
fig_season_rating = px.bar(df_season_rating, x="Season",
             y="IMDB rating", color="Season", title="Rating pr. season")
fig_season_rating.show()

Next we will look at the average rating pr. episode, to see if we could find any patterns. From the figure below we see approximately the same pattern as above, but we can now see that often the last 2 episodes in a season do achieve a higher average score compare to the middle episodes. Further it should be noticed that from the beginning of season 8 the episodes do keep getting lower average score, and the last episode in season 8 do achieve a quite low score of only 4. 

In [None]:
#Convert to dataframe:
df_episode_rating = pd.DataFrame.from_dict(episode_rating, orient= 'index')
#Reset index again:
df_episode_rating = df_episode_rating.reset_index()
df_episode_rating.columns = ['Episode', "IMDB rating"]

#Plot:
fig_episode_rating = px.bar(df_episode_rating, x="Episode",
             y="IMDB rating", color="Episode", title="Rating pr. episode")
fig_episode_rating.show()

Next part, investigate the demographics of the people who review and rate the Game Of Thrones series. In order to do this more easy we are going two create two functions for this. 

One which extract the number of votes and rating sum, and another that reweights the rating. 

In [None]:
def rating_weighted(cat_dist):
    """
    Function that given a distribution of votes and rating_sum convert
    the rating into a weighted average
    """
    for key, value in cat_dist.items():
        #Iterate through each category and compute the weighted rating score:
        cat_dist[key]['rating_sum'] = value['rating_sum']/value['votes']
    return cat_dist

def dist(item, dict_item):
    """
    Function that iterates through a dict and assign the number of votes and rating sum
    """
    for i in item.keys():
        #If not in dict create the object:
        if i not in dict_item:
            dict_item[i] = {'votes':0, "rating_sum":0}

        #Put values into dict:
        dict_item[i]["votes"] += item[i]['votes']
        dict_item[i]["rating_sum"] += item[i]['rating']*item[i]['votes']
    return dict_item

cat_dist = {}
#Iterate through each season:
for season, episodes in ratings.items():
    s = "S" + season.split(" ")[1]

    #Iterate through each episode in season:
    for c,(episode) in enumerate(episodes):
        e = "E" + str(c)

        #Get demographics iteM
        item = ratings[season][episode]['ratings']['demographics']
        #Extract values into cat_dist dictionary:
        cat_dist = dist(item, cat_dist)
#Compute weighted average rating:
cat_dist = rating_weighted(cat_dist)

Now that we have organized our data in a more organized manner that makes it more easy to plot we are going to plot the demographics of the voters. 

In [None]:
#Age categories:
x = ["Aged under 18", "Aged 18 29",'Aged 30 44','Aged 45 plus']*2
#Categories for females:
x_f = ['females aged under 18', 'females aged 18 29', 'females aged 30 44','females aged 45 plus']
#Categories for men:
x_m = ['males aged under 18', 'males aged 18 29', 'males aged 30 44','males aged 45 plus']

#Get the gender:
gender = np.concatenate((['female']*int(len(x)/2),['male']*int(len(x)/2)))
x_gender = np.concatenate((x_f, x_m))

# Extract number of votes:
v = [cat_dist[vote]['votes'] for vote in x_gender]

#Extract average rating:
ratings = [cat_dist[vote]['rating_sum'] for vote in x_gender]

#Convert into dictionary for easy plotting:
df = pd.DataFrame({
    "Category": x,
    "Gender Category" : x_gender,
    "Gender" : gender,
    "Votes" : v,
    "Ratings" : ratings  
})

#Plot number of votes based on age and gender:
px.bar(df, x="Category", y="Votes", color="Gender", title="Distribution of number votes across age and gender")

Apperently the largest gender group is males which give votes, and the largest age group is age 30-44 whereas the smallest is under 18 years old. This would make sense as the series is restricted to 18 years or more. Further, it can be seen that people above age 45 does not tend to watch Game Of Thrones as much. 

In [None]:
#Create figure with grouped bar plots for the average IMDB rating:
fig = go.Figure(data=[
    go.Bar(name='Females', x=df['Category'].unique(), y=df['Ratings'][df['Gender'] =='female']),
    go.Bar(name='Males', x=df['Category'].unique(), y=df['Ratings'][df['Gender'] =='male'])
])
# Change the bar mode
fig.update_layout(barmode='group',title_text="Distribution of average rating across age and gender", 
                    yaxis_title= "Average IMDB rating", xaxis_title = "Age category")
fig.show()

Generally the average IMDB rating are quite constant across the age and gender groups. The average rating are further quite high as the IMDB rating score goes from 0-10 and the average is above 8. The group that gives the series the lowest score are females under 18 years, but this group is also rather small so not many people need to give it a bad rating before this score gets affected by the ratings. 

<h1 align ="center" >3. Tools, theory and analysis. Describe the process of theory to insight </h1>

---

This section is going to describe how we have approached the task of answering the following questions about the Game Of Thrones: 
- Who is the main characters of each season of the series?
- Is it possible to find a pattern in the data that helps understand the complicated world of Westeros?
- Is the theme of the Game of Thrones series consistent throughout the show or does it change during it's course?

This will be done first analysing the character network, next performing text analysis and lastly perform a community analysis. Therefore this section will be divided into three parts:

- Network analysis 
- Text analysis 
- Community analysis

Each part is going to be introduced below, as well as the subconclusion of each part. Further, a link to a notebook doing the full analysis of each of the parts will be provided. This is done to make this notebook more convenient, and as the size otherwise would be to big to render on Github, and on most computers be extremely slow to read through. 

<h2 align = "center"> 3.1 Network analysis </h2>

The Notebook going through the full analysis can be found [here](https://mikkelmathiasen23.github.io/GameOfThrones_Network/Explainer_Network/).

This part analyse the character interactions, do people tend to group in eg. allegiances? Or do they interact with people from other allegiances? Which character are main character in each season and across the whole series? 

This is questions that this part of the analysis are going to dive into. The methods used for this part of the analysis are mainly network science tools such as degree distributions, centrality measures, assortivity and lastly, good old exploratory analysis and visualizations in general. 

The main outputs from this analysis are two inteactive visualizations/apps that the end user can play around with. The first app are based on directed graphs, where the links between the characters (nodes) are extracted from the links between the character pages on the Game Of Thrones wiki-page. The network app can be visualized both pr. season and across each season, further it can be overlayed with different colormaps based on selected attribute. 

The other main output are a interactive visualization/table where the out- and in degrees and centrality measure for each character are presented in a table and a graph, where one can tab through the seasons. This can help determine the importance of each of the characters in the selected season. 

<h3 align = "center"> 3.1.1 Subconclusion </h3>

The resulting graph across all seasons were a very dense graph, and through different network analysis tools and by splitting into seasons it was possible to determine the most important characters in each season. Further, we found that the character attributes selected (religion, culture, allegiance, appearance and status) did not pose to be good measures to distinguish the characters. The characters were possibly connected base on another attribute or underlying pattern. 

In the next sections we are going to investigate community analysis and hopefully this can help us reveal some of these patterns. 


<h2 align = "center"> 3.2 Text analysis </h2>

The Notebook doing the full text analysis can be found [here](https://mikkelmathiasen23.github.io/GameOfThrones_Network/Explainer_Text/).

<p align="center">
  <img src="https://blog.grabon.in/wp-content/uploads/2019/05/Got-Quote-Tyrion-Lannister.jpeg" />
</p>


Do the theme of the Game Of Thrones change as the seasons goes by? Do characters and allegiances have specific words they use? Is it maybe possible to determine the mood of the characters throughout the series? This is some of the questions we try to answer and understand in this part of the analysis. 

In order to dive into these questions the text analysis contains three parts namely:

- TF-IDF analysis 
- Sentiment analysis
- Dispersion plot

The text analysis are based on two data sources namely data extracted from the wiki-pages (ie. the character pages), and from the character dialogoue which is based on transcripts (see data section for more in depth description). In the TF-IDF we analyse the words used by each character, and compare the results between the character dialogoue and wiki-page. Further, we dive into whether the different allegiances has specific words they used compare to other. For both character and allegiance text analysis a few characters and allegiances are selected and compared. Further, we look into whether the theme changes throughout the seasons, and are done by looking into the words used in each season. 

The results are presented in wordclouds in order to make it easy to grasp the information, and get an overview. 

Further, in order to determine the mood of the characters we look into sentiment analysis, here two methods are utilized for the analysis: LabMT and VADER, which will be further explained in the linked notebook. Lastly, in order to look into whether the theme changes we are going to visualize a couple of selected words in a dispersion plot, to look into whether these change. A few words that are carefully selected are investigated. 



<h3 align = "center"> 3.2.1 Subconclusion </h3>

Through the use of wordclouds it was possible to dive into the words used by a few selected characters and allegiances, and the extracted words clearly did explain these characters and allegiances. The words did also correspond well with the biased image of the characters and allegiances. Further, through the TF-IDF analysis of the seasons, we could see that the theme clearly changes as the series progress, and this was further backed up by the dispersion plot. 

Lastly, by analysing the sentiment of the characters based on the dialogoue and wiki-pages that the dialogoue did contain larger variation in sentiment which would be expected as dialogoue would be expected to contain more polarizing words. 

<h2 align = "center"> 3.3 Community analysis </h2>

The Notebook doing the full community analysis can be found [here](https://mikkelmathiasen23.github.io/GameOfThrones_Network/Explainer_Community/).

What is the underlying pattern that connects the characters? This is the main question of this section. From the network analysis (in the previous part) it appeared that the underlying pattern connection the different characters was not primarily based on the selected attributes for each character. This section contains four parts which will dive into how the characters are connected. The sections are as follows:

- Community detection and exploratory analysis
- Network analysis
- TF-IDF
- Sentiment analysis 

We utilzing the Louvain algorithm to detect communities in our character network, which we are going to perform some exploratory analysis on, and afterwards create an interactive network visualization/app same approach as in the Network analysis section. Afterwards, we are going to dive into the words characterizing these communities and lastly, looking into the sentiment. 

<h3 align = "center"> 3.3.1 Subconclusion </h3>

The Louvain algorithm did extract six communities, and through the exploratory analysis it became clear that the characters were not grouped by their attributes (allegiance, religion etc.) but instead based on which people they were sourounded by. Daenarys Targaryen was a good example of this. Further when we looked into the community networks this became even more clear that this was the underlying pattern connecting these characters. Which seemed obvious when you think about it. We further analyzed the words of each community and this further backed up the hypothesis, and lastly the sentiment analysis that these communities clearly had different levels of sentiment.

<h1 align= "center"> 4. Discussion </h1>

---

Our primary data source were the wiki-pages, and this proved to be a time consuming task to extract data from, due to a lot of human incosistencies and errors. This included multiple re-iterations of the regular expressions used to extract both character links, but also the attributes for each character. We did in the end after numerous attempts end up with a approach that appears to be mostly error free (we might find some when digging more into this), so we ended up with clean data. We did also discover that the information present for each character was very diverse and characters did contain different amount and quality of information, which needed to be handled. 

In the beginning of the project we wished to add all the reviews from IMDB, which we in the beginning approached with web-scraping and used quite some time to set scripts and functions up to do this, but learned after some research that this was not allowed. We did therefore approach this by using a bult-in package in Python to extract these, but this showed to only extract 25 reviews pr. episode. After further analysis this showed not be enough as eg. the reviews and ratings not always appeared to be matching very well. We did therefore drop to use the reviews in this project, but we still think it could be interesting to investigate how sentiment of the reviews developed as the series progresses and compare this with the state of the (main) characters. Do we change our state of mood when the characters? Humans are said to mirror other humans, so this would be intersting to dive into. 

<p align="center">
  <img src="https://uproxx.com/wp-content/uploads/2019/04/brienne-jaime.jpg?w=650" />
</p>

We especially think that our interactive visualization apps ie. the network app that enables the user to investigate the interactions season by season but also across the whole story went very well. This app contains enormous amount of information as one can get information about each character including the attributes, frequent used words, image and a link to the character page on the wiki-page. Further, the user can overlay the network with different attributes, and investigate patterns. At the same time we think the visualization are easy to understand without diving deep into the information. 

The app describing the degrees and closeness centrality is also an element that we are proud of, as this can in an easy to grasp approach show what are the main character in each season. Also we are very proud of the large functionalities of this app, as the user can sort and delete features. 

<h1 align = "center"> 5. Future Work </h1>

---

The data only contained 165 characters as the primary data source used for the character was from the Game Of Thrones wikipage, which did not contain all the characters from the books. It would be interesting to further investigate the characters, their interactions, and some of the underlying patterns as there yet are a lot to dive into. The data could be supplied with further data from the books which can be found [here](https://www.kaggle.com/mmmarchetti/game-of-thrones-network-analysis/data). 

This extensive dataset could add further complexity to the network analysis, and further the dialogoues used in this project could suplement information of the dialogoue of the majority of these characters. The dialogoue dataset contains more than 800 characters, and supplying the to date analysis with these elements could hopefully help reveal further interesting patterns of the Game Of Thrones story. 


<p align="center">
  <img src="https://quotedtext.com/wp-content/uploads/2020/06/game-of-thrones-quotes-by-tyrion-lannister-on-life-possibilities-1024x768.jpeg" width = 600 />
</p>


<h1 align = "center"> 6. Conclusion </h1>

---

To reiterate our goals of this project:
- Create a visually nice and interactive website
- To perform network analysis and investigate of any patterns arises in the network attributes
- Determine who was actually the main character of the story and it's respective seasons
- Determine the overall sentiment of each character in the Game of Thrones series
- Search for communities in the Game of Thrones network and establish if these communities makes sense from a story point of view

---

In conclusion, we, subjectively, think we succeeced in creating a visually pleasing website, with opportunities for the viewer to explore the extraced data and analysis themselves. This was done by utilizing interactive *Dash* applications hosted on *Heroku*. A lot of effort went into the making of the website content and it's formatting. In our network analysis it was possible to determine the most important characters of the show and it's respective seasons. This was achieved by investigating centrality measures and degree distributions. It was also concluded that the extracted attributes of religion, culture and allegiance did not pose to be good measures to distinguish the characters. 

We did further analyse characters and their allegiances through use of text analysis, especially we focused on TF-IDF scores which we further visualized using wordclouds. This clearly established properties of each characters and their allegiances, and was in line with what was expected from our biased opinion. Further, we looked into whether the overall theme of the series changed through the seasons, which we investigated using TF-IDF analysis and dispersion plot, from here it was possible to see new patterns in theme which clearly changed through the story. As a last element on the text analysis we looked into sentiment analysis and could clearly find differences in the mood of the characters and as well we found that sentiment analysis appeared to have the best effect when used on dialogoue, where we could see larger variation in the sentiment across characters.

Lastly, as the character attributes did not pose as a good measure to distinguish the characters we performed community detection and found that the best way to split the characters was based on which other characters that sorrounded them. This conclusion can appear obvious, but nevertheless this was our finding from the analysis. We did further investigate the properties of these communities in these were clearly inline with the main result that the characters are best split into communities based on which characters they are sorrounded by. 

<h1 align = "center"> 7. Contributions </h1>

---

Both group members have contribute to all parts of the project.

**Introduction:** Mikkel 

**Data:** Nicolai

**Basic statistics:** Mikkel

**Network analysis:** Nicolai

**Text analysis:** Mikkel

**Community analysis:** Nicolai

**Network app:** Mikkel

**Table app:** Nicolai

**Community app:** Mikkel

**Main explainer notebook:** Nicolai

**Network analysis explainer notebook:** Mikkel

**Text analysis explainer notebook:** Nicolai

**Community analysis explainer notebook:** Mikkel

**Website:** Nicolai


<h1 align = "center"> 8. References </h1>

---
**Data sources:**

Game of thrones Wikipedia: https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki

Dialogoue data: https://github.com/jeffreylancaster/game-of-thrones#data


**Plotting and visualization templates:**

Plotly: https://dash.plotly.com/ 

Heroku: https://towardsdatascience.com/deploying-your-dash-app-to-heroku-the-magical-guide-39bd6a0c586c 

Previous work in the 02805 Social Graphs and Interactions assignments

**Website:**

Fastai fastpages: https://github.com/fastai/fastpages 

**Network science:**

Design & Data Visualization: Max Tillich, Kim Albrecht, Mauro Martino, Marton Posfai and Gabriele Musella. 

**Sentiment analysis methods:**

VADER: https://github.com/cjhutto/vaderSentiment

LabMT: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466725/ 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0aadd790-0254-407e-bf1a-a0259cad43c9' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>