# Basic Statistics
> "This section are going to perform exploratory analysis of the data from the three data sources described in [Data](https://mikkelmathiasen23.github.io/GameOfThrones_Network/data/)."

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [exploratory analysis, basic statistics, data visualization]
- hide: true
- search_exclude: false
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

In [1]:
#hide
import networkx as nx
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import json
import requests
import plotly.graph_objects as go

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from plotly.subplots import make_subplots
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mikkel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mikkel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

This part of the webpage are used to give a further introduction to the data used for analysis in this project. The [Data](https://mikkelmathiasen23.github.io/GameOfThrones_Network/data/) introduces the basics of the data, whereas this page will dive further in to what the data contains and some of its properties. A full analysis can be found in the [Explainer Notebook](https://mikkelmathiasen23.github.io/GameOfThrones_Network/Explainer_Notebook/). 

This page will contain three parts: firstly, properties of the Game Of Thrones network are presented, secondly properties of the dialogoues are presented and lastly the reviews and ratings from IMDB.

<h1 align="center">Game Of Thrones Network</h1>

This part are used to introduce properties of the data used for network analysis of the data especially there are focus on the properties of the characters namely their religion, allegiance, culture, number of appearances in the series and status ie. whether they are dead or alive at the end of the series. 


In [2]:
#hide
G = nx.read_gpickle("../data/got_G.gpickle")

property_dict = {
    "status": [],
    "appearances": [],
    "culture" : [],
    "allegiance": [],
    "religion" : []
}
attributes = ["status", "appearances", "culture", "allegiance", "religion"]
characters = []
for x,y in G.nodes(data = True): 
    node_name = x
    characters.append(node_name.replace('_', ' '))
    for attribute in attributes:
        if attribute == "appearances":
            if y[attribute] == "":
                yat = 0
            else:
                yat = int(y[attribute])
        elif attribute == "allegiance":
            if y[attribute] == "":
                yat = "No known allegiance"
            else:
                yat = y[attribute]
        elif attribute == "culture":
            if y[attribute] == "":
                yat = "No known culture"
            else: 
                yat = y[attribute]
        elif attribute == "status":
            if y[attribute] == "Place = [[Haystack Hall" or y[attribute] == '':
                yat = 'Unknown status'
            else: 
                yat = y[attribute]
        else:
            yat = y[attribute]
        property_dict[attribute].append(yat)    
df_property = pd.DataFrame.from_dict(property_dict, orient = "columns")
dfs = {}
figs ={}
for attribute in attributes:
    dfs[attribute] = df_property[attribute].value_counts()
    dfs[attribute] = dfs[attribute].reset_index()
    dfs[attribute].columns = [attribute.capitalize(), "Counts"]
    if attribute == 'appearances':
        dfs['appearances'] = dfs['appearances'][dfs['appearances']['Appearances'] != 0]
    figs[attribute] = px.bar(dfs[attribute], x=attribute.capitalize(),
             y="Counts", color=attribute.capitalize(), title="Distribution of character "+attribute)


The network data contains 162 characters, which is found by scraping the [gameofthrones.fandom.com](https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki) webpage. Each character has been assigned a number of attributes namely relgion, allegiance, culture, status ie. whether the character is dead or alive and lastly the number of appearances the character has through the series.

We will start out by investigating how the characters are distributed across these attributes, are some of groups of each attribute more frequent than others. This is going to be presented in interactive figures where the categories of each attribute can be selected and deselected by clicking these on and off. If ones want to only investigate one category this can be done by double-clicking the category of interest. 

Let's start out with how the characters distribute across the different religions. 

In [3]:
#hide_input
figs["religion"].show()

From this it is apparent that a lot of the characters does not have a known religion, which is the majority of the characters, but if we toggle this of, we can see that the majority of the characters are part of *Faith of The Seven* and *Old Gods of the Forest*. On the other side it should be noted that the least frequent religions are *White Walkers* and *Ghiscari religion*. From the basic knowledge of the Game Of Thrones universe it also makes sense that the two most popular religions are *Faith of The Seven* and *Old Gods of the Forest* as the *The Seven Kingdoms* are practicing the *Faith of the Seven* whereas the people in the *North* are practicing the *Old Gods of the Forest*.

Further it should be noted that the the Game Of Thrones universe contains 8 different religions based on the [Wiki pages](https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki). 

Next, we are going to investigate the frequencies of the different allegiances. The allegiances are a strong factor in Game Of Thrones as this has a large effect in the wars, and how people interact. How the different houses talk and interact are strongly affected by the allegiances. Some allegiances are known to be hostile to each other such as the *House of Stark* and *House of Lannister*. But also *House Baratheon of King's Landing* are known to be very hostile against *House Targaryen*, and these two Houses are known to be in war due to past history where the *Mad King* did kill people for fun. 

In [4]:
#hide_input
figs["allegiance"].show()

Again, some of the characters does not have an associated allegiance. The two most frequent allegiances are *House Stark* and *Hose Lannister*, followed by *Night's Watch* and *House Targaryen*. These allegiances are also the main allegiances in Game Of Thrones and further also the allegiances of the main characters in the series. 

*House Lannister* has characters as Cersei, Jamie and Tyrion whereas *House Stark* has Robb, Bran and the bastard Jon Snow. Jon Snow is one of the series most well known character which is also part of the *Night's Watch*, and the the *Night's Watch* are playing a big role later in the series when the battle against the *White Walkers* are happening. Lastly, *House Targaryen* are a house which is beaten down but as the series are evolving Daenerys are becoming a larger player in the universe as she conquers the world part by part. 

The characters are not only divided into allegiances, but also cultures, which has shown to be important. The people in the North are helping each other out even though they are not part of the same allegiance. 

In [5]:
#hide_input
figs["culture"].show()

From the above figure it can be seen that the most prominent culture are *Andals* followed by *Northmen*, again a large group has a unknown culture. From this it is apparent that most of the characters are found in the *Andals* and *Northmen* cultures, and makes the majority of the Game Of Thrones universe. Further, it should be noted that the universe contains a lot of small cultures such *Children of the Forest*. 

The *Andals* are the people who invaded Westeros in the beginning of the universe, and are the dominant group. The *Northmen* are also a big cultural group defined by all the characters living in the North of the Game Of Thrones world.  The *Children of the Forest* are a small group of characters which are presented fairly late in the series. They are small non-human characters, and should be the original people of Westeros. Further it should be noticed that the network contains a lot of different cultures. 

It is further investigated how many of the characters that die through the series. We start out with 224 characters, and end up with only 30 characters being alive, whereas 8 is uncertain and 2 unkown.

This means that 121 characters dies throughout the series, and anyone who has seen the series would be able to confirm that a lot of characters die as the series progresses. In the figure below the distribution of the characters status can be seen. 

In [6]:
#hide_input
figs["status"].show()

Next we will dive into the last attribute for each character in the network, namely how many appearances the character has throughout the series. This will give us indication how often we in general will see a character but also present if there are any strict patterns. 

In [7]:
#hide_input
figs["appearances"].show()

Again, a lot of characters do not have this attribute on their character page, and these observations have been omitted in the figure above. We can see that the majority of the characters only appear a couple of times ie. below 10-15 apperances. This would make sense as a lot of the characters are not main characters and therefore only appear in a season or likewise. We can further see a little group around 40 appearances and 60 appearances which could indicate we have a little group of characters appearing in most episodes, which would be expected as the series have a couple of main characters. 

<h1 align="center">Character dialogoues</h1>

Next we dive into the character dialogoues which are extracted from transcripts, this dataset contains dialogoues from all characters in the season, and this is based on another dataset than in the previous part of this page. Therefore we restrict the data to only contain data for the characters that are present in the network used for analysis in [Text Analysis](https://mikkelmathiasen23.github.io/GameOfThrones_Network/textanalysis/). 

Originally the data contains 817 characters and the original dataset can be found [here](https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json). 

In [8]:
#hide
resp = requests.get("https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/script-bag-of-words.json")

diag = json.loads(resp.text)

char_diag = {}
char_count = {}
for element in diag:
    episode = element['episodeNum']
    season = element['seasonNum']
    title = element['episodeTitle']
    text = element['text']
    for textObj in text:
        if textObj['name'] not in characters: 
            continue 
        if textObj['name'] in char_diag:
            char_diag[textObj['name']].append(textObj['text'])
            if ("S"+str(season)+"E"+str(episode)) not in char_count[textObj["name"]]['episodes']:
                char_count[textObj["name"]]['episodes'].append("S"+str(season)+"E"+str(episode))
            if season not in char_count[textObj["name"]]['seasons']:
                char_count[textObj['name']]['seasons'].append(season) 
            char_count[textObj['name']]['diag'] += len(word_tokenize(textObj['text']))
        else:
            char_diag[textObj['name']] = [textObj['text']]
            char_count[textObj['name']] = {'episodes': ["S"+str(season)+"E"+str(episode)], 'seasons': [season], "diag": 0}
            

In [9]:
#hide
df_diag = pd.DataFrame(
    {"character" : [char for char in char_count.keys()],
    'Character episode count': [len(v['episodes']) for char, v in char_count.items()],
     'Character season count':[len(v['seasons']) for char, v in char_count.items()],
     'Character diag length': [v['diag'] for char, v in char_count.items()]    })


We are going to investigate how many episodes and series does each character appear in and also what is the average token length ie. how much dialogoue are present for each character as the dialogoue length could indicate the importance of a character. This is thought as a good approximation, as a character with a lot of dialogoue probably also are present a lot in the series and this could indicate the importance of the character. 

In [10]:
#hide_input
fig_char_diag_len = px.bar(df_diag,x = "character", y="Character diag length", title = "Character dialogoue length distribution")
fig_char_diag_len.update_layout(xaxis={'categoryorder':'total descending'})

fig_char_diag_len.show()

From the figure above we can see that *Tyrion Lannister* clearly are the character with the longest dialogoue, which for anyone who has seen the series knows that Tyrion talks a lot and likes to talk. Next we can see that *Jon Snow*, *Cersei Lannister* and *Daenerys Targaryen* also has a lot of dialogoue. This makes sense as these three are part of the main characters, and appear in a lot of episodes. 

Next we will dive into the characters appearances in seasons but also episodes, this is done by finding the episodes and series where they have some diaologoue and use this as indications of appearance.

In [11]:
#hide_input
fig_char_season  = px.bar(df_diag,x = "character", y="Character season count", title = "Character appearances distribution (season level)")
fig_char_season.update_layout(xaxis={'categoryorder':'total descending'})
fig_char_season.show()

From the figure above it can be seen that a lot of characters are present in all 8 season such as: *Jon Snow, Sansa Stark, Tyrion Lannister, Bronn and Samwell Tarly* and again this is expected as these characters are part of the key characters. On the other side a lot of characters are only present in 1 season such as *Syrio Forel* which is Arya Starks "dancing teacher" when she moves to King's Landing. 

We will now investigate the appearance on episode level as this can give a more fine coarsed description of the character presence. 

In [12]:
#hide_input
fig_char_episode  = px.bar(df_diag,x = "character", y="Character episode count", title = "Character appearances distribution (episode level)")
fig_char_episode.update_layout(xaxis={'categoryorder':'total descending'})

fig_char_episode.show()

From this we can see that the character which appear in most episodes are *Tyrion Lannister* followed by *Jon Snow, Sansa Stark, Daenerys Targaryen* which makes perfect sense as these characters are main characters. Only a couple of characters are present only ones which clearly would indicate they had a small role in the Game Of Thrones plot. 

<h1 align="center">Reviews and ratings</h1>

Lastly we will dive into the data from IMDB, where ratings and reviews are extracted. Here we will investigate how the rating distribution are in general, but also how it is distributed when taking the average rating pr. episode but also pr. season. 

In [13]:
#hide
f = open("../data/imdb_reviews.json")
ratings = json.load(f)

episode_rating = {}
season_rating = {}
s = 0
for season, episodes in ratings.items():
    season_rating["S" + str(s+1)] = 0
    c = 0
    for episode in episodes:
        season_rating["S"+ str(s+1)] += episodes[episode]['ratings']['demographics']['imdb users']['rating']
        episode_rating["S" + str(s+1) + " E" + str(c+1)]= episodes[episode]['ratings']['demographics']['imdb users']['rating']
        c+= 1
    season_rating["S"+ str(s+1)] = season_rating["S"+str(s+1)]/c
    s+=1

We will start out by looking at the average rating pr. season. From the figure below we can see that season 1 through season 7 have almost the same average rating, whereas season 8 clearly sticks out with a low score. Further it should be noticed that season 4 has the highest average rating of 9.31 which is quite high as the highest IMDB score are 10. Further it should be noticed that the season in general has a high average rating.

Season 8 having the lowest score does not come as a surprise as a lot of people were unhappy with the ending of the series, and a lot of people did feel that they just ended the series to quick. 

In [14]:
#hide_input
df_season_rating = pd.DataFrame.from_dict(season_rating, orient= 'index')
df_season_rating = df_season_rating.reset_index()
df_season_rating.columns = ['Season', "IMDB rating"]
fig_season_rating = px.bar(df_season_rating, x="Season",
             y="IMDB rating", color="Season", title="Rating pr. season")
fig_season_rating.show()

Next we will look at the average rating pr. episode, to see if we could find any patterns. From the figure below we see approximately the same pattern as above, but we can now see that often the last 2 episodes in a season do achieve a higher average score compare to the middle episodes. Further it should be noticed that from the beginning of season 8 the episodes do keep getting lower average score, and the last episode in season 8 do achieve a quite low score of only 4. 

In [15]:
#hide_input
df_episode_rating = pd.DataFrame.from_dict(episode_rating, orient= 'index')
df_episode_rating = df_episode_rating.reset_index()
df_episode_rating.columns = ['Episode', "IMDB rating"]
fig_episode_rating = px.bar(df_episode_rating, x="Episode",
             y="IMDB rating", color="Episode", title="Rating pr. episode")
fig_episode_rating.show()

In [16]:
#hide
def rating_weighted(cat_dist):
    for key, value in cat_dist.items():
        cat_dist[key]['rating_sum'] = value['rating_sum']/value['votes']
    return cat_dist

def dist(item, dict_item):
    for i in item.keys():
        if i not in dict_item:
            dict_item[i] = {'votes':0, "rating_sum":0}

        dict_item[i]["votes"] += item[i]['votes']
        dict_item[i]["rating_sum"] += item[i]['rating']*item[i]['votes']
    return dict_item
cat_dist = {}
for season, episodes in ratings.items():
    s = "S" + season.split(" ")[1]
    for c,(episode) in enumerate(episodes):
        e = "E" + str(c)

        item = ratings[season][episode]['ratings']['demographics']
        cat_dist = dist(item, cat_dist)

cat_dist = rating_weighted(cat_dist)

We will dive into the demographics of the reviewers of Game Of Thrones, to see if a specific group of people watch the series, as this could help us understand any patterns and let us dive deeper into the case. 

The figure below shows the distribution of number of votes across 4 age groups and gender. From this it can be seen that males do vote more the females, and also the largest age is age 30-44 whereas the smallest are aged under 18.

In [17]:
#hide_input
x = ["Aged under 18", "Aged 18 29",'Aged 30 44','Aged 45 plus']*2
x_f = ['females aged under 18', 'females aged 18 29', 'females aged 30 44','females aged 45 plus']
x_m = ['males aged under 18', 'males aged 18 29', 'males aged 30 44','males aged 45 plus']
gender = np.concatenate((['female']*int(len(x)/2),['male']*int(len(x)/2)))
x_gender = np.concatenate((x_f, x_m))
v = [cat_dist[vote]['votes'] for vote in x_gender]
ratings = [cat_dist[vote]['rating_sum'] for vote in x_gender]

df = pd.DataFrame({
    "Category": x,
    "Gender Category" : x_gender,
    "Gender" : gender,
    "Votes" : v,
    "Ratings" : ratings  
})

px.bar(df, x="Category", y="Votes", color="Gender", title="Distribution of number votes across age and gender")

We could now look into how the average rating was distributed across gender and age. From the figure below it can be seen that males across all age groups give approximately the same high average rating, whereas females in the age group under 18, give the lowest average rating. It should be noted that this age groups also was the smallest, so this score can easily be affected by fewer people giving low average rating compared to the other groups. 

In [18]:
#hide_input
fig = go.Figure(data=[
    go.Bar(name='Females', x=df['Category'].unique(), y=df['Ratings'][df['Gender'] =='female']),
    go.Bar(name='Males', x=df['Category'].unique(), y=df['Ratings'][df['Gender'] =='male'])
])
# Change the bar mode
fig.update_layout(barmode='group',title_text="Distribution of average rating across age and gender", 
                    yaxis_title= "Average IMDB rating", xaxis_title = "Age category")
fig.show()

We have now investigated the data used in this project. How the attributes of the characters are distributed, and dived a little further into the Game Of Thrones universe. We have investigated the dialogoue of the character from transcripts, and seen how many episodes and seasons these characters are present, and further what their average dialogoue length was. Lastly, we have dived into the basics of the voters for the series, how the demographics of the voters were and their average rating. 