# Project : Game of Thrones : Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

Hey 👋🏻 Welcome to this jupyter notebook ! I'm glad you joined me. Here we're going to answer some funny questions about Game Of Thrones using Data Analysis. There may be spoilers 😏.
<br />

The dataset that we're investigating holds data on almost 1,000 characters from the universe of Game of Thrones. I took it from [Kaggle](https://www.kaggle.com/mylesoneill/game-of-thrones).

### Questions to answer :

#### 1️⃣ Is there any trend in the evolution of death through the books chapters ?

Does George R. R. Martin becomes sadistic with time ? Is the rate of death increase through the evolution of the story ? We'll see.

#### 2️⃣ What does the distribution of the death proportion look like through allegiances ?

Which is the house with the proportion of death ? Just to know who to swear allegiance to.

#### 3️⃣ Hypothesis Testing : Do we have less chance to die if we are a noble ?

Is nobility a pledge of long life ? We'll elaborate a hypothesis testing to answer this question from a statistic point of view.

#### 4️⃣ How many chapters a character takes to die ?

What a sordid question. But in Game of Thrones everyone seems to die fast. But how fast ? 😂



<a id='wrangling'></a>
## Data Wrangling

### General Properties

In the first section we'll just take a look at the data and try to detect its weak points.

We're importing all of the packages that we'll use during the analysis 📦

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 26
fig_size[1] = 10
plt.rcParams["figure.figsize"] = fig_size
%matplotlib inline

Importing the dataset as a dataframe and getting a first look at the data

In [None]:
df_deaths = pd.read_csv('../input/character-deaths.csv')

In [None]:
df_deaths.head()

In [None]:
df_deaths.shape[0]

We have exactly 917 characters to analyze.

In [None]:
df_deaths.info()

As we can see, the Book of Death and Death Chapter has a lot of null values. It can be easily explained. The rows from these columns hold value only if the character is dead, so it means that we have characters that are still alive.

In [None]:
df_deaths.shape[0] - df_deaths['Death Chapter'].dropna().shape[0]

As you can see, we have 618 characters who are still alive.

In [None]:
df_deaths.describe()

In [None]:
df_deaths.columns

Hum. Even if the current column names are well formatted for reading, they are not formatted for manipulation. We'll fix it.

In [None]:
df_deaths = df_deaths.rename(columns=lambda x: x.replace(' ', '_').lower())
df_deaths.columns

Better 👌🏻.

<a id='eda'></a>
## Exploratory Data Analysis

### 1️⃣ Is there any trend in the evolution of death through the books chapters ?

To analyze the evolution of deaths through books chapters we need to group our dataframe by the death_chapter column. Then when we have our grouped dataframe we just need to count the occurences for each chapter.

In [None]:
df_deaths_by_chapter = df_deaths.groupby('death_chapter' ).count()

In [None]:
def plot_it(x=[], y=[], kind="plot", title="Your chart", xlabel="x-axis", ylabel="y-axis"):
    """ This function plot different type of charts depending on args value. """
    if kind == "plot":
        plt.plot(x, y)
    elif kind == "scatter":
        plt.scatter(x, y)
    elif kind == "bar":
        plt.bar(x, y, color=np.random.rand(256,3))
    else:
        raise ValueError(kind + ' is not a supported type of chart.')
    fig_size = plt.rcParams["figure.figsize"]
    fig_size[0] = 26
    fig_size[1] = 10
    plt.rcParams["figure.figsize"] = fig_size
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()

In [None]:
plot_it(df_deaths_by_chapter.index, df_deaths_by_chapter['name'], 'plot', 'Evolution of the number of deaths through chapters', 'Chapter Number', 'Deaths')

As we can see, there is no trend in the evolution of death through chapters. Thank God, George R. R. Martin is not going mad.

### 2️⃣ What does the distribution of the death proportion look like through allegiances ?

We take a first look at all the difference allegiances.

In [None]:
df_deaths['allegiances'].unique()

As we did with the columns name we're going to normalize the different allegiances label.

In [None]:
df_deaths['allegiances'] = df_deaths['allegiances'].apply(lambda x: x.replace('House ', '').lower())

Then we need to group our dataframe by allegiances and count the occurences. This will gives us the number of characters by allegiances.

In [None]:
df_characters_by_allegiances = df_deaths.groupby('allegiances').count()
df_characters_by_allegiances[['name']]

Then we replace null values of the death_chapter column with 'none' to filter more easily.

In [None]:
df_deaths['death_chapter'] = df_deaths['death_chapter'].fillna('none')

Then we create a new dataframe holding only dead characters.

In [None]:
df_dead_characters = df_deaths[df_deaths['death_chapter'] != 'none'].copy()

We group our new dataframe by allegiances and count the occurences. This will gives us the number of deaths by allegiances.

In [None]:
df_deaths_by_allegiances = df_dead_characters.groupby('allegiances').count()
df_deaths_by_allegiances[['name']]

In [None]:
df_deaths_by_allegiances['death_proportion'] = df_deaths_by_allegiances['name'] / df_characters_by_allegiances['name']

In [None]:
plot_it(df_deaths_by_allegiances.index, df_deaths_by_allegiances['death_proportion'], 'bar', 'Number of deaths by allegiances', 'Allegiances', 'Deaths')

As we can see the highest proportion of death is from the wildings. It's safer to live south of the wall 😙.

### 3️⃣ Hypothesis Testing : Do we have less chance to die if we are a noble ?

Before starting answering this question we'll convert it into stastistical hypotheses.
<br />

Here we're trying to prove that the probability of death when we're a noble is lower than for a non-noble. It means that the difference of death probability (**$d_{death}$**) between noble and non-noble should be negative.

We'll translate this question into two distinct hypotheses :
<br />

##### Null Hypothesis :
**$d_{death}$** >= 0

##### Alternative Hypothesis :
**$d_{death}$** < 0

To start we're going to bootstrap a sampling distribution :

In [None]:
deaths_diffs = []

for i in range(10000):
    sample = df_deaths.sample(df_deaths.shape[0], replace=True)
    noble_probability = sample.query("nobility == 1 & death_chapter != 'none'").shape[0] / sample.query("nobility == 1").shape[0]
    not_noble_probability = sample.query("nobility == 0 & death_chapter != 'none'").shape[0] / sample.query("nobility == 0").shape[0]
    deaths_diffs.append(noble_probability - not_noble_probability)
    
deaths_diffs = np.array(deaths_diffs)

Then we use our bootstrapped sampling distribution to create a simulation from the null hypothesis.

In [None]:
null_simulation = np.random.normal(0, deaths_diffs.std(), 10000)

Then we use our simulation from the null to calculate the p-value :

In [None]:
(deaths_diffs.mean() >= null_simulation).mean()

And it's unanimous, we can clearly reject the null hypothesis. We can affirm that we have less chance to die if we're a noble.

### 4️⃣ How many chapters a character takes to die ?

We create a new column 'alive_chapters' that will hold the number of chapter that current dead characters had spent alive.

In [None]:
df_dead_characters['alive_chapters'] = df_dead_characters['death_chapter'] - df_dead_characters['book_intro_chapter']
df_dead_characters = df_dead_characters[df_dead_characters['alive_chapters'] >= 0]
df_dead_characters[['alive_chapters']].head()

In [None]:
df_dead_characters['alive_chapters'].median()

Using the median, a character takes 11 chapters to die !

In [None]:
df_dead_characters.sort_values('alive_chapters', ascending=False).iloc[0]

The survival award goes to Harma ! She was a wilding and she takes 75 chapters to die, what a score !

<a id='conclusions'></a>
## Conclusions

Thank you very much for your attention 🎉. You've now reached the end of my analysis. We learned so much ! Here a  brief recap.

### 1️⃣ Deaths evolution is stable.

The deaths evolution don't follow a specific trend.

### 2️⃣ Come in the south !

By looking at our bar plot we saw that wildings has the highest proportion of death.

### 3️⃣ I hope you're a noble !

Statistics show that nobles have a lower probability to die.

### 4️⃣ How fast will you die ?

Using the median, characters takes 11 chapters to die.