# Game of thrones Tabular Data Cleanup

We're building a toy example to compare predictions based on tabular data to a model enhanced with graph based features (describing the relationships, communities, and connections between users and events). In order to build our basic model - which we'll seek to improve - let's first compile the available tabular data and clean it up.

In [1]:
import pandas as pd

The first data set we have is 'character-predictions.csv' This comes from the team at A Song of Ice and Data who scraped it from http://awoiaf.westeros.org/ . It also includes their predictions on which character will die, the methodology of which can be found here: https://got.show/machine-learning-algorithm-predicts-death-game-of-thrones

For our purposes, we only really want to make use of the metadata around each individual character -their gender, culture, parents, age, etc. We'll ignore the predictions.

In [33]:
# the first data set we have available is the "character predictions" data from https://got.show
character_predictions=pd.read_csv('https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-predictions.csv')

In [24]:
list(character_predictions)

['S.No',
 'actual',
 'pred',
 'alive',
 'plod',
 'name',
 'title',
 'male',
 'culture',
 'dateOfBirth',
 'DateoFdeath',
 'mother',
 'father',
 'heir',
 'house',
 'spouse',
 'book1',
 'book2',
 'book3',
 'book4',
 'book5',
 'isAliveMother',
 'isAliveFather',
 'isAliveHeir',
 'isAliveSpouse',
 'isMarried',
 'isNoble',
 'age',
 'numDeadRelations',
 'boolDeadRelations',
 'isPopular',
 'popularity',
 'isAlive']

In [34]:
# additional data set - "character deaths" 
character_deaths=pd.read_csv('https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/character-deaths.csv')

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=5)
print(tscv)  
TimeSeriesSplit(max_train_size=None, n_splits=5)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    print("X train:", X_train, " X test:", X_test)
    y_train, y_test = y[train_index], y[test_index]
    print("y train",y_train," y test",y_test)

In [26]:
list(character_deaths)


['Name',
 'Allegiances',
 'Death Year',
 'Book of Death',
 'Death Chapter',
 'Book Intro Chapter',
 'Gender',
 'Nobility',
 'GoT',
 'CoK',
 'SoS',
 'FfC',
 'DwD']

In [27]:
battles=pd.read_csv('https://raw.githubusercontent.com/tomasonjo/neo4j-game-of-thrones/master/data/battles.csv')

What data are actually useful? We ultimately want to predict, given a character's features, whether they'll die in the next book. We have some time invariant data (eg. whether a character is male or female) as well as some time dependant data (whether or not their parents are alive). Let's start by first combining our time invariant data into a single dataframe.

Time invariant features: gender, culture, house, mother, father, spouse, heir, whether they're nobility, whether they have allegiances, and what their allegiance is.

In [40]:
character_deaths[['Book of Death']]

Unnamed: 0,Book of Death
0,
1,3.0
2,
3,5.0
4,
5,
6,4.0
7,5.0
8,
9,


In [48]:
time_invariant_data=character_deaths[['Book of Death','Name','Allegiances','Nobility','Gender']]

In [42]:
# lets add a boolean 'has allegiance' in for whether a character has any allegiance (ignore the warning, this is fine)
time_invariant_data['Has_Allegiance']=1
time_invariant_data.loc[time_invariant_data['Allegiances']=='None','Has_Allegiance']=0
time_invariant_data['isAlive']=0
time_invariant_data.loc[time_invariant_data['Book of Death'].isna(),'isAlive']=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [49]:
# now let's merge in the relevant columns from the characters prediction data set - house, mother, father, and culture
time_invariant_data=pd.merge(time_invariant_data,character_predictions[['isAlive','name','culture','mother','father','spouse','house','heir','boolDeadRelations']],right_on='name',left_on='Name', how='outer')

In [43]:
time_invariant_data.head()

Unnamed: 0,Book of Death,Name,Allegiances,Nobility,Gender,Has_Allegiance,isAlive
0,,Addam Marbrand,Lannister,1,1,1,1
1,3.0,Aegon Frey (Jinglebell),,1,1,0,0
2,,Aegon Targaryen,House Targaryen,1,1,1,1
3,5.0,Adrack Humble,House Greyjoy,1,1,1,0
4,,Aemon Costayne,Lannister,1,1,1,1


In [38]:
#how many were successfully merged?
len(time_invariant_data)-len(time_invariant_data[time_invariant_data['name'].isna()])

1947

In [45]:
# let's  write out what didn't merge and fix it manually
time_invariant_data.to_csv('GoT_time_invariant_2.csv')

In [12]:
time_invariant_data[time_invariant_data['name'].isna()]

Unnamed: 0,Name,Allegiances,Nobility,Gender,Has_Allegiance,alive,name,culture,mother,father,spouse,house,heir
1,Aegon Frey (Jinglebell),,1,1,0,,,,,,,,
2,Aegon Targaryen,House Targaryen,1,1,1,,,,,,,,
3,Adrack Humble,House Greyjoy,1,1,1,,,,,,,,
6,Aemon Targaryen (son of Maekar I),Night's Watch,1,1,1,,,,,,,,
12,Alan of Rosby,Night's Watch,1,1,1,,,,,,,,
20,Alia of Braavos,,0,0,0,,,,,,,,
39,Anvil Ryn,,0,1,0,,,,,,,,
50,Arryk (Guard),House Tyrell,0,1,1,,,,,,,,
67,Barsena,,0,0,0,,,,,,,,
73,Becca the Baker,,0,0,0,,,,,,,,


In [21]:
character_predictions[character_predictions['name'].str.contains('Barathe')]

Unnamed: 0,S.No,actual,pred,alive,plod,name,title,male,culture,dateOfBirth,...,isAliveHeir,isAliveSpouse,isMarried,isNoble,age,numDeadRelations,boolDeadRelations,isPopular,popularity,isAlive
5,6,1,0,0.021,0.979,Tommen Baratheon,,1,,,...,1.0,,0,0,,5,1,1,1.0,1
50,51,0,0,0.397,0.603,Joffrey Baratheon,,1,,,...,1.0,,0,0,,5,1,1,1.0,0
172,173,1,0,0.036,0.964,Stannis Baratheon,,1,,,...,1.0,,0,0,,4,1,1,1.0,1
541,542,1,0,0.486,0.514,Gowen Baratheon,,1,,,...,,1.0,1,0,,0,0,0,0.016722,1
1573,1574,1,1,0.583,0.417,Myrcella Baratheon,Princess,0,,290.0,...,,,0,1,15.0,5,1,1,0.561873,1
1586,1587,0,0,0.031,0.969,Orys Baratheon,Storm's End,1,,,...,,1.0,1,1,,5,1,1,0.488294,0
1782,1783,1,0,0.381,0.619,Shireen Baratheon,Princess,0,,289.0,...,,,0,1,16.0,4,1,0,0.230769,1
1784,1785,0,0,0.45,0.55,Renly Baratheon,Lord Paramount of the Stormlands,1,Stormlands,277.0,...,,1.0,1,1,22.0,2,1,1,1.0,0
1811,1812,0,0,0.066,0.934,Steffon Baratheon,Storm's End,1,,246.0,...,,0.0,1,1,32.0,2,1,1,0.371237,0
1861,1862,0,0,0.21,0.79,Lyonel Baratheon,Ser,1,,,...,,,0,1,,0,0,0,0.056856,0


In [50]:
character_predictions[['name','boolDeadRelations']].to_csv('num_dead_rels.csv')