In [9]:
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

## Overview

We gathered data from Kaggle for the game of thrones TV show. This data has dialogues data for every character in each episode.
[Dialogues for each character data](https://www.kaggle.com/datasets/gopinath15/gameofthrones). We also scrapped the IMDb data from the web using the ```get_metadata.py``` script. The IMDb data has the character level information, season images, viewer rating etc.

Using the two data source we clean and reformat the data. We also map the correct characters for every dialogue spoken by joining the two datasets. We also format the episode and season naming as per our requirement for the UI.

### Data Profile

The raw data from Kaggle is stored in ```game-of-thrones.csv``` and the cleaned data csv is located at ```output_dialogues.csv``` in the data folder

In [10]:
import pandas as pd

df_raw = pd.read_csv("../thronetalk-game-of-thrones-summarizer/data/game-of-thrones.csv")
df_cleaned = pd.read_csv("../thronetalk-game-of-thrones-summarizer/data/ouput_dialogues.csv")

### Raw data

In [11]:
df_raw.head()

Unnamed: 0,Text,Speaker,Episode,Season,Show
0,[First scene opens with three Rangers riding t...,,e1-Winter is Coming,season-01,Game-of-Thrones
1,What d’you expect? They’re savages. One lot s...,WAYMAR ROYCE,e1-Winter is Coming,season-01,Game-of-Thrones
2,I’ve never seen wildlings do a thing like thi...,WILL,e1-Winter is Coming,season-01,Game-of-Thrones
3,How close did you get?,WAYMAR ROYCE,e1-Winter is Coming,season-01,Game-of-Thrones
4,Close as any man would.,WILL,e1-Winter is Coming,season-01,Game-of-Thrones


### Cleaned data

The cleaned data gets rid of Nans for speaker name. We also observed a few speakers had just the first name and no last name which created repeat characters. After joining the IMDb data we were able to remove this issue and get a unique list of characters with first and last name.

In [12]:
df_cleaned.head()

Unnamed: 0.1,Unnamed: 0,Text,Speaker,Episode,Season,Show,Episode_name,Episode_Number,Season_Number,dialogue_with_speaker,Character
0,0,[First scene opens with three Rangers riding t...,narrator,e1,season-01,Game-of-Thrones,,1,1,NARRATOR:First scene opens with three Rangers ...,narrator
1,1,What d’you expect? They’re savages. One lot s...,waymar,e1,season-01,Game-of-Thrones,,1,1,WAYMAR ROYCE: What d’you expect? They’re savag...,waymar royce
2,2,I’ve never seen wildlings do a thing like thi...,will,e1,season-01,Game-of-Thrones,,1,1,WILL: I’ve never seen wildlings do a thing lik...,will
3,3,How close did you get?,waymar,e1,season-01,Game-of-Thrones,,1,1,WAYMAR ROYCE: How close did you get?,waymar royce
4,4,Close as any man would.,will,e1,season-01,Game-of-Thrones,,1,1,WILL: Close as any man would.,will


The final dialogue data, ```ouput_dialogues```, is then stored as a ```.csv``` file in ```'.thronetalk-game-of-thrones-summarizer/data'```

 ####

### IMDb Metadata

The ```episode_metadata.csv``` dataset contains metadata for each episode of Game of Thrones and ```show_metadata.json``` dataset contains metadata for the show. This data was scraped from IMDb using ```Cinemagoer``` library used in ```scripts/get_metadata.py```. The episodes metadata is generated by ```get_episode_metadata()``` and show metadata is generated using ```get_show_metadata()```.

In [13]:
episodes_metadata = pd.read_csv("../thronetalk-game-of-thrones-summarizer/data/episodes_metadata.csv")
episodes_metadata.head(3)

Unnamed: 0,title,synopsis,plot outline,season,episode,rating,runtimes,votes,full-size cover url,plot,previous episode imdb id,next episode imdb id,release year,casts,directors,producers,writers
0,Winter Is Coming,"['In the Seven Kingdoms of Westeros, a soldier...","At the castle Winterfell, Lord Ned Stark begin...",1,1,8.9,['62'],53997,https://m.media-amazon.com/images/M/MV5BMmVhOD...,Eddard Stark is torn between his family and an...,31321401,1668746.0,2011,"['Sean Bean', 'Mark Addy', 'Nikolaj Coster-Wal...",['Timothy Van Patten'],"['David Benioff', 'Jonathan Brytus', 'Jo Burn'...","['David Benioff', 'D.B. Weiss', 'George R.R. M..."
1,The Kingsroad,['Tyrion doesn\'t like his dwarf status in the...,Although his son Bran is lying in bed unconsci...,1,2,8.6,['56'],40785,https://m.media-amazon.com/images/M/MV5BYzhhOT...,"While Bran recovers from his fall, Ned takes o...",1480055,1829962.0,2011,"['Sean Bean', 'Mark Addy', 'Nikolaj Coster-Wal...",['Timothy Van Patten'],"['David Benioff', 'Jonathan Brytus', 'Guymon C...","['David Benioff', 'D.B. Weiss', 'George R.R. M..."
2,Lord Snow,['Ned reaches King\'s Landing and is reminded ...,After the attempted assassination of young Bra...,1,3,8.5,['58'],38620,https://m.media-amazon.com/images/M/MV5BZjcwNz...,Jon begins his training with the Night's Watch...,1668746,1829963.0,2011,"['Sean Bean', 'Mark Addy', 'Nikolaj Coster-Wal...",['Brian Kirk'],"['David Benioff', 'Jonathan Brytus', 'Guymon C...","['David Benioff', 'D.B. Weiss', 'George R.R. M..."


In [14]:
import json

with open('../thronetalk-game-of-thrones-summarizer/data/show_metadata.json') as f:
    d = json.load(f)
    print(d)

{'cast': ['Peter Dinklage', 'Lena Headey', 'Emilia Clarke', 'Kit Harington', 'Sophie Turner', 'Maisie Williams', 'Nikolaj Coster-Waldau', 'Iain Glen', 'John Bradley', 'Alfie Allen', 'Conleth Hill', 'Liam Cunningham', 'Gwendoline Christie', 'Aidan Gillen', 'Isaac Hempstead Wright', 'Rory McCann', 'Nathalie Emmanuel', 'Jerome Flynn', 'Daniel Portman', 'Jacob Anderson', 'Ben Crompton', 'Kristofer Hivju', 'Julian Glover', 'Carice van Houten', 'Charles Dance', 'Hannah Murray', 'Natalie Dormer', 'Jack Gleeson', 'Michelle Fairley', 'Ian McElhinney', 'Stephen Dillane', 'Joe Dempsie', 'Kristian Nairn', 'Anton Lesser', 'Mark Stanley', 'Richard Madden', 'Finn Jones', 'Sibel Kekilli', 'Iwan Rheon', 'Michael McElhatton', 'Owen Teale', 'Michiel Huisman', 'Diana Rigg', 'Dean-Charles Chapman', 'Rose Leslie', 'Tom Wlaschiha', 'Hafþór Júlíus Björnsson', "Brenock O'Connor", 'Ian Beattie', 'Natalia Tena'], 'end year': 2019, 'full-size cover url': 'https://m.media-amazon.com/images/M/MV5BN2IzYzBiOTQtNGZmMi

##