# Introduction
On this kernel, I will explain step by step I done in generating Script for Game of Thrones series.

The site I used for scrapping is genius.com. This site contains a whole scripts of Game of Thrones series for all episodes.

## Scrapping Episodes Urls
### Loading the Required Packages
Packages used for scrapping the episodes list are:
* requests
* BeautifulSoup

In [101]:
import requests # Used for doing the http request
from bs4 import BeautifulSoup # Used for assisting the HTML read
import pandas as pd

### Urls Scrapping
Example URL for an album in genius.com is as following (https://genius.com/albums/Game-of-thrones/Season-1-scripts ). In our case, album means season of a series. Each album contains a full list of URLs used to access the lyrics of all the songs in the album. Then again, songs can be interpreted as episodes of a series in a single season (album).

From the example above we can see that the season URLs are structured following this pattern below:
> 'https://genius.com/albums/Game-of-thrones/Season-' + [Season Number] + '-scripts'

We know that Game of Thrones has already finished with 8 seasons in total. Therefore we will have 8 different URLs for each season.

In order to capture the URLs for each season, we can make a simple list comprehension to generate the URL for each season which follows the pattern we've discovered before.

In [2]:
season_urls = ['https://genius.com/albums/Game-of-thrones/Season-' + str(season_number) + '-scripts' for season_number in range(1,9)]

Let's take a look at our season URLs.

In [3]:
for season_url in season_urls:
    print(season_url)

https://genius.com/albums/Game-of-thrones/Season-1-scripts
https://genius.com/albums/Game-of-thrones/Season-2-scripts
https://genius.com/albums/Game-of-thrones/Season-3-scripts
https://genius.com/albums/Game-of-thrones/Season-4-scripts
https://genius.com/albums/Game-of-thrones/Season-5-scripts
https://genius.com/albums/Game-of-thrones/Season-6-scripts
https://genius.com/albums/Game-of-thrones/Season-7-scripts
https://genius.com/albums/Game-of-thrones/Season-8-scripts


Now that we already have the URLs for each season, we can take a look at the inner HTML from one of those URLs.
We can do HTTP request using requests package that we imported, and then save the HTML result as a BeautifulSoup object.

BeautifulSoup will wrap the result of a requests from raw string/text format to a structured data type known as BeautifulSoup object.

In [4]:
r = requests.get(season_urls[0])
html_doc = r.text
soup = BeautifulSoup(html_doc)

# only view snippet because the result is too large
str(soup)[:1000]

'<!DOCTYPE html>\n<html class="snarly song_stories_public_launch--enabled react_forums--disabled report_abuse--disabled" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">\n<head>\n<base href="//genius.com/" target="_top"/>\n<script type="text/javascript">\n//<![CDATA[\n\n  var _sf_startpt=(new Date()).getTime();\n  if (window.performance && performance.mark) {\n    window.performance.mark(\'parse_start\');\n  }\n\n//]]>\n</script>\n<title>Game of Thrones - Season 1 Scripts Lyrics and Tracklist | Genius</title>\n<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n<meta content="width=device-width,initial-scale=1" name="viewport"/>\n<meta content="app-id=709482991" name="apple-itunes-app"/>\n<link href="https://assets.genius.com/images/apple-touch-icon.png?1651006365" rel="apple-touch-icon"/>\n<link href="https://assets.genius.com/images/apple-touch-icon.png?1651006365" rel="apple-touch-icon"/>\n<!-- Mobile IE all

If we look at the HTML text comprehensively, we can see that the URLs for each episode in a season are wrapped in an `a` tag with class name of `u-display_block`.

In BeautifulSoup object, we can easily access each of html tag with specific attribute using a method called `.find_all()`.

In [5]:
url_containers = soup.find_all('a', class_='u-display_block')
# take a look at one of the items
url_containers[0]

<a class="u-display_block" href="https://genius.com/Game-of-thrones-winter-is-coming-annotated">
<h3 class="chart_row-content-title">
              Winter is Coming
              <span class="chart_row-content-title-subtitle">Lyrics</span>
</h3>
</a>

From one of the html `a` tag above we know that the URL for each episode is stored in the attribute `href`. Tag in `BeautifulSoup` is kind of similar to `dictionary` in python in term of (`attribute`, `value of attribute`) from a tag can be treated as (`key`, `value`) from a dictionary. Therefore we can get the URL of the episode by simply accessing the `href` attribute of an `a` tag using the same method as accessing value of a python dictionary.

In [6]:
urls = [url_container['href'] for url_container in url_containers]

# Take a look at the URLs inside
for url in urls:
    print(url)

https://genius.com/Game-of-thrones-winter-is-coming-annotated
https://genius.com/Game-of-thrones-the-kingsroad-annotated
https://genius.com/Game-of-thrones-lord-snow-annotated
https://genius.com/Game-of-thrones-cripples-bastards-and-broken-things-annotated
https://genius.com/Game-of-thrones-the-wolf-and-the-lion-annotated
https://genius.com/Game-of-thrones-a-golden-crown-annotated
https://genius.com/Game-of-thrones-you-win-or-you-die-annotated
https://genius.com/Game-of-thrones-the-pointy-end-annotated
https://genius.com/Game-of-thrones-baelor-annotated
https://genius.com/Game-of-thrones-fire-and-blood-annotated


Now we know how to get the URL for each episode in a season of Game of Thrones. Next we need to do is extracting the URLs for all episodes from all seasons. We can do this by making simple loop of the season URLs and wrap all of what we have done before in order to get the episode URLs inside the loop.

In [7]:
urls = []
for season_url in season_urls:
    
    r = requests.get(season_url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    
    url_containers = soup.find_all('a', class_='u-display_block')
    
    for url_container in url_containers:
        urls.append(url_container['href'])

In [8]:
# show number of episodes
len(urls)

75

There are some anomalies in the URLs that we got. We know that Game of Thrones only consists of 73 episodes in total, but based on our scrap we captured more than 73 episodes.

In [9]:
for url in urls:
    print(url)

https://genius.com/Game-of-thrones-winter-is-coming-annotated
https://genius.com/Game-of-thrones-the-kingsroad-annotated
https://genius.com/Game-of-thrones-lord-snow-annotated
https://genius.com/Game-of-thrones-cripples-bastards-and-broken-things-annotated
https://genius.com/Game-of-thrones-the-wolf-and-the-lion-annotated
https://genius.com/Game-of-thrones-a-golden-crown-annotated
https://genius.com/Game-of-thrones-you-win-or-you-die-annotated
https://genius.com/Game-of-thrones-the-pointy-end-annotated
https://genius.com/Game-of-thrones-baelor-annotated
https://genius.com/Game-of-thrones-fire-and-blood-annotated
https://genius.com/Game-of-thrones-the-north-remembers-annotated
https://genius.com/Game-of-thrones-the-night-lands-annotated
https://genius.com/Game-of-thrones-what-is-dead-may-never-die-annotated
https://genius.com/Game-of-thrones-garden-of-bones-annotated
https://genius.com/Game-of-thrones-the-ghost-of-harrenhal-annotated
https://genius.com/Game-of-thrones-the-old-gods-and-t

After looking comprehensively at each item in our URL list, we could see there are preview and trailer episodes for both season 4 & 5 which are not needed. Therefore, we will remove those episodes from the list.

In [10]:
urls = [url for url in urls if 'season' not in url]
len(urls)

73

Finally, we have clean and complete list of Game of Thrones' all episodes. Now we can move to scrap the script/conversations from each of those episodes.

## Raw Text Scrapping
In this part we will start scrapping the content contained on each episode URLs that we've found. The data that we want to retrieve from each URL simply consists of:
* Episode Number
* Episode Title
* Season Number
* Release Date
* Conversations

### Path Finding
What we gonna do in finding those data is starting with finding the html paths containing each one of the data. After the path found we will transform the data which naturally will be on raw html format to more readable datatypes and store them on predefined variables.

Before we do web scrapping from all the episode URLs, It is better to do the process on one of the URL. Therefore we can find the processes and methods that we can apply to other URLs.

So, we need to save one of our URL in one single variable for further use.

In [196]:
url = urls[0]
url

'https://genius.com/Game-of-thrones-winter-is-coming-annotated'

Similar to previous process, we need to get the HTML response from the URL and store them as `BeautifulSoup` object.

In [197]:
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

# take a look inside
# again only snippet because the result too large
str(soup)[:2000]

'<!DOCTYPE html>\n<html>\n<head>\n<title>Game\xa0of Thrones – Winter is Coming | Genius</title>\n<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n<meta content="width=device-width,initial-scale=1" name="viewport"/>\n<meta content="app-id=709482991" name="apple-itunes-app"/>\n<link href="https://assets.genius.com/images/apple-touch-icon.png?1651006365" rel="apple-touch-icon"/>\n<link href="https://assets.genius.com/images/apple-touch-icon.png?1651006365" rel="apple-touch-icon"/>\n<!-- Mobile IE allows us to activate ClearType technology for smoothing fonts for easy reading -->\n<meta content="on" http-equiv="cleartype"/>\n<meta content="f63347d284f184b0" name="y_key"/>\n<meta content="Genius" property="og:site_name"/>\n<meta content="265539304824" property="fb:app_id"/>\n<meta content="308252472676410" property="fb:pages"/>\n<link href="https://genius.com/opensearch.xml" rel="search" title="Genius" type="application/opensearchdescription+xml"/>\n<script>\n!function(

#### Episode Number & Title
Based on the HTML text that we loaded above, the episode number and title are wrapped in a `div` tag with class name of `track_listing-track track_listing-track--current`.
Once again we can easily access this tag using `.find_all()` method.

In [198]:
episode = soup.find_all('div', class_='AlbumTracklist__TrackName-sc-123giuo-2 guEaas')
# take a look inside
episode

[<div class="AlbumTracklist__TrackName-sc-123giuo-2 guEaas"><div class="AlbumTracklist__TrackNumber-sc-123giuo-3 epTVob">1. </div>Winter is Coming</div>]

We can see that the result is a list containing a single element of inner HTML from a span. However, the inner HTML itself is still in HTML format. We can get all the texts inside HTML using `.text` attribute of a soup.

In [199]:
episode = episode[0].text
# take a look inside
episode

'1. Winter is Coming'

Further text processing is needed to get the `number` and `title` of an episode.

In [200]:
# creating a list by splitting the string using '\n'
episode = episode.split('\n')

# remove unused and empty strings
episode = ''.join(e + ' ' for e in episode)
episode = episode.split(' ')
episode = list(filter(None, episode))

# assign episode number and episode title to different variables
episode_number = ''.join('Episode ' + episode[0].split('.')[0])
episode_title = ''.join(e + ' ' for e in episode[1:])[:-1]

# show the results
print(episode_number)
print(episode_title)

Episode 1
Winter is Coming


#### Season Number
Season number is wrapped in an `a` tag with class name of `song_album-info-title`.

In [201]:
# get all elements inside 'a' tag, remove enters, convert to list splitted by empty space
season = soup.find_all('a', class_='Link-h3isu4-0 gHBbjJ')[0].text.replace('\n','').split(' ')

# remove empty strings and concat all the remaining
season = list(filter(None, season))
season = ''.join(s + ' ' for s in season[:-1])[:-1]

print(season)

Season 1


#### Release Date
Release date is wrapped in an `span` tag with class name of `metadata_unit-info metadata_unit-info--text_only`.

In [202]:
# get all elements inside 'a' tag
release_date = soup.find_all('div', class_='HeaderMetadata__Section-sc-1p42fnf-3 jROWVH')
release_date_1 = pd.Series(release_date)
release_date = release_date[1].next_element.next_element.next_element.text
print(release_date)

April 17, 2011


We want to make the date stored in a more simplified format. We can use method `.strptime()` and `.strftime()` from `datetime`. To do this we need to import `datetime` from package `datetime`.

In [203]:
from datetime import datetime

release_date = datetime.strptime(release_date, '%B %d, %Y')
release_date = datetime.strftime(release_date, '%Y-%m-%d')

print(release_date)

2011-04-17


#### Conversations
Scrapping the conversations part will have complex and long processes. These processes include getting the raw html text, removing unused tags, filtering the tags needed, and so many text cleansing processes.

As a start, we kno that the conversation part is stored in a `div` tag with class name of `lyrics`

In [288]:
lyrics = soup.find_all("div", class_="Lyrics__Container-sc-1ynbvzw-6 jYfhrf")[0]

# again only snippet because the result too large
str(lyrics)[:2000]

'<div class="Lyrics__Container-sc-1ynbvzw-6 jYfhrf" data-lyrics-container="true"><b>EPISODE 1 - WINTER IS COMING</b><hr/><p>[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. <a class="ReferentFragmentVariantdesktop__ClickTarget-sc-1837hky-0 dkZkek" href="/1639510/Game-of-thrones-winter-is-coming/A-birds-eye-view-shows-the-bodies-arranged-in-a-shield-like-pattern"><span class="ReferentFragmentVariantdesktop__Highlight-sc-1837hky-1 jShaMP">A birds-eye view shows the bodies arranged in a shield-like pattern.</span></a><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span style="position:absolute;opacity:0;width:0;heig

We see there are a lot of tags on our soup. We can easily remove those unused tags using method `.extract()` of a soup. In order to do that we need to convert our soup to a BeautifulSoup object, and then apply the `.extract()` method for each unused tag.

In [309]:
lyrics = BeautifulSoup(str(lyrics))
[s.extract() for s in lyrics('br')]
[s.extract() for s in lyrics('i')]
[s.extract() for s in lyrics('hr')]
[s.extract() for s in lyrics('h1')]
[s.extract() for s in lyrics('h2')]
[s.extract() for s in lyrics('h3')]

# take a look inside
# again only snippet because the result too large
print(str(lyrics)[:3000])

<html><body><div class="Lyrics__Container-sc-1ynbvzw-6 jYfhrf" data-lyrics-container="true"><b>EPISODE 1 - WINTER IS COMING</b><p>[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. <a class="ReferentFragmentVariantdesktop__ClickTarget-sc-1837hky-0 dkZkek" href="/1639510/Game-of-thrones-winter-is-coming/A-birds-eye-view-shows-the-bodies-arranged-in-a-shield-like-pattern"><span class="ReferentFragmentVariantdesktop__Highlight-sc-1837hky-1 jShaMP">A birds-eye view shows the bodies arranged in a shield-like pattern.</span></a><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex="0"></span><span style="position:absolute;opacity:0;width:

Now we already removed some of unused tags in our `lyrics`. From this we can see that all of the conversations are wrapped in `p` tag and they are clearly written by matching with this pattern `[Person]:[Sentences]`.

However, other text that is not considered as conversation also stored on this tag. So, we need to clean the data again later.

In [323]:
# get the 'p' tags inner HTML
paragraphs = lyrics.find_all('p')

# create variable to store the conversations
conversations = []

# iterating all 'p' tags found
for p in paragraphs:
    # get the inner text of p, create list by splitting text using '\n', and extend them to list outside the loop
    conversations.extend(p.text.split('<br>'))
    print(conversation)
    
# remove empty strings
conversations = list(filter(None, conversations))

# by following the [person]:[sentences] pattern, convert the string inside list to tuple format
conversations = [tuple(s.split(',')) for s in conversations]

for conversation in conversations[:1]:
    print(conversation)

('[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]WAYMAR ROYCE', 'What d’you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces.WILL', 'I’ve never seen wildlings do a thing like this. I’ve never seen a thing like this, not ever in my life.WAYMAR ROYCE', 'How close did you get?WILL', 'Close as any man would.GARED', 'We should head back to the wall.ROYCE', 'Do the dead frighten you?GARED', 'Our orders were to track the wildlings. We tracked them. They won’t trouble us no more.ROYCE', 'You don’t think he’ll ask us how they died? Get back on your horse.[GARED grumbles.]WILL', 'Whatever did it to them could do it

We have all the conversations stored in a list containing tuple of (`person`,`sentence`) format. Unfortunately, some of the entries of our list don't match with the format. This indicates that those values are not considered as a conversation. Therefore we need to remove them.

Now we have two different types of tuple on our list which are tuple consisting 2 values, and tuple consisting only one value. Let's take a look on those two types of tuple.

In [324]:
for index, conversation in enumerate(conversations[1:265]):
    if len(conversation) >= 2:
        print(str(index) + ' | 2 values | ' + ''.join(str(c) + ':' for c in conversation)[:-1])
    else:
        print(str(index) + ' | 1 value | ' + ''.join(str(c) + ':' for c in conversation)[:-1])

0 | 2 values | [Riders from Winterfell come up behind a dazed WILL. The scene shifts to the castle: where BRAN is practicing archery and getting frustrated: under the eyes of JON SNOW and ROBB STARK. JON pats BRAN’S shoulder.]JON: Go on. Father’s watching.[We see NED and CATELYN STARK watching from above.]JON: And your mother.[Scene shifts to needlework practice with the girls inside the castle.]SEPTA MORDANE (to SANSA): Fine work: as always. Well done.SANSA: Thank you.SEPTA MORDANE: I love the detail that you’ve managed to get in this corners. … Quite beautiful … the stitching …[As she murmurs to SANSA about the embroidery: ARYA struggles with her needlework and listens to the arrows hitting and the male laughter outside.][Outside: BRAN tries and misses again. Everyone laughs.]NED: And which one of you was a marksman at ten? Keep practicing: Bran. Go on.JON: Don’t think too much: Bran.ROBB: Relax your bow arm.[BRAN pulls the arrow back. An arrow hits the bullseye. BRAN (still with his

Ideally, we can just remove the tuple that only consisting 1 value and store the rest of tuples as our clean data. However, It turns out that some of the conversations actually are not following the `[person]:[sentences]` format. We need to do some `regex` matching in order not to lose those conversations.

But before doing that, we need to remove tuples that represent background situation. Those tuples have elements written inside a bracket `[]`. We can remove those tuples using some `regex` matching. For a better understanding about `regex`, you can do exercise here: https://regexr.com/

In [313]:
import re
# regex to find conversations in [ some text ] format
regex = '(.+)\[.+\](.+)|(.+)\[.+\]|\[.+\]'
pattern = re.compile(regex)

for index, conversation in enumerate(conversations):
    if len(conversation) <= 1:
        match = pattern.findall(conversation[0])
        if len(match) > 0:
            conversations[index] = tuple((''.join(e + ' ' for e in list(filter(None, match[0]))).replace('    ',' ').replace('   ',' ').replace('  ', ' ')).split('\n'))

conversations = list(filter(None, conversations))
conversations = [c for c in conversations if len(c[0]) > 0]

# show
conversations[0:265]

[('[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]WAYMAR ROYCE',
  ' What d’you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces.WILL',
  ' I’ve never seen wildlings do a thing like this. I’ve never seen a thing like this, not ever in my life.WAYMAR ROYCE',
  ' How close did you get?WILL',
  ' Close as any man would.GARED',
  ' We should head back to the wall.ROYCE',
  ' Do the dead frighten you?GARED',
  ' Our orders were to track the wildlings. We tracked them. They won’t trouble us no more.ROYCE',
  ' You don’t think he’ll ask us how they died? Get back on your horse.[GARED grumbles.]WILL',
  ' Whateve

Eventhough we already filtered some of the background situations, there are some cases in which the background situation is divided to two different lines which are not captured by our regex before. We need to clean the list again.

In [314]:
# regex that match for '[ some text' and 'some text ]' format
regex = '^\[.+|.+\]$'
pattern = re.compile(regex)

for index, conversation in enumerate(conversations):
    if len(conversation) <= 1:
        match = pattern.search(conversation[0])
        if match:
            conversations[index] = None

conversations = list(filter(None, conversations))
conversations[0:265]

[('[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]WAYMAR ROYCE',
  ' What d’you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces.WILL',
  ' I’ve never seen wildlings do a thing like this. I’ve never seen a thing like this, not ever in my life.WAYMAR ROYCE',
  ' How close did you get?WILL',
  ' Close as any man would.GARED',
  ' We should head back to the wall.ROYCE',
  ' Do the dead frighten you?GARED',
  ' Our orders were to track the wildlings. We tracked them. They won’t trouble us no more.ROYCE',
  ' You don’t think he’ll ask us how they died? Get back on your horse.[GARED grumbles.]WILL',
  ' Whateve

Now we have filtered all of the background situations, our list now should contains conversations only. However, we still have not do anything about the conversations that is not following the `[person]:[sentence]` format.

Let's take a look at them.

In [315]:
for index, conversation in enumerate(conversations):
    if len(conversation) < 2:
        print(str(index) + ' | ' + conversation[0])

If we look closely, these conversations match the pattern `[person in uppercase] (some text) [sentences]`. Besides, there can also be some conversations that do not have the `(some text)` part.

Once again, we will extract the `person` and `sentence` from those conversations using `regex`.

In [316]:
# regex to match with '[person in uppercase] [rest of the text]' and '[person in uppercase] (some text) [rest of the text]' format
regex = '^([A-Z]{2,})(.+)'
pattern = re.compile(regex)

for index, conversation in enumerate(conversations):
    if len(conversation) <= 1:
        match = pattern.findall(conversation[0])
        if len(match) > 0:
            conversations[index] = (match[0][0], match[0][-1])
    
# take a look
conversations[1:135]

[('[Riders from Winterfell come up behind a dazed WILL. The scene shifts to the castle, where BRAN is practicing archery and getting frustrated, under the eyes of JON SNOW and ROBB STARK. JON pats BRAN’S shoulder.]JON',
  ' Go on. Father’s watching.[We see NED and CATELYN STARK watching from above.]JON',
  ' And your mother.[Scene shifts to needlework practice with the girls inside the castle.]SEPTA MORDANE (to SANSA)',
  ' Fine work, as always. Well done.SANSA',
  ' Thank you.SEPTA MORDANE',
  ' I love the detail that you’ve managed to get in this corners. … Quite beautiful … the stitching …[As she murmurs to SANSA about the embroidery, ARYA struggles with her needlework and listens to the arrows hitting and the male laughter outside.][Outside, BRAN tries and misses again. Everyone laughs.]NED',
  ' And which one of you was a marksman at ten? Keep practicing, Bran. Go on.JON',
  ' Don’t think too much, Bran.ROBB',
  ' Relax your bow arm.[BRAN pulls the arrow back. An arrow hits the bu

At this point we have all the conversations on desired format. Now our list of `conversations` should only have conversation on two valued tuple.

In [228]:
for conversation in conversations:
    if len(conversation) < 2:
        print(conversation)

We can now take out all of the one valued tuples from our list.

In [229]:
conversations = [conversation for conversation in conversations if len(conversation) > 1]

# take a look
conversations[:10]

[('[First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]WAYMAR ROYCE',
  ' What d’you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces.WILL',
  ' I’ve never seen wildlings do a thing like this. I’ve never seen a thing like this, not ever in my life.WAYMAR ROYCE',
  ' How close did you get?WILL',
  ' Close as any man would.GARED',
  ' We should head back to the wall.ROYCE',
  ' Do the dead frighten you?GARED',
  ' Our orders were to track the wildlings. We tracked them. They won’t trouble us no more.ROYCE',
  ' You don’t think he’ll ask us how they died? Get back on your horse.[GARED grumbles.]WILL',
  ' Whateve

Finally, we have our desired list of conversations which is the last piece of data that we want to scrap. From now on we will combine all of the data that we already gathered which are `episode number`, `episode title`, `season number`, `release date`, and `conversation`. We will put all these data together and store them in a `dataframe`.

#### Create Dataframe
For creating dataframe we need to import the required package.

In [230]:
import pandas as pd

To make a better and tidier dataframe, we can not put raw tuples as our entry. Therefore we need to separate our conversations data as two different set of values which are `person` and `sentence`. We are going to create pandas `Series` for each of them.

In [231]:
person = pd.Series([c[0] for c in conversations])
sentence = pd.Series([c[1] for c in conversations])

Let's have a quick look.

In [232]:
print(person.head())
print(sentence.head())

0    [First scene opens with three Rangers riding t...
1    [Riders from Winterfell come up behind a dazed...
dtype: object
0     What d’you expect? They’re savages. One lot s...
1     Go on. Father’s watching.[We see NED and CATE...
dtype: object


Now we have all the separated data in different variables. We can now wrap all of these variables to a single dataframe.

In [234]:
script = pd.DataFrame({
    'Season': season,
    'Episode': episode_number,
    'Episode Title': episode_title,
    'Sentence': sentence,
    'Name': person,
    'Release Date': release_date
})
script = script[['Release Date','Season','Episode','Episode Title','Name','Sentence']]
print(script.info())
script.head()
script.to_excel('Game_of_Thrones_Script_episode1.xlsx', encoding='utf-8', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Release Date   2 non-null      object
 1   Season         2 non-null      object
 2   Episode        2 non-null      object
 3   Episode Title  2 non-null      object
 4   Name           2 non-null      object
 5   Sentence       2 non-null      object
dtypes: object(6)
memory usage: 224.0+ bytes
None


We can still see there are some data that considered dirty in this dataframe. On the top position we have data with Name `EPISODE` and Sentence `1 - WINTER IS COMING`. Other case is different representation of same person, such as `DAENERYS` and `DAENERYS TARGARYEN`. There are still some other cases as well. However, we will clean this later on `Post Scrapping Data Cleansing` part. For now we already have at least all of the conversation with no loss and in our desired format.

### Wrap The Process in Functions
Based on what we have done before, we already know how to scrap and gather all the data required to make a dataframe of conversations from one episode of Game of Thrones. We can now iterate all of the episode `URLs` and do the whole process to each of them to get all of the scripts.

However, instead of putting the whole process inside a single loop, it is better to wrap each of the independent process in different functions.

#### Get Episode

In [184]:
def get_episode(soup):
    episode = soup.find_all('div', class_='AlbumTracklist__TrackName-sc-123giuo-2 guEaas')[0].text.split('\n')
    episode = ''.join(e + ' ' for e in episode)
    episode = episode.split(' ')
    episode = list(filter(None, episode))

    episode_number = ''.join('Episode ' + episode[0].split('.')[0])
    episode_title = ''.join(e + ' ' for e in episode[1:])[:-1]
    
    return episode_number, episode_title

#### Get Season

In [185]:
def get_season(soup):
    season = soup.find_all('a', class_='Link-h3isu4-0 gHBbjJ')[0].text.replace('\n','').split(' ')
    season = list(filter(None, season))
    season = ''.join(s + ' ' for s in season[:-1])[:-1]
    
    return season

#### Get Release Date

In [174]:
#from datetime import datetime

#def get_release_date(soup):
    #release_date = soup.find_all('div', class_='HeaderMetadata__Section-sc-1p42fnf-3 jROWVH')[1].next_element.next_element.next_element.text
    #release_date = datetime.strptime(release_date, '%B %d, %Y')
    #release_date = datetime.strftime(release_date, '%Y-%m-%d')
    
    #return release_date

#### Get Conversations

In [175]:
import re

def get_conversations(soup):
    lyrics = soup.find_all("div", class_="Lyrics__Container-sc-1ynbvzw-6 jYfhrf")[0]

    lyrics = BeautifulSoup(str(lyrics))
    [s.extract() for s in lyrics('br')]
    [s.extract() for s in lyrics('i')]
    [s.extract() for s in lyrics('hr')]
    [s.extract() for s in lyrics('h1')]
    [s.extract() for s in lyrics('h2')]
    [s.extract() for s in lyrics('h3')]

    paragraphs = lyrics.find_all('p')

    conversations = []

    for p in paragraphs:
        conversations.extend(p.text.split('\n'))

    conversations = list(filter(None, conversations))
    conversations = [tuple(s.split(':')) for s in conversations]
    
    regex = '(.+)\[.+\](.+)|(.+)\[.+\]|\[.+\]'
    pattern = re.compile(regex)
    
    for index, conversation in enumerate(conversations):
        if len(conversation) <= 1:
            match = pattern.findall(conversation[0])
            if len(match) > 0:
                conversations[index] = tuple((''.join(e + ' ' for e in list(filter(None, match[0]))).replace('    ',' ').replace('   ',' ').replace('  ', ' ')).split('\n'))
                
    conversations = list(filter(None, conversations))
    conversations = [c for c in conversations if len(c[0]) > 0]
    
    regex = '^\[.+|.+\]$'
    pattern = re.compile(regex)
    
    for index, conversation in enumerate(conversations):
        if len(conversation) <= 1:
            match = pattern.search(conversation[0])
            if match:
                conversations[index] = None
                
    conversations = list(filter(None, conversations))
    
    regex = '^([A-Z]{2,})(.+)'
    pattern = re.compile(regex)
    
    for index, conversation in enumerate(conversations):
        if len(conversation) <= 1:
            match = pattern.findall(conversation[0])
            if len(match) > 0:
                conversations[index] = (match[0][0], match[0][-1])
                
    conversations = [conversation for conversation in conversations if len(conversation) > 1]
    
    return conversations

#### Create Dataframe

In [192]:
def create_dataframe(**kwargs):
    
    person = pd.Series([c[0] for c in conversations])
    sentence = pd.Series([c[1] for c in conversations])
    
    script = pd.DataFrame({
        'Episode': episode_number,
        'Episode Title': episode_title,
        'Season': season,
        'Sentence': sentence,
        'Name': person
    })
    
    script = script[['Season','Episode','Episode Title','Name','Sentence']]
    
    return script

### Iterate All Episodes
After wrapping all of our independent processes in different functions, next thing we should do is applying these functions to all of our episode `URLs` to get the whole script of Game of Thrones. To do this we will make a simple for loop to iterate all of our `URLs`, and put the functions we have made before inside the loop.

In [325]:
# initiate an empty list to store dataframes from each episode
scripts = []
for url in urls:
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    
    episode_number, episode_title = get_episode(soup)
    season = get_season(soup)
    conversations = get_conversations(soup)
    
    df_scripts = create_dataframe(episode_number = episode_number, 
                                  episode_title = episode_title,
                                  season = season,
                                  conversations = conversations)
    
    scripts.append(df_scripts)
    print('Script from: ' + url + ' added')

Script from: https://genius.com/Game-of-thrones-winter-is-coming-annotated added
Script from: https://genius.com/Game-of-thrones-the-kingsroad-annotated added
Script from: https://genius.com/Game-of-thrones-lord-snow-annotated added
Script from: https://genius.com/Game-of-thrones-cripples-bastards-and-broken-things-annotated added
Script from: https://genius.com/Game-of-thrones-the-wolf-and-the-lion-annotated added
Script from: https://genius.com/Game-of-thrones-a-golden-crown-annotated added
Script from: https://genius.com/Game-of-thrones-you-win-or-you-die-annotated added
Script from: https://genius.com/Game-of-thrones-the-pointy-end-annotated added
Script from: https://genius.com/Game-of-thrones-baelor-annotated added
Script from: https://genius.com/Game-of-thrones-fire-and-blood-annotated added
Script from: https://genius.com/Game-of-thrones-the-north-remembers-annotated added
Script from: https://genius.com/Game-of-thrones-the-night-lands-annotated added
Script from: https://geniu

  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-walk-of-punishment-annotated added
Script from: https://genius.com/Game-of-thrones-and-now-his-watch-is-ended-annotated added
Script from: https://genius.com/Game-of-thrones-kissed-by-fire-annotated added
Script from: https://genius.com/Game-of-thrones-the-climb-annotated added
Script from: https://genius.com/Game-of-thrones-the-bear-and-the-maiden-fair-annotated added
Script from: https://genius.com/Game-of-thrones-second-sons-annotated added


  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-the-rains-of-castamere-annotated added
Script from: https://genius.com/Game-of-thrones-mhysa-annotated added
Script from: https://genius.com/Game-of-thrones-two-swords-annotated added
Script from: https://genius.com/Game-of-thrones-the-lion-and-the-rose-annotated added
Script from: https://genius.com/Game-of-thrones-breaker-of-chains-annotated added
Script from: https://genius.com/Game-of-thrones-oathkeeper-annotated added
Script from: https://genius.com/Game-of-thrones-first-of-his-name-annotated added
Script from: https://genius.com/Game-of-thrones-the-laws-of-gods-and-men-annotated added
Script from: https://genius.com/Game-of-thrones-mockingbird-annotated added
Script from: https://genius.com/Game-of-thrones-the-mountain-and-the-viper-annotated added
Script from: https://genius.com/Game-of-thrones-the-watchers-on-the-wall-annotated added
Script from: https://genius.com/Game-of-thrones-the-children-annotated added
Script from: https://

  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-beyond-the-wall-annotated added
Script from: https://genius.com/Game-of-thrones-the-dragon-and-the-wolf-annotated added
Script from: https://genius.com/Game-of-thrones-winterfell-annotated added
Script from: https://genius.com/Game-of-thrones-a-knight-of-the-seven-kingdoms-annotated added


  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-the-long-night-annotated added


  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-the-last-of-the-starks-annotated added


  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


Script from: https://genius.com/Game-of-thrones-the-bells-annotated added
Script from: https://genius.com/Game-of-thrones-the-iron-throne-annotated added


  person = pd.Series([c[0] for c in conversations])
  sentence = pd.Series([c[1] for c in conversations])


In [326]:
script_dataframe = pd.concat(scripts)
script_dataframe.info()
script_dataframe.to_excel('Game_of_Thrones_Script_avant.xlsx', encoding='utf-8', index=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 0 to 3
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Season         182 non-null    object
 1   Episode        182 non-null    object
 2   Episode Title  182 non-null    object
 3   Name           182 non-null    object
 4   Sentence       182 non-null    object
dtypes: object(5)
memory usage: 8.5+ KB


Finally, we have collected all scripts for the whole season of Game of Thrones in a single dataframe. However, we still keep in mind that these data are not completely clean yet. That is why what we are going to do next is doing the `Post Scrapping Data Cleansing` to make sure the dataset is safe to use.

## Post-Scrapping Data Cleansing

In [40]:
script_dataframe = script_dataframe.dropna()
script_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26099 entries, 0 to 292
Data columns (total 6 columns):
Release Date     26099 non-null object
Season           26099 non-null object
Episode          26099 non-null object
Episode Title    26099 non-null object
Name             26099 non-null object
Sentence         26099 non-null object
dtypes: object(6)
memory usage: 1.4+ MB


We have encountered some `Name` and `Sentence` that contain some bracketed strings in previous sections. This can describe about what the person is thinking or refers to the audience of the person talking. We don't need these to ruin our data, therefore we need to remove them.

In [41]:
import re
def remove_bracketed(text):
    regex = '\([^)]*\)'
    text = re.sub(regex, '', text).replace('  ',' ')
    return text

script_dataframe['Name'] = script_dataframe['Name'].apply(remove_bracketed)
script_dataframe['Sentence'] = script_dataframe['Sentence'].apply(remove_bracketed)

script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,EPISODE,1 - WINTER IS COMING
1,2011-04-17,Season 1,Episode 1,Winter is Coming,WAYMAR ROYCE,What d’you expect? They’re savages. One lot s...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,WILL,I’ve never seen wildlings do a thing like thi...


Now we have already eliminated null values on our dataframe and also eliminated bracketed text from column `Name` and `Sentence` on our dataframe.

For further cleansing, we need to make values of our `Name` column homogenous. First thing we need to do to achieve this is making the `Name` column in lowercase text format.

In [42]:
script_dataframe['Name'] = script_dataframe['Name'].apply(lambda x: str(x).lower())
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,episode,1 - WINTER IS COMING
1,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...


After making all names in lowercase format, we need to remove all non-aphabetical character in `Name` column. We can make a simple function to do regex substitution.

In [43]:
import re
def remove_non_alphabetic(text):
    regex = '[^A-Za-z\s]'
    text = re.sub(regex, '', text).replace('  ',' ')
    text = text if text[-1] != ' ' else text[:-1]
    return text

In [44]:
script_dataframe['Name'] = script_dataframe['Name'].apply(remove_non_alphabetic)
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,episode,1 - WINTER IS COMING
1,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...


We have all names in similar format, in lowercase and only consist of alphabetic characters.The first thing we need to do now is removing the naration text from our dataframe. As we previously known, some of the background conditions which categorized as narations are written in the same format as conversation. Some examples for this case are `EPISODE` and `CUT TO`.

Easiest way to do this is by extracting the first word of each names and put them into a new column. This column will later be used to filter the names.

In [45]:
script_dataframe['First Token'] = script_dataframe['Name'].apply(lambda x: str(x).split(' ')[0])
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,First Token
0,2011-04-17,Season 1,Episode 1,Winter is Coming,episode,1 - WINTER IS COMING,episode
1,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar
2,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,will


To continue, we will now filter the background conditions from our dataframe. These entries will have `First Token` such as `episode`, `cut`, `int`, and `ext`.

In [46]:
script_dataframe = script_dataframe[(script_dataframe['First Token'] != 'cut') &
                                    (script_dataframe['First Token'] != 'int') &
                                    (script_dataframe['First Token'] != 'ext') &
                                    (script_dataframe['First Token'] != 'episode')]
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,First Token
1,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar
2,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,will
3,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?,waymar


So we have now filtered all of the backgoud conditions from our dataframe. Let's focus on column `Name` again.

We should check if this column only has normal name as its entries. There are many cases that can be considered as anomaly in person's name, such as name length. For now we will see are there any anomaly length on our `Name` column.

In [47]:
script_dataframe['Name Length'] = script_dataframe['Name'].apply(lambda x: len(str(x)))
print(script_dataframe['Name Length'].describe(percentiles=[.8,.9,.95,.99,.999,.9999,.99999,.999999]))
print(script_dataframe.info())
print(script_dataframe['Name Length'].value_counts().sort_values().head())

count       24354.000000
mean            6.734458
std             3.771149
min             2.000000
50%             6.000000
80%             8.000000
90%            12.000000
95%            13.000000
99%            17.000000
99.9%          21.000000
99.99%         23.000000
99.999%       251.158650
99.9999%      315.815865
max           323.000000
Name: Name Length, dtype: float64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24354 entries, 1 to 290
Data columns (total 8 columns):
Release Date     24354 non-null object
Season           24354 non-null object
Episode          24354 non-null object
Episode Title    24354 non-null object
Name             24354 non-null object
Sentence         24354 non-null object
First Token      24354 non-null object
Name Length      24354 non-null int64
dtypes: int64(1), object(7)
memory usage: 1.7+ MB
None
28      1
323     1
20      3
22      3
23     12
Name: Name Length, dtype: int64


Based on information above, we know that most of our value in column `Name` only have length no longer than 28 characters. In fact, as an outlier we have one name that consists of more than 300 characters.

Once again, just like null values on data, there are also many ways to handle outlier values on data. But for this case we will just eliminate the outlier because it only has small number, just one entry to be precise.

In [48]:
script_dataframe = script_dataframe[script_dataframe['Name Length'] <= 28]
script_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24353 entries, 1 to 290
Data columns (total 8 columns):
Release Date     24353 non-null object
Season           24353 non-null object
Episode          24353 non-null object
Episode Title    24353 non-null object
Name             24353 non-null object
Sentence         24353 non-null object
First Token      24353 non-null object
Name Length      24353 non-null int64
dtypes: int64(1), object(7)
memory usage: 1.7+ MB


### Name Homogenization
As the next step, we previously know that our `Name` column may have different values for a same person. This can happen because of aliases, family names, nicknames, and others. However, it will take a lot of effort if we make all of our `Name` value homogen. Instead of doing that, we will just make homogen name of characters that matter the most in the data. Now let's see who are those characters that matter the most.

In [49]:
appearance_counts = script_dataframe.groupby(['Name'])['Sentence'].count().reset_index()
appearance_counts.Sentence.describe(percentiles=[.8,.9,.95,.99,.999])

count     638.000000
mean       38.170846
std       118.213483
min         1.000000
50%         5.000000
80%        32.000000
90%        83.900000
95%       182.750000
99%       637.950000
99.9%    1161.402000
max      1578.000000
Name: Sentence, dtype: float64

Based on information above, we figured that only 10% of the characters have more than 80 different sentences. These characters will we be focused on and made homogen.

In [50]:
most_sentence_characters = appearance_counts[appearance_counts['Sentence'] > 80].sort_values(by=['Sentence'], ascending=[0])

Take a peek on the dataset, we will see that characters on this dataset still consist of some universal aliases like `man` and `soldier` which are not owned by a single character, instead used by many different characters. These aliases should be removed from dataset.

In [51]:
most_sentence_characters = most_sentence_characters[(most_sentence_characters['Name'] != 'man') &
                                                    (most_sentence_characters['Name'] != 'soldier')]

Now we already removed all of the universal aliases from our character dataframes. Let's take a look of its unique values.

In [52]:
char_names = most_sentence_characters['Name'].unique()
print('total: ' + str(len(char_names)))
char_names

total: 63


array(['tyrion', 'jon', 'cersei', 'daenerys', 'jaime', 'sansa', 'arya',
       'davos', 'theon', 'sam', 'bronn', 'varys', 'brienne', 'bran',
       'tywin', 'jorah', 'stannis', 'margaery', 'ramsay', 'melisandre',
       'robb', 'eddard stark', 'jon snow', 'shae', 'gendry',
       'littlefinger', 'joffrey', 'tormund', 'gilly', 'tyrion lannister',
       'missandei', 'catelyn', 'ygritte', 'olenna', 'daario', 'podrick',
       'yara', 'the hound', 'osha', 'oberyn', 'baelish', 'sandor',
       'tommen', 'jaqen', 'grey worm', 'qyburn', 'talisa',
       'petyr baelish', 'meera', 'catelyn stark', 'samwell', 'thoros',
       'daenerys targaryen', 'robert baratheon', 'arya stark', 'shireen',
       'high sparrow', 'beric', 'euron', 'hound', 'sansa stark', 'grenn',
       'jorah mormont'], dtype=object)

Only 63 characters with more than 80 sentences (from now on we will address this as `important characters`) are left on our character dataframe. It takes not a big effort for us to manually homogenize the names on this list. Also, notice that some of the characters on this list have other name or alias in the series. We are going to make a mapping for these aliases too.

For now, we will make a new dataframe containing unique name and alias from these characters.

In [53]:
char_names = ['tyrion lannister', 'jon snow', 'jaime lannister', 'sansa stark', 'arya stark', 'davos',
              'theon greyjoy', 'bronn', 'varys', 'brienne', 'bran stark', 'tywin lannister', 'jorah mormont', 'stannis baratheon',
              'margaery tyrell', 'ramsay bolton', 'melisandre', 'robb stark', 'jon snow', 'shae', 'gendry baratheon',
              'tormund', 'gilly', 'tyrion lannister', 'missandei', 'catelyn stark', 'ygritte', 'olenna tyrell', 'daario',
              'podrick', 'yara greyjoy', 'osha', 'oberyn martell', 'jaqen hghar','grey worm', 'qyburn', 'talisa', 'meera', 'catelyn stark',
              'thoros','robert baratheon', 'arya stark', 'shireen', 'sparrow', 'beric', 'euron greyjoy','sansa stark', 'grenn', 'jorah mormont']

alias_mapper = ['sandor clegane','petyr baelish','petyr baelish','sam tarly','eddard stark','cersei lannister','joffrey lannister',
                'tommen lannister','daenerys targaryen','daenerys targaryen']

alias = ['hound','littlefinger','baelish','samwell tarly','ned stark','cersei baratheon','joffrey baratheon',
         'tommen baratheon','daenerys stormborn','dany']

char_names = sorted(list(pd.Series(char_names).unique()))
char_alias = [None for i in range(0, len(char_names))]
char_names.extend(alias_mapper)
char_alias.extend(alias)
name_dictionary = pd.DataFrame({
    "Base Name": char_names,
    "Alias": char_alias
})

name_dictionary = name_dictionary[['Base Name','Alias']]
name_dictionary = name_dictionary.sort_values(by=['Base Name'])
name_dictionary.head()

Unnamed: 0,Base Name,Alias
0,arya stark,
1,beric,
2,bran stark,
3,brienne,
4,bronn,


After completing the name dictionary for important characters, the next thing we do is mapping these name to our main dataframe. We will first make a mapper dataframe to be used later for mapping purpose. To do this we need to make a copy of our main dataframe.

But before we make a copy, there are some cases that need to be highlighted. Some of the character aliases contain word like `high` and `the`, for example `high sparrow` and `the hound`. Keeping these words will get us into trouble when scoring the string similarity later because these words will increase the scores two different names that have prefix of these words. Therefore, we need to remove these words from our character names.

In [54]:
def clean_words(x):
    new_name = x.replace('the ','')
    new_name = new_name.replace('high ', '')
    return new_name

In [55]:
script_dataframe['Name'] = script_dataframe['Name'].apply(clean_words)

Now we can start making the mapper for our important characters starting with making a copy of our main dataframe as a new dataframe object.

In [56]:
script_for_mapper = script_dataframe.copy()

The first step of creating this mapper is generating a cartesian product of our new dataframe and our important characters dataframe. The easiest way to do this is by creating a column with similar name and values for both dataframe. This column will be used for merging in which we will use this column as a key for doing dataframe left outer merge. We add column `Cartesian Key` with value of `0` for both data frame, and then we do outer merge on those dataframes using the column `Cartesian Key`.

In [57]:
script_for_mapper['Cartesian Key'] = 0
name_dictionary['Cartesian Key'] = 0
script_for_mapper = script_for_mapper.merge(name_dictionary, on=['Cartesian Key'], how='outer')
script_for_mapper.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1290709 entries, 0 to 1290708
Data columns (total 11 columns):
Release Date     1290709 non-null object
Season           1290709 non-null object
Episode          1290709 non-null object
Episode Title    1290709 non-null object
Name             1290709 non-null object
Sentence         1290709 non-null object
First Token      1290709 non-null object
Name Length      1290709 non-null int64
Cartesian Key    1290709 non-null int64
Base Name        1290709 non-null object
Alias            243530 non-null object
dtypes: int64(2), object(9)
memory usage: 118.2+ MB


Notice that there is massive increase on number of rows of our dataframe. This is happened because of cartesian product basically mapping every rows in first dataframe to every rows in second dataframe. It increases total row by multiplying number of rows of first dataframe by number of rows of second dataframe.

In order to create a proper mapper, we need to make sure that each name of our dataframe is the really the name of our important characters. In this process we will also tackle case of typo writing on our dataframe. To do this we will get similarity score between our character names and the important characters dataset, either the name or the alias.

This algorithm below will do the scoring process. As for the string similarity, after some researches I found that the most suitable algorithm for this is the `Jaro Winkler` algorithm. I will use package from `jellyfish` that contains `Jaro Winkler` algorithm. You can read more about `Jaro Winkler` algorithm here https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance, and the `jellyfish` documentation here https://jellyfish.readthedocs.io/.

In [58]:
!pip install jellyfish

Collecting jellyfish
[?25l  Downloading https://files.pythonhosted.org/packages/3f/80/bcacc7affb47be7279d7d35225e1a932416ed051b315a7f9df20acf04cbe/jellyfish-0.7.2.tar.gz (133kB)
[K     |████████████████████████████████| 143kB 2.8MB/s 
[?25hBuilding wheels for collected packages: jellyfish
  Building wheel for jellyfish (setup.py) ... [?25ldone
[?25h  Created wheel for jellyfish: filename=jellyfish-0.7.2-cp36-cp36m-linux_x86_64.whl size=81352 sha256=983016650beadadb358dbb22492d3f20d57789c2941b9e0cb1732c56d57b89f6
  Stored in directory: /tmp/.cache/pip/wheels/e8/fe/99/d8fa8f2ef7b82a625b0b77a84d319b0b50693659823c4effb4
Successfully built jellyfish
Installing collected packages: jellyfish
Successfully installed jellyfish-0.7.2


In [59]:
from jellyfish import jaro_winkler

def get_similarity(row):
    current_name = row['Name']
    base_name = row['Base Name']
    alias = row['Alias']
    
    score_base_name = 0
    score_alias = 0
    
    if current_name == base_name:
        score_base_name = 1
    else:
        listed_current_name = current_name.split(' ')
        listed_base_name = base_name.split(' ')
        
        if len(listed_current_name) > 1 and len(listed_base_name) > 1:
            family_name_similarity = jaro_winkler(listed_current_name[1], listed_base_name[1])
            if family_name_similarity > .9:
                score_base_name = jaro_winkler(listed_current_name[0], listed_base_name[0])
            else:
                score_base_name = jaro_winkler(current_name, base_name)
        elif len(listed_base_name) > 1:
            score_base_name = jaro_winkler(current_name, listed_base_name[0])
        else:
            score_base_name = jaro_winkler(current_name, base_name)
        
        if alias != None:
            listed_alias = alias.split(' ')
            if len(listed_current_name) > 1 and len(listed_alias) > 1:
                family_name_similarity = jaro_winkler(listed_current_name[1], listed_alias[1])
                if family_name_similarity > .9:
                    score_base_name = jaro_winkler(listed_current_name[0], listed_alias[0])
                else:
                    score_base_name = jaro_winkler(current_name, alias)
            elif len(listed_alias) > 1:
                score_base_name = jaro_winkler(current_name, listed_alias[0])
            else:
                score_base_name = jaro_winkler(current_name, alias)
    
    return score_base_name if score_base_name > score_alias else score_alias

In [60]:
script_for_mapper['Name Similarity'] = script_for_mapper.apply(get_similarity, axis=1)

After some data exploration I found that minimum score for the name similarity that can be accepted as a same character is `0.89`. Therefore, we will make a new name column named `Homogenized Name` and fill them with condition if the similarity score is greater than `0.89` use name from the important character dataset, which in this dataframe stored as `Base Name` column, and for rows with similarity score les than `0.89` use `None`.

In [61]:
def get_homogenized_name(x):
    similarity = x['Name Similarity']
    name = x['Name']
    base_name = x['Base Name']
    
    if similarity > .89:
        return base_name
    else:
        return None

In [62]:
script_for_mapper['Homogenized Name'] = script_for_mapper.apply(get_homogenized_name, axis=1)
script_for_mapper.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,First Token,Name Length,Cartesian Key,Base Name,Alias,Name Similarity,Homogenized Name
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,0,arya stark,,0.644444,
1,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,0,beric,,0.427778,
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,0,bran stark,,0.572222,
3,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,0,brienne,,0.484127,
4,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,0,bronn,,0.427778,


By now, we should have our important characters with both of their `Name` and `Homogenized Name` filled with non null object. Next we will do is extracting the `Name` and `Homogenized Name` columns and dropping the `None` values so that the dataframe only contains name of our important characters.

In [63]:
script_for_mapper = script_for_mapper[['Name','Homogenized Name']].dropna().drop_duplicates()
script_for_mapper.head()

Unnamed: 0,Name,Homogenized Name
815,jon,jon snow
992,sansa,sansa stark
1071,ned,eddard stark
1200,robb,robb stark
1383,catelyn,catelyn stark


At this point, there is a special case on this mapper. A character named `Robett Glover` is mapped as `Robert Baratheon` on our mapper. This is happened because of on the script from `genius.com` they wrote the name `robett` only without the family name. If you read our algorithm for scoring the string similarity, the algorithm will compare `robett` and `robert` which will give a high similarity score as result. For this case, I will manually delete the row from our mapper.

In [64]:
script_for_mapper = script_for_mapper.drop(1031097)

Now that we have a clean mapper for our important characters, we can finally map the name in our main dataframe to our important characters mapper dataframe. We can do the mapping by doing a simple pandas left merge on column `Name` of each dataframe.

In [65]:
script_dataframe = script_dataframe.merge(script_for_mapper, on=['Name'], how='left')
script_dataframe.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,First Token,Name Length,Homogenized Name
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,will,4,
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?,waymar,12,
3,2011-04-17,Season 1,Episode 1,Winter is Coming,will,Close as any man would.,will,4,
4,2011-04-17,Season 1,Episode 1,Winter is Coming,gared,We should head back to the wall.,gared,5,


We now have successfully mapped the name of our important characters to their name on main dataframe. We will now clean the `Name` column by filling them with available value in `Homogenized Name` column, or we can say changing the name of our important characters to their clean name.

In [66]:
script_dataframe['Homogenized Name'] = script_dataframe['Homogenized Name'].fillna('')
script_dataframe['Name'] = script_dataframe[['Name','Homogenized Name']].apply(lambda x: x[1] if x[1] != '' else x[0], axis=1)
script_dataframe.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,First Token,Name Length,Homogenized Name
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,waymar,12,
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,will,4,
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?,waymar,12,
3,2011-04-17,Season 1,Episode 1,Winter is Coming,will,Close as any man would.,will,4,
4,2011-04-17,Season 1,Episode 1,Winter is Coming,gared,We should head back to the wall.,gared,5,


Finally, we managed to map and homogenize the name of our important characters. Let's restore our main dataframe to its orginal format and remove the duplicated values.

In [67]:
script_dataframe = script_dataframe[['Release Date','Season','Episode','Episode Title','Name','Sentence']].drop_duplicates()
script_dataframe.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?
3,2011-04-17,Season 1,Episode 1,Winter is Coming,will,Close as any man would.
4,2011-04-17,Season 1,Episode 1,Winter is Coming,gared,We should head back to the wall.


### Sentence Cleansing
Case that most likely occured on column `Sentence` is not as much as in `Name` column. One of the reason is because we already clean some of them on `Name` cleansing process.

However there are still cases that make our the `Sentence` column still contains dirty data. First case is the sentence is not properly started meaning that first character on the sentence is a non-aphanumeric character. This can happen because of our previous cleansing in which we remove the bracketed text on our `Name` and `Sentence` columns. Next case is the sentence written in differen format than most of them, or to be specific some sentences are written inside quote `''` or double quote `""`, while others are not. The last case is sentences that contain empty string which might also happen because of removing of the bracketed texts.

The function below contains algorithm that is going to handle the non-proper form sentences.

In [68]:
import re
def clean_sentence(text):
    
    text_list = text.split(' ')
    
    if len(text_list) > 1:
        text = ''.join(' ' + word for word in text_list if word != '')[1:].replace('    ',' ').replace('   ',' ').replace('  ',' ')
        if len(text) > 1:
            text = text[:-1] if text[-1] == ' ' else text
            if text[0] == '"' and text[-1] == '"':
                text = text[1:-1]
            if text[0] == '\'' and text[-1] == '\'':
                text = text[1:-1]

        regex = '^[^A-Za-z0-9]*'
        text = re.sub(regex, '', text).replace('  ',' ')
        if len(text) > 0:
            text = text if text[-1] != ' ' else text[:-1]
    
    return text

In [69]:
script_dataframe['Clean Sentence'] = script_dataframe['Sentence'].apply(clean_sentence)
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,Clean Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,What d’you expect? They’re savages. One lot st...
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,I’ve never seen wildlings do a thing like this...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?,How close did you get?


Now the `Sentence` column values should all be in the same format. The last case that needed to be handled is the empty string. We can easily handle this by making a new column that contains value of the lenght of each `Sentence` value in our main dataframe. And then use the column to filter empty strings.

In [70]:
script_dataframe['Length Sentence'] = script_dataframe['Clean Sentence'].apply(len)
script_dataframe = script_dataframe[script_dataframe['Length Sentence'] > 1]
script_dataframe.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence,Clean Sentence,Length Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot s...,What d’you expect? They’re savages. One lot st...,136
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like thi...,I’ve never seen wildlings do a thing like this...,103
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?,How close did you get?,22


We already have a clean sentence and also removed all empty strings on our dataframe. Now we should replace values on `Sentence` column using the values on `Clean Sentence` column and then restore our dataframe to its original structure.

In [71]:
script_dataframe['Sentence'] = script_dataframe['Clean Sentence']
script_dataframe = script_dataframe[['Release Date','Season','Episode','Episode Title','Name','Sentence']].drop_duplicates()
script_dataframe.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What d’you expect? They’re savages. One lot st...
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I’ve never seen wildlings do a thing like this...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?
3,2011-04-17,Season 1,Episode 1,Winter is Coming,will,Close as any man would.
4,2011-04-17,Season 1,Episode 1,Winter is Coming,gared,We should head back to the wall.


For the final touch, it might be hard to notice but quote symbol in `Sentence` column is using `’` instead of `'`. A simple regex replace can fix this problem. While we are doing this, we can as well replace the `d’` word that can easily be seen on the first row. This is actually a simple alternative for word `do`. Again, we will use regex replace to fix them.

In [72]:
script_dataframe['Sentence'] = script_dataframe['Sentence'].apply(lambda x: str(x).replace('’', '\''))
script_dataframe['Sentence'] = script_dataframe['Sentence'].apply(lambda x: str(x).replace('d\'', 'do '))
script_dataframe.head()

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What do you expect? They're savages. One lot s...
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I've never seen wildlings do a thing like this...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?
3,2011-04-17,Season 1,Episode 1,Winter is Coming,will,Close as any man would.
4,2011-04-17,Season 1,Episode 1,Winter is Coming,gared,We should head back to the wall.


In [73]:
script_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23911 entries, 0 to 24353
Data columns (total 6 columns):
Release Date     23911 non-null object
Season           23911 non-null object
Episode          23911 non-null object
Episode Title    23911 non-null object
Name             23911 non-null object
Sentence         23911 non-null object
dtypes: object(6)
memory usage: 1.3+ MB


Finally, our process in scrapping and cleansing the dataset for `Game of Thrones` has finished. The last thing to do is export the dataframe to an external file.

In [74]:
script_dataframe.to_csv('Game_of_Thrones_Script.csv', encoding='utf-8', index=False)

# Conclusion
Scrapping web to extract the data scattered around them may take a lot of effort. We need to do bunch of manual inspect element just to get the exact position of data or information that we want to collect from all over the html parts. BeautifulSoup functionality is good, but for case like this we need to combine this package with manual inspect element to produce the desired result faster. Some of you might have different and even a better solution on extracting data from online sources, and that is a great thing. As for me, I am still learning and experimenting different methodologies to find my best practice on doing so.

As for the closing, you can use this data and mine the information provided there as you please. Also, give me feedback both on this dataset and the process of getting them if you have the time. 

Thank you! 

And, Have a nice day:)