For this assignment, I will by processing and analyzing the transcripts of the first three seasons of the TV series Game of Thrones. Originally I was planning on analyzing transcripts of all eight seasons, to see if it's possible to see how the style (and quality) changed in the later seasons, however I could only find the unprocessed transcripts of the first few seasons. I have a CSV file of all eight seasons, already processed, so I will probably repeat this analysis for the final assignment.

## Part 1: Loading the Data


<a href="https://colab.research.google.com/github/programminghistorian/jekyll/blob/Issue-3052/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
#Importing all modules
import spacy
!spacy download en_core_web_sm
import os
from spacy import displacy
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import plotly.express as px

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
"""Since jupyter notebook wasn't working for me, I am doing this assignment in
google collaborate, and have the files stored in my google drive. This probably
means that it won't work if someone else tries to run it, but given the
constraints, and that this is not a graded assignment, it is the solution I
came up with for the time being"""

from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/Master/Collecting Data/assignment_2")
os.listdir()

Mounted at /content/drive


['metadata.txt',
 '.DS_Store',
 'GOT_transcripts',
 'metadata.csv',
 'assignment_2.ipynb']

In [26]:
texts_raw = []
file_names = []

for txt_file in os.listdir('GOT_transcripts'):
    if txt_file.endswith('.txt'):
        texts_raw.append(open('GOT_transcripts' + '/' + txt_file, 'r', encoding='utf-8').read())
        file_names.append(txt_file)

# removing "\n" from the text to make it easier to read
texts = []
for episode in texts_raw:
  texts.append(episode.replace("\n", " "))

In [27]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

# Turn dictionary into a dataframe
GOT_df = pd.DataFrame(d)

In [28]:
# Load metadata.
metadata_df = pd.read_csv('metadata.csv', sep = ",")
metadata_df.rename(columns={"FILE": "Filename"}, inplace=True)
final_data_df = metadata_df.merge(GOT_df,on='Filename')
final_data_df = final_data_df.drop(columns = ["TEXT"])

final_data_df.head()

Unnamed: 0,Filename,SEASON,EPISODE,Text
0,01x01.txt,1,1,"Easy, boy. What do you expect? They’re savages..."
1,01x02.txt,1,2,"You need to drink, child. And eat. Isn’t there..."
2,01x03.txt,1,3,"Welcome, Lord Stark. Grand Maester Pycelle has..."
3,01x04.txt,1,4,The little lord’s been dreaming again. We have...
4,01x05.txt,1,5,Does Ser Hugh have any family in the capital? ...


## Part 2: Text Enrichment & Processing




In [9]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [77]:
#Defining all necessary functions

# Defines a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

# Defines a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

# Defines a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

#Return the coarse- and fine-grained part of speech text for each token in the doc
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

# Defines function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Defines function to extract named entities from doc objects
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

# Define function to extract text tagged with named entities from doc objects
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

# Defines a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

In [30]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each script
final_data_df['Doc'] = final_data_df['Text'].apply(process_text)

With the next few lines of code, I apply each of the provided functions in a new column

In [31]:
# Run the token retrieval function on the doc objects in the dataframe
final_data_df['Tokens'] = final_data_df['Doc'].apply(get_token)

In [32]:
# Isolate the tokenized text
tokens = final_data_df[['Text', 'Tokens']].copy()

In [33]:
# Run the lemma retrieval function on the doc objects in the dataframe
final_data_df['Lemmas'] = final_data_df['Doc'].apply(get_lemma)

In [34]:
# Define a function to retrieve parts of speech from a doc object
final_data_df['POS'] = final_data_df['Doc'].apply(get_pos)

In [35]:
# Apply function to Doc column and store resulting proper nouns in new column
final_data_df['Proper_Nouns'] = final_data_df['Doc'].apply(extract_proper_nouns)

In [36]:
# Apply function to Doc column and store resulting named entities in new column
final_data_df['Named_Entities'] = final_data_df['Doc'].apply(extract_named_entities)

In [78]:
# Apply function to Doc column and store resulting text in new column
final_data_df['NE_Words'] = final_data_df['Doc'].apply(extract_named_entities)

In [79]:
# Let's see what the final table looks like...
final_data_df.head()

Unnamed: 0,Filename,SEASON,EPISODE,Text,Doc,Tokens,Lemmas,POS,Proper_Nouns,Named_Entities,NE_Words
0,01x01.txt,1,1,"Easy, boy. What do you expect? They’re savages...","(Easy, ,, boy, ., What, do, you, expect, ?, Th...","[Easy, ,, boy, ., What, do, you, expect, ?, Th...","[easy, ,, boy, ., what, do, you, expect, ?, th...","[(ADJ, JJ), (PUNCT, ,), (NOUN, NN), (PUNCT, .)...","[Wall, Father, Bran, Bran, Quick, Bran, Lord, ...","[CARDINAL, ORDINAL, CARDINAL, PERSON, ORG, PER...","[(One), (first), (ten), (Bran), (Quick, ,, Bra..."
1,01x02.txt,1,2,"You need to drink, child. And eat. Isn’t there...","(You, need, to, drink, ,, child, ., And, eat, ...","[You, need, to, drink, ,, child, ., And, eat, ...","[you, need, to, drink, ,, child, ., and, eat, ...","[(PRON, PRP), (VERB, VBP), (PART, TO), (VERB, ...","[Dothraki, Shadow, Lands, Asshai, Dothraki, Kh...","[PERSON, CARDINAL, PRODUCT, PERSON, PERSON, DA...","[(Dothraki), (two), (the, Shadow, Lands), (Ass..."
2,01x03.txt,1,3,"Welcome, Lord Stark. Grand Maester Pycelle has...","(Welcome, ,, Lord, Stark, ., Grand, Maester, P...","[Welcome, ,, Lord, Stark, ., Grand, Maester, P...","[welcome, ,, Lord, Stark, ., Grand, Maester, P...","[(VERB, VBP), (PUNCT, ,), (PROPN, NNP), (PROPN...","[Lord, Stark, Grand, Maester, Pycelle, Small, ...","[PERSON, PERSON, ORG, PERSON, PERSON, NORP, DA...","[(Stark), (Grand, Maester, Pycelle), (the, Sma..."
3,01x04.txt,1,4,The little lord’s been dreaming again. We have...,"(The, little, lord, ’s, been, dreaming, again,...","[The, little, lord, ’s, been, dreaming, again,...","[the, little, lord, ’s, be, dream, again, ., w...","[(DET, DT), (ADJ, JJ), (PROPN, NNP), (AUX, POS...","[lord, Robb, I., Robb, Lord, Winterfell, Hodor...","[DATE, ORG, PERSON, WORK_OF_ART, ORG, WORK_OF_...","[(all, day), (Winterfell), (Hodor), (the, Nigh..."
4,01x05.txt,1,5,Does Ser Hugh have any family in the capital? ...,"(Does, Ser, Hugh, have, any, family, in, the, ...","[Does, Ser, Hugh, have, any, family, in, the, ...","[do, Ser, Hugh, have, any, family, in, the, ca...","[(AUX, VBZ), (PROPN, NNP), (PROPN, NNP), (VERB...","[Ser, Hugh, Mountain, Lord, Stark, Sisters, Tr...","[GPE, TIME, LOC, PERSON, PERSON, PRODUCT, GPE,...","[(Ser, Hugh), (last, night), (Mountain), (Star..."


In [84]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
final_data_df.to_csv('processed_GOT_s1to3.csv')

#### Nouns & Pronouns

In [51]:
# Create a list of the first 10 part of speech tags of episode 1
list(final_data_df['POS'])[0][:10]

[('ADJ', 'JJ'),
 ('PUNCT', ','),
 ('NOUN', 'NN'),
 ('PUNCT', '.'),
 ('PRON', 'WP'),
 ('AUX', 'VBP'),
 ('PRON', 'PRP'),
 ('VERB', 'VB'),
 ('PUNCT', '.'),
 ('PRON', 'PRP')]

In [62]:
# Create a list of the first 10 proper nounds of episode 1
list(final_data_df.loc[[0], 'Proper_Nouns'])[0][:10]

['Wall',
 'Father',
 'Bran',
 'Bran',
 'Quick',
 'Bran',
 'Lord',
 'Stark',
 'Night',
 'Watch']

In [74]:
# Count all proper nouns, to see which is the most featured.
pn_list = list(final_data_df['Proper_Nouns'])[0]
pn_counts = {}

for pn in pn_list:
  pn_counts[pn] = pn_list.count(pn)

pn_counts

{'Wall': 7,
 'Father': 6,
 'Bran': 5,
 'Quick': 1,
 'Lord': 10,
 'Stark': 11,
 'Night': 2,
 'Watch': 2,
 'Cat': 3,
 'Law': 1,
 'Ned': 9,
 'Robert': 3,
 'House': 4,
 'Baratheon': 1,
 'Andals': 2,
 'First': 2,
 'Men': 2,
 'Seven': 3,
 'Kingdoms': 3,
 'Eddard': 3,
 'Winterfell': 4,
 'Warden': 1,
 'North': 3,
 'Jon': 7,
 'Snow': 1,
 'Casterly': 1,
 'Rock': 1,
 'Lannisters': 4,
 'Arryn': 5,
 'Hand': 4,
 'King': 7,
 'Tyrion': 3,
 'Tommy': 1,
 'Brandon': 1,
 'Arya': 1,
 'Sansa': 2,
 'Robb': 1,
 'Jaime': 1,
 'Lannister': 2,
 'Imp': 3,
 'Iron': 1,
 'Throne': 1,
 'we’d': 1,
 'Mmm': 2,
 'Starks': 1,
 'Targaryens': 1,
 'Illyrio': 2,
 'Dothraki': 7,
 'Athchomar': 1,
 'chomakaan': 1,
 'Khal': 5,
 'Targaryen': 2,
 'Daenerys': 2,
 'Drogo': 5,
 'I.': 1,
 'Joffrey': 1,
 'Rodrik': 1,
 'Uncle': 2,
 'Benjen': 2,
 'Lady': 2,
 'royal': 1,
 'Winter': 1,
 'Northman': 1,
 'Maester': 1,
 'Luwin': 1,
 'Pardon': 1,
 'lord': 1,
 'Eyrie': 2,
 'Lysa': 1,
 'Etayo': 1,
 'Jadi': 1,
 'zhey': 1,
 'Jorah': 3,
 'Mormont': 1

####NET Analysis

In [81]:
# Extract the first Doc object
doc = final_data_df['Doc'][0]

# Visualize the named entity tagging in episode 1
displacy.render(doc, style='ent', jupyter=True)

####POC analysis

In [94]:
# Create new DataFrame for analysis purposes
pos_analysis_df = final_data_df[['Filename','SEASON', 'Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df.loc['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

In [95]:
# Create new dataframe with part of speech counts
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)

# Add discipline of each paper as new column to dataframe
idx = 0
new_col = pos_analysis_df['SEASON']
pos_counts.insert(loc=idx, column='SEASON', value=new_col)

pos_counts # By episode

Unnamed: 0,SEASON,ADJ,ADP,ADV,AUX,CCONJ,DET,INTJ,NOUN,NUM,PART,PRON,PROPN,PUNCT,SCONJ,VERB,X,SPACE,SYM
0,1.0,182,240,183,308,70,294,41,513,25,139,771,201,758,82,606,,,
1,1.0,182,272,187,314,77,295,48,564,25,151,799,205,819,76,654,1.0,1.0,
2,1.0,276,388,255,454,133,454,62,798,36,199,1036,286,1043,112,830,,,
3,1.0,300,417,289,463,130,431,79,807,30,245,1136,356,1123,135,882,2.0,1.0,
4,1.0,289,406,309,464,109,472,47,835,40,215,1123,292,1083,145,875,,,
5,1.0,168,305,160,332,111,312,35,569,20,148,794,254,818,74,616,,,
6,1.0,209,444,223,374,142,454,59,788,23,193,1069,329,1052,101,804,,,
7,1.0,174,370,201,422,133,347,31,777,8,191,947,330,964,109,734,,,
8,1.0,200,336,219,434,135,331,64,715,30,176,1043,233,995,113,766,,,
9,1.0,177,354,200,341,111,322,32,615,20,163,864,225,838,101,661,1.0,,


In [96]:
# Get average part of speech counts used in papers of each season
average_pos_df = pos_counts.groupby(['SEASON']).mean()

# Round calculations to the nearest whole number
average_pos_df = average_pos_df.round(0)

# Reset index to improve DataFrame readability
average_pos_df = average_pos_df.reset_index()

# Show dataframe
average_pos_df

Unnamed: 0,SEASON,ADJ,ADP,ADV,AUX,CCONJ,DET,INTJ,NOUN,NUM,PART,PRON,PROPN,PUNCT,SCONJ,VERB,X,SPACE,SYM
0,1.0,216.0,353.0,223.0,391.0,115.0,371.0,50.0,698.0,26.0,182.0,958.0,271.0,949.0,105.0,743.0,1.0,1.0,
1,2.0,226.0,362.0,233.0,426.0,116.0,373.0,46.0,711.0,26.0,194.0,1021.0,260.0,962.0,113.0,787.0,2.0,,
2,3.0,232.0,346.0,234.0,378.0,113.0,348.0,47.0,623.0,26.0,200.0,984.0,263.0,919.0,112.0,783.0,1.0,,1.0


Since I only took the first 3 seasons, which were each quite highly regarded, I won't do much in the way of interpreting this, but it would be interesting to see how they compare to season 7 and 8, which were almost universally disliked and regarded as being inferior to the early seasons.

In [97]:
# Use plotly to plot proper noun use per season
fig = px.bar(average_pos_df, x="SEASON", y=["ADJ", 'VERB', "NUM"], title="Average Part-of-Speech Use Per Season of GOT (Season 1-3)", barmode='group')
fig.show()