# Assignment 2: Game of Thrones (Extra Credit)
## © Cristian Danescu-Niculescu-Mizil 2020
## CS/INFO 4300 Language and Information
## Due by midnight on Wednesday February 12th

This assignment is **individual**.

By now, you've had some exposure to techniques for uncovering social dynamics in textual data through your assignments working with the "Keeping Up With The Kardashians" data set.

In this assignment we will be giving you the freedom to conduct analysis **of your choosing** on transcripts from the television series "A Game of Thrones".

**Extra Credit Policy**

The policy for this extra credit is that you should complete as much as you can.


**Learning Objectives**

This project aims to help you get comfortable working with the following tools / technologies / concepts:

* Applying learned tools to conduct novel analysis

**Academic Integrity and Collaboration**

Note that these projects should be completed individually. As a result, all University-standard academic integrity guidelines must be followed.

**Grading**

This extra credit assignment is open-ended and will ask you to do a free-form analysis.
The number of extra points will be decided based on how interesting, creative and unique your analysis is.  This will be in part a subjective judgement on the part of the TAs.


**Submission**

You are expected to submit this .ipynb as your submission for Assignment 2 Extra Credit. 

In addition please submit an html copy of the notebook (You can create this by clicking File > Download as > HTML (.html)).

In [1]:
import pickle, re

In [2]:
import sys
# Ensure that your kernel is using Python3
assert sys.version_info.major == 3

# Free-form Analysis

## Preprocessing

We have provided you with some pre-cleansed data that contains a subset of the "Game Of Thrones" transcripts. This is by no means a comprehensive data set (many episodes are not represented), but it should be enough data for you to make some meaningful observations. 

The data is represented as a dictionary that maps an episode title to a transcript (represented as a list of speaker-line tuples):
```
{'A Golden Crown': [
      ('EDDARD', '  Your pardon, your Grace.'),
      ('CERSEI', ' Do you know what your wife has done?'),
      ...
    ],
  ...
}
```

**However, the data is messy so you might want to do some additional steps to clean it.**

e.g. Character names may not be consistent throughout the transcripts. There might be spelling errors in the names, and characters may be referenced by their first name only or by their full name.

In [4]:
with open("GoT_transcripts_c.p", "rb") as file:
    got_data = pickle.load(file)

## Extra Credit Task
For extra credit, we would like you to attempt the **ONE** of the following tasks:

## OPTION 1 :
1. Explore the role of gender in social interactions. For example, do people of one gender talk differently to those of the opposite gender? You should base this analysis on the age-based social interaction analysis in Assignment 2. You might also consider comparing GoT results with Kardashians results.
2. Provide at least 1 plot illustrating your results.

**NOTE:** Being that there are many characters in GoT, you may have to select a subset of the characters to work with and look up their genders. If you are looking only at a subset of the characters, please provide an explanation as for how and why you arrived at the subset that you did.


## OPTION 2 :
1. Propose an interesting question that you would like to explore in regard to the GoT dataset. 
2. Answer your question by conducting and documenting your analysis on the dataset.
3. Provide at least 1 plot illustrating your results


**NOTE**: You will be rewarded points based on novelty of question and on the thoughtfulness of the exploration.

Make sure you clearly express the questions you are asking, and document the code. 
Try to use the level of detail we used in our problems from A2.   
Also, interpret the results (highlighting what you found particularly interesting/unexpected). 

You may import additional libraries as you see fit so long as they are standard python libraries (https://docs.python.org/3/library/) 

We have also provided you with a few helper functions to aid you in your analysis.

In [60]:
import numpy as np

In [62]:
def tokenize(text):
    """Returns a list of tokens from an input string.
    
    Params: {text: string}
    Returns: List
    """
    return [x for x in re.findall(r"[a-z]*", text.lower()) if x != ""]

## OPTION 2 :
### Question:  find the least frequent words, and figure out who speak these words most frequently and who speak these words least frequently.###
**Note: Build a n_good_names by n_good_words matrix, with matrix[i,j] be the number of times this words speaked by person i. **

In order to de-emphasize the ratios of rare words (i.e. words in rare contexts are more likely to occur for only a single pair), we will apply additive smoothing also known as Laplace smoothing. This means that we will add one to each character-pair word occurrence before reweighting.

So, for a given word $w$, person $p$, amd character-pair word occurrence $\text{count}(p, w)$, the weighted occurrence $W(p, w)$ is

$$ W(p, w) = \frac{\text{count}(p, w) + 1}{\displaystyle\sum_{\pi \in \text{good person}} \big(\text{count}(\pi, w) + 1\big)} $$


In [63]:
### YOUR ANALYSIS HERE
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))

value = list(got_data.values())
word_tokens = []
for epi in value:
    for sent in epi:
        word_tokens += [w.lower() for w in tokenize(sent[1])]
filtered_sentence = [w for w in word_tokens if not w in stop_words]
from collections import Counter
word_dic = Counter(filtered_sentence)
sort_word = sorted(word_dic.items(),key=lambda x:x[1])
word_list = [w for (w,k) in sort_word]
word_index_reverse = {j: i for (i,j) in enumerate(word_list)}

In [64]:
name_list = []
for epi in value:
    for sent in epi:
        name_list.append(sent[0])
name_counter = Counter(name_list)
#Create good_name_list for top 17 frequent name appear in the episodes: that is the name that appears at least 100 times
good_name_list = [w for (w,k) in sorted(name_counter.items(),key=lambda x:x[1], reverse = True)][:17]
print(good_name_list)
good_name_index_reverse = {j: i for (i,j) in enumerate(good_name_list)}
print(good_name_index_reverse)

['TYRION', 'JON', 'EDDARD', 'CERSEI', 'DAENERYS', 'JAIME', 'ARYA', 'LITTLEFINGER', 'CATELYN', 'SANSA', 'ROBB', 'ROBERT', 'NED', 'JORAH', 'SAM', 'BRAN', 'BRONN']
{'TYRION': 0, 'JON': 1, 'EDDARD': 2, 'CERSEI': 3, 'DAENERYS': 4, 'JAIME': 5, 'ARYA': 6, 'LITTLEFINGER': 7, 'CATELYN': 8, 'SANSA': 9, 'ROBB': 10, 'ROBERT': 11, 'NED': 12, 'JORAH': 13, 'SAM': 14, 'BRAN': 15, 'BRONN': 16}


In [76]:
n_good_names = len(good_name_list)
n_good_words =  len(word_list)
words_matrix = np.zeros((n_good_names, n_good_words))
for epi in got_data.values():
    for sent in epi:
        name = sent[0]
        words = tokenize(sent[1])
        for w in words:
            if (w.lower() in word_list) & (name in good_name_list):
                words_matrix[good_name_index_reverse[name], word_index_reverse[w.lower()]] +=1

array([ 0.,  0.,  0., ..., 45., 19., 11.])

In [77]:
# Create a weighted words matrix
weighted_words_dup = np.copy(words_matrix)
weighted_words_dup = (weighted_words_dup+1)/(np.sum(weighted_words_dup, axis =0)+weighted_words_dup.shape[0])

In [98]:
words_frequency_ratio_sum = np.sum(weighted_words_dup[:, :100], axis = 1)
words_frequency_ratio_sum_sorted_index = np.argsort(words_frequency_ratio_sum)
print(words_frequency_ratio_sum)
print(words_frequency_ratio_sum_sorted_index)
# find the name of these people:
for i in words_frequency_ratio_sum_sorted_index[::-1]:
    print(good_name_list[i])

[5.73856209 5.79411765 5.73856209 5.73856209 6.46078431 5.73856209
 5.96078431 6.29411765 5.8496732  5.73856209 6.07189542 5.73856209
 5.73856209 6.01633987 5.73856209 5.90522876 5.73856209]
[ 0 14 12 11  9 16  3  2  5  1  8 15  6 13 10  7  4]
DAENERYS
LITTLEFINGER
ROBB
JORAH
ARYA
BRAN
CATELYN
JON
JAIME
EDDARD
CERSEI
BRONN
SANSA
ROBERT
NED
SAM
TYRION


Since the word_list is created with sequence that from least frequent to most frequent, i choose top 100 least frequent words and take the sum of all weighted appearance ratios.

From the words_frequency_ratio_sum, we find there is not significant difference between people saying top 100 least frequent words. most of ratios are between 5.6 to 6.0. There are only 4 people who say a bit more least frequency words in the transcripts.

They are 
DAENERYS,
LITTLEFINGER,
ROBB,
JORAH.