In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from helpers.helper_functions import *

# Handling text 2 exercise
[Handling text exercisses ADApted drom ADA 2018 final exam]

The Sheldon Cooper we all know and love (OK, some of us might not know him, and some might not love him) from the TV series "The Big Bang Theory" has gotten into an argument with Leonard from the same TV show. Sheldon insists that he knows the show better than anyone, and keeps making various claims about the show, which neither of them know how to prove or disprove. The two of them have reached out to you ladies and gentlemen, as data scientists, to help them. You will be given the full script of the series, with information on the episode, the scene, the person saying each dialogue line, and the dialogue lines themselves.

Leonard has challenged several of Sheldon's claims about the show, and throughout this exam you will see some of those and you will get to prove or disprove them, but remember: sometimes, we can neither prove a claim, nor disprove it!

## Task A: Picking up the shovel

**Note: You will use the data you preprocess in this task in all the subsequent ones.**

Our friends' argument concerns the entire show. We have given you a file in the `data/` folder that contains the script of every single episode. New episodes are indicated by '>>', new scenes by '>', and the rest of the lines are dialogue lines. Some lines are said by multiple people (for example, lines indicated by 'All' or 'Together'); **you must discard these lines**, for the sake of simplicity. However, you do not need to do it for Q1 in this task -- you'll take care of it when you solve Q2.

**Q1**. Your first task is to extract all lines of dialogue in each scene and episode, creating a dataframe where each row has the episode and scene where a dialogue line was said, the character who said it, and the line itself. You do not need to extract the proper name of the episode (e.g. episode 1 can appear as "Series 01 Episode 01 - Pilot Episode", and doesn't need to appear as "Pilot Episode"). Then, answer the following question: In total, how many scenes are there in each season? We're not asking about unique scenes; the same location appearing in two episodes counts as two scenes. You can use a Pandas dataframe with a season column and a scene count column as the response.

**Note: The data refers to seasons as "series".**

In [9]:
with open('data/all_scripts.txt') as f:
    lines = f.readlines()

In [10]:
df_scritp = pd.DataFrame(columns=['episode','scene','dialogue'])
for line in lines:
    line = line.replace('\n','').replace('\xa0',' ')
    if line.startswith('>>'):
        episode = line.replace('>>','')
    elif line.startswith('>'):
        scene = line.replace('>','')
    else:
        df_scritp = pd.concat([df_scritp, pd.DataFrame([[episode, scene, line]],columns=['episode','scene','dialogue'])])

In [None]:
df_scritp.head()

Unnamed: 0,episode,scene,dialogue
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon: So if a photon is directed through a ...
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"Leonard: Agreed, what’s your point?"
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"Sheldon: There’s no point, I just think it’s a..."
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard: Excuse me?
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Receptionist: Hang on.


**Q2**. Now, let's define two sets of characters: all the characters, and recurrent characters. Recurrent characters are those who appear in more than one episode. For the subsequent sections, you will need to have a list of recurrent characters. Assume that there are no two _named characters_ (i.e. characters who have actual names and aren't referred to generically as "little girl", "grumpy grandpa", etc.) with the same name, i.e. there are no two Sheldons, etc. Generate a list of recurrent characters who have more than 90 dialogue lines in total, and then take a look at the list you have. If you've done this correctly, you should have a list of 20 names. However, one of these is clearly not a recurrent character. Manually remove that one, and print out your list of recurrent characters. To remove that character, pay attention to the _named character_ assumption we gave you earlier on. **For all the subsequent questions, you must only keep the dialogue lines said by the recurrent characters in your list.**

In [None]:
df_scritp['cast'] = df_scritp['dialogue'].apply(lambda line: line.split(':',2)[0])
df_scritp['dialogue'] = df_scritp['dialogue'].apply(lambda line: line.split(':',2)[1])
df_scritp.head()

Unnamed: 0,episode,scene,dialogue,cast
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,So if a photon is directed through a plane wi...,Sheldon
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"Agreed, what’s your point?",Leonard
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"There’s no point, I just think it’s a good id...",Sheldon
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Excuse me?,Leonard
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Hang on.,Receptionist


In [None]:
recurrent_char = df_scritp.cast.value_counts()
top_recurrent_char = recurrent_char[(recurrent_char > 90) & (recurrent_char.index != 'Man')].index
print(top_recurrent_char)

Index(['Sheldon', 'Leonard', 'Penny', 'Howard', 'Raj', 'Amy', 'Bernadette',
       'Stuart', 'Priya', 'Mrs Cooper', 'Emily', 'Beverley', 'Mrs Wolowitz',
       'Zack', 'Arthur', 'Wil', 'Leslie', 'Kripke', 'Bert'],
      dtype='object', name='cast')


In [None]:
df_scritp = df_scritp[df_scritp.cast.isin(top_recurrent_char)]

## Task B: Read the scripts carefully

### Part 1: Don't put the shovel down just yet

**Q3**. From each dialogue line, replace punctuation marks (listed in the EXCLUDE_CHARS variable provided in `helpers/helper_functions.py`) with whitespaces, and lowercase all the text. **Do not remove any stopwords, leave them be for all the questions in this task.**

In [None]:
def remove_punctuation(input):
    for punct in EXCLUDE_CHARS:
        input = input.replace(punct,' ')
    return input
df_scritp['dialogue_clean'] = df_scritp['dialogue'].apply(lambda diag: remove_punctuation(diag).lower())

NameError: name 'df_scritp' is not defined

**Q4**. For each term, calculate its "corpus frequency", i.e. its number of occurrences in the entire series. Visualize the distribution of corpus frequency using a histogram. Explain your observations. What are the appropriate x and y scales for this plot?

In [None]:
df_scritp['dialogue_tokenized'] = df_scritp['dialogue_clean'].apply(lambda diag: simple_tokeniser(diag))
df_scritp.head()

Unnamed: 0,episode,scene,dialogue,cast,dialogue_clean,dialogue_tokenized
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,So if a photon is directed through a plane wi...,Sheldon,so if a photon is directed through a plane wi...,"[so, if, a, photon, is, directed, through, a, ..."
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"Agreed, what’s your point?",Leonard,agreed what s your point,"[agreed, what, s, your, point]"
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"There’s no point, I just think it’s a good id...",Sheldon,there s no point i just think it s a good id...,"[there, s, no, point, i, just, think, it, s, a..."
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Excuse me?,Leonard,excuse me,"[excuse, me]"
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,"One across is Aegean, eight down is Nabakov, ...",Leonard,one across is aegean eight down is nabakov ...,"[one, across, is, aegean, eight, down, is, nab..."


In [None]:
tokens = np.array([word for sub_list in df_scritp['dialogue_tokenized'].values for word in sub_list])
tokens_occ = np.unique(tokens,return_counts=True)
occurance = sorted(tokens_occ[1],reverse=True)

NameError: name 'df_scritp' is not defined

In [None]:
plt.plot(range(0,len(occurance)),occurance)
plt.xlabel('words rank (log)')
plt.ylabel('words occurances (log)')
plt.yscale('log')
plt.xscale('log')

NameError: name 'occurance' is not defined

In [None]:
plt.hist(occurance,loglog=True)
plt.xlabel('')
plt.ylabel('words frequency')
plt.show()

NameError: name 'occurance' is not defined

### Part 2: Talkativity
**Q5**. For each of the recurrent characters, calculate their total number of words uttered across all episodes. Based on this, who seems to be the most talkative character?

In [None]:
df_scritp['count_token'] = df_scritp['dialogue_tokenized'].apply(lambda tokens: len(tokens))
talkativity = df_scritp.groupby('cast',as_index=False)[['count_token']].sum()
talkativity.sort_values('count_token',ascending=False)

NameError: name 'df_scritp' is not defined

## Task D: The Detective's Hat

Sheldon claims that given a dialogue line, he can, with an accuracy of above 70%, say whether it's by himself or by someone else. Leonard contests this claim, since he believes that this claimed accuracy is too high.

**Q6**. Divide the set of all dialogue lines into two subsets: the training set, consisting of all the seasons except the last two, and the test set, consisting of the last two seasons.

In [None]:
df_scritp.reset_index(drop=True,inplace=True)
test_index = df_scritp[df_scritp.episode.str.contains('Series 09') | df_scritp.episode.str.contains('Series 10')].index
train_index = df_scritp[~df_scritp.episode.str.contains('Series 09') & ~df_scritp.episode.str.contains('Series 10')].index

NameError: name 'df_scritp' is not defined

**Q7**. Find the set of all words in the training set that are only uttered by Sheldon. Is it possible for Sheldon to identify himself only based on these? Use the test set to assess this possibility, and explain your method.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

def preprocess_bow(df, train_index, test_index):
    X = vectorizer.fit_transform(df.dialogue_clean.values)
    Y = ((df.cast == 'Sheldon').values).astype(int)

    X_train = X.toarray()[train_index,:]
    X_test = X.toarray()[test_index,:]

    Y_train = Y[train_index]
    Y_test = Y[test_index]

    return X_train, Y_train, X_test, Y_test

X_train, Y_train, X_test, Y_test = preprocess_bow(df_scritp, train_index, test_index)
print(X_train.shape)
print(X_test.shape)

NameError: name 'df_scritp' is not defined

In [None]:
import warnings
warnings.filterwarnings('ignore')

model = LogisticRegressionCV(cv=10).fit(X_train, Y_train)
score_train = model.score(X_train,Y_train)
score_test = model.score(X_test,Y_test)

print('Train Accuracy:',score_train)
print('Test Accuracy:',score_test)