# Utterances

EXISTING

This data frame lists the movie lines (utterances) and the character speaking. The `line_ID` column is referenced in the `conversations_df`.

In [1]:
# import packages
import numpy as np
import pandas as pd
import nltk
import spacy

## Loading in the data, basic summary, and initial cleaning

In [2]:
# creating the df
utterances_df = pd.read_csv('./data/movie_lines.txt', sep='\s+\+\+\+\$\+\+\+\s?',
                            names=['line_ID', 'character_ID' , 'movie_ID', 'character_name', 'utterance'], 
                            dtype='string', engine='python', encoding='ISO-8859-1')

The data was all separated with ' +++$+++ ' and did not have column names. The README described what each column was in the data so I used that to create column names.

In [3]:
utterances_df.shape

(304713, 5)

In [4]:
utterances_df.info()
# looks like there may be some missing information here

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   line_ID         304713 non-null  string
 1   character_ID    304713 non-null  string
 2   movie_ID        304713 non-null  string
 3   character_name  304670 non-null  string
 4   utterance       304446 non-null  string
dtypes: string(5)
memory usage: 11.6 MB


In [5]:
# replacing missing values with an empty string
utterances_df['character_name'].fillna('', inplace=True)
utterances_df['utterance'].fillna('', inplace=True)

In [6]:
utterances_df.head()

Unnamed: 0,line_ID,character_ID,movie_ID,character_name,utterance
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


## Tokenizing and POS Tagging

In [7]:
# tokenizing
utterances_df['sents'] = utterances_df.utterance.map(nltk.sent_tokenize)
utterances_df['tokens'] = utterances_df.utterance.map(nltk.word_tokenize)

In [8]:
# generate nlp object for spacy pos tagging
nlp = spacy.load("en_core_web_sm")

In [9]:
# function that tags pos and creates word, pos tuples in a list
def pos_tag(x):
    pos = []
    for y in nlp(x):
        pos_tag = (y, y.pos_)
        pos.append(pos_tag)
    return pos

In [10]:
# adding POS tags to see if any trends arise
utterances_df['pos_tag'] = utterances_df.utterance.map(lambda x: pos_tag(x))

I like the readability of having (word, POS) tuples that NLTK POS tagger has, so I created the function above to maintain that structure using spaCy.

In [11]:
# token counts
utterances_df['sent_count'] = utterances_df.sents.map(len)
utterances_df['token_count'] = utterances_df.tokens.map(len)

In [12]:
# sent length
utterances_df['avg_sent_length'] = utterances_df.token_count / utterances_df.sent_count

In [13]:
utterances_df.head()

Unnamed: 0,line_ID,character_ID,movie_ID,character_name,utterance,sents,tokens,pos_tag,sent_count,token_count,avg_sent_length
0,L1045,u0,m0,BIANCA,They do not!,[They do not!],"[They, do, not, !]","[(They, PRON), (do, VERB), (not, PART), (!, PU...",1,4,4.0
1,L1044,u2,m0,CAMERON,They do to!,[They do to!],"[They, do, to, !]","[(They, PRON), (do, VERB), (to, PART), (!, PUN...",1,4,4.0
2,L985,u0,m0,BIANCA,I hope so.,[I hope so.],"[I, hope, so, .]","[(I, PRON), (hope, VERB), (so, ADV), (., PUNCT)]",1,4,4.0
3,L984,u2,m0,CAMERON,She okay?,[She okay?],"[She, okay, ?]","[(She, PRON), (okay, ADJ), (?, PUNCT)]",1,3,3.0
4,L925,u0,m0,BIANCA,Let's go.,[Let's go.],"[Let, 's, go, .]","[(Let, VERB), ('s, PRON), (go, VERB), (., PUNCT)]",1,4,4.0


In [14]:
utterances_df.describe()

Unnamed: 0,sent_count,token_count,avg_sent_length
count,304713.0,304713.0,304446.0
mean,1.69385,13.722559,7.855335
std,1.252766,14.711341,5.155574
min,0.0,0.0,1.0
25%,1.0,5.0,4.5
50%,1.0,9.0,7.0
75%,2.0,17.0,10.0
max,45.0,684.0,122.0


In [15]:
# some utterances are blank
utterances_df[utterances_df.utterance=='']

Unnamed: 0,line_ID,character_ID,movie_ID,character_name,utterance,sents,tokens,pos_tag,sent_count,token_count,avg_sent_length
538,L474,u5,m0,KAT,,[],[],[],0,0,
5637,L24609,u224,m14,SYKES,,[],[],[],0,0,
36526,L239088,u1125,m74,JANOSZ,,[],[],[],0,0,
45298,L283548,u1356,m90,BRUCE,,[],[],[],0,0,
49894,L303243,u1475,m100,JOE,,[],[],[],0,0,
...,...,...,...,...,...,...,...,...,...,...,...
289507,L624042,u8606,m583,VIXIS,,[],[],[],0,0,
299552,L649938,u8876,m603,LASHER,,[],[],[],0,0,
299714,L649416,u8879,m603,MICHAEL,,[],[],[],0,0,
303350,L663421,u8980,m612,DREIBERG,,[],[],[],0,0,


I am not sure why these utterances are empty or if they will have any impact on the `conversations_df`. If they do not have any impact they will be removed from the df.

## Saving the data

The spaCy code made this data frame not able to be pickled. I will save the df as a csv into an ignore filed so it does not upload to GitHub because the df is too large to be hosted on GitHub.

In [18]:
# save csv to an unpublished folder on github
utterances_df.to_csv('./private/utterances_df.csv', header=True)

In [None]:
# the df is too large to upload to github as a csv file
# run this notebook and this chunk to generate the df as a csv file
# utterances_df.to_csv('./new_data/utterances_df.csv', header=True)