# Utterances

EXISTING

This data frame lists the movie lines (utterances) and the character speaking. The `line_ID` column is referenced in the `conversations_df`.

In [1]:
# import packages
import numpy as np
import pandas as pd
import nltk



## Loading in the data, basic summary, and initial cleaning

In [2]:
# creating the df
utterances_df = pd.read_csv('./data/movie_lines.txt', sep='\s+\+\+\+\$\+\+\+\s?',
                            names=['line_ID', 'character_ID' , 'movie_ID', 'character_name', 'utterance'], 
                            index_col='line_ID', dtype='string', engine='python', encoding='ISO-8859-1')

The data was all separated with ' +++$+++ ' and did not have column names. The README described what each column was in the data so I used that to create column names. Where logical, I made the index of the df the initial ID column.

In [3]:
utterances_df.shape

(304713, 4)

In [4]:
utterances_df.info()
# looks like there may be some missing information here

<class 'pandas.core.frame.DataFrame'>
Index: 304713 entries, L1045 to L666256
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   character_ID    304713 non-null  string
 1   movie_ID        304713 non-null  string
 2   character_name  304670 non-null  string
 3   utterance       304446 non-null  string
dtypes: string(4)
memory usage: 11.6+ MB


In [5]:
# replacing missing values with an empty string
utterances_df['character_name'].fillna('', inplace=True)
utterances_df['utterance'].fillna('', inplace=True)

In [6]:
utterances_df.head()

Unnamed: 0_level_0,character_ID,movie_ID,character_name,utterance
line_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
L1045,u0,m0,BIANCA,They do not!
L1044,u2,m0,CAMERON,They do to!
L985,u0,m0,BIANCA,I hope so.
L984,u2,m0,CAMERON,She okay?
L925,u0,m0,BIANCA,Let's go.


## Tokenizing and POS Tagging

In [7]:
# tokenizing
utterances_df['sents'] = utterances_df.utterance.map(nltk.sent_tokenize)
utterances_df['tokens'] = utterances_df.utterance.map(nltk.word_tokenize)

In [8]:
# adding POS tags to see if any trends arise
utterances_df['pos_tag'] = utterances_df.tokens.map(lambda x: nltk.pos_tag(x))

In [9]:
# token counts
utterances_df['sent_count'] = utterances_df.sents.map(len)
utterances_df['token_count'] = utterances_df.tokens.map(len)

In [10]:
# sent length
utterances_df['avg_sent_length'] = utterances_df.token_count / utterances_df.sent_count

In [11]:
utterances_df.head()

Unnamed: 0_level_0,character_ID,movie_ID,character_name,utterance,sents,tokens,pos_tag,sent_count,token_count,avg_sent_length
line_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
L1045,u0,m0,BIANCA,They do not!,[They do not!],"[They, do, not, !]","[(They, PRP), (do, VBP), (not, RB), (!, .)]",1,4,4.0
L1044,u2,m0,CAMERON,They do to!,[They do to!],"[They, do, to, !]","[(They, PRP), (do, VBP), (to, TO), (!, .)]",1,4,4.0
L985,u0,m0,BIANCA,I hope so.,[I hope so.],"[I, hope, so, .]","[(I, PRP), (hope, VBP), (so, RB), (., .)]",1,4,4.0
L984,u2,m0,CAMERON,She okay?,[She okay?],"[She, okay, ?]","[(She, PRP), (okay, PRP), (?, .)]",1,3,3.0
L925,u0,m0,BIANCA,Let's go.,[Let's go.],"[Let, 's, go, .]","[(Let, VB), ('s, POS), (go, VB), (., .)]",1,4,4.0


In [12]:
utterances_df.describe()

Unnamed: 0,sent_count,token_count,avg_sent_length
count,304713.0,304713.0,304446.0
mean,1.69385,13.722559,7.855335
std,1.252766,14.711341,5.155574
min,0.0,0.0,1.0
25%,1.0,5.0,4.5
50%,1.0,9.0,7.0
75%,2.0,17.0,10.0
max,45.0,684.0,122.0


In [13]:
# some utterances are blank
utterances_df[utterances_df.utterance=='']

Unnamed: 0_level_0,character_ID,movie_ID,character_name,utterance,sents,tokens,pos_tag,sent_count,token_count,avg_sent_length
line_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
L474,u5,m0,KAT,,[],[],[],0,0,
L24609,u224,m14,SYKES,,[],[],[],0,0,
L239088,u1125,m74,JANOSZ,,[],[],[],0,0,
L283548,u1356,m90,BRUCE,,[],[],[],0,0,
L303243,u1475,m100,JOE,,[],[],[],0,0,
...,...,...,...,...,...,...,...,...,...,...
L624042,u8606,m583,VIXIS,,[],[],[],0,0,
L649938,u8876,m603,LASHER,,[],[],[],0,0,
L649416,u8879,m603,MICHAEL,,[],[],[],0,0,
L663421,u8980,m612,DREIBERG,,[],[],[],0,0,


I am not sure why these utterances are empty or if they will have any impact on the `conversations_df`. If they do not have any impact they will be removed from the df.

## Pickling the data

In [14]:
import pickle

In [15]:
# pickle the data to use in other notebooks for further analysis
f = open('utterances_df.pkl', 'wb')
pickle.dump(utterances_df, f, -1)
f.close()

In [17]:
# the df is too large to upload to github as a csv file
# run this notebook and this chunk to generate the df as a csv file
# utterances_df.to_csv('./new_data/utterances_df.csv', header=True)