In [0]:
import pandas as pd
import re

In [0]:
df_raw = pd.read_csv('/content/drive/My Drive/characterUtterances.csv')

In [3]:
df_raw.head()

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n
3,10,1,Chef,I'm sorry boys.\n
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."


Quickly check there are no missing values in the data...

In [4]:
df_raw.isnull().sum()

Season       0
Episode      0
Character    0
Line         0
dtype: int64

In [0]:
df = df_raw.copy()

Nice and clean! Lucky us... Let's now remove those trailing new line characters...


In [0]:
df['Line'] = df['Line'].map(lambda x: x.rstrip())

In [7]:
df.head()

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away."
1,10,1,Kyle,Going away? For how long?
2,10,1,Stan,Forever.
3,10,1,Chef,I'm sorry boys.
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."


After initially cleaning the data I noticed elipses were causing an issue when using a single regular expression to remove punctuation... The elipses would be removed, but when they are used to indicate a stutter, as in: "b...but", there are no spaces seperating the characters. In light of this, we need to remove all repeated occurences of a fullstop, replace it with a blank space, then strip regular punctuation... We map these semi-processed string to a new collumn so we may easily extract the unprocessed test lines for human evaluation (much later on)...

In [0]:
df['Processed Line'] = df['Line'].map(lambda x: re.sub('\.{2,}',' ', x))

Remove uppercasing and pesky punctuation...

In [0]:
df['Processed Line'] = df['Processed Line'].map(lambda x: re.sub('[^\w\s]','', x).lower())

In [10]:
df.head()

Unnamed: 0,Season,Episode,Character,Line,Processed Line
0,10,1,Stan,"You guys, you guys! Chef is going away.",you guys you guys chef is going away
1,10,1,Kyle,Going away? For how long?,going away for how long
2,10,1,Stan,Forever.,forever
3,10,1,Chef,I'm sorry boys.,im sorry boys
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...",chef said hes been bored so he joining a group...


Since we are interested in character diction, we ought to determine a minimum length for strings in the dataframe, it is not likely diction can be detected from a singular word, wouldn't you agree?

Let us first count the number of words in each string, to better determine a lower bound on the length - we don't want to throw away too much data...

In [0]:
counts = df['Line'].str.count(' ').add(1).value_counts(sort=False)
counts.sort_index(inplace=True)

In [12]:
counts

1      6681
2      4946
3      4758
4      4749
5      4569
       ... 
257       1
261       1
263       2
265       1
305       1
Name: Line, Length: 181, dtype: int64

We can see there are thousands of short utterances, these are not very informative and the character is unlikely to be identifiable based on so few words... Obviously with the exception of character 'sayings', such as 'screw you guys' (Cartman). 

Let us see how much data will be lost if we proceed in this manner...

In [13]:
LOWER_BOUND = 4 # Min words per sentence

print('Exlcuding sentences with less than {} words drops {:.2f}% data'.format(
    LOWER_BOUND, 100-counts[LOWER_BOUND-1:].sum()/len(df)*100)
)

Exlcuding sentences with less than 4 words drops 23.11% data


We are going to lose some data, but the remaining sentences will have richer information. 

Let us drop those shorter rows...

In [0]:
df = df[df['Line'].str.count(' ')>LOWER_BOUND-2]

In [15]:
df.head()

Unnamed: 0,Season,Episode,Character,Line,Processed Line
0,10,1,Stan,"You guys, you guys! Chef is going away.",you guys you guys chef is going away
1,10,1,Kyle,Going away? For how long?,going away for how long
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...",chef said hes been bored so he joining a group...
6,10,1,Mrs. Garrison,Chef?? What kind of questions do you think adv...,chef what kind of questions do you think adven...
7,10,1,Chef,What's the meaning of life? Why are we here?,whats the meaning of life why are we here


We are next going to drop all irrelevant characters, let us look at the top 10 characters and their line count.

I have printed out the line count for the lower bound dataset, and the raw dataset, just to check a single character hasn't lost too many lines...

In [16]:
df['Character'].value_counts()[:10]

Cartman         7944
Stan            5638
Kyle            5230
Randy           2008
Butters         1923
Mr. Garrison     869
Chef             707
Sharon           696
Mr. Mackey       552
Jimmy            508
Name: Character, dtype: int64

In [17]:
df_raw['Character'].value_counts()[:10]

Cartman         9774
Stan            7680
Kyle            7099
Butters         2602
Randy           2467
Mr. Garrison    1002
Chef             917
Kenny            881
Sharon           862
Mr. Mackey       633
Name: Character, dtype: int64

Some characters have significantly more lines than others, so we are clearly working with an imbalanced dataset. This is fine though, we will simply use an appropriate metric when determining model score.

Another thing we should consider, some characters are going to be easily identified based on a single, repeated word, such as Mr. Mackey who always says 'mkay', or Kenny, who constantly uses expletives. We want to look at general language structure and not have the model hone in on these common phrases. For this reason, we are going to select the following characters:



1.   Cartman
2.   Stan
3.   Kyle
4.   Butters
5.   Randy
6.   Mr. Garrison

(For the time being)

This leaves us with ~23k sentences...



In [18]:
df['Character'].value_counts()[:6].sum()

23612

In [0]:
# Considered characters
valid_characters = ['Cartman', 'Stan', 'Kyle', 'Butters', 
                    'Randy', 'Mr. Garrison']

In [0]:
char_df = df[df['Character'].isin(valid_characters)]

In [21]:
char_df.head()

Unnamed: 0,Season,Episode,Character,Line,Processed Line
0,10,1,Stan,"You guys, you guys! Chef is going away.",you guys you guys chef is going away
1,10,1,Kyle,Going away? For how long?,going away for how long
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...",chef said hes been bored so he joining a group...
9,10,1,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...,im gonna miss him im gonna miss chef and i an...
10,10,1,Stan,"Dude, how are we gonna go on? Chef was our fuh...",dude how are we gonna go on chef was our fuh f...


Let us save this processed data as a csv prior to doing some exploration...

In [0]:
SAVE_AS = 'procCharLines.csv'

char_df.to_csv(SAVE_AS, header=True)