# Create a custom list of words for word2vec play

## The task

Go from this (CSV containing Seinfeld dialogue and metadata):

```
...51,CLAIRE,"Id have to say, uuhh, no.",1,S01E01,1
52,(George shows his note-block to Jerry; it says very largely,NO.),1,S01E01,1
53,CLAIRE,To be polite.,1,S01E01,1
54,GEORGE,To be polite. I rest my case.,1,S01E01,1
55,JERRY,"Good. Did you have fun? You have no idea, what youre talking about, now, come on, come with me. (stands up) I gotta go get my stuff out of the dryer anyway.",1,S01E01,1
56,GEORGE,Im not gonna watch you do laundry.,1,S01E01,1
57,JERRY,"Oh, come on, be a come-with guy.",1,S01E01,1
58,GEORGE,"Come on, Im tired.",1,S01E01,1
59,CLAIRE,"(to Jerry) Dont worry, I gave him a little caffeine. Hell perk up.",1,S01E01,1
60,GEORGE,"(panicking) Right, I knew I felt something!",1,S01E01,1
61,GEORGE,Jerry? I have to tell you something. This is the dullest moment Ive ever experienced.,1,S01E01,1
62,JERRY,"Well, look at this guy. Look, hes got everything, hes got detergents, sprays, fabric softeners.  This is not his first load.",1,S01E01,1...
```

To this (a list of each word spoken in Seinfeld):

```
[...'agenda', 'agent', 'agents', 'ages', 'aggravate',...]
```

<br>

## Our steps

1. Take a look at the source data
1. Extract the dialogue
1. Remove unwanted punctuation
1. Split text into list
1. Unique the list

## Take a look at the source data

In [19]:
# Import read_csv
from pandas import read_csv

In [20]:
# Create a dataframe object using the csv as the data
df = read_csv('seinfeld_raw.csv')

In [21]:
# Preview the dataframe object
df

Unnamed: 0.1,Unnamed: 0,Character,Dialogue,EpisodeNo,SEID,Season
0,0,JERRY,Do you know what this is all about? Do you kno...,1,S01E01,1
1,1,JERRY,"(pointing at Georges shirt) See, to me, that b...",1,S01E01,1
2,2,GEORGE,Are you through?,1,S01E01,1
3,3,JERRY,"You do of course try on, when you buy?",1,S01E01,1
4,4,GEORGE,"Yes, it was purple, I liked it, I dont actuall...",1,S01E01,1
...,...,...,...,...,...,...
54611,54611,JERRY,Grand theft auto - don't steal any of my jokes.,23,S09E23,9
54612,54612,PRISONER 3,You suck - I'm gonna cut you.,23,S09E23,9
54613,54613,JERRY,"Hey, I don't come down to where you work, and ...",23,S09E23,9
54614,54614,GUARD,"Alright, Seinfeld, that's it. Let's go. Come on.",23,S09E23,9


In [26]:
# Narrrow in on only the dialogue column,
# and preview just the first five rows
df['Dialogue'][:5]

0    Do you know what this is all about? Do you kno...
1    (pointing at Georges shirt) See, to me, that b...
2                                     Are you through?
3               You do of course try on, when you buy?
4    Yes, it was purple, I liked it, I dont actuall...
Name: Dialogue, dtype: object

In [27]:
# View that same data as a list
df['Dialogue'].tolist()[:5]

['Do you know what this is all about? Do you know, why were here? To be out, this is out...and out is one of the single most enjoyable experiences of life. People...did you ever hear people talking about We should go out? This is what theyre talking about...this whole thing, were all out now, no one is home. Not one person here is home, were all out! There are people tryin to find us, they dont know where we are. (on an imaginary phone) Did you ring?, I cant find him. Where did he go? He didnt tell me where he was going. He must have gone out. You wanna go out you get ready, you pick out the clothes, right? You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then youre standing around, whatta you do? You go We gotta be getting back. Once youre out, you wanna get back! You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? Where ever you are in life, its my feeling, youve gotta go.',
 '(pointing at George

In [30]:
# Create one big string that contains all dialogue turns
dialogue_list = df['Dialogue'].astype(str).tolist()
dialogue_txt = ' '.join(dialogue_list)

In [34]:
# Preview the first 400 characters in our big dialogue string
dialogue_txt[:400]

'Do you know what this is all about? Do you know, why were here? To be out, this is out...and out is one of the single most enjoyable experiences of life. People...did you ever hear people talking about We should go out? This is what theyre talking about...this whole thing, were all out now, no one is home. Not one person here is home, were all out! There are people tryin to find us, they dont know'

## Convert dialogue to token list

In [38]:
# Split the dialogue using a basic split method
dialogue_txt.split(' ')[:50]

['Do',
 'you',
 'know',
 'what',
 'this',
 'is',
 'all',
 'about?',
 'Do',
 'you',
 'know,',
 'why',
 'were',
 'here?',
 'To',
 'be',
 'out,',
 'this',
 'is',
 'out...and',
 'out',
 'is',
 'one',
 'of',
 'the',
 'single',
 'most',
 'enjoyable',
 'experiences',
 'of',
 'life.',
 'People...did',
 'you',
 'ever',
 'hear',
 'people',
 'talking',
 'about',
 'We',
 'should',
 'go',
 'out?',
 'This',
 'is',
 'what',
 'theyre',
 'talking',
 'about...this',
 'whole',
 'thing,']

In [43]:
# Import re, which stands for regular expression
import re

# Define a pattern using a regular expression
pattern = r"[^a-z]"

# Search for the pattern, and replace every instance
# with a replacement string
dialogue_txt = re.sub(pattern, ' ', dialogue_txt.lower())

In [47]:
# Preview the first 400 characters in our new dialogue_txt
dialogue_txt[:400]

'do you know what this is all about  do you know  why were here  to be out  this is out   and out is one of the single most enjoyable experiences of life  people   did you ever hear people talking about we should go out  this is what theyre talking about   this whole thing  were all out now  no one is home  not one person here is home  were all out  there are people tryin to find us  they dont know'

In [48]:
# Split the dialogue text into a list of tokens
dialogue_list = re.split(r" +", prepareText(dialogue_txt))

In [50]:
# Preview the first few items in the dialogue list
dialogue_list[:50]

['do',
 'you',
 'know',
 'what',
 'this',
 'is',
 'all',
 'about',
 'do',
 'you',
 'know',
 'why',
 'were',
 'here',
 'to',
 'be',
 'out',
 'this',
 'is',
 'out',
 'and',
 'out',
 'is',
 'one',
 'of',
 'the',
 'single',
 'most',
 'enjoyable',
 'experiences',
 'of',
 'life',
 'people',
 'did',
 'you',
 'ever',
 'hear',
 'people',
 'talking',
 'about',
 'we',
 'should',
 'go',
 'out',
 'this',
 'is',
 'what',
 'theyre',
 'talking',
 'about']

## Unique the list

In [13]:
# Create an empty list where we'll store exactly one
# of each token
unique_token_list = []

# For each token in the dialogue list,
for token in dialogue_list:
    # if (and only if) that token is not yet in the unique list
    if token not in unique_token_list:
        # add it to the unique list
        unique_token_list.append(token)

# Sort the list, so it'll be easier
# to spot duplicates if they exist
unique_token_list.sort()

In [51]:
print(unique_tokens[300:320])

['agenda', 'agent', 'agents', 'ages', 'aggravate', 'aggravated', 'aggravation', 'aggressive', 'agh', 'aghast', 'aghh', 'agian', 'aging', 'agitated', 'agitator', 'ago', 'agonised', 'agony', 'agr', 'agrabah']


## Write to file

In [332]:
# with open('seinfeld_tokens.txt', 'w') as f:
#     for token in unique_tokens:
#         f.write(token + '\n')

## The shorthand way