In [1]:
import pickle
import pandas as pd
import re

In [2]:
cr_df = pd.read_pickle('../pickle_jar/CritRole.pkl')
cr_df.head()
#I don't like the periods in these column names so I'm changing them

Unnamed: 0,name,text,ts.h,ts.m,ts.s,episode
0,MATT,We've got some cool stuff to talk about. First...,0,0,15,1
1,LAURA,As in nnnn.,0,1,21,1
2,MATT,"Nnnn. But yeah, so they're going to be a long-...",0,1,24,1
3,TRAVIS,(vomiting noises) Would you like some?,0,2,4,1
4,MATT,"So yeah. I'm super excited to have that, guys....",0,2,7,1


In [3]:
cr_df = cr_df.rename(columns={'ts.h': 'ts_h', 'ts.m': 'ts_m', 'ts.s': 'ts_s'})
#cr_df.head()

In [4]:
cr_df["ts_h"] = cr_df["ts_h"].astype(str)
cr_df["ts_h"] = cr_df["ts_h"].str.zfill(2)
cr_df["ts_m"] = cr_df["ts_m"].astype(str)
cr_df["ts_m"] = cr_df["ts_m"].str.zfill(2)
cr_df["ts_s"] = cr_df["ts_s"].astype(str)
cr_df["ts_s"] = cr_df["ts_s"].str.zfill(2)

When looking at the timestamps in a typical timestamp format, 0:0:15 and 0:2:7 is unpleasant to look at and inconsistent, so i added in a leading 0 if the number in the column is less than 2 digits so that I can aggregate them and create the much easier to read 00:00:15 00:02:07 format. 

This will be helpful when I want to look at crosstalk or overlapping speech. Note that crosstalk is the language I've heard used in production spaces for overlapping speech and I know it's something they do their best to avoid *because* they are a production being recorded and aired. Too much speech at once would be difficult for viewers to parse and follow the ongoing story. It happens, though, as I explore later on in this notebook.

In [5]:
cr_df['timestamp'] = cr_df[['ts_h', 'ts_m', 'ts_s']].agg(':'.join, axis=1)

In [6]:
cr_df = cr_df.drop(columns=['ts_h', 'ts_m', 'ts_s'])
cr_df.head()
#much better looking in my opinion

Unnamed: 0,name,text,episode,timestamp
0,MATT,We've got some cool stuff to talk about. First...,1,00:00:15
1,LAURA,As in nnnn.,1,00:01:21
2,MATT,"Nnnn. But yeah, so they're going to be a long-...",1,00:01:24
3,TRAVIS,(vomiting noises) Would you like some?,1,00:02:04
4,MATT,"So yeah. I'm super excited to have that, guys....",1,00:02:07


So I want to start exploring what the data looks like. I've noticed immediately at the tope of the data that not every instance in the "text" column is actual speech. How many instances do we have of this?

In [7]:
test = cr_df[cr_df['text'].str.contains('\)')]
test

Unnamed: 0,name,text,episode,timestamp
3,TRAVIS,(vomiting noises) Would you like some?,1,00:02:04
5,SAM,(laughter),1,00:02:42
7,SAM,(laughter),1,00:04:35
14,LAURA,"(yells) Okay. Yes, I do. It's a new campaign, ...",1,00:06:13
16,LAURA,"So we released our teaser on socials, and ever...",1,00:06:24
...,...,...,...,...
2689,MATT,"(chuckling) You say that now, wait till next w...",99,03:46:13
2702,MATT,(cheering),99,03:46:36
2714,TRAVIS,Both sides would be like. (groaning),99,03:47:11
2725,MATT,(cheering),99,03:47:52


So the thing is that text in parenthesis isn't speech! It shouldn't be tokenized or included in data about speech events. I also spotted some lines marked [no audio] so I'll look into those too.

--talked to Na-Rae, new goal to separate into two columns, one for "sounds" and one for "speech"

In [8]:
allnames = cr_df['name'].value_counts()
print(len(allnames))
allnames[0:15]

394


MATT        111729
LAURA        59987
SAM          52150
TRAVIS       49852
MARISHA      49278
LIAM         41441
TALIESIN     40530
ASHLEY       17660
MICA          1548
ASHLY         1407
BRIAN         1324
ALLY           733
ALL            670
DEBORAH        540
KHARY          491
Name: name, dtype: int64

My next thought was about the number of speakers (and while I'm at it, how many speech events our speakers have). Unsurprisingly, our DM has the most speech events in general, but it's the most by so much! 

The next few are interesting.... Sam and Laura both tend to be the ones doing opening show announcements (what's new with the company, new merch etc) and ad reads, so I'm not surprised they're higher than the rest of the cast. Laura is an interesting one.... she was gone for a number of episodes this campaign for maternity leave but she also played a character who was a sort of hyperactive blabbermouth, so she must have really made up for lost time. 

Finally, Ashley having the fewest lines also makes sense to me. She's a regular in the show Blindspot and was filming out of state, and missed a big chunk of episodes. On top of that, her character was a bit stoic and awkward, so she wasn't much of a talker even when she was present.

In [9]:
allnames[-20:]

MARISHA, SAM, ASHLY               1
SAM, TALIESIN, MARISHA            1
SAM, MARISHA, ASHLY               1
ASHLY, LIAM, SAM                  1
II                                1
LIAM, MARISHA, SAM                1
TALIESIN, MARISHA, LAURA          1
MARISHA, SUMALEE                  1
SUMALEE, MATT                     1
LIAM, ASHLY                       1
ASHLY, MARISHA, SUMALEE           1
NILA                              1
AUDIENCE, BRIAN                   1
ASHLY, MATT, AUDIENCE             1
MARISHA, MATT, AUDIENCE           1
ASHLY, LIAM, AUDIENCE             1
ASHLY, AUDIENCE                   1
TRAVIS, MARISHA, MATT             1
MARISHA, TALIESIN, LAURA          1
LAURA, TALIESIN, LIAM, MARISHA    1
Name: name, dtype: int64

In [10]:
cr_df[cr_df['name']=='SAM, TALIESIN, MARISHA']

Unnamed: 0,name,text,episode,timestamp
1654,"SAM, TALIESIN, MARISHA",Whispers?,26,02:11:37


In [11]:
cr_df[cr_df['name']=='TALIESIN, MARISHA, LAURA']

Unnamed: 0,name,text,episode,timestamp
74,"TALIESIN, MARISHA, LAURA",Not it.,31,00:11:09


I wanted to explore and take a look at an instance of overlapping speech I remembered from when I was watching (highlights, these episodes are too long for me) this campaign. At the end of episode 2, two characters (played by our talkative Laura and Sam) think they're onto the perpetrator of some crime, and are in this moment hurling accusations at that person (Ornna). 

They overlap *a lot* in this crazy buzz of a moment, and you can more or less see this in the transcripts, but it's not always 100% clear. These transcriptionists did a really thorough job though.

In [12]:
ep2 = cr_df[cr_df['episode']=='2']
ep2[-130:-120]

Unnamed: 0,name,text,episode,timestamp
2564,MATT,"At which point, the flap of the tent opens up ...",2,04:01:13
2565,LAURA,It's Ornna!,2,04:01:17
2566,SAM,"Ornna, you have a lot of explaining to do!",2,04:01:19
2567,MARISHA,Shut up!,2,04:01:20
2568,SAM,You have a lot of explaining to do!,2,04:01:21
2569,MARISHA,Shut up!,2,04:01:22
2570,SAM,We've talked to Toya. She knows it's you who d...,2,04:01:24
2571,LAURA,We know you guys are in a fight all the time.,2,04:01:28
2572,SAM,You're the one behind the whole plot! You did ...,2,04:01:30
2573,LAURA,You!,2,04:01:32


a link to the moment of chaos, just for comparison

https://youtu.be/MPELLuQXVcE?si=13too39HcHj5Cfow&t=14475

### Splitting test run

Now back to this goal of splitting (sounds) and speech into different columns.

In [13]:
#cr_df[cr_df['text'].str.contains('^\(.*\)')]

In [14]:
#cr_df[cr_df['text'].str.contains('^\(.*\) $')]

This is good for at least the lines that are *just* sound and not combined sound/speech, which is good. I have to do some more reading about moving these specific lines to a new column for sounds but maintain their index space. 
I think I can do this by adding the new column and mapping the spplication of that new column onto the existing text column, but I'll take the time to experiment with that more this coming week to be sure of it. 

I'm not exactly ready to dig heavy into the stats yet but I want to do a little look for our progress report at least

In [15]:
ashley = cr_df[cr_df['name']=='ASHLEY']
matt = cr_df[cr_df['name']=='MATT']
marisha = cr_df[cr_df['name']=='MARISHA']
taliesin = cr_df[cr_df['name']=='TALIESIN']
sam = cr_df[cr_df['name']=='SAM']
liam = cr_df[cr_df['name']=='LIAM']
laura = cr_df[cr_df['name']=='LAURA']
travis = cr_df[cr_df['name']=='TRAVIS']

#ashley

In [16]:
women = [ashley, marisha, laura]
women_df = pd.concat(women)
#women_df

Glad I tried this, keep running into spots I'm finding I need to read up on in detail that are important for my later work. What my vision here is to be able to split the cr_df by these groupings, but that may not be so simple as this. 

Thinking further... maybe from these sub-dfs I can just add a "gender" column and remap it onto the main df, that should get the job done at least. Another grouping I would like to do *maybe* is a catchall "other" for the guests and the instances of multiple speakers doesn't look like such a mess like in the chart below. 

In [17]:
#allnames.plot.pie()

Wow. Looking at the numbers was one thing, but this really puts it into another perspective. Matt really has a lot of talking to do. 

In [18]:
cr_df.shape

(434052, 4)

In [19]:
#the index was making some of my work strange - I'm going to preserve it because I'm sure there's a reason it loaded in like this
#but to make my life easier i'm going to use the pandas provided index for my work
cr_df = cr_df.reset_index()

In [20]:
cr_df.tail()

Unnamed: 0,index,name,text,episode,timestamp
434047,2722,MATT,"We're going to get through this together, guys.",99,03:47:47
434048,2723,SAM,Wash your stubbly face.,99,03:47:49
434049,2724,TRAVIS,Gross.,99,03:47:51
434050,2725,MATT,(cheering),99,03:47:52
434051,2726,TRAVIS,(dramatic orchestral music),99,03:47:57


Let's get rid of those pesky (sound effects) lines

#### Test run

In [21]:
xnonspeech_df= pd.DataFrame(cr_df[cr_df['text'].str.contains('^\(.*\) $')]) #create a df of text lines that contain ONLY (sounds)
#xnonspeech_df.rename(columns={'text': 'nonspeech'}, inplace=True) #rename text to nonspeech
xnonspeech_df = pd.DataFrame(xnonspeech_df['text']) #narrow that down to only the nonspeech column, that's our focus
xnonspeech_df.head()

Unnamed: 0,text
5,(laughter)
7,(laughter)
54,(laughter)
86,(nervous laughter)
135,(laughter)


I was running into an error when trying to add this non speech data to a new column because the df sizes were not equal. 
This tutorial here was a good starting point for learning about this - mapping a series into a new column so it autofills any empty lines with NaN values.<br>
https://www.statology.org/length-of-values-does-not-match-length-of-index/

A continuation of my initial nonspeech experiment commented out to keep it from interfering with my later work, but it was a nice proof of concept that the series and fillna method should work

In [22]:
#xnonspeech = pd.Series(xnonspeech_df['text']) #create series from simple nonspeech data
#xnonspeech

In [23]:
#cr_df['nonspeech']=xnonspeech

In [24]:
#let's check with Sam who I spotted earlier laughing a good bit
#cr_df[cr_df['name']=='SAM']

Let's replace those same simple (sounds) in our base text column with an empty string and look at Sam again to confirm it's working.

In [25]:
#cr_df['text'] = cr_df['text'].str.replace(r'^\(.*\) $', '', regex=True).astype('str')

In [26]:
#cr_df[cr_df['name']=='SAM']

In [27]:
#cr_df['nonspeech'] = cr_df['nonspeech'].fillna('') #get rid of those ugly NaN values

In [28]:
#cr_df[cr_df['name']=='SAM']

This looks really good! Now we just have to replicate this process multiple times with other instances of (sounds). Sometimes it's combined with actual speech information. Sometimes it's leading and sometimes it's trailing. Sometimes it's \[sounds\] in the case of sound effects in the room and not information like laughter or singing. 

Because of this, and because trying to keep the square brackets in the data was complicating things unnecessarily, I'm going to be erasing anything in the square brackets. It's not about an utterance that any specific person is making and is information about the room around them, so I'm okay with not keeping it stored.

After, we'll redo this process above for anything in (), combine all of the series that we make which will preserve the index information, add that full series to the new "nonspeech" column, fill NaN with empty strings, and get moving along! 

### Splitting and cleaning

#### Deleting or treating nonverbal information

In [29]:
#cr_df[cr_df['text'].str.contains(r'♪')]

In [30]:
cr_df['text'] = cr_df['text'].str.replace(r'♪', r'', regex=True).astype('str')
cr_df.text[433078]
#gets rid of instances where music notes indicate song

" D&D, D&D   You got your staffs   You got your swords   You got your stuffs  (laughter)  And you got your invisible wand   It's D&D, D&D   D&D  "

In [31]:
#cr_df[cr_df['text'].str.contains(r'.* \[inaudible\] .*')]

In [32]:
inaudible_df = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'.* \[inaudible\] .*')])

In [33]:
inaudible_df['text'] = inaudible_df['text'].str.replace(r'.* (\[inaudible\]) .*', r'\1', regex=True).astype('str')
#inaudible_df

In [34]:
inaudible = pd.Series(inaudible_df['text'])
#inaudible

In [35]:
cr_df['inaudible_speech'] = inaudible

In [36]:
#cr_df[cr_df['text'].str.contains(r'\[.*\]')]

In [37]:
cr_df['text'] = cr_df['text'].str.replace(r'\[.*\]', r'', regex=True).astype('str')

In [38]:
print(cr_df.text[5770])
print(cr_df.text[430418])

Like . 
Fucking-- another . 


In [39]:
cr_df.text[10558] #these are sounds on a soundboard so I'm going to individually alter this line which stuck out during my work

'Whoa! (digital lasers firing) Laser sword! (grunts) (horse neighs) (glass shatters) (cat cries) '

In [40]:
cr_df[cr_df['text'].str.contains(r'.* \(digital lasers firing\) .*')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
10558,11,SAM,Whoa! (digital lasers firing) Laser sword! (gr...,102,00:01:46,


In [41]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )\(digital lasers firing\)( .*!* )\(.*\)', r'\1 \2', regex=True).astype('str')

In [42]:
cr_df.text[10558]

'Whoa!   Laser sword! (grunts) (horse neighs) (glass shatters)  '

In [43]:
cr_df[cr_df['text'].str.contains(r'.* \(horse neighs\) .*')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
10558,11,SAM,Whoa! Laser sword! (grunts) (horse neighs) (...,102,00:01:46,


In [44]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )\(.*\) \(horse neighs\) .*', r'\1', regex=True).astype('str')

In [45]:
cr_df.text[10558]

'Whoa!   Laser sword! '

In [46]:
cr_df[cr_df['text'].str.contains(r'\(t\)')]
#another line i discovered was a problem - this isn't a sound, it's a pun transcibed to show the meaning (not) a play on
#a character's name (nott) - since we ditched the sound effects in [brackets] I'll transform this so my (sound) catchers skip it

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
323307,197,LAURA,I have Not(t).,69,00:19:27,
323309,199,LAURA,I have Not(t).,69,00:19:32,
323333,223,LAURA,I will not(t).,69,00:21:14,


In [47]:
cr_df['text'] = cr_df['text'].str.replace(r'\(t\)', r'[t]', regex=True).astype('str')

In [48]:
print(cr_df.text[323307])
print(cr_df.text[323309])
print(cr_df.text[323333])

I have Not[t]. 
I have Not[t]. 
I will not[t]. 


In [49]:
cr_df[cr_df['text'].str.contains(r'\(s\)')]
#a similar instance (must have been an enemy with multiple heads) 

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
356636,3063,TRAVIS,Bring me his head(s).,78,04:05:16,


In [50]:
cr_df['text'] = cr_df['text'].str.replace(r'\(s\)', r'[s]', regex=True).astype('str')

In [51]:
print(cr_df.text[356636])

Bring me his head[s]. 


In [52]:
cr_df[cr_df['text'].str.contains(r'\(c\)\(3\)')]
#another single specific instance about their 501(c)(3) nonprofit

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
145738,1183,MATT,"The Critical Role Foundation is a nonprofit, 5...",141.01,01:38:10,


In [53]:
cr_df['text'] = cr_df['text'].str.replace(r'\(c\)\(3\)', r'[c][3]', regex=True).astype('str')

In [54]:
print(cr_df.text[145738])

The Critical Role Foundation is a nonprofit, 501[c][3] charitable organization that partners with other charitable organizations and endeavors. 


In [55]:
cr_df[cr_df['text'].str.contains(r'.* \(water bubbling\) \(thunder rumbling\).*')]
#these ones are the transcription of the show's theme song and so I'm excluding that as well

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
13602,149,MATT,(random cast noises) (water bubbling) (thunder...,103.0,00:09:37,
30174,114,LIAM,(screams) (water bubbling) (thunder rumbling) ...,108.0,00:09:16,
45446,101,SAM,Boomerang! (water bubbling) (thunder rumbling)...,113.0,00:07:21,
62151,68,MATT,(clamoring) (water bubbling) (thunder rumbling...,118.0,00:07:15,
95066,109,SAM,(yells) (water bubbling) (thunder rumbling) (e...,127.0,00:06:38,
126330,61,ALL,Yeah! (water bubbling) (thunder rumbling) (exp...,136.0,00:06:05,
129921,104,TALIESIN,(bleating) (water bubbling) (thunder rumbling)...,137.0,00:10:29,
155737,86,SAM,Yeah! (water bubbling) (thunder rumbling) (exp...,141.03,00:06:37,
165315,59,TRAVIS,(growling) (water bubbling) (thunder rumbling)...,141.0,00:08:10,
342425,104,TRAVIS,(growling) (water bubbling) (thunder rumbling)...,75.0,00:08:32,


In [56]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )\(water bubbling\) \(thunder rumbling\).*', r'\1', regex=True).astype('str')

In [57]:
cr_df[cr_df['text'].str.contains(r'.* \(roll the dice\) .*')]
#more instances of the theme song (without sound effects transcribed)

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
80565,51,TRAVIS,(water bubbling) (thunder rumbling) (explosion...,123.0,00:06:08,
132482,94,MATT,(water bubbling) (thunder rumbles) (explosion)...,138.0,00:09:36,
155740,89,ASHLEY,They got magic and flair They got falchions...,141.03,00:07:27,
161765,51,ASHLEY,They got magic and flair They got falchions...,141.04,00:23:24,
417747,109,MATT,(roaring) (water bubbling) (booming) (booming)...,95.0,00:10:13,


In [58]:
cr_df['text'] = cr_df['text'].str.replace(r'.* \(roll the dice\) .*', r'', regex=True).astype('str')

In [59]:
cr_df[cr_df['text'].str.contains(r'.* \(water gurgling\) .*')]
#and another theme song instance with "water gurgling" instead of "bubbling"

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
290147,63,LIAM,Everybody sit back and be safe! (water gurglin...,59,00:07:09,


In [60]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )\(water gurgling\) .*', r'\1', regex=True).astype('str')

In [61]:
#cr_df[cr_df['text'].str.contains(r'.* You got the perfect warlock')]
#more instances of theme song, inconsistently transcribed and at times with a brief spoken line before it
#drop the song, keep the intro line

In [62]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )You got the perfect warlock.*$', r'\1', regex=True).astype('str')

In [63]:
cr_df[cr_df['text'].str.contains(r'.* You got the perfect warlock')]
#this one example was not captured for some reason, so treat it again

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
433076,1751,MATT,(bright upbeat music) You got the perfect war...,99,02:13:50,


In [64]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* )You got the perfect warlock.*$', r'\1', regex=True).astype('str')

In [65]:
#cr_df[cr_df['text'].str.contains(r'.*?BRIAN \(V\.O\.\):.*')]

In [66]:
cr_df['text'] = cr_df['text'].str.replace(r'.*?BRIAN \(V\.O\.\):.*', r'', regex=True).astype('str')

In [67]:
#cr_df[cr_df['text'].str.contains(r'.*Last time,? on Talks Machina.*')]

In [68]:
cr_df['text'] = cr_df['text'].str.replace(r'.*Last time,? on Talks Machina.*', r'', regex=True).astype('str')

In [69]:
#cr_df[cr_df['text'].str.contains(r'.*TRAVIS \(V\.O\.\)*')]

In [70]:
cr_df['text'] = cr_df['text'].str.replace(r'.*TRAVIS \(V\.O\.\)*', r'', regex=True).astype('str')

#### Unmatched brackets

Once I got my full code working, I found some instances of lines of unmatched brackets. I'm going to fix those one by one so that they will be appropriately ignored or captured by all the treatments below

In [71]:
cr_df[cr_df['text'].str.contains('\(Metal clanging .*')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
397668,1022,MATT,"(Metal clanging ""Hello?""",9,01:37:05,


In [72]:
cr_df['text'] = cr_df['text'].str.replace(r'\(Metal clanging (.*)', r'(Metal clanging) \1', regex=True).astype('str')

In [73]:
cr_df[cr_df['text'].str.contains('\(pigeon cooing $')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
400946,2067,LAURA,(pigeon cooing,90,02:51:29,


In [74]:
cr_df['text'] = cr_df['text'].str.replace(r'\(pigeon cooing $', r'(pigeon cooing)', regex=True).astype('str')

In [75]:
cr_df[cr_df['text'].str.contains('\(as Essek .*')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
433803,2478,MATT,"(as Essek ""Outwardly, at least.""",99,03:32:38,


In [76]:
cr_df['text'] = cr_df['text'].str.replace(r'\(as Essek (.*)', r'(as Essek) \1', regex=True).astype('str')

In [77]:
cr_df[cr_df['text'].str.contains('.* impressive!" \(')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
183223,696,MATT,There's some claps in the back rooms. You see ...,2,01:14:49,


In [78]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* impressive!") \(', r'\1', regex=True).astype('str')

In [79]:
cr_df[cr_df['text'].str.contains('\($')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
81562,1048,TRAVIS,(,123,00:57:20,


In [80]:
cr_df['text'] = cr_df['text'].str.replace(r'\($', r'', regex=True).astype('str')

Okay, so that gets rid of a lot of the strange one-off cases i've run into in my hours and hours (and hours) of working on the regex to get these columns split into two parts. After that treatment, I'm going to re-run the sample nonspeech process from above to single out examples from the df that ONLY contain (sounds) and no other text.

#### Creating nonspeech column

In [81]:
#cr_df[cr_df['text'].str.contains('^\(.*\) $')]

In [82]:
nonspeech_df = pd.DataFrame(cr_df[cr_df['text'].str.contains('^\(.*\) $')])

In [83]:
nonspeech = pd.Series(nonspeech_df['text'])
nonspeech

5                          (laughter) 
7                          (laughter) 
54                         (laughter) 
86                 (nervous laughter) 
135                        (laughter) 
                      ...             
433938                     (laughter) 
433990                    (screaming) 
434027                     (cheering) 
434050                     (cheering) 
434051    (dramatic orchestral music) 
Name: text, Length: 11421, dtype: object

In [84]:
cr_df['text'] = cr_df['text'].str.replace(r'^\(.*\) $', r'', regex=True).astype('str')

In [85]:
#cr_df[cr_df['text'].str.contains('^ \(.*\) $')]

In [86]:
morens_df = pd.DataFrame(cr_df[cr_df['text'].str.contains('^ \(.*\) $')])

In [87]:
more_nonspeech = pd.Series(morens_df['text'])
more_nonspeech

10572                                           (laughter) 
90540                                          (explosion) 
123092             (Addams Family theme music)  (laughter) 
149350     (bright music) (bright music) (bright music) ...
162236                                          (laughter) 
162528                                          (laughter) 
162798                                          (laughter) 
162804                                          (laughter) 
162993                                          (laughter) 
282834                                 (slamming) (crying) 
322213                     (Beginning of "Immigrant Song") 
323561                       (Girl from Ipanema)  (laughs) 
333502                                    (dramatic music) 
347831                                          (laughter) 
421679                                          (laughter) 
Name: text, dtype: object

In [88]:
cr_df['text'] = cr_df['text'].str.replace(r'^ \(.*\) $', r'', regex=True).astype('str')

In [89]:
cr_df['text'] = cr_df['text'].str.replace(r'\(ominous music\) \(unlocking\) .*$', r'', regex=True).astype('str')
#this is another theme/opening to the show, not part of the show

No Series too small for this DF - a lot of my regex wasn't capturing complex long examples with many brackets, so I'm just going to narrow them down smaller and smaller. Even if it's just a single line series, it'll be a big help.

In [90]:
cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
401526,2647,LAURA,Spiritual Weapon! (crappy beatboxing) (laughte...,90,03:48:39,


In [91]:
eight_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')])

In [92]:
eight_bracks['text'] = eight_bracks['text'].str.replace(r'^.* (\(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$)', r'\1', regex=True).astype('str')
eight_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
401526,2647,LAURA,(crappy beatboxing) (laughter) (Sam vocalizing...,90,03:48:39,


In [93]:
eight = pd.Series(eight_bracks['text'])
eight

401526    (crappy beatboxing) (laughter) (Sam vocalizing...
Name: text, dtype: object

In [94]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.* )\(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$', r'\1', regex=True).astype('str')

In [95]:
cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
125849,2714,MATT,"I'll say it hits, and you watch as two of the ...",135,03:23:54,


In [96]:
six_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')])

In [97]:
six_bracks['text'] = six_bracks['text'].str.replace(r'^.* (\(.*\)).* (\(.*\)).* (\(.*\)).* (\(.*\)).* (\(.*\)).* (\(.*\)),*?.*$', r'\1 \2 \3 \4 \5 \6', regex=True).astype('str')
six_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
125849,2714,MATT,(snarling) (giggling) (laughter) (snarling) (s...,135,03:23:54,


In [98]:
six = pd.Series(six_bracks['text'])
six

125849    (snarling) (giggling) (laughter) (snarling) (s...
Name: text, dtype: object

In [99]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(.* )\(.*\),*?.*$', r'\1 \2 \3 \4 \5 \6', regex=True).astype('str')

In [100]:
cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
70836,1833,MATT,You watch as Frumpkin runs past (cat yowling) ...,12,02:29:05,
206106,1253,MATT,"As you hold the goodberry out, which, with you...",27,02:21:19,


In [101]:
five_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.* \(.*\).* \(.*\).* \(.*\).* \(.*\).* \(.*\),*?.*$')])

In [102]:
five_bracks['text'] = five_bracks['text'].str.replace(r'^.* (\(.*\)).* (\(.*\)).* (\(.*\)).* (\(.*\)).* (\(.*\)),*?.*$', r'\1 \2 \3 \4 \5', regex=True).astype('str')
five_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
70836,1833,MATT,(cat yowling) (dog barking) (guard stuttering)...,12,02:29:05,
206106,1253,MATT,(bird song) (bird whistle) (bird whistle) (bir...,27,02:21:19,


In [103]:
five = pd.Series(five_bracks['text'])
five

70836     (cat yowling) (dog barking) (guard stuttering)...
206106    (bird song) (bird whistle) (bird whistle) (bir...
Name: text, dtype: object

In [104]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(.* )\(.*\)(,*?.*$)', r'\1 \2 \3 \4 \5 \6', regex=True).astype('str')

In [105]:
cr_df.text[426372]

'But you\'re now feel the (pounding on door) "(nervous laugh) I\'m just enjoyed some air, don\'t worry." (laughter) (pounding on door) "Son of a..." '

In [106]:
#cr_df[cr_df['text'].str.contains('^.*\(.*\).*\(.*\).* \(.*\).* \(.*\),*?.*$')]

In [107]:
four_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.*\(.*\).*\(.*\).* \(.*\).* \(.*\),*?.*$')])

In [108]:
four_bracks['text'] = four_bracks['text'].str.replace(r'^.*(\(.*\)).*(\(.*\)).* (\(.*\)).* (\(.*\)),*?.*$', r'\1 \2 \3 \4', regex=True).astype('str')
four_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
103082,2500,TRAVIS,"(crunch) (explosion) (scared yelling, explosio...",129,03:37:13,
117239,2144,NARRATOR,(light ethereal music) (whimsical adventure mu...,133,02:18:04,
122325,3071,MATT,(arrow impact) (laughter) (arrow vibrating) (l...,134,03:28:15,
140486,1773,MATT,(ugh) (guttural animal snorting) (whip crack) ...,14,02:53:34,
184949,2422,MATT,(yawn) (grunt) (thumping) (yawn),2,03:49:28,
198310,1801,MATT,(sighs) (sighs) (crash) (distressed noises),24,02:34:51,
198330,1821,MATT,(cranking) (metallic song) (metallic song) (ch...,24,02:38:33,
238453,2692,MATT,(laughing) (whooshing) (arrow firing) (landing),39,03:30:10,
275459,1760,MATT,(snarl) (escalating snarling) (fssh) (pop),52,03:06:45,
302724,1756,SAM,"(""Spring"" by Vivaldi) (""Spring"" by Vivaldi) (""...",62,02:15:07,


In [109]:
four = pd.Series(four_bracks['text'])
four

103082    (crunch) (explosion) (scared yelling, explosio...
117239    (light ethereal music) (whimsical adventure mu...
122325    (arrow impact) (laughter) (arrow vibrating) (l...
140486    (ugh) (guttural animal snorting) (whip crack) ...
184949                     (yawn) (grunt) (thumping) (yawn)
198310          (sighs) (sighs) (crash) (distressed noises)
198330    (cranking) (metallic song) (metallic song) (ch...
238453      (laughing) (whooshing) (arrow firing) (landing)
275459           (snarl) (escalating snarling) (fssh) (pop)
302724    ("Spring" by Vivaldi) ("Spring" by Vivaldi) ("...
347682      (as Bat 1) (as Bat 2) (as Bat 1) (as Steve Bat)
365149                (whooshes) (pings) (thuds) (growling)
377568      (shouting) (normal voice) (shouting) (laughter)
382572       (gags) (laughs) (Western music) (upbeat music)
426372    (pounding on door) (nervous laugh) (laughter) ...
433090    (kiss) (upbeat jazz music) (intense electronic...
Name: text, dtype: object

In [110]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*)\(.*\)(.*)\(.*\)(.* )\(.*\)(.* )\(.*\)(,*?.*$)', r'\1 \2 \3 \4 \5', regex=True).astype('str')

In [111]:
cr_df.text[25410]

'(laughter) It does last for one minute, so the wings suddenly (poof). '

In [112]:
#cr_df[cr_df['text'].str.contains('^.*\(.*\).* \(.*\).* "*\(.*\)"*,*?.*$')]

In [113]:
three_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.*\(.*\).* \(.*\).* "*\(.*\)"*,*?.*$')])

In [114]:
three_bracks['text'] = three_bracks['text'].str.replace(r'^.*(\(.*\)).* (\(.*\)).* "*(\(.*\))"*,*?.*$', r'\1 \2 \3', regex=True).astype('str')
#three_bracks

In [115]:
three = pd.Series(three_bracks['text'])
three

8783                     (whizzing) (explosion) (screaming)
15411     (breathing heavily) (whooshes) (surprised laug...
17487                    (whooshing) (whooshing) (flapping)
25846                     (as Marius) (as Orly) (as Marius)
41274               (Luc struggle grunts) (as Pumat) (poof)
                                ...                        
417661                      (grunting) (gasping) (laughter)
417680         (yelling) (digital bleeping) (dial up tones)
426365                      (gasping) (laughter) (laughter)
427855                 (exhale) (heavy wind gust) (rolling)
432836                      (laughs) (explosion) (laughter)
Name: text, Length: 85, dtype: object

In [116]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*)\(.*\)(.* )\(.*\)(.* "*)\(.*\)("*,*?.*$)', r'\1 \2 \3 \4', regex=True).astype('str')

In [117]:
cr_df.text[25410]

'(laughter) It does last for one minute, so the wings suddenly (poof). '

In [118]:
#cr_df[cr_df['text'].str.contains('^.*\(.*\).*\(.*\),?.*$')]

#captures cases of multiple brackets in as many different formations as possible

In [119]:
mult_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^.*\(.*\).*\(.*\),?.*$')])

In [120]:
mult_bracks['text'] = mult_bracks['text'].str.replace(r'^.*(\(.*\)).*(\(.*\)),?.*$', r'\1 \2', regex=True).astype('str')
mult_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
2031,434,MATT,(metal scraping) (click),10,00:41:41,
3416,1819,MATT,(bursting) (skittering),10,02:49:35,
3592,1995,SAM,(whispers) (louder),10,03:03:33,
4610,360,TRAVIS,(whirring) (laughter),100,00:26:43,
4874,624,TRAVIS,(claps) (laughter),100,00:38:47,
...,...,...,...,...,...,...
430888,3794,MATT,(heavy breathing) (frilling),98,04:47:29,
431266,4172,MATT,(splashing) (sloughing),98,05:04:48,
432074,749,MATT,(groaning) (thudding),99,00:58:31,
432199,874,TRAVIS,(pained) (laughter),99,01:06:13,


In [121]:
mult = pd.Series(mult_bracks['text'])
mult

2031          (metal scraping) (click)
3416           (bursting) (skittering)
3592               (whispers) (louder)
4610             (whirring) (laughter)
4874                (claps) (laughter)
                      ...             
430888    (heavy breathing) (frilling)
431266         (splashing) (sloughing)
432074           (groaning) (thudding)
432199             (pained) (laughter)
432566           (groaning) (laughter)
Name: text, Length: 780, dtype: object

In [122]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*)\(.*\)(.*)\(.*\)(,?.*$)', r'\1 \2 \3', regex=True).astype('str')

Okay, so now that our lines with multiple split up (sound) instances are taken care of and cleared out, lets see what other instances of (sounds) show up in cr_df

In [123]:
#cr_df[cr_df['text'].str.contains('^\(.*\) ')]

#captures all examples of (sounds) followed by speech

In [124]:
lead_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('^\(.*\) ')])

In [125]:
#lead_bracks.head()

In [126]:
#find all instances of a string initial brackets with str inside followed by more string
#replace that full string with just the brackets string
lead_bracks = lead_bracks['text'].str.replace(r'(^\(.*\)) .*$', r'\1', regex=True).astype('str')

In [127]:
lead = pd.Series(lead_bracks)
lead

3                     (vomiting noises)
14                              (yells)
100       (high-pitched Cockney accent)
101               (light German accent)
328                             (gasps)
                      ...              
433822                       (groaning)
433907                        (howling)
433929                        (yelling)
433989                        (yelling)
434014                      (chuckling)
Name: text, Length: 5881, dtype: object

In [128]:
cr_df['text'] = cr_df['text'].str.replace(r'^\(.*\) ', '', regex=True).astype('str')
#cr_df.head()

In [129]:
trail_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains('.* \(.*\)+ $')])

#finds all instances of normal speech followed by (sounds)

In [130]:
trail_bracks = trail_bracks['text'].str.replace(r'.* (\(.*\)+ $)', r'\1', regex=True).astype('str')
trail = pd.Series(trail_bracks)
trail

16        (cheering) 
137       (laughter) 
235       (laughter) 
357         (laughs) 
364       (laughter) 
             ...     
433630    (laughter) 
433870    (chuckles) 
433879    (chuckles) 
433880     (poofing) 
434039    (groaning) 
Name: text, Length: 3657, dtype: object

In [131]:
cr_df['text'] = cr_df['text'].str.replace(r'\(.*\) $', '', regex=True).astype('str')
cr_df.text[16]

'So we released our teaser on socials, and everybody was like, "I want that as a poster!" So we listened to you, and we made it a poster! '

In [132]:
#middle brackets (text then (sound) then more text)
cr_df[cr_df['text'].str.contains(r'^.* \(.*\).*$')]

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
113,113,LIAM,"Well, we-- (sighs)-- discussed coming to a big...",1,00:24:04,
358,358,TALIESIN,I am taking this in. All right. (sighs) I'm go...,1,00:50:05,
438,438,TALIESIN,You are all the most charming people I've met ...,1,00:55:58,
532,532,LAURA,"Yeah, you should have seen him. He disguised h...",1,01:02:15,
1104,1104,MATT,"You see his legs shaking as he stands, his che...",1,02:20:09,
...,...,...,...,...,...,...
433078,1753,ASHLEY,"D&D, D&D You got your staffs You got your...",99,02:14:34,
433083,1758,ASHLEY,And thank you for being there for us and to an...,99,02:15:53,
433101,1776,BRIAN,"The next morning I was like, ""Hey man..."" (as ...",99,02:27:19,
433641,2316,MATT,"""Anyway."" (imitating whooshing) And he's gone.",99,03:19:29,


In [133]:
mid_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'^.* \(.*\).*$')])
#mid_bracks

In [134]:
mid_bracks = mid_bracks['text'].str.replace(r'^.* (\(.*\)).*$', r'\1', regex=True).astype('str')
#mid_bracks

In [135]:
mid = pd.Series(mid_bracks)
mid

113                     (sighs)
358                     (sighs)
438                     (sighs)
532             (clicks tongue)
1104                 (exclaims)
                  ...          
433078               (laughter)
433083                   (kiss)
433101              (as Travis)
433641    (imitating whooshing)
433954               (laughter)
Name: text, Length: 3848, dtype: object

In [136]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*) \(.*\)(.*$)', r'\1 \2', regex=True).astype('str')
cr_df.text[113]

"Well, we-- -- discussed coming to a bigger town. It's going to be a little more difficult now. You can't go-- it was easier on outskirts, it was easier in farms, but we can't do that here. "

In [137]:
cr_df[cr_df['text'].str.contains(r'.* "*\(.*\)"*.*$')]

#getting more and more niche with the series again
#previous samples didn't capture (sounds) that had a quotation flush on one or both sides

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
3374,1777,MATT,"""I'm sorry, I'm just-- oh god,"" and he looks o...",10,02:46:04,
3484,1887,MATT,"Okay, as he's doing this, he smiles. You can s...",10,02:54:22,
8744,1197,MATT,"""That's cool!"" He attempts it and immediately,...",101,01:28:03,
18033,1594,MATT,You see a blade dangling from one hand and a s...,104,02:06:31,
21230,1379,LIAM,"Every day, it's got to be like, ""(sighs) Anoth...",105,01:48:46,
...,...,...,...,...,...,...
424011,3147,MATT,"Now the arms go lax for a second she goes, ""(g...",96,03:42:12,
424027,3163,MATT,"She stands up, Corrin is like, ""(groans) Breat...",96,03:43:55,
424526,3662,MATT,"She's like, ""(giggles)""",96,04:20:19,
424603,3739,MATT,It's wearing a very faint bit of tattered clot...,96,04:25:49,


In [138]:
quote_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'.* "*\(.*\)"*.*$')])

In [139]:
quote_bracks['text'] = quote_bracks['text'].str.replace(r'.* "*(\(.*\))"*.*$', r'\1', regex=True).astype('str')
quote_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
3374,1777,MATT,(panicked screams),10,02:46:04,
3484,1887,MATT,(sniff),10,02:54:22,
8744,1197,MATT,(inhales),101,01:28:03,
18033,1594,MATT,(gasping),104,02:06:31,
21230,1379,LIAM,(sighs),105,01:48:46,
...,...,...,...,...,...,...
424011,3147,MATT,(gasping),96,03:42:12,
424027,3163,MATT,(groans),96,03:43:55,
424526,3662,MATT,(giggles),96,04:20:19,
424603,3739,MATT,(panting),96,04:25:49,


In [140]:
quotes = pd.Series(quote_bracks['text'])
quotes

3374      (panicked screams)
3484                 (sniff)
8744               (inhales)
18033              (gasping)
21230                (sighs)
                 ...        
424011             (gasping)
424027              (groans)
424526             (giggles)
424603             (panting)
425633           (exertions)
Name: text, Length: 72, dtype: object

In [141]:
cr_df['text'] = cr_df['text'].str.replace(r'(.* "*)\(.*\)("*.*$)', r'\1 \2', regex=True).astype('str')

In [142]:
cr_df[cr_df['text'].str.contains(r'^.*?\(.*\),.*$')]

#same as the quotes but with a comma following the brackets

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
35022,2523,MATT,"""(sighs),"" and then just heads off.",109,03:51:29,
335134,942,MATT,"(whooshing), picking them up into the air, you...",72,01:11:21,


In [143]:
comma_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'^.*?\(.*\),.*$')])
#comma_bracks.head()

In [144]:
comma_bracks = comma_bracks['text'].str.replace(r'^.*?(\(.*\)),.*$', r'\1', regex=True).astype('str')
comma_bracks

35022         (sighs)
335134    (whooshing)
Name: text, dtype: object

In [145]:
comma = pd.Series(comma_bracks)
comma

35022         (sighs)
335134    (whooshing)
Name: text, dtype: object

In [146]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*?)\(.*\)(,.*$)', r'\1 \2', regex=True).astype('str')

In [147]:
cr_df[cr_df['text'].str.contains(r'"\(.*\)"')]

#a bunch of quote brackets were still not captured, rather than fix the previous regex I just made another series

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
6762,2512,MATT,"""(grunts)""",100,02:42:49,
7231,2981,MATT,"""(laughs)"" She punches you in the shoulder and...",100,03:13:21,
7240,2990,MATT,"""(grunts)""",100,03:13:51,
16022,2569,MATT,"""(gasping)""",103,03:08:44,
31849,1789,MATT,"""(sighs)""",108,03:13:46,
...,...,...,...,...,...,...
426312,1599,MATT,"""(laughs)""",97,02:55:02,
426691,1978,MATT,"""(grunting)""",97,03:30:41,
426979,2266,MATT,"""(scoffs)""",97,03:52:59,
427881,787,MATT,"""(skeptical noise)""",98,01:15:21,


In [148]:
more_quotes = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'"\(.*\)"')])

In [149]:
more_quotes['text'] = more_quotes['text'].str.replace(r'^.*"(\(.*\))"? .*$', r'\1', regex=True).astype('str')
more_quotes

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
6762,2512,MATT,(grunts),100,02:42:49,
7231,2981,MATT,(laughs),100,03:13:21,
7240,2990,MATT,(grunts),100,03:13:51,
16022,2569,MATT,(gasping),103,03:08:44,
31849,1789,MATT,(sighs),108,03:13:46,
...,...,...,...,...,...,...
426312,1599,MATT,(laughs),97,02:55:02,
426691,1978,MATT,(grunting),97,03:30:41,
426979,2266,MATT,(scoffs),97,03:52:59,
427881,787,MATT,(skeptical noise),98,01:15:21,


In [150]:
more = pd.Series(more_quotes['text'])
more

6762               (grunts)
7231               (laughs)
7240               (grunts)
16022             (gasping)
31849               (sighs)
                ...        
426312             (laughs)
426691           (grunting)
426979             (scoffs)
427881    (skeptical noise)
429435         (screeching)
Name: text, Length: 73, dtype: object

In [151]:
cr_df['text'] = cr_df['text'].str.replace(r'(^.*")\(.*\)("? .*$)', r'\1 \2', regex=True).astype('str')

In [152]:
#cr_df[cr_df['text'].str.contains(r'\(.*\)\.?"')]

#some odd straggler cases here

In [153]:
odd_ones = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'\(.*\)\.?"')])

In [154]:
odd_ones['text'] = odd_ones['text'].str.replace(r'(\(.*\))\.?".*', r'\1', regex=True).astype('str')
odd_ones

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
33522,1023,MATT,(laughs),109.0,01:17:17,
50085,1012,MATT,(hiccups),114.0,01:26:26,
50092,1019,MATT,(sniffs),114.0,01:27:23,
106551,358,MATT,(nervous laugh),130.0,00:42:46,
115837,742,MATT,(sighs),133.0,00:44:04,
116060,965,MATT,(shouting),133.0,00:59:24,
116065,970,MATT,(grunts),133.0,00:59:39,
126756,487,MATT,(sighs),136.0,00:35:00,
140460,1747,MATT,(heh heh),14.0,02:50:55,
157662,2011,MATT,(laughs),141.03,01:36:13,


In [155]:
odds = pd.Series(odd_ones['text'])
odds

33522            (laughs)
50085           (hiccups)
50092            (sniffs)
106551    (nervous laugh)
115837            (sighs)
116060         (shouting)
116065           (grunts)
126756            (sighs)
140460          (heh heh)
157662           (laughs)
166354            (yawns)
284688           (snorts)
335488            (sighs)
341857           (grunts)
354082         (chuckles)
356316            (sighs)
382414            (sighs)
394550         (chuckles)
416864           (groans)
427897            (spits)
Name: text, dtype: object

In [156]:
cr_df['text'] = cr_df['text'].str.replace(r'\(.*\)(\.?".*)', r'\1', regex=True).astype('str')

In [157]:
cr_df[cr_df['text'].str.contains(r'\(.*\)')]

#one last one, just to scrape up what all our other passes failed to capture

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
171722,2076,MATT,"--(sploosh) swings out at you-- yes, it does.",15,02:48:12,
228675,1853,MATT,(door slamming sound). It does so.,35,02:38:50,
383461,3154,SAM,Castle Ooh--(gurgling noises)?,85,04:19:47,
400946,2067,LAURA,(pigeon cooing),90,02:51:29,


In [158]:
final_bracks = pd.DataFrame(cr_df[cr_df['text'].str.contains(r'\(.*\)')])

In [159]:
final_bracks['text'] = final_bracks['text'].str.replace(r'.*(\(.*\)).*', r'\1', regex=True).astype('str')
final_bracks

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech
171722,2076,MATT,(sploosh),15,02:48:12,
228675,1853,MATT,(door slamming sound),35,02:38:50,
383461,3154,SAM,(gurgling noises),85,04:19:47,
400946,2067,LAURA,(pigeon cooing),90,02:51:29,


In [160]:
final = pd.Series(final_bracks['text'])
final

171722                (sploosh)
228675    (door slamming sound)
383461        (gurgling noises)
400946          (pigeon cooing)
Name: text, dtype: object

In [161]:
cr_df['text'] = cr_df['text'].str.replace(r'(.*)\(.*\)(.*)', r'\1 \2', regex=True).astype('str')

#### FINALLY

That was a whole weekend of work :) So satisfying now that it works though - confirmed no lingering instances of parenthesis in the text column of the df

So the next steps are to concat all of the series together, check for duplicate indexes to verify all of our lines were captured and treated only once, add our newly created series into a new column, fill all our NaN lines with an empty string, and re-pickle this DF into a new version called "split"

In [162]:
sounds = [nonspeech, more_nonspeech, eight, six, five, four, three, mult, lead, trail, mid, comma, quotes, more, odds, final]

In [163]:
noises = pd.concat(sounds)
noises

5                   (laughter) 
7                   (laughter) 
54                  (laughter) 
86          (nervous laughter) 
135                 (laughter) 
                  ...          
427897                  (spits)
171722                (sploosh)
228675    (door slamming sound)
383461        (gurgling noises)
400946          (pigeon cooing)
Name: text, Length: 25878, dtype: object

In [164]:
duplicates = noises.index[noises.index.duplicated(keep=False)]

In [165]:
print(duplicates)

Int64Index([], dtype='int64')


In [166]:
cr_df['nonspeech'] = noises
cr_df.head()

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech,nonspeech
0,0,MATT,We've got some cool stuff to talk about. First...,1,00:00:15,,
1,1,LAURA,As in nnnn.,1,00:01:21,,
2,2,MATT,"Nnnn. But yeah, so they're going to be a long-...",1,00:01:24,,
3,3,TRAVIS,Would you like some?,1,00:02:04,,(vomiting noises)
4,4,MATT,"So yeah. I'm super excited to have that, guys....",1,00:02:07,,


In [167]:
cr_df['nonspeech'] = cr_df['nonspeech'].fillna('')
cr_df['inaudible_speech'] = cr_df['inaudible_speech'].fillna('')

In [168]:
cr_df.head()

Unnamed: 0,index,name,text,episode,timestamp,inaudible_speech,nonspeech
0,0,MATT,We've got some cool stuff to talk about. First...,1,00:00:15,,
1,1,LAURA,As in nnnn.,1,00:01:21,,
2,2,MATT,"Nnnn. But yeah, so they're going to be a long-...",1,00:01:24,,
3,3,TRAVIS,Would you like some?,1,00:02:04,,(vomiting noises)
4,4,MATT,"So yeah. I'm super excited to have that, guys....",1,00:02:07,,
