# Peep Show TV Series Script Corpus Creation Process

### Process raw scripts for season 1 to 9 into a table and review

#### Thomas Bowe

In [1]:
library(tidyverse)
library(reshape2)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths




In [2]:
# Import test data and preview
ep = readLines("../input/peep-show-raw-transcripts/s01e03.txt") %>% tibble(content = .) %>% subset(content != "")
head(ep,20)

content
<chr>
On The Pull
Season 1
Episode 3
03/10/2003
On The Pull is the third episode of the British sitcom Peep Show. It first aired on 3rd October 2003. It was written by Jesse Armstrong and Sam Bain. It was directed by Jeremy Wooding. Lines in parentheses indicate internal monologues spoken by the characters via voice-over.
Transcript
[Mark is shopping in the supermarket]
"Mark: (Life's all pain. Pain, rejection and gloom. Why do we even pretend there's anything other than a yawning blankness at the heart of-) [Picks up a can of soup] (Hey! 33% extra free. I am doing excellent shopping. My depressed state of mind means I am being even more frugal than normal.)"
[Jez joins him]
Jeremy: Mark!


In [3]:
# Categorise each line of transcript
ep <- mutate(
  ep,
  text_type = case_when(
    str_detect(content, ":") == TRUE ~ "Script",
    str_detect(content, "\\[") == TRUE ~ "Directions",
    TRUE ~ "Other"
  )
)
# Preview data
head(ep,10)
# Ensure no other lines have been categorised as other except for the first 6
head(subset(ep,text_type = "Other"))

content,text_type
<chr>,<chr>
On The Pull,Other
Season 1,Other
Episode 3,Other
03/10/2003,Other
On The Pull is the third episode of the British sitcom Peep Show. It first aired on 3rd October 2003. It was written by Jesse Armstrong and Sam Bain. It was directed by Jeremy Wooding. Lines in parentheses indicate internal monologues spoken by the characters via voice-over.,Other
Transcript,Other
[Mark is shopping in the supermarket],Directions
"Mark: (Life's all pain. Pain, rejection and gloom. Why do we even pretend there's anything other than a yawning blankness at the heart of-) [Picks up a can of soup] (Hey! 33% extra free. I am doing excellent shopping. My depressed state of mind means I am being even more frugal than normal.)",Script
[Jez joins him],Directions
Jeremy: Mark!,Script


content,text_type
<chr>,<chr>
On The Pull,Other
Season 1,Other
Episode 3,Other
03/10/2003,Other
On The Pull is the third episode of the British sitcom Peep Show. It first aired on 3rd October 2003. It was written by Jesse Armstrong and Sam Bain. It was directed by Jeremy Wooding. Lines in parentheses indicate internal monologues spoken by the characters via voice-over.,Other
Transcript,Other


In [4]:
# Episode information extraction
episode_name <- ep[1,1] %>% pull
episode_date <- ep[4,1] %>% pull
episode_number <- ep[3,1] %>% pull %>% substr(9,9)
season_number <- ep[2,1] %>% pull %>% substr(8,8)

In [5]:
# Add the episode information as new columns then preview
ep <- mutate(
  ep,
  season_number = season_number,
  episode_number = episode_number,
  episode_name = episode_name,
  episode_date = episode_date
) %>% tail(-6)
head(ep)

content,text_type,season_number,episode_number,episode_name,episode_date
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
[Mark is shopping in the supermarket],Directions,1,3,On The Pull,03/10/2003
"Mark: (Life's all pain. Pain, rejection and gloom. Why do we even pretend there's anything other than a yawning blankness at the heart of-) [Picks up a can of soup] (Hey! 33% extra free. I am doing excellent shopping. My depressed state of mind means I am being even more frugal than normal.)",Script,1,3,On The Pull,03/10/2003
[Jez joins him],Directions,1,3,On The Pull,03/10/2003
Jeremy: Mark!,Script,1,3,On The Pull,03/10/2003
Mark: Hey Jeremy.,Script,1,3,On The Pull,03/10/2003
Jeremy: You do realize tinned food is just for crackheads and wars. [Picks up a bottle of extra-virgin olive oil],Script,1,3,On The Pull,03/10/2003


In [6]:
# Create a character column and assign the name of that character appropriately
ep <- cbind(ep, colsplit(ep$content, ":", names = c("character","content2"))) %>% 
select(-content) %>% 
rename(content = content2) %>%
mutate(content = 
       ifelse(text_type == "Directions", character, content),
       character = 
       ifelse(text_type == "Directions", "Directions", character))

In [7]:
# Create a list of all individual transcript files for looping
dir <- "../input/peep-show-raw-transcripts"
filenames <- list.files(path = dir, pattern="*.txt")
datalist <- list()

In [8]:
# Loop through each file
for(i in filenames) {
  filepath <-
    file.path(paste0("../input/peep-show-raw-transcripts/", i))
  extract <-
    readLines(filepath) %>%
    tibble(content = .)
  episode_name <-
    extract[1, 1] %>% pull
  episode_date <-
    extract[4, 1] %>% pull
  episode_number <-
    extract[3, 1] %>% pull
  season_number <-
    extract[2, 1] %>% pull
  datalist[[i]] <-
    assign(
      i,
      readLines(filepath) %>%
        tibble(content = .) %>%
        subset(content != "") %>%
        mutate(
          text_type = case_when(
            str_detect(content, ":") == TRUE ~ "Script",
            str_detect(content, "\\[") == TRUE ~ "Directions",
            TRUE ~ "Other"
          )
        ) %>%
        mutate(
          season_number = season_number,
          episode_number = episode_number,
          episode_name = episode_name,
          episode_date = episode_date
        ) %>%
        tail(-6)
    )
}

In [9]:
# First create the base table
concat <-
  bind_rows(datalist, .id = "column_label") %>%
  mutate(episode_index = cumsum(ifelse(
    episode_number != lag(episode_number) |
      is.na(lag(episode_number)), 1, 0
  )))

In [10]:
# Split the content field to get the speaking character's name and re-bind
peepshow <-
  select(concat,-column_label) %>%
  cbind(colsplit(concat$content, ":", names = c("character", "content2"))) %>%
  select(-content) %>%
  rename(content = content2) %>%
  mutate(
    content = ifelse(text_type == "Directions", trimws(character), trimws(content)),
    character = ifelse(text_type == "Directions", "Directions", trimws(character))
  ) %>%
  mutate(
    character = str_replace_all(character, "Johnson", "Alan"),
    content = str_replace_all(content, "knuckle head", "knucklehead")) %>%
  select(-text_type) %>%
  # Create an episode index
  mutate(episode_index = cumsum(ifelse(
    episode_number != lag(episode_number) |
      is.na(lag(episode_number)), 1, 0
  ))) %>%
  # Create a unique ID
  rowid_to_column("index") %>%
  # Reorder fields
  select(index,episode_index,season_number,episode_number,episode_name,episode_date,character,content)

In [11]:
# Preview
head(peepshow)

Unnamed: 0_level_0,index,episode_index,season_number,episode_number,episode_name,episode_date,character,content
Unnamed: 0_level_1,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,1,Season 1,Episode 1,Warring Factions,19/09/2003,Directions,"[Inside Jez's room, where he his playing a musical track he has composed]"
2,2,1,Season 1,Episode 1,Warring Factions,19/09/2003,Jeremy,"(This is fucking wicked. I am almost definitely a musical genius). [Looks at himself in the mirror]. (Maybe a tattoo on my chest, but of my face. Yeah, double me! Feel it!)"
3,3,1,Season 1,Episode 1,Warring Factions,19/09/2003,Directions,[In the streets; Mark is running towards the bus]
4,4,1,Season 1,Episode 1,Warring Factions,19/09/2003,Mark,"(She's on there, she's on there. Got to get the same bus home. Don't go! Almost there. Yes, I am the lord of the bus, said he! Where is she? Knickers! She's not on here...)"
5,5,1,Season 1,Episode 1,Warring Factions,19/09/2003,Sophie,Hey Mark!
6,6,1,Season 1,Episode 1,Warring Factions,19/09/2003,Mark,Sophie!


In [12]:
# Run a test for any rogue directions which may be incorrectly allocated as script due to having a ":" in it
test <-  
  filter(peepshow,character != "Directions") %>%
  filter(str_sub(content,1,1) == "[") %>% filter(str_sub(content,-1,-1) == "]")
  head(test,100)

Unnamed: 0_level_0,index,episode_index,season_number,episode_number,episode_name,episode_date,character,content
Unnamed: 0_level_1,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,175,1,Season 1,Episode 1,Warring Factions,19/09/2003,Jeremy,"[Writing down ""Cockmuncher"" on his Rizla] (Heh-heh, yeah this is it. Paula's gonna love this. Very Iggy.) [Tries to stick it on Mark's forehead]"
2,642,3,Season 1,Episode 3,On The Pull,03/10/2003,Mark,"[Picking up two pints of beer] (Jez is so great. He's like an idiot savant but not so stupid. I bet he's totally sorting this whole night in his head for us right now.) [Sits down with Jez, Toni and Valerie.]"
3,1615,6,Season 1,Episode 6,Funeral,24/10/2003,Jeremy,"[Stepping out from his hiding place] Oh! It's like that, is it?! That's how it's going to be, is it? Well, excuse me! [Storms out in a huff.]"
4,8366,29,Season 5,Episode 5,Jeremy's Manager,30/05/2008,Jeremy,"[Banging on the door of Cally's trailer] Cally? Cally? [There's no answer, so Jeremy opens the door]"
5,8863,31,Season 6,Episode 1,Jeremy at JLB,18/09/2009,Mark,[Coughs]


In [13]:
# Write the data
write.csv(peepshow, "peepshow.csv", row.names = F)