# Getting the scripts for "The Office"

[OfficeQuotes.net](http://officequotes.net) has all the scripts from one of my favorite TV shows, [The Office](https://www.imdb.com/title/tt0386676/). Since the website is written in PHP and is not a single page application, I should be able to scrape all the data with just [Requests](https://pypi.org/project/requests/) and [Beautiful Soup](https://pypi.org/project/beautifulsoup4/). [Here's a handy post](https://curiouscoder.space/blog/backend/web-scraping-using-python/) on how to do this.

The website has a sidebar with links to each episode but the URL for each episode can easily be generated as it follows this pattern - `no{season_num}-{episode_num}.php`. Inspecting the HTML tells me I want everything in `div`s with the class `quote`. I am going to ignore the 'Delete Scene 1' headings which seem to always be in `u` tags but keep the lines in those scenes. My goal is to build a [Recurrent Neural Network(RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) using these scripts to generate more scripts so these lines from deleted scenes give me more data for training.

First I will try to download the lines from a single episode. After I perfect this method, I should be able to loop through all the episodes very easily. It's time for some code.

In [42]:
import requests
from bs4 import BeautifulSoup

base_url = "http://officequotes.net"

season, episode = 1, 1
episode_url = f"{base_url}/no{season}-{episode:02}.php"
result = requests.get(episode_url)
soup = BeautifulSoup(result.content, "lxml")

In [43]:
quotes = soup.find_all("div", class_="quote")
print(len(quotes))
soup.get_text()

51


'\n\n\n\n\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'UA-123167577-1\');\n\n\n\n\n\n\n\nOfficeQuotes.net - The Comprehensive Source for The Office Quotes!\n\n\nfunction roll(img_name, img_src)\n{\n    document[img_name].src = img_src;\n}\n\n\n(function(d, s, id) {\n  var js, fjs = d.getElementsByTagName(s)[0];\n  if (d.getElementById(id)) return;\n  js = d.createElement(s); js.id = id;\n  js.src = \'https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.1\';\n  fjs.parentNode.insertBefore(js, fjs);\n}(document, \'script\', \'facebook-jssdk\'));\n&nbsp\n\n\n\n\n\n\n\n<!--\nfunction random_text()\n{};\nvar random_text = new random_text();\n// Set the number of text strings to zero to start\nvar number = 0;\n// Incremental list of all possible Text\nrandom_text[number++] = "I know that patience and loyalty are good, and virtuous traits. But sometimes I just think you need to grow a 

In [70]:
# Try selenium, doesn't look like I am getting the full data from the above method
from selenium import webdriver

# make sure installation is completed correctly
# https://selenium-python.readthedocs.io/installation.html
driver = webdriver.Chrome()
driver.implicitly_wait(5)
driver.get(episode_url)
quotes = driver.find_elements_by_class_name("quote")
# driver.close() # needs to be open when working with data
print(len(quotes))

51


In [45]:
quotes[0].text

"Michael: All right Jim. Your quarterlies look very good. How are things at the library?\nJim: Oh, I told you. I couldn't close it. So...\nMichael: So you've come to the master for guidance? Is this what you're saying, grasshopper?\nJim: Actually, you called me in here, but yeah.\nMichael: All right. Well, let me show you how it's done."

In [46]:
quotes[-1].text

'Deleted Scene 12\n  Michael: I think better to be a happy idiot then a, um... Then someone who knows the truth.'

In [66]:
import re

# Remove text that says 'Deleted Scene 12\n'
stripped = re.sub('Deleted Scene [0-9]+\n\s*', '', quotes[-1].text)

print(stripped)

Michael: I think better to be a happy idiot then a, um... Then someone who knows the truth.


In [64]:
def remove_deleted_scene_heading(text):
    return re.sub('Deleted Scene [0-9]+\n\s*', '', text)

# Simple test
for i in range(15):
    quote = quotes[-i].text
    print(quote[:20])
    print(remove_deleted_scene_heading(quote)[:20])
    print("\n")

Michael: All right J
Michael: All right J


Deleted Scene 12
  M
Michael: I think bet


Deleted Scene 11
  D
Documentary Crew Mem


Deleted Scene 10
  M
Michael: Do I need t


Deleted Scene 9
  Mi
Michael: What's that


Deleted Scene 8
  Mi
Michael: What you do


Deleted Scene 7
  An
Angela: My name is A


Deleted Scene 6
  Mi
Michael: So this is 


Deleted Scene 5
  Mi
Michael: All these p


Deleted Scene 4
  Mi
Michael: Ah, right h


Deleted Scene 3
  Dw
Dwight: People respo


Deleted Scene 2
  Mi
Michael: Pam! Pam-Pa


Deleted Scene 1
  Dw
Dwight: Dwight Schru


Pam: Hey.
Jim: Hey.

Pam: Hey.
Jim: Hey.



Michael: What is the
Michael: What is the




In [72]:
with open("script.txt", "at") as f:
    for quote in quotes:
        f.write(remove_deleted_scene_heading(quote.text) + '\n\n') 

In [75]:
def write_episode_to_file(season, episode):
    
    base_url = "http://officequotes.net"
    episode_url = f"{base_url}/no{season}-{episode:02}.php"
    
    driver = webdriver.Chrome()
    driver.implicitly_wait(5)
    driver.get(episode_url)
    quotes = driver.find_elements_by_class_name("quote")
    
    with open("the-office-all-episodes.txt", "at") as f:
        for quote in quotes:
            f.write(remove_deleted_scene_heading(quote.text) + '\n\n') 
        
    driver.close()
        
write_episode_to_file(1, 5)

In [80]:
num_episodes_per_season = [6, 22, 23, 14, 26, 24, 24, 24, 23]

for season, num_episodes in enumerate(num_episodes_per_season):
    for episode in range(num_episodes):
        write_episode_to_file(season+1, episode+1)