# The Office - Transcripts Scrapper

The purpose of this notebook is to illustrate how The_Office_Scraper.py and IMDB_Scraper.py collects and store the transcripts and ratings.

## Preparation

The libraries required to run this notebook are Pandas, Numpy and BeautifulSoup.

* this notebook will also make use of requests, json, and re (regular expressions), those are standard libraries in python 3.

In [1]:
# install required libraries
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas



In [12]:
# import libraries
import requests
import json
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

## Define Function
This framework uses 'soups' as objects to handle the webpage content, so we'll use a function to facilitate the colection of this object.

To request the webpage data from the server BeautifulSoup uses a get request sent directly from python, since many websites block such requests we need to change our headers to simulate another user agent, such as a browser.

In [3]:
# get the content of a webpage and return a list
def get_content(url):
    # Most websites refuse GET requests from python, so we change the header to pretend we're a browser.
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
    page = requests.get(url, headers = headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

## Collect transcripts URLs

The transcripts are stored in different pages of the website, in order to access them we'll need a list with their addresses.

### PHP REQUESTS
The URLs are distributed in lists of 25 items, in 8 pages.
to access those pages the website uses a php 'get' request as bellow:

    https://transcripts.foreverdreaming.org/viewforum.php?f=574 *first page  
    https://transcripts.foreverdreaming.org/viewforum.php?f=574&start=25 *second page  

Where **'&start='** is the name of PHP variable beign passed and **25** is the value.

### HTML SEARCH
Once in the correct page Beautfulsoup will be used to retrieve all anchor(< a>) tags and check if their class name matches with the one we're looking for, python will them store the episode name and the url in lists to be later converted in a dataframe.  
Beside the episode transcripts, the list of links in the webpage also have two annoucements from the website that are removed with an exception list.

In [4]:
# define variables
# web pages:
home='https://transcripts.foreverdreaming.org'
forum = 'https://transcripts.foreverdreaming.org/viewforum.php?f=574'
# php request
r = np.arange(25,176,25) # [25, 50, 75 ... 150, 175]
lim = '&start='
# lists
exceptions=['Updated: Editors Needed', 'Online Store']
url=[]
ep=[]

# collect urls from first page
page = get_content('https://transcripts.foreverdreaming.org/viewforum.php?f=574')
for item in page.find_all('a'):
    if(str(item.get('class'))== "['topictitle']"):
        if item.get_text() not in exceptions:
            url.append(home+str(item.get('href'))[1:])
            ep.append(item.get_text())
            
# go trough every value of the range
for i in r:
    # build php request
    page_sulfix = lim+str(i)
    forum_list = forum + page_sulfix
    # retrieve webpage data
    page = get_content(forum_list)
    # collect urls
    for item in page.find_all('a'):
        if(str(item.get('class'))== "['topictitle']"):
            if item.get_text() not in exceptions:
                url.append(home+str(item.get('href'))[1:])
                ep.append(item.get_text())

In [5]:
# define a dataframe to hold the episode name and the url
df = pd.DataFrame(ep)
df.columns = ['ep']
df['url'] = url

In [6]:
# first 5 rows
df.head()

Unnamed: 0,ep,url
0,Board Updates: Please Read 8/26/19,https://transcripts.foreverdreaming.org/viewto...
1,01x01 - Pilot,https://transcripts.foreverdreaming.org/viewto...
2,01x02 - Diversity Day,https://transcripts.foreverdreaming.org/viewto...
3,01x03 - Health Care,https://transcripts.foreverdreaming.org/viewto...
4,01x04 - The Alliance,https://transcripts.foreverdreaming.org/viewto...


## Collect Transcripts

To facilitate the cleaning process the retrieved data will be organized before beign saved.  
The texts from the dialogs are all inside paragraph tags (< p>), and all dialogs start with the name of the character and the dialog as bellow:  
  
    Character Name: Sentences the character is saying with [interactions and actions the character is performing]  
  
To break down those dialogs the character name will be separated from the rest by ':', and all the information within '\[ \]' will be removed since they're not part of the dialogs. 

In [7]:
char=[]
text=[]
ep=[]

# go trough all urls previously collected
for i, row in df.iterrows():
    # retrieve webpage and print episode name
    page = get_content(row.url)
    print(row.ep)
    # go trough the paragraphs check if they contain ':' and remove text inside []
    for item in page.find_all('p'):
        if(':' in item.get_text()):
            temp = item.get_text().split(':',1)
            char.append(re.sub("[\[].*?[\]]", "", temp[0]))
            text.append(re.sub("[\[].*?[\]]", "", temp[1]))
            ep.append(row.ep)

Board Updates: Please Read 8/26/19
01x01 - Pilot
01x02 - Diversity Day
01x03 - Health Care
01x04 - The Alliance
01x05 - Basketball
01x06 - Hot Girl
01x99 - Deleted Scenes from Season 1
02x01 - The Dundies
02x02 - Sexual Harassment
02x03 - Office Olympics
02x04 - The Fire
02x05 - Halloween
02x06 - The Fight
02x07 - The Client
02x08 - Performance Review
02x09 - E-Mail Surveillance
02x10 - Christmas Party
02x11 - Booze Cruise
02x12 - The Injury
02x13 - The Secret
02x14 - The Carpet
02x15 - Boys & Girls
02x16 - Valentine's Day
02x17 - Dwight's Speech
02x18 - Take Your Daughter to Work Day
Board Updates: Please Read 8/26/19
02x19 - Michael's Birthday
02x20 - Drug Testing
02x21 - Conflict Resolution
02x22 - Casino Night
02x99 - Deleted Scenes from Season 2
03x00 - The Accountants Webisodes 1-10
03x01 - Gay Witch Hunt
03x02 - The Convention
03x03 - The Coup
03x04 - Grief Counseling
03x05 - Initiation
03x06 - Diwali
03x07 - Branch Closing
03x08 - The Merger
03x09 - The Convict
03x10/11 - A Ben

In [8]:
# Build Dataframe
df_lines = pd.DataFrame(char)
df_lines.columns = ['char']
df_lines['text'] = text
df_lines['ep'] = ep
# Remove blank lines
df_lines = df_lines.drop(df_lines[df_lines['text']==' '].index).copy()
df_lines = df_lines.drop(df_lines[df_lines['text']==''].index).copy()
df_lines = df_lines[:-1]

In [9]:
# save .csv file
df_lines.to_csv('the_office.csv', sep=';', encoding='utf-16')

# Collect Conversations

In [10]:
relations = []
talk = []

counter = 0

for i, row in df.iterrows():
    # print progress
    counter += 1
    if counter % 20 == 0:
        print(counter/2,'%')
    # collect data
    page = get_content(row.url)
    for item in page.find_all(['p','hr']):
        if item.name == 'p':
            if(':' in item.get_text()):
                temp = item.get_text().split(':',1)
                talk.append(re.sub("[\[].*?[\]]", "", temp[0]))
        else:
            relations.append(talk)
            talk = []

10.0 %
20.0 %
30.0 %
40.0 %
50.0 %
60.0 %
70.0 %
80.0 %
90.0 %
100.0 %


## Clean names

In [13]:
file = open('corrections.json')
corrections_json = file.read()
corrections = json.loads(corrections_json)

main_chars = ['Darryl', 'Creed', 'Meredith', 'Kelly', 'Ryan Howard', 'Stanley',
              'Phyllis', 'Oscar', 'Andy', 'Angela', 'Kevin', 'Pam', 'Jim',
              'Dwight', 'Michael']

totals = []
talk_scores = {}

for talk in relations:
    for name in talk:
        n = name
        if name in corrections:
            name = corrections[name]
        if name in main_chars:    
            talk_scores[name] = talk.count(n)
    if(len(talk_scores) > 1):
        totals.append(talk_scores)
    talk_scores = {}

In [14]:
# check if there is any missing character and print their names
for name in main_chars:
    flag = True
    for talk in totals:
        if name in talk:
            flag = False
    if flag:
        print(name)
# print number of conversations
print(len(totals))

4463


In [15]:
# Save data to a .json file
with open('conversations.json', 'w') as file:
    json.dump(totals, file)

# IMDB Ratings
Very similar to what was done to the transcripts but simpler, in order to retrieve the ratings from IMDB the same methods and libraries will be used.

## Collection
The script loop trought the webpages for every season, saving the collected data in temporary lists that are merged into the main lists at the end of each loop. Those lists are converted into dataframes later.

### Episode Rating
The ratings are stored in a span tag with class 'ipl-rating-star__rating', this same combination is used for their voting system so BeutifulSoup is retrieving lots of irrelevant values. The actual value of the ratings apear every 23th value and that is what the program collect.

### Episode Name
The episode names are retrieved from anchor tags where the attribute 'itemprop' is equal name

In [16]:
ratings, ep_name, ep_num, season = [],[],[],[]

# go trought each season page
for s in np.arange(1,10):
    temp_ratings, temp_ep_name, temp_ep_num, temp_season = [],[],[],[]
    page = get_content('https://www.imdb.com/title/tt0386676/episodes?season='+str(s))
    counter = 1
    
    # get the ratings from span tags
    for i in page.find_all('span'):
        class_name = dict(i.attrs).get('class')
        if(class_name == ['ipl-rating-star__rating']):
            temp_ratings.append(i.get_text())
    # get just the rating values    
    temp_ratings = temp_ratings[::23]
    
    # get the episode name from the anchor tags
    for i in page.find_all('a'):
        class_name = dict(i.attrs).get('itemprop')
        if(class_name == 'name'):
            temp_ep_name.append(i.get_text())
            temp_ep_num.append(counter)
            temp_season.append(s)
            counter += 1
    # add data to         
    ratings.extend(temp_ratings)
    ep_name.extend(temp_ep_name)
    ep_num.extend(temp_ep_num)
    season.extend(temp_season)

In [17]:
# Build dataframe
df = pd.DataFrame(ep_name)
df.columns = ['ep_name']
df['ep_num'] = ep_num
df['season'] = season
df['ratings'] = ratings

df.head()

Unnamed: 0,ep_name,ep_num,season,ratings
0,Pilot,1,1,7.5
1,Diversity Day,2,1,8.3
2,Health Care,3,1,7.9
3,The Alliance,4,1,8.1
4,Basketball,5,1,8.4


In [18]:
# Save .csv with the ratings per episode
df.to_csv('ratings.csv', sep=';', encoding='utf-16')