<a href="https://colab.research.google.com/github/BrendaLoznik/BigBangTheory/blob/main/Bang_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

data source: https://bigbangtrans.wordpress.com/

# 1 Housekeeping

### 1.1 Import libraries

In [1]:
#basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from bs4 import BeautifulSoup

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 200)

### 1.2 Connect to drive

In [2]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### 1.3 Custom functions

In [3]:
#function to use regex to clean text
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""    

# 2 Scraping

### 2.1 Scrape URLs

In [4]:
#grab the html of a site
import requests
url = 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/'
result = requests.get(url)


In [5]:
#create a html file from the html request
doc = BeautifulSoup(result.text, 'html.parser')

In [6]:
#retrieve the htmls
htmls =doc.find_all('a')
html_strings = []
for x in htmls:
    html_strings.append(str(x))

html_strings[5:10]

['<a href="https://bigbangtrans.wordpress.com/about/">About</a>',
 '<a aria-current="page" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/">Series 01 Episode 01 – Pilot\xa0Episode</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/">Series 01 Episode 02 – The Big Bran\xa0Hypothesis</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/">Series 01 Episode 03 – The Fuzzy Boots\xa0Corollary</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/">Series 01 Episode 04 – The Luminous Fish\xa0Effect</a>']

In [7]:
#create a list of cleaned urls
cleaned_urls = []
for x in html_strings:
  cleaned_html = find_between(x, 'href="', '">Series')
  cleaned_urls.append(cleaned_html)
  cleaned_urls = [x for x in cleaned_urls if x]  #this removes empty strings from the list

cleaned_urls[0:10]

['https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/',
 'https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/',
 'https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/',
 'https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/',
 'https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/',
 'https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/',
 'https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/',
 'https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/',
 'https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-polarization/',
 'https://bigbangtrans.wordpress.com/series-1-episode-10-the-loobenfeld-decay/']

### 2.2 Season 1

The HTML structure of Season 1 differs from the other seasons, so I will extract them sepperately.

In [8]:
#grab the html of a site
import requests

season_1 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season1_urls = cleaned_urls[0:18] #season 1 + first  episode of season 2, from the second episode onwards I get an error

for url in season1_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, 'Calibri;">', "</span>")
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df.head(20)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_1 = season_1.append(episode_df)

#a completely empty column is returned that we can drop
season_1 = season_1.drop('character_scene\t', axis=1)
season_1

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
244,"Well, it’s really not that fancy, it’s just a...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
245,"Right, but I have to have some sort of degree...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny
246,That doesn’t matter to me at all.,Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
247,"So, it’s fine with you if I’m not smart.",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny


### 2.3 season 2-9

In [69]:
#grab the html of a site
import requests

seasons_except7 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

#create list of urls
season2to6_url = cleaned_urls[18:135]
season8to9_url = cleaned_urls[159:-24]
seasons_except7_url = season2to6_url + season8to9_url  #create a list of seasons 2, 3, 4, 5, 6, 8 and 9

for url in seasons_except7_url:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, '<p>', "</p>")  #if you use > and <, the scene is somehow lost
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  seasons_except7 = seasons_except7.append(episode_df)

#a completely empty column is returned that we can drop
seasons_except7 = seasons_except7.drop('character_scene\t', axis=1)
seasons_except7

Unnamed: 0,line,episode,character_scene
0,The building entrance lobby. The guys enter. ...,Series 02 Episode 02 – The Codpiece Topology |...,Scene
1,Worst Renaissance Fair ever.,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
2,"Please let it go, Sheldon.",Series 02 Episode 02 – The Codpiece Topology |...,Leonard
3,It was rife with historical inaccuracies. For...,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
4,You’re nitpicking.,Series 02 Episode 02 – The Codpiece Topology |...,Leonard
...,...,...,...
279,I don’t like this at all.,Series 09 Episode 24 – The Convergence Converg...,Sheldon
280,I don’t like it either.,Series 09 Episode 24 – The Convergence Converg...,Leonard
281,Really? ‘Cause I love it.,Series 09 Episode 24 – The Convergence Converg...,Penny
282,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 09 Episode 24 – The Convergence Converg...,"<a href=""https"


### 2.4 Season 7A

In [247]:
#grab the html of a site
import requests

season_7_a =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_7_urls_a =   cleaned_urls[135: -81]

for url in season_7_urls_a :
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, ';">', "</")
    cleaned_paragraph = cleaned_paragraph.replace('<i>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_7_a = season_7_a.append(episode_df)

#a completely empty column is returned that we can drop
season_7_a = season_7_a.drop('character_scene\t', axis=1)
season_7_a

Unnamed: 0,line,episode,character_scene
0,"On the deck of a ship on the North Sea, in th...",Series 07 Episode 01 – The Hofstadter Insuffic...,Scene
1,"Sheldon, it’s not a great time, what do you w...",Series 07 Episode 01 – The Hofstadter Insuffic...,Leonard
2,,Series 07 Episode 01 – The Hofstadter Insuffic...,Sheldon (in the apartment)
3,What is it?,Series 07 Episode 01 – The Hofstadter Insuffic...,Leonard
4,Back to the Future II was in the Back to the ...,Series 07 Episode 01 – The Hofstadter Insuffic...,Sheldon
...,...,...,...
233,Hmm. It never occurred to me to pick a favour...,Series 07 Episode 15 – The Locomotive Manipula...,Sheldon
234,"Well, give it a go.",Series 07 Episode 15 – The Locomotive Manipula...,Leonard
235,I can’t answer that without collecting additi...,Series 07 Episode 15 – The Locomotive Manipula...,Sheldon
236,Additional data. You dog.,Series 07 Episode 15 – The Locomotive Manipula...,Leonard


### 2.5 Season 7B

In [254]:
#grab the html of a site
import requests

season_7_b =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_7_urls_b = cleaned_urls[150:159]

for url in season_7_urls_b :
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, '<p>', "</")
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_7_b = season_7_b.append(episode_df)

#a completely empty column is returned that we can drop
season_7_b = season_7_b.drop('character_scene\t', axis=1)
season_7_b

Unnamed: 0,line,episode,character_scene
0,The apartment,Series 07 Episode 16 – The Table Polarisation ...,Scene
1,I’m thinking about growing a goatee.,Series 07 Episode 16 – The Table Polarisation ...,Howard
2,"Oh, actually that’s a Van Dyke. A goatee is j...",Series 07 Episode 16 – The Table Polarisation ...,Raj
3,"Oh. Wait, then what is it if you just have ha...",Series 07 Episode 16 – The Table Polarisation ...,Leonard
4,You mean a moo-stache?,Series 07 Episode 16 – The Table Polarisation ...,Raj
...,...,...,...
267,I really think this is gonna be for the best.,Series 07 Episode 24 – The Status Quo Combusti...,Penny
268,"Me, too. And he was able to take a sabbatical...",Series 07 Episode 24 – The Status Quo Combusti...,Leonard
269,How could you let him go?,Series 07 Episode 24 – The Status Quo Combusti...,Amy
270,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 07 Episode 24 – The Status Quo Combusti...,"<a href=""https"


### 2.6 Season 10

In [46]:
#grab the html of a site
import requests

season_10 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_10_urls =  cleaned_urls[-24:]  #season 10 also has a different html structure

for url in season_10_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
      cleaned_paragraph = find_between(x, 'sans-serif;"', "</")
      cleaned_paragraph = cleaned_paragraph.replace('<i>', "")
      cleaned_paragraph = cleaned_paragraph.replace('>', "")
      cleaned_paragraphs.append(cleaned_paragraph )
      cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_10 = season_10.append(episode_df)

#a completely empty column is returned that we can drop
season_10 = season_10.drop('character_scene\t', axis=1)
season_10

Unnamed: 0,line,episode,character_scene
0,,Series 10 Episode 01 – The Conjugal Conjecture...,Following a “previously on” sequence.
1,Leonard and Penny’s bedroom.,Series 10 Episode 01 – The Conjugal Conjecture...,Sceme
2,Leonard? Leonard?,Series 10 Episode 01 – The Conjugal Conjecture...,Sheldon
3,What?,Series 10 Episode 01 – The Conjugal Conjecture...,Leonard
4,You realize you and I could become brothers.,Series 10 Episode 01 – The Conjugal Conjecture...,Sheldon
...,...,...,...
234,"And I with you. Question, are you seeking a r...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
235,What if I were?,Series 10 Episode 24 – The Long Distance Disso...,Ramona
236,"Well, that would raise a number of problems. ...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
237,Princeton.,Series 10 Episode 24 – The Long Distance Disso...,Scene


### 2.7 Append and save

In [255]:
episodes = season_1.append(seasons_except7)  #append season 2, 3, 4, 5, 6, 8, 9
episodes = episodes.append(season_7_a) #appends season 7a
episodes = episodes.append(season_7_b) #append season 7b
episodes = episodes.append(season_10) #append season 10
episodes

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
234,"And I with you. Question, are you seeking a r...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
235,What if I were?,Series 10 Episode 24 – The Long Distance Disso...,Ramona
236,"Well, that would raise a number of problems. ...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
237,Princeton.,Series 10 Episode 24 – The Long Distance Disso...,Scene


In [256]:
#save  dataframe 
episodes.to_csv("raw_episodes.csv", index=False)