<a href="https://colab.research.google.com/github/BrendaLoznik/BigBangTheory/blob/main/Bang_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

data source: https://bigbangtrans.wordpress.com/ : Contains 10 of the 12 seasons

# 1 Housekeeping

### 1.1 Import libraries

In [78]:
#basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from bs4 import BeautifulSoup

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 200)

### 1.2 Connect to drive

In [79]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### 1.3 Custom functions

In [80]:
#function to use regex to clean text
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""    

# 2 Scraping

### 2.1 Scrape URLs

In [81]:
#grab the html of a site
import requests
url = 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/'
result = requests.get(url)


In [82]:
#create a html file from the html request
doc = BeautifulSoup(result.text, 'html.parser')

In [83]:
#retrieve the htmls
htmls =doc.find_all('a')
html_strings = []
for x in htmls:
    html_strings.append(str(x))

html_strings[5:10]

['<a href="https://bigbangtrans.wordpress.com/about/">About</a>',
 '<a aria-current="page" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/">Series 01 Episode 01 – Pilot\xa0Episode</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/">Series 01 Episode 02 – The Big Bran\xa0Hypothesis</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/">Series 01 Episode 03 – The Fuzzy Boots\xa0Corollary</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/">Series 01 Episode 04 – The Luminous Fish\xa0Effect</a>']

In [84]:
#create a list of cleaned urls
cleaned_urls = []
for x in html_strings:
  cleaned_html = find_between(x, 'href="', '">Series')
  cleaned_urls.append(cleaned_html)
  cleaned_urls = [x for x in cleaned_urls if x]  #this removes empty strings from the list

cleaned_urls[0:10]

['https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/',
 'https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/',
 'https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/',
 'https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/',
 'https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/',
 'https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/',
 'https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/',
 'https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/',
 'https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-polarization/',
 'https://bigbangtrans.wordpress.com/series-1-episode-10-the-loobenfeld-decay/']

### 2.2 Season 1

The HTML structure of Season 1 differs from the other seasons, so I will extract them sepperately.

In [85]:
#grab the html of a site
import requests

season_1 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season1_urls = cleaned_urls[0:18] #season 1 + first  episode of season 2, from the second episode onwards I get an error

for url in season1_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, 'Calibri;">', "</span>")
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df.head(20)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_1 = season_1.append(episode_df)

#a completely empty column is returned that we can drop
season_1 = season_1.drop('character_scene\t', axis=1)
season_1

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
244,"Well, it’s really not that fancy, it’s just a...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
245,"Right, but I have to have some sort of degree...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny
246,That doesn’t matter to me at all.,Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
247,"So, it’s fine with you if I’m not smart.",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny


### 2.3 season 2-9

In [86]:
#grab the html of a site
import requests

seasons_except7 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

#create list of urls
season2to6_url = cleaned_urls[18:135]
season8to9_url = cleaned_urls[159:-24]
seasons_except7_url = season2to6_url + season8to9_url  #create a list of seasons 2, 3, 4, 5, 6, 8 and 9

for url in seasons_except7_url:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, '<p>', "</p>")  #if you use > and <, the scene is somehow lost
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  seasons_except7 = seasons_except7.append(episode_df)

#a completely empty column is returned that we can drop
seasons_except7 = seasons_except7.drop('character_scene\t', axis=1)
seasons_except7

Unnamed: 0,line,episode,character_scene
0,The building entrance lobby. The guys enter. ...,Series 02 Episode 02 – The Codpiece Topology |...,Scene
1,Worst Renaissance Fair ever.,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
2,"Please let it go, Sheldon.",Series 02 Episode 02 – The Codpiece Topology |...,Leonard
3,It was rife with historical inaccuracies. For...,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
4,You’re nitpicking.,Series 02 Episode 02 – The Codpiece Topology |...,Leonard
...,...,...,...
279,I don’t like this at all.,Series 09 Episode 24 – The Convergence Converg...,Sheldon
280,I don’t like it either.,Series 09 Episode 24 – The Convergence Converg...,Leonard
281,Really? ‘Cause I love it.,Series 09 Episode 24 – The Convergence Converg...,Penny
282,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 09 Episode 24 – The Convergence Converg...,"<a href=""https"


### 2.4 Season 7A

In [87]:
#grab the html of a site
import requests

season_7_a =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_7_urls_a =   cleaned_urls[135: -81]

for url in season_7_urls_a :
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, ';">', "</")
    cleaned_paragraph = cleaned_paragraph.replace('<i>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_7_a = season_7_a.append(episode_df)

#a completely empty column is returned that we can drop
season_7_a = season_7_a.drop('character_scene\t', axis=1)
season_7_a

Unnamed: 0,line,episode,character_scene
0,"On the deck of a ship on the North Sea, in th...",Series 07 Episode 01 – The Hofstadter Insuffic...,Scene
1,"Sheldon, it’s not a great time, what do you w...",Series 07 Episode 01 – The Hofstadter Insuffic...,Leonard
2,,Series 07 Episode 01 – The Hofstadter Insuffic...,Sheldon (in the apartment)
3,What is it?,Series 07 Episode 01 – The Hofstadter Insuffic...,Leonard
4,Back to the Future II was in the Back to the ...,Series 07 Episode 01 – The Hofstadter Insuffic...,Sheldon
...,...,...,...
233,Hmm. It never occurred to me to pick a favour...,Series 07 Episode 15 – The Locomotive Manipula...,Sheldon
234,"Well, give it a go.",Series 07 Episode 15 – The Locomotive Manipula...,Leonard
235,I can’t answer that without collecting additi...,Series 07 Episode 15 – The Locomotive Manipula...,Sheldon
236,Additional data. You dog.,Series 07 Episode 15 – The Locomotive Manipula...,Leonard


### 2.4 Season 7B

In [88]:
#grab the html of a site
import requests

season_7_b =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_7_urls_b = cleaned_urls[150:159]

for url in season_7_urls_b :
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, '<p>', "</")
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_7_b = season_7_b.append(episode_df)

#a completely empty column is returned that we can drop
season_7_b = season_7_b.drop('character_scene\t', axis=1)
season_7_b

Unnamed: 0,line,episode,character_scene
0,The apartment,Series 07 Episode 16 – The Table Polarisation ...,Scene
1,I’m thinking about growing a goatee.,Series 07 Episode 16 – The Table Polarisation ...,Howard
2,"Oh, actually that’s a Van Dyke. A goatee is j...",Series 07 Episode 16 – The Table Polarisation ...,Raj
3,"Oh. Wait, then what is it if you just have ha...",Series 07 Episode 16 – The Table Polarisation ...,Leonard
4,You mean a moo-stache?,Series 07 Episode 16 – The Table Polarisation ...,Raj
...,...,...,...
267,I really think this is gonna be for the best.,Series 07 Episode 24 – The Status Quo Combusti...,Penny
268,"Me, too. And he was able to take a sabbatical...",Series 07 Episode 24 – The Status Quo Combusti...,Leonard
269,How could you let him go?,Series 07 Episode 24 – The Status Quo Combusti...,Amy
270,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 07 Episode 24 – The Status Quo Combusti...,"<a href=""https"


### 2.4 Season 10

In [89]:
#grab the html of a site
import requests

season_10 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season_10_urls =  cleaned_urls[-24:]  #season 10 also has a different html structure

for url in season_10_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
      cleaned_paragraph = find_between(x, 'sans-serif;"', "</")
      cleaned_paragraph = cleaned_paragraph.replace('<i>', "")
      cleaned_paragraph = cleaned_paragraph.replace('>', "")
      cleaned_paragraphs.append(cleaned_paragraph )
      cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_10 = season_10.append(episode_df)

#a completely empty column is returned that we can drop
season_10 = season_10.drop('character_scene\t', axis=1)
season_10

Unnamed: 0,line,episode,character_scene
0,,Series 10 Episode 01 – The Conjugal Conjecture...,Following a “previously on” sequence.
1,Leonard and Penny’s bedroom.,Series 10 Episode 01 – The Conjugal Conjecture...,Sceme
2,Leonard? Leonard?,Series 10 Episode 01 – The Conjugal Conjecture...,Sheldon
3,What?,Series 10 Episode 01 – The Conjugal Conjecture...,Leonard
4,You realize you and I could become brothers.,Series 10 Episode 01 – The Conjugal Conjecture...,Sheldon
...,...,...,...
234,"And I with you. Question, are you seeking a r...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
235,What if I were?,Series 10 Episode 24 – The Long Distance Disso...,Ramona
236,"Well, that would raise a number of problems. ...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
237,Princeton.,Series 10 Episode 24 – The Long Distance Disso...,Scene


# 2 Scraping continues

I later on noticed that I miss episodes 6-1 and 6-4 in my dataset.

### 3.1 Season 6-1

In [90]:
#grab the html of a site
import requests
url = 'https://bigbangtrans.wordpress.com/series-6-episode-01-the-date-night-variable/'
result = requests.get(url)

#create a html file from the html request
doc = BeautifulSoup(result.text, 'html.parser')

#extract the title from the html
titles = doc.find_all('title')

#first occurence of the title tag
tag = doc.title
print(tag.string)

Series 06 Episode 01 – The Date Night Variable | Big Bang Theory Transcripts


In [91]:
#find all p-tags
tags = doc.find_all('p') #print all 'p' tags

In [92]:
p =doc.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))

paragraphs[0:10]

['<p class="western"><span style="font-size:small;">Following a “Previously on The Big Bang Theory” section</span></p>',
 '<p class="western"><span style="font-size:small;"><em>Scene: The Comic Book Store.</em></span></p>',
 '<p class="western"><span style="font-size:small;">Stuart: So, Howard’s really in space, huh?</span></p>',
 '<p class="western"><span style="font-size:small;">Leonard: Mm-hmm, International Space Station. 250 miles that way.</span></p>',
 '<p class="western"><span style="font-size:small;">Raj: Right now, Howard’s staring down at our planet like a tiny Jewish Greek god. Zeusowitz.</span></p>',
 '<p class="western"><span style="font-size:small;">Sheldon: I must admit, I can’t help but feel a twinge of envy. He can look out the window and see the majesty of the universe unfolding before his eyes. His dim, uncomprehending eyes. It’s like a cat in an airport carrying case.</span></p>',
 '<p class="western"><span style="font-size:small;">Leonard: You know, it’s not exact

In [93]:
#nice: the function works on the list
cleaned_paragraphs = []
for x in paragraphs:
  cleaned_line = find_between(x, 'small;">', "</")
  cleaned_line = cleaned_line.replace('<em>', "")
  cleaned_paragraphs.append(cleaned_line)

cleaned_paragraphs[0:10]

['Following a “Previously on The Big Bang Theory” section',
 'Scene: The Comic Book Store.',
 'Stuart: So, Howard’s really in space, huh?',
 'Leonard: Mm-hmm, International Space Station. 250 miles that way.',
 'Raj: Right now, Howard’s staring down at our planet like a tiny Jewish Greek god. Zeusowitz.',
 'Sheldon: I must admit, I can’t help but feel a twinge of envy. He can look out the window and see the majesty of the universe unfolding before his eyes. His dim, uncomprehending eyes. It’s like a cat in an airport carrying case.',
 'Leonard: You know, it’s not exactly glamorous up there. The water that the astronauts drink is made from each other’s recycled urine.',
 'Stuart: Must be nice. Nobody wants anything that comes out of me.',
 'Raj: I wonder what he’s doing right this very second.',
 'Leonard: Mm, conducting experiments in zero gravity.']

In [94]:
#create df of cleaned paragraphs
epsidode_6_1 = pd.DataFrame(cleaned_paragraphs)
epsidode_6_1= epsidode_6_1.rename(columns={0: "line"})
epsidode_6_1['character_scene'] = epsidode_6_1['line'].str.split(":").str[0]
epsidode_6_1['line'] = epsidode_6_1['line'].str.split(":").str[1]

#add episode title
title = doc.title
epsidode_6_1['episode'] = title.string

epsidode_6_1.head(10)

Unnamed: 0,line,character_scene,episode
0,,Following a “Previously on The Big Bang Theory...,Series 06 Episode 01 – The Date Night Variable...
1,The Comic Book Store.,Scene,Series 06 Episode 01 – The Date Night Variable...
2,"So, Howard’s really in space, huh?",Stuart,Series 06 Episode 01 – The Date Night Variable...
3,"Mm-hmm, International Space Station. 250 mile...",Leonard,Series 06 Episode 01 – The Date Night Variable...
4,"Right now, Howard’s staring down at our plane...",Raj,Series 06 Episode 01 – The Date Night Variable...
5,"I must admit, I can’t help but feel a twinge ...",Sheldon,Series 06 Episode 01 – The Date Night Variable...
6,"You know, it’s not exactly glamorous up there...",Leonard,Series 06 Episode 01 – The Date Night Variable...
7,Must be nice. Nobody wants anything that come...,Stuart,Series 06 Episode 01 – The Date Night Variable...
8,I wonder what he’s doing right this very second.,Raj,Series 06 Episode 01 – The Date Night Variable...
9,"Mm, conducting experiments in zero gravity.",Leonard,Series 06 Episode 01 – The Date Night Variable...


### 3.2 Season 6-4

In [95]:
#grab the html of a site
import requests
url = 'https://bigbangtrans.wordpress.com/series-6-episode-04-the-re-entry-minimization/'
result = requests.get(url)

#create a html file from the html request
doc = BeautifulSoup(result.text, 'html.parser')

#extract the title from the html
titles = doc.find_all('title')

#first occurence of the title tag
tag = doc.title
print(tag.string)

Series 06 Episode 04 – The Re-Entry Minimization | Big Bang Theory Transcripts


In [96]:
#find all p-tags
tags = doc.find_all('p') #print all 'p' tags
print(tags)

[<p class="western"><em>Scene: The Apartment.</em></p>, <p class="western">Raj: Howard’s capsule should be re-entering the atmosphere any minute.</p>, <p class="western">Leonard: It’ll be good to have him back.</p>, <p class="western">Raj: The Fantastic Four reunited.</p>, <p class="western">Sheldon: Yeah, you had a good run, Fake Wolowitz. We’ll remember you with nostalgic fondness, the way we do the dial-up modem, the VHS tape, or, or Leonard’s gym membership.</p>, <p class="western">Raj: We’re not kicking him out. Stuart and I have become good friends.</p>, <p class="western">Sheldon: Okay, one vote for, one vote against. Leonard, you’re the tiebreaker.</p>, <p class="western">Leonard: I don’t have a problem with Stuart. Besides, he gives us a twenty percent discount at his comic book store.</p>, <p class="western">Sheldon: Well, I don’t sell my friendship that cheaply.</p>, <p class="western">Stuart: I can go thirty.</p>, <p class="western">Sheldon: Welcome aboard, old chum.</p>, <

In [97]:
p =doc.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))

paragraphs[0:10]

['<p class="western"><em>Scene: The Apartment.</em></p>',
 '<p class="western">Raj: Howard’s capsule should be re-entering the atmosphere any minute.</p>',
 '<p class="western">Leonard: It’ll be good to have him back.</p>',
 '<p class="western">Raj: The Fantastic Four reunited.</p>',
 '<p class="western">Sheldon: Yeah, you had a good run, Fake Wolowitz. We’ll remember you with nostalgic fondness, the way we do the dial-up modem, the VHS tape, or, or Leonard’s gym membership.</p>',
 '<p class="western">Raj: We’re not kicking him out. Stuart and I have become good friends.</p>',
 '<p class="western">Sheldon: Okay, one vote for, one vote against. Leonard, you’re the tiebreaker.</p>',
 '<p class="western">Leonard: I don’t have a problem with Stuart. Besides, he gives us a twenty percent discount at his comic book store.</p>',
 '<p class="western">Sheldon: Well, I don’t sell my friendship that cheaply.</p>',
 '<p class="western">Stuart: I can go thirty.</p>']

In [98]:
#nice: the function works on the list
cleaned_paragraphs = []
for x in paragraphs:
  cleaned_line = find_between(x, '"western"', "</")
  cleaned_line = cleaned_line.replace('<em>', "")
  cleaned_line = cleaned_line.replace('>', "")
  cleaned_paragraphs.append(cleaned_line)

cleaned_paragraphs[0:10]

['Scene: The Apartment.',
 'Raj: Howard’s capsule should be re-entering the atmosphere any minute.',
 'Leonard: It’ll be good to have him back.',
 'Raj: The Fantastic Four reunited.',
 'Sheldon: Yeah, you had a good run, Fake Wolowitz. We’ll remember you with nostalgic fondness, the way we do the dial-up modem, the VHS tape, or, or Leonard’s gym membership.',
 'Raj: We’re not kicking him out. Stuart and I have become good friends.',
 'Sheldon: Okay, one vote for, one vote against. Leonard, you’re the tiebreaker.',
 'Leonard: I don’t have a problem with Stuart. Besides, he gives us a twenty percent discount at his comic book store.',
 'Sheldon: Well, I don’t sell my friendship that cheaply.',
 'Stuart: I can go thirty.']

In [99]:
#create df of cleaned paragraphs
epsidode_6_4 = pd.DataFrame(cleaned_paragraphs)
epsidode_6_4= epsidode_6_4.rename(columns={0: "line"})
epsidode_6_4['character_scene'] = epsidode_6_4['line'].str.split(":").str[0]
epsidode_6_4['line'] = epsidode_6_4['line'].str.split(":").str[1]

#add episode title
title = doc.title
epsidode_6_4['episode'] = title.string

epsidode_6_4.head(10)

Unnamed: 0,line,character_scene,episode
0,The Apartment.,Scene,Series 06 Episode 04 – The Re-Entry Minimizati...
1,Howard’s capsule should be re-entering the at...,Raj,Series 06 Episode 04 – The Re-Entry Minimizati...
2,It’ll be good to have him back.,Leonard,Series 06 Episode 04 – The Re-Entry Minimizati...
3,The Fantastic Four reunited.,Raj,Series 06 Episode 04 – The Re-Entry Minimizati...
4,"Yeah, you had a good run, Fake Wolowitz. We’l...",Sheldon,Series 06 Episode 04 – The Re-Entry Minimizati...
5,We’re not kicking him out. Stuart and I have ...,Raj,Series 06 Episode 04 – The Re-Entry Minimizati...
6,"Okay, one vote for, one vote against. Leonard...",Sheldon,Series 06 Episode 04 – The Re-Entry Minimizati...
7,"I don’t have a problem with Stuart. Besides, ...",Leonard,Series 06 Episode 04 – The Re-Entry Minimizati...
8,"Well, I don’t sell my friendship that cheaply.",Sheldon,Series 06 Episode 04 – The Re-Entry Minimizati...
9,I can go thirty.,Stuart,Series 06 Episode 04 – The Re-Entry Minimizati...


# 4 Append and save

In [101]:
episodes = season_1.append(seasons_except7)  #append season 2, 3, 4, 5, 6, 8, 9
episodes = episodes.append(epsidode_6_1) #appends episode 6-1
episodes = episodes.append(epsidode_6_4)
episodes = episodes.append(season_7_a)
episodes = episodes.append(season_7_b) #append season 7b
episodes = episodes.append(season_10) #append season 10
episodes

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
234,"And I with you. Question, are you seeking a r...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
235,What if I were?,Series 10 Episode 24 – The Long Distance Disso...,Ramona
236,"Well, that would raise a number of problems. ...",Series 10 Episode 24 – The Long Distance Disso...,Sheldon
237,Princeton.,Series 10 Episode 24 – The Long Distance Disso...,Scene


In [102]:
#save  dataframe 
episodes.to_csv("raw_episodes2.csv", index=False, sep = '=')