<a href="https://colab.research.google.com/github/BrendaLoznik/BigBangTheory/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

data source: https://bigbangtrans.wordpress.com/

# 1 Housekeeping

### 1.1 Import libraries

In [None]:
#basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from bs4 import BeautifulSoup

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 200)

### 1.2 Connect to drive

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### 1.3 Custom functions

In [None]:
#function to use regex to clean text
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""    

# 2 Scraping

### 2.1 Scrape URLs

In [None]:
#grab the html of a site
import requests
url = 'https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/'
result = requests.get(url)


In [None]:
#create a html file from the html request
doc = BeautifulSoup(result.text, 'html.parser')

In [None]:
#retrieve the htmls
htmls =doc.find_all('a')
html_strings = []
for x in htmls:
    html_strings.append(str(x))

html_strings[5:10]

['<a href="https://bigbangtrans.wordpress.com/about/">About</a>',
 '<a aria-current="page" href="https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/">Series 01 Episode 01 – Pilot\xa0Episode</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/">Series 01 Episode 02 – The Big Bran\xa0Hypothesis</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/">Series 01 Episode 03 – The Fuzzy Boots\xa0Corollary</a>',
 '<a href="https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/">Series 01 Episode 04 – The Luminous Fish\xa0Effect</a>']

In [None]:
#create a list of cleaned urls
cleaned_urls = []
for x in html_strings:
  cleaned_html = find_between(x, 'href="', '">Series')
  cleaned_urls.append(cleaned_html)
  cleaned_urls = [x for x in cleaned_urls if x]  #this removes empty strings from the list

cleaned_urls[0:10]

['https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/',
 'https://bigbangtrans.wordpress.com/series-1-episode-2-the-big-bran-hypothesis/',
 'https://bigbangtrans.wordpress.com/series-1-episode-3-the-fuzzy-boots-corollary/',
 'https://bigbangtrans.wordpress.com/series-1-episode-4-the-luminous-fish-effect/',
 'https://bigbangtrans.wordpress.com/series-1-episode-5-the-hamburger-postulate/',
 'https://bigbangtrans.wordpress.com/series-1-episode-6-the-middle-earth-paradigm/',
 'https://bigbangtrans.wordpress.com/series-1-episode-7-the-dumpling-paradox/',
 'https://bigbangtrans.wordpress.com/series-1-episode-8-the-grasshopper-experiment/',
 'https://bigbangtrans.wordpress.com/series-1-episode-9-the-cooper-hofstadter-polarization/',
 'https://bigbangtrans.wordpress.com/series-1-episode-10-the-loobenfeld-decay/']

### 2.2 Season 1

The HTML structure of Season 1 differs from the other seasons, so I will extract them sepperately.

In [None]:
#grab the html of a site
import requests

season_1 =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

season1_urls = cleaned_urls[0:18] #season 1 + first  episode of season 2, from the second episode onwards I get an error

for url in season1_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, 'Calibri;">', "</span>")
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df.head(20)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  season_1 = season_1.append(episode_df)

#a completely empty column is returned that we can drop
season_1 = season_1.drop('character_scene\t', axis=1)
season_1

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
244,"Well, it’s really not that fancy, it’s just a...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
245,"Right, but I have to have some sort of degree...",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny
246,That doesn’t matter to me at all.,Series 02 Episode 01 – The Bad Fish Paradigm |...,Leonard
247,"So, it’s fine with you if I’m not smart.",Series 02 Episode 01 – The Bad Fish Paradigm |...,Penny


### 2.3 Other seasons

In [None]:
#grab the html of a site
import requests

other_seasons =  pd.DataFrame(columns=['line', 'character_scene	', 'episode'])

other_episodes_urls =  cleaned_urls[18:]  #first episode of season 2 has a different structure and is included in the season 1 df

for url in other_episodes_urls:
  result = requests.get(url)

  #fetch the html of the url
  doc = BeautifulSoup(result.text, 'html.parser')
  #extract the episode title
  title = doc.title.string
  #print(title)

  #fined all text lines from the p classes labeled 'msonormal
  paragraphs = doc.find_all('p')

  #convert the paragraphs to strings 
  string_paragraphs = []
  for x in paragraphs:
    string_paragraphs.append(str(x))
  #print(string_paragraphs)

  #clean the paragraphs
  cleaned_paragraphs = []
  for x in string_paragraphs:
    cleaned_paragraph = find_between(x, '<p>', "</p>")  #if you use > and <, the scene is somehow lost
    cleaned_paragraph = cleaned_paragraph.replace('<em>', "")  #some weird strings remain that we can remove
    cleaned_paragraph = cleaned_paragraph.replace('</em>', "")
    cleaned_paragraphs.append(cleaned_paragraph )
    cleaned_paragraphs = [x for x in cleaned_paragraphs if x]  #this removes empty strings from the list
  #print(cleaned_paragraphs)

  #create df of cleaned paragraphs
  episode_df = pd.DataFrame(cleaned_paragraphs)
  episode_df=  episode_df.rename(columns={0: "line"})
  episode_df['character_scene'] = episode_df['line'].str.split(":").str[0]
  episode_df['line'] = episode_df['line'].str.split(":").str[1]
  episode_df['episode'] = title


  #appends single episode df to full df
  other_seasons = other_seasons.append(episode_df)

#a completely empty column is returned that we can drop
other_seasons = other_seasons.drop('character_scene\t', axis=1)
other_seasons

Unnamed: 0,line,episode,character_scene
0,The building entrance lobby. The guys enter. ...,Series 02 Episode 02 – The Codpiece Topology |...,Scene
1,Worst Renaissance Fair ever.,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
2,"Please let it go, Sheldon.",Series 02 Episode 02 – The Codpiece Topology |...,Leonard
3,It was rife with historical inaccuracies. For...,Series 02 Episode 02 – The Codpiece Topology |...,Sheldon
4,You’re nitpicking.,Series 02 Episode 02 – The Codpiece Topology |...,Leonard
...,...,...,...
236,"Calibri, sans-serif;"">Sheldon",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
237,"Calibri, sans-serif;""><i>Scene",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
238,"Calibri, sans-serif;"">Sheldon",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
239,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 10 Episode 24 – The Long Distance Disso...,"<a href=""https"


### 2.4 Append and save

In [None]:
episodes = season_1.append(other_seasons)
episodes

Unnamed: 0,line,episode,character_scene
0,A corridor at a sperm bank.,Series 01 Episode 01 – Pilot Episode | Big Ban...,Scene
1,So if a photon is directed through a plane wi...,Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
2,"Agreed, what’s your point?",Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
3,"There’s no point, I just think it’s a good id...",Series 01 Episode 01 – Pilot Episode | Big Ban...,Sheldon
4,Excuse me?,Series 01 Episode 01 – Pilot Episode | Big Ban...,Leonard
...,...,...,...
236,"Calibri, sans-serif;"">Sheldon",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
237,"Calibri, sans-serif;""><i>Scene",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
238,"Calibri, sans-serif;"">Sheldon",Series 10 Episode 24 – The Long Distance Disso...,"<span style=""font-family"
239,//www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&...,Series 10 Episode 24 – The Long Distance Disso...,"<a href=""https"


In [None]:
#save  dataframe 
episodes.to_csv("raw_episodes.csv")