Scraping Hollywood - Beautiful Soup and Movie Data

In my last post about Hollywood movies, I didn't specifically address the engineering challenge of scraping all the data for that analysis. Below, I will go through the process to identify data for scraping using Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/ )

The first step is to identify the data you want to scrape, it's helpful if it's already in some sort of html structure that beautiful soup can use to locate on the page. As you can see, the target page I went after already has sort of a table structure, this means there are html tags we can find with BS. 

Ok, let's get started!

In [25]:
# import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

# define BS target url
url = ['https://www.the-numbers.com/movie/budgets/all']

# I'm sharing a technique for scraping a large list of files, it will only execute once now,
# but replace url with a list of pages and this will make sense

raw_list = []

for url in url:
  r = requests.get(url)
  soup = BeautifulSoup(r.text)
  for x in soup.find_all('tr'):
    raw_list.append(x.get_text())

data = pd.Series(raw_list)
df = data.str.split('\n', expand=True)

So what happened above?

We store our string outputs from Beautiful Soup in raw_list.

The for loop iterates through every url in our url variable (one in the sample but you can add as many pages as you like) and uses requests.get() to generate a response object r. This response object gives it's text output to BS, which creates our soup object.

The power of the soup object is that we can search by html tag. In this case, after (a lot!) of trial and error I found that the <tr> tag demarchated the data I needed.
    
Finally, by appending all of the data bracketed by the <tr> tag into a list, it ended up saved on different lines, with a '\n' between elements. This is actually ideal for casting into a dataframe because I could simply save the list as a pandas Series and then perform '.str.split('\n', expand=True)' 

By using 'data.str' I treat the series elements as strings.

By using '.split('\n', expand=True)' I split each element along a delimiter, in this case a return ('\n') and expand each item returned into a new colum (expand=True).

We  get the following output:
 

In [26]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Release DateMovieProduction BudgetDomestic Gr...,,,,,,
1,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",
2,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",
3,3,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",
4,4,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",


Refering back to our source page, we can see what the column headings ought to be and rename them.

In [28]:
# cleaning column names, and dropping nans
df = df.drop(columns = [0,6]).drop([0]).dropna(how='all').rename(columns={1: 'Release Date', 2:'Movie', 
                                        3:'Production Budget', 4: 'Domestic Gross', 5:'Worldwide Gross'})
df.head()

Unnamed: 0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"


Now we end up with a good looking dataframe. We still need to check the dtypes, remove some nulls, etc, but you can check my post here on how to remove nans.

Thanks for reading!