This is notebook shows my scraping process for a larger project I completed about analyzing movie data. The larger process is in a different notebook in this same repo.

I began by importing the libraries I would need and creating a soup with the URL from Box Office Mojo.

In [33]:
# Importing the necessary libaries
import re
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import lxml

In [None]:
# Making a soup
html_page = requests.get('https://www.boxofficemojo.com/year/world/2019/', 'lxml')
soup = BeautifulSoup(html_page.content, 'lxml')

This next code is my exploratory code where I was testing to make sure that the various methods were going to pull the information and in the cleanest way possible. I also did some playing around with this code where I fed in an entry that had missing values to make sure that it wouldn't break the code.

The pages were organized by year with the movies listed in ranked order for the year and had gross profits information. I really wanted more information like budget, genres, running time, etc. I found that by clicking on the link for each movie in this list you could get to an overview tab that had all that information I wanted. I had to create multiple soups that layered on each other in order to click through each movie and scrape that information from the overview tab as well.

In [None]:
# Found the place where the movies and the data were listed
movie_listings_container = soup.find('table', class_='mojo-body-table')
movies = movie_listings_container.find_all('tr')

# The first entry was the headers for the table so .pop() removed that entry
movies.pop(0)

# Exploring through the html to make sure that I can get each of the values out I want
year_rank = movies[0].find_all('td')[0].text
title = movies[0].find_all('td')[1].text
worldwide = movies[0].find_all('td')[2].text
domestic = movies[0].find_all('td')[3].text
per_dom = movies[0].find_all('td')[4].text
foreign = movies[0].find_all('td')[5].text
per_for = movies[0].find_all('td')[6].text

# Clicking on the link to the movie to get more data from the movie's landing page
moviepage_link = movies[0].find(class_='a-link-normal')['href']

# Making a new soup to follow that link
moviepage_link_req = requests.get('https://www.boxofficemojo.com' + moviepage_link, 'lxml')
soup2 = BeautifulSoup(moviepage_link_req.content, 'lxml')

# Had to make one last click to get to the page with the right info for scraping
overview_tab_link = soup2.find('a', class_='mojo-title-link')['href']
overview_tab_link

# Making another soup to follow that last link
overview_link_req = requests.get('https://www.boxofficemojo.com' + overview_tab_link, 'lxml')
soup3 = BeautifulSoup(overview_link_req.content, 'lxml')

# Found the code that contained the last of the info
budget_container = soup3.find('div', class_='mojo-summary-values')
budget = budget_container.find_all(class_='money')[1].text
budget_container

I took all the playing around from the previous cell and put it into a function. The function starts on a page that has all the movies for the year in ranked order. Then it clicks through each movie to grab that information on the overview tab for each movie like I mentioned before. Then it moves on to the next year. The function takes in the root URL and the range for the years that you want to scrape. Box Office Mojo has a really regular URL system so that made it easy to have the function be able to extrapolate the next URL it needed. I also had the function add a 'Year' column so that after the scrape I would know what year each movie was scraped from. 

I included a try and except here because there were still some entries that were making it error but they were so few and far between that I was wasting too much time trying to figure out how to get them to not error. I decided to just pass on those couple of movies for the greater good of having data to work with at all. 

In [2]:
def scrape_movies(root_url, years):
    # Creating an empty list to append entries to
    movies_list = []
    # Looping through the years in the range given
    for year in years:
        # forming the appropriate URL and forming the first soup
        html_page = requests.get(root_url + '{}/'.format(year), 'lxml')
        soup = BeautifulSoup(html_page.content, 'lxml')
        # Grabbing that container in the html that holds all the movie data
        movie_listings_container = soup.find('table', class_='mojo-body-table')
        # Defining the movie entries to loop through these
        movies = movie_listings_container.find_all('tr')
        # The first entry for each year page was the headers so those get thrown out here with .pop()
        movies.pop(0)
        # Looping through each movie entry to grab all the information
        for movie in movies: 
            try:
                # Setting up my keys to turn all the harvested data into a dictionary
                keys = ['Year_Rank', 'Title', 'Worldwide_Gross', 'Domestic', 'Percent_Domestic',  
                        'Foreign', 'Percent_Foreign', 'Year']
                # Defining all the pieces that get collected for each movie
                year_rank = movie.find_all('td')[0].text
                title = movie.find_all('td')[1].text
                worldwide = movie.find_all('td')[2].text
                domestic = movie.find_all('td')[3].text
                per_dom = movie.find_all('td')[4].text
                foreign = movie.find_all('td')[5].text
                per_for = movie.find_all('td')[6].text
                # Adding in the year entry so that it will be added as a column in the resulting dataframe
                movie_year = year
                # Defining the collected pieces as values for the keys above
                values = [year_rank, title, worldwide, domestic, per_dom, foreign, per_for, movie_year]
                # Zipping the keys and values together to make a dictionary
                movie_dict = dict(zip(keys,values))

                # Following the link to the movie's landing page
                moviepage_link = movie.find(class_='a-link-normal')['href']
                moviepage_link_req = requests.get('https://www.boxofficemojo.com' + moviepage_link, 'lxml')
                soup2 = BeautifulSoup(moviepage_link_req.content, 'lxml')
                
                # Navigating to the overview tab on the landing page
                overview_tab_link = soup2.find('a', class_='mojo-title-link')['href']
                overview_link_req = requests.get('https://www.boxofficemojo.com' + overview_tab_link, 'lxml')
                soup3 = BeautifulSoup(overview_link_req.content, 'lxml')
                # Grabbing the container where all the good data is stored on the page
                
                movieinfo_container = soup3.find('div', class_='mojo-summary-values')
                # All the pieces were under 'span' tags
                spans = movieinfo_container.find_all('span')
                # Getting just the text out from the span tags
                spans_list = [span.text for span in spans]
                # Creating an empty list to add all the final pieces to
                spans_list_final = []
                # Looping through each tag
                
                for span in spans_list:
                    # Adding the pieces to the list from before
                    if span not in spans_list_final:
                        spans_list_final.append(span)
                
                # Turning the pieces from a list to a dictionary with keys and values matched up
                movieinfo_dict = dict(zip(spans_list_final[::2],spans_list_final[1::2]))
                # Putting together the two dictionaries for each movie
                entry_dict = {**movie_dict, **movieinfo_dict}
                # Adding the dictionary of all the information for one movie to the list
                movies_list.append(entry_dict)
            
            # Passing the entries that were making the function error
            except:
                pass

    # Returns the created dictionaries that have all the info for each movie
    return movies_list

I collected data from 2000 to 2020 for this project. When I called the function, I automatically set it into a dataframe and then set the dataframe to save to a csv file on my computer. 

I first ran the function with all the years included in the range but then abandoned the effort because it was taking too long. I started splitting the years into chunks so that it wouldn't take so long and I could make sure that I had some data saved as I went. The ultimate fear when scraping is that something will go wrong while the scraping is running and you will have nothing to show for it. 

In [10]:
# How I called my function. I updated this line with the new range of years as I went 
df = pd.DataFrame(scrape_movies('https://www.boxofficemojo.com/year/world/', range(2000,2005)))

# Saving the dataframe to my computer with an appropriate file name
df.to_csv(r'C:\Users\drudi\DataScience\Module01\Final Project\Module-1-Final-Project\BoxOfficMojoScrapeFinal2000-2004.csv')
df

Unnamed: 0,Year_Rank,Title,Worldwide_Gross,Domestic,Percent_Domestic,Foreign,Percent_Foreign,Year,Domestic Distributor,Domestic Opening,Budget,Earliest Release Date,MPAA,Running Time,Genres,\n IMDbPro\n
0,1,Mission: Impossible II,"$546,388,108","$215,409,889",39.4%,"$330,978,219",60.6%,2000,Paramount PicturesSee full company information...,"$57,845,297","$125,000,000","May 24, 2000\n (Domestic)",PG-13,2 hr 3 min,Action\n \n Adventure\n \n ...,See more details at IMDbPro\n\n
1,2,Gladiator,"$460,583,960","$187,705,427",40.8%,"$272,878,533",59.2%,2000,DreamWorks DistributionSee full company inform...,"$34,819,017","$103,000,000","May 4, 2000\n (Australia)",R,2 hr 35 min,Action\n \n Adventure\n \n ...,See more details at IMDbPro\n\n
2,3,Cast Away,"$429,632,142","$233,632,142",54.4%,"$196,000,000",45.6%,2000,Twentieth Century FoxSee full company informat...,"$28,883,406","$90,000,000","December 22, 2000\n (Domestic)",PG-13,2 hr 23 min,Adventure\n \n Drama\n \n ...,See more details at IMDbPro\n\n
3,4,What Women Want,"$374,111,707","$182,811,707",48.9%,"$191,300,000",51.1%,2000,Paramount PicturesSee full company information...,"$33,614,543","$70,000,000","December 15, 2000\n (Domestic)",PG-13,2 hr 7 min,Comedy\n \n Fantasy\n \n R...,See more details at IMDbPro\n\n
4,5,Dinosaur,"$349,822,765","$137,748,063",39.4%,"$212,074,702",60.6%,2000,Walt Disney Studios,"$38,854,851","$127,500,000","May 19, 2000\n (Domestic)",PG,1 hr 22 min,Adventure\n \n Animation\n \n ...,See more details at IMDbPro\n\n
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2350,551,Shwaas,"$1,416","$1,416",100%,-,-,2004,Kathi ArtsSee full company information\n\n,"$1,042",,"December 10, 2004\n (Domestic)",,1 hr 47 min,Drama,See more details at IMDbPro\n\n
2351,552,800 Bullets 2004 Re-release,$866,$866,100%,-,-,2004,,,,"October 18, 2002\n (Spain)",R,2 hr 4 min,Action\n \n Comedy\n \n Dr...,See more details at IMDbPro\n\n
2352,553,Anatomy 2 2004 Re-release,$623,$623,100%,-,-,2004,,,,"February 6, 2003\n (Germany)",R,1 hr 41 min,Horror\n \n Sci-Fi\n \n Th...,See more details at IMDbPro\n\n
2353,554,"Jesus, You Know",$604,$604,100%,-,-,2004,Leisure Time FeaturesSee full company informat...,$604,,"December 3, 2004\n (Domestic)",,1 hr 27 min,Documentary,See more details at IMDbPro\n\n


Now it was time to join all the dataframes together to create the single large file I would work with in my project.

In [22]:
# Listing all the dataframes to be joined
frames = [df2020, df2019, df2018, df2017, df2016, df2010_2015, df2005_2009, df2000_2004]
# Joining the frames together and looking at the result
df = pd.concat(frames)
df

Unnamed: 0.1,Unnamed: 0,Year_Rank,Title,Worldwide_Gross,Domestic,Percent_Domestic,Foreign,Percent_Foreign,Year,Domestic Distributor,Domestic Opening,Budget,Earliest Release Date,MPAA,Running Time,Genres,\n IMDbPro\n
0,0,1,Bad Boys for Life,"$419,074,646","$204,417,855",48.8%,"$214,656,791",51.2%,2020,Sony Pictures Entertainment (SPE)See full comp...,"$62,504,105","$90,000,000","January 15, 2020\n (LATAM, APAC)",R,2 hr 4 min,Action\n \n Comedy\n \n Cr...,See more details at IMDbPro\n\n
1,1,2,Sonic the Hedgehog,"$306,766,470","$146,066,470",47.6%,"$160,700,000",52.4%,2020,Paramount PicturesSee full company information...,"$58,018,348","$85,000,000","February 12, 2020\n (APAC, EMEA)",PG,1 hr 39 min,Action\n \n Adventure\n \n ...,See more details at IMDbPro\n\n
2,2,3,Dolittle,"$224,752,486","$77,047,065",34.3%,"$147,705,421",65.7%,2020,Universal PicturesSee full company information...,"$21,844,045","$175,000,000","January 8, 2020\n (South Korea)",PG,1 hr 41 min,Adventure\n \n Comedy\n \n ...,See more details at IMDbPro\n\n
3,3,4,Birds of Prey: And the Fantabulous Emancipatio...,"$201,858,461","$84,158,461",41.7%,"$117,700,000",58.3%,2020,Warner Bros.See full company information\n\n,"$33,010,017","$84,500,000","February 5, 2020\n (APAC, EMEA)",R,1 hr 49 min,Action\n \n Adventure\n \n ...,See more details at IMDbPro\n\n
4,4,5,The Invisible Man,"$128,251,913","$64,914,050",50.6%,"$63,337,863",49.4%,2020,Universal PicturesSee full company information...,"$28,205,665","$7,000,000","February 26, 2020\n (EMEA, APAC)",R,2 hr 4 min,Horror\n \n Mystery\n \n S...,See more details at IMDbPro\n\n
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2350,2350,551,Shwaas,"$1,416","$1,416",100%,-,-,2004,Kathi ArtsSee full company information\n\n,"$1,042",,"December 10, 2004\n (Domestic)",,1 hr 47 min,Drama,See more details at IMDbPro\n\n
2351,2351,552,800 Bullets 2004 Re-release,$866,$866,100%,-,-,2004,,,,"October 18, 2002\n (Spain)",R,2 hr 4 min,Action\n \n Comedy\n \n Dr...,See more details at IMDbPro\n\n
2352,2352,553,Anatomy 2 2004 Re-release,$623,$623,100%,-,-,2004,,,,"February 6, 2003\n (Germany)",R,1 hr 41 min,Horror\n \n Sci-Fi\n \n Th...,See more details at IMDbPro\n\n
2353,2353,554,"Jesus, You Know",$604,$604,100%,-,-,2004,Leisure Time FeaturesSee full company informat...,$604,,"December 3, 2004\n (Domestic)",,1 hr 27 min,Documentary,See more details at IMDbPro\n\n


In [23]:
# Saving this complete dataframe as a file to my computer
df.to_csv(r'C:\Users\drudi\DataScience\Module01\Final Project\Module-1-Final-Project\BoxOfficeMojoScrapeFinal.csv')

And voila I have all my data in one big csv file. I will then pull this into my other notebook to clean, analyze, and visualize.