<h1> Scraping IMDB Movie Ratings </h1>

We want to scrap movie in the year 2019, sort the movies on the first page by number of votes, then switch to the next page. 

We will use the **get()** function from the **requests** module and assign the address of the web page to a variable named url.

In [2]:
from requests import get
url = 'https://www.imdb.com/search/title/?release_date=2019&sort=num_votes,desc&page=2&ref_=adv_nxt'
response = get(url)
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


We’ll use a module called **BeautifulSoup** to parse our HTML document. 

In [2]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

On inspecting the HTML page of the container we are interested, we’ll notice that the class attribute has two values: lister-item and mode-advanced. This combination is unique to these div containers. Therefore, we will use **find_all()** method to extract all the div containers that have a class attribute of lister-item mode-advanced.

In [3]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


In [5]:
print(movie_containers[0])

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt7286456"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt7286456/"> <img alt="Joker" class="loadlate" data-tconst="tt7286456" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BNGVjNWI4ZGUtNzE0MS00YTJmLWE0ZDctN2ZiYTk2YmI3NTYyXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB466725069_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt7286456/">Joker</a>
<span class="lister-item-year text-muted unbold">(2019)</span>
</h3>
<p class="text-muted">
<span class="certificate">MA15+</span>
<span class="ghost">|</span>
<span class="runtime">122 min</span>
<span class="ghost">|</span>
<span class="genre">

We will now analyse the first move from the list movie_container

In [10]:
first_movie = movie_containers[0]

In [11]:
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt7286456"></div>
</div>

In [12]:
first_movie.a

<a href="/title/tt7286456/"> <img alt="Joker" class="loadlate" data-tconst="tt7286456" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BNGVjNWI4ZGUtNzE0MS00YTJmLWE0ZDctN2ZiYTk2YmI3NTYyXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB466725069_.png" width="67"/>
</a>

In [13]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt7286456/">Joker</a>
<span class="lister-item-year text-muted unbold">(2019)</span>
</h3>

In [14]:
first_movie.h3.a

<a href="/title/tt7286456/">Joker</a>

In [15]:
first_name = first_movie.h3.a.text
print(first_name)

Joker


In [16]:
first_year = first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold')
first_year

<span class="lister-item-year text-muted unbold">(2019)</span>

In [17]:
first_year = first_year.text
print(first_year)

(2019)


In [18]:
first_movie.strong

<strong>8.5</strong>

In [19]:
first_imdb = float(first_movie.strong.text)
print(first_imdb)

8.5


In [30]:
first_mscore = first_movie.find('span', class_ = 'metascore mixed')

In [31]:
first_mscore = int(first_mscore.text)
print(first_mscore)

59


In [32]:
first_votes = first_movie.find('span', attrs = {'name':'nv'})
first_votes

<span data-value="821814" name="nv">821,814</span>

In [33]:
first_votes['data-value']

'821814'

In [34]:
first_votes = int(first_votes['data-value'])

In [35]:
print(first_votes)

821814


On analysing the first web page, we could observe that the fourth movie has no meta-score provided.

In [36]:
fourth_movie_mscore = movie_containers[3].find('div', class_ = 'ratings-metascore')
type(fourth_movie_mscore)

NoneType

We will first declare some list variables to store the extracted data and loop through each container(movie) in movie_containers. We will then extract the data points of interest only if the container(movie) has a meta-score.

In [38]:
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
        name = container.h3.a.text
        names.append(name)
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

We will use **pandas** to store it in a **DataFrame**

In [39]:
import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'votes': votes})
print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      39 non-null     object 
 1   year       39 non-null     object 
 2   imdb       39 non-null     float64
 3   metascore  39 non-null     int64  
 4   votes      39 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 1.6+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Joker,(2019),8.5,59,821814
1,Avengers: Endgame,(2019),8.4,78,735810
2,Once Upon a Time in Hollywood,(2019),7.7,83,487644
3,Parasite,(2019),8.6,96,441668
4,Captain Marvel,(2019),6.9,64,416882
5,Knives Out,(2019),7.9,82,340712
6,1917,(2019),8.3,78,339122
7,Star Wars: The Rise of Skywalker,(2019),6.7,53,337655
8,Spider-Man: Far from Home,(2019),7.5,69,297468
9,The Irishman,(2019),7.9,94,297028


We will now scrap data from first 5 pages.

In [40]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2020)]

In [44]:
from time import sleep
from random import randint

In [52]:
from time import time
from IPython.core.display import clear_output
from warnings import warn

We will now control the crawl rate by using **sleep()** function.We will also use **randit()** fucntion to mimic human behavior.

In [54]:
# Redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Preparing the monitoring of the loop
start_time = time()
requests = 0

# For every year in the interval 2000-2017
for year_url in years_url:

    # For every page in the interval 1-4
    for page in pages:

        # Make a get request
        response = get('http://www.imdb.com/search/title?release_date=' + year_url +
        '&sort=num_votes,desc&page=' + page)

        # Pause the loop
        sleep(randint(8,15))

        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if requests > 72:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')

        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:

                
                name = container.h3.a.text
                names.append(name)

                
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)

                
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)

                
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))

                
                vote = container.find('span', attrs = {'name':'nv'})['data-value']
                votes.append(int(vote))



Request:74; Frequency: 0.07538665106386494 requests/s


We will now merge the data into our DataFrame.

In [55]:
movie_ratings = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'votes': votes})
print(movie_ratings.info())
movie_ratings.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      3276 non-null   object 
 1   year       3276 non-null   object 
 2   imdb       3276 non-null   float64
 3   metascore  3276 non-null   int64  
 4   votes      3276 non-null   int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 128.1+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Gladiator,(2000),8.5,67,1297492
1,Memento,(2000),8.4,80,1090155
2,Snatch,(2000),8.3,55,761575
3,Requiem for a Dream,(2000),8.3,68,743113
4,X-Men,(2000),7.4,64,559207
5,Cast Away,(2000),7.8,73,506290
6,American Psycho,(2000),7.6,64,467841
7,Unbreakable,(2000),7.3,62,377913
8,Mission: Impossible II,(2000),6.1,59,306371
9,Meet the Parents,(2000),7.0,73,304704


In [56]:
movie_ratings = movie_ratings[['movie', 'year', 'imdb', 'metascore', 'votes']]
movie_ratings.head()

Unnamed: 0,movie,year,imdb,metascore,votes
0,Gladiator,(2000),8.5,67,1297492
1,Memento,(2000),8.4,80,1090155
2,Snatch,(2000),8.3,55,761575
3,Requiem for a Dream,(2000),8.3,68,743113
4,X-Men,(2000),7.4,64,559207


We will now examine the unique values of the year column.

In [57]:
movie_ratings['year'].unique()

array(['(2000)', '(I) (2000)', '(2001)', '(2002)', '(2003)', '(2004)',
       '(I) (2004)', '(2005)', '(I) (2005)', '(2006)', '(I) (2006)',
       '(2007)', '(I) (2007)', '(2008)', '(I) (2008)', '(2009)',
       '(I) (2009)', '(2010)', '(I) (2010)', '(2011)', '(I) (2011)',
       '(2012)', '(I) (2012)', '(2013)', '(I) (2013)', '(2014)',
       '(I) (2014)', '(II) (2014)', '(2015)', '(I) (2015)', '(II) (2015)',
       '(2016)', '(II) (2016)', '(IX) (2016)', '(I) (2016)', '(2017)',
       '(I) (2017)'], dtype=object)

Counting from the end toward beginning, we can see that the years are always located from the fifth character to the second. We’ll use the **.str()** method to select only that interval. We’ll also convert the result to an integer using the **astype()** method.

In [58]:
movie_ratings.loc[:, 'year'] = movie_ratings['year'].str[-5:-1].astype(int)

In [59]:
movie_ratings['year'].head(3)

0    2000
1    2000
2    2000
Name: year, dtype: int64

In [60]:
movie_ratings.describe().loc[['min', 'max'], ['imdb', 'metascore']]

Unnamed: 0,imdb,metascore
min,4.1,24.0
max,9.0,100.0


In [61]:
movie_ratings['n_imdb'] = movie_ratings['imdb'] * 10
movie_ratings.head(3)

Unnamed: 0,movie,year,imdb,metascore,votes,n_imdb
0,Gladiator,2000,8.5,67,1297492,85.0
1,Memento,2000,8.4,80,1090155,84.0
2,Snatch,2000,8.3,55,761575,83.0


We will save our data into **.csv** file for future anlysis or reporting.

In [62]:
movie_ratings.to_csv('movie_ratings.csv')