Note: code adapted from https://www.dataquest.io/blog/web-scraping-beautifulsoup/

First, let's get some standard imports done.

In [1]:
import requests
from bs4 import BeautifulSoup

Great, now let's define a couple of variables and see if we can just get the raw HTML for a webpage. We use the `request.get()` function to get the raw HTML of the webpage with the given URL.

In [2]:
release_date = '2017-01-01'
sort= 'num_votes'
ordering = 'desc'
reference = 'adv_next'

BASE_URL = 'https://www.imdb.com/search/title/?release_date={},&sort={},{}&start={}&ref_={}'

page_url = BASE_URL.format(release_date, sort, ordering, 1, reference)
response = requests.get(page_url)
'''
We use the .text field to access just the content of the response.
Additionally, the [:500] just tells Python to print only the first 500 characters, as we
don't want a massive printout
'''
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


Now that we have the raw HTML, we're going to use `bs4` (beautifulsoup) to extract the specific info we want.

In [3]:
html_soup = BeautifulSoup(response.text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


Let's see what each `bs4` result in `movie_containers` is:

In [9]:
print(movie_containers[0])

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154756"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt4154756/"> <img alt="Avengers: Infinity War" class="loadlate" data-tconst="tt4154756" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt4154756/">Avengers: Infinity War</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>
<p class="text-muted">
<span class="certificate">PG-13</span>
<span class="ghost">|</span>
<span class="runtime">149 min</span>
<span class="ghost">|</span>
<span class="genre

Let's use the `h3` tag to get close to the movie name:

In [12]:
print(movie_containers[0].h3)

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt4154756/">Avengers: Infinity War</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>


Great! We see that the movie title is in an anchor tag, so now we can just use `.h3.a.text` to get the movie title:

In [14]:
print(movie_containers[0].h3.a.text)

Avengers: Infinity War


We also notice that the IMDb rating is in a strong tag, so we can use `.strong.text` to get the IMDb rating:

In [16]:
print(movie_containers[0].strong.text)

8.5


## Putting it Together

Now, let's put it all together — we'll scrape each webpage, get the name and rating for each movie on each page, and save it in a Python dictionary.

In [19]:
movie_dict = {}

for start in range(1, 1000, 50):
    #.format() just inserts the above variables into the base_url
    page_url = BASE_URL.format(release_date, sort, ordering, start, reference)
    
    #get each webpage (IMDb displays 50 movies by default)
    response = requests.get(page_url)
    
    #parse the raw HTML and find the divs for each movie
    html_soup = BeautifulSoup(response.text, 'html.parser')
    movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
    
    #For each movie, get the name and rating from the HTML
    for movie in movie_containers:
        name = movie.h3.a.text
        rating = movie.strong.text
        movie_dict[name] = float(rating)
    

KeyboardInterrupt: 

In [18]:
print(movie_dict)

{'Avengers: Infinity War': 8.5, 'Avengers: Endgame': 8.5, 'Joker': 8.6, 'Logan': 8.1, 'Black Panther': 7.3, 'Thor: Ragnarok': 7.9, 'Guardians of the Galaxy Vol. 2': 7.6, 'Star Wars: Episode VIII - The Last Jedi': 7.0, 'Wonder Woman': 7.4, 'Dunkirk': 7.9, 'Spider-Man: Homecoming': 7.4, 'Get Out': 7.7, 'Deadpool 2': 7.7, 'It': 7.3, 'Blade Runner 2049': 8.0, 'Chernobyl': 9.5, 'Bohemian Rhapsody': 8.0, 'Baby Driver': 7.6, 'Captain Marvel': 6.9, 'Three Billboards Outside Ebbing, Missouri': 8.2, 'Justice League': 6.4, 'A Quiet Place': 7.5, 'Once Upon a Time... in Hollywood': 7.8, 'The Shape of Water': 7.3, 'John Wick: Chapter 2': 7.5, 'Ready Player One': 7.5, 'Aquaman': 7.0, 'Venom': 6.7, 'Coco': 8.4, 'A Star Is Born': 7.7, 'Spider-Man: Into the Spider-Verse': 8.4, 'Jumanji: Welcome to the Jungle': 6.9, 'Green Book': 8.2, 'Ant-Man and the Wasp': 7.1, 'Mission: Impossible - Fallout': 7.8, 'Solo: A Star Wars Story': 6.9, 'Spider-Man: Far from Home': 7.6, 'Beauty and the Beast': 7.1, 'Kong: Sku