# Lab | Web Scraping

**Hint**

Your first mission is to familiarize yourself with the IMDb advanced search page. Head over to [IMDb advanced search](https://www.imdb.com/search/title/) and input the following parameters, keeping all other fields to their default values or blank:

- **Title Type**: Feature film
- **Release date**: From 1990 to 1992 (Note: You don't need to specify the day and month)
- **User Rating**: 7.5 to -

Upon searching, you'll land on a page showcasing a list of movies, each displaying vital details such as the title, release year, and crew information. Your task is to scrape this treasure trove of data.

Carefully examine the resulting URL and construct your own URL to include all the necessary parameters for filtering the movies.


---

**Best of luck! Immerse yourself in the world of movies and may the data be with you!**

**Important note**:

In the fast-changing online world, websites often get updates and make changes. When you try this lab, the IMDb website might be different from what we expect.

If you run into problems because of these changes, like new rules or things that stop you from getting data, don't worry! Instead, get creative.

You can choose another website that interests you and is good for scraping data. Websites like Wikipedia or The New York Times are good options. The main goal is still the same: get useful data and learn how to scrape it from a website that you find interesting. It's a chance to practice your web scraping skills and explore a source of information you like.

In [182]:
import pandas as pd

In [184]:
# bs4 = beautiful soup library
from bs4 import BeautifulSoup # import this class

In [186]:
import requests

In [188]:
# Load imdb from url
url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,10'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

response = requests.get(url, headers=headers)
response

<Response [200]>

In [190]:
# response.content

In [192]:
response.headers

{'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Mon, 20 Jan 2025 16:07:47 GMT', 'Vary': 'Accept-Encoding,Content-Type,Accept-Encoding,User-Agent', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains; preload', 'x-amz-rid': 'S1E1YCBCP5119T5NB0D4', 'set-cookie': 'session-id=131-6556946-8969911; Domain=.imdb.com; Expires=Tue, 01 Jan 2036 08:00:01 GMT; Path=/; Secure, session-id-time=2082787201l; Domain=.imdb.com; Expires=Tue, 01 Jan 2036 08:00:01 GMT; Path=/; Secure, international-seo=; Max-Age=0; Domain=.imdb.com; Path=/; Secure; SameSite=Strict, next-sid=yW06MFzfLyWwVDK9LtNE2; Path=/; Expires=Thu, 01 Jan 1970 00:00:00 GMT; HttpOnly', 'content-security-policy': "frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.googl

In [194]:
# To analyze html with soup define it
soup = BeautifulSoup(response.content, "html.parser")

In [196]:
# soup

In [198]:
# print(soup.prettify())

In [200]:
# user_rating can be found in <span class="ipc-rating-star--rating">7.7</span>
soup.find_all("span", class_='ipc-rating-star--rating')

[<span class="ipc-rating-star--rating">8.7</span>,
 <span class="ipc-rating-star--rating">7.8</span>,
 <span class="ipc-rating-star--rating">8.6</span>,
 <span class="ipc-rating-star--rating">7.7</span>,
 <span class="ipc-rating-star--rating">8.6</span>,
 <span class="ipc-rating-star--rating">8.3</span>,
 <span class="ipc-rating-star--rating">8.2</span>,
 <span class="ipc-rating-star--rating">7.7</span>,
 <span class="ipc-rating-star--rating">7.5</span>,
 <span class="ipc-rating-star--rating">7.6</span>,
 <span class="ipc-rating-star--rating">8.0</span>,
 <span class="ipc-rating-star--rating">7.6</span>,
 <span class="ipc-rating-star--rating">8.0</span>,
 <span class="ipc-rating-star--rating">7.8</span>,
 <span class="ipc-rating-star--rating">8.0</span>,
 <span class="ipc-rating-star--rating">8.0</span>,
 <span class="ipc-rating-star--rating">7.5</span>,
 <span class="ipc-rating-star--rating">7.6</span>,
 <span class="ipc-rating-star--rating">7.6</span>,
 <span class="ipc-rating-star--

In [202]:
# start_date & end_date can be found in <span class="ipc-chip__text">Release Date: January 1, 1990 to December 31, 1992</span>
soup.find_all("span", class_='ipc-chip__text')

[<span class="ipc-chip__text">Movie</span>,
 <span class="ipc-chip__text">TV Series</span>,
 <span class="ipc-chip__text">Short</span>,
 <span class="ipc-chip__text">TV Episode</span>,
 <span class="ipc-chip__text">TV Mini Series</span>,
 <span class="ipc-chip__text">TV Movie</span>,
 <span class="ipc-chip__text">TV Special</span>,
 <span class="ipc-chip__text">TV Short</span>,
 <span class="ipc-chip__text">Video Game</span>,
 <span class="ipc-chip__text">Video</span>,
 <span class="ipc-chip__text">Music Video</span>,
 <span class="ipc-chip__text">Podcast Series</span>,
 <span class="ipc-chip__text">Podcast Episode</span>,
 <span class="ipc-chip__text">Action<span class="ipc-chip__count">91</span></span>,
 <span class="ipc-chip__text">Adventure<span class="ipc-chip__count">32</span></span>,
 <span class="ipc-chip__text">Animation<span class="ipc-chip__count">19</span></span>,
 <span class="ipc-chip__text">Biography<span class="ipc-chip__count">40</span></span>,
 <span class="ipc-chip__

In [226]:
# Show all titles
# Second value is the title, extract with split, then store in list !
for title in soup.select("h3.ipc-title__text"):
    title = title.get_text().split(".")[1]
    print(title)

 GoodFellas - Drei Jahrzehnte in der Mafia
 Zeit des Erwachens
 Das Schweigen der Lämmer
 Kevin - Allein zu Haus
 Terminator 2: Tag der Abrechnung
 Reservoir Dogs - Wilde Hunde
 Erbarmungslos
 Eine Frage der Ehre
 Total Recall - Die totale Erinnerung
 Der Pate 3
 Der Duft der Frauen
 Der letzte Mohikaner
 Der mit dem Wolf tanzt
 Misery
 Aladdin
 Die Schöne und das Biest
 Jagd auf Roter Oktober
 Mein Vetter Winnie
 Thelma & Louise
 Boyz n the Hood - Jungs im Viertel
 Edward mit den Scherenhänden
 Glengarry Glen Ross
 Zurück in die Zukunft III
 JFK: Tatort Dallas
 Grüne Tomaten


IndexError: list index out of range

In [230]:
# Put all titles in a list
titles = []
for title in soup.select("h3.ipc-title__text"):
    titles.append(title.get_text().split(".")[1])
titles

IndexError: list index out of range

In [260]:
titles

[' GoodFellas - Drei Jahrzehnte in der Mafia',
 ' Zeit des Erwachens',
 ' Das Schweigen der Lämmer',
 ' Kevin - Allein zu Haus',
 ' Terminator 2: Tag der Abrechnung',
 ' Reservoir Dogs - Wilde Hunde',
 ' Erbarmungslos',
 ' Eine Frage der Ehre',
 ' Total Recall - Die totale Erinnerung',
 ' Der Pate 3',
 ' Der Duft der Frauen',
 ' Der letzte Mohikaner',
 ' Der mit dem Wolf tanzt',
 ' Misery',
 ' Aladdin',
 ' Die Schöne und das Biest',
 ' Jagd auf Roter Oktober',
 ' Mein Vetter Winnie',
 ' Thelma & Louise',
 ' Boyz n the Hood - Jungs im Viertel',
 ' Edward mit den Scherenhänden',
 ' Glengarry Glen Ross',
 ' Zurück in die Zukunft III',
 ' JFK: Tatort Dallas',
 ' Grüne Tomaten']

In [262]:
# Always check the length of output list - are they same length ?
len(titles)

25

In [264]:
# Show all movie numbers
# First values is the movie number, extract with split, then store in separat list !
for number in soup.select("h3.ipc-title__text"):
    number = number.get_text().split(".")[0]
    print(number)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Recently viewedRecently viewed


In [266]:
# Put all movie numbers in a list
movie_nr = []
for number in soup.select("h3.ipc-title__text"):
    movie_nr.append(number.get_text().split(".")[0])
movie_nr

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 'Recently viewedRecently viewed']

In [268]:
# Get rid of the last entry: 'Recently viewed'
movie_nr = movie_nr[:-1]
movie_nr

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25']

In [258]:
# Always check the length of output list - are they same length ?
len(movie_nr)

25

In [270]:
# Show all ratings
for rating in soup.select("span.ipc-rating-star--rating"):
    print(rating.get_text())

8.7
7.8
8.6
7.7
8.6
8.3
8.2
7.7
7.5
7.6
8.0
7.6
8.0
7.8
8.0
8.0
7.5
7.6
7.6
7.8
7.9
7.7
7.5
8.0
7.7


In [272]:
# Put all ratings in a list
ratings = []
for rating in soup.select("span.ipc-rating-star--rating"):
    ratings.append(rating.get_text())
ratings

['8.7',
 '7.8',
 '8.6',
 '7.7',
 '8.6',
 '8.3',
 '8.2',
 '7.7',
 '7.5',
 '7.6',
 '8.0',
 '7.6',
 '8.0',
 '7.8',
 '8.0',
 '8.0',
 '7.5',
 '7.6',
 '7.6',
 '7.8',
 '7.9',
 '7.7',
 '7.5',
 '8.0',
 '7.7']

In [274]:
# Always check the length of output list - are they same length ?
len(ratings)

25

In [142]:
# Show all start and end dates - This does not work !
# for date in soup.select("span.ipc-chip__text"):
#    print(date.get_text())

In [144]:
# Put all start and end dates into a list - this does not work !
# dates = []
# for date in soup.select("span.ipc-chip__text"):
#     dates.append(date.get_text())
# dates

In [276]:
# Create start and end date variables by myself
from datetime import date
start_date = date(1990,1,1)
end_date = date(1992,12,31)
print(start_date)
print(end_date)

1990-01-01
1992-12-31


In [278]:
# Descriptions list 
descriptions = []
for description in soup.select("div.ipc-html-content-inner-div"):
    descriptions.append(description.get_text())
descriptions

['The story of Henry Hill and his life in the mafia, covering his relationship with his wife Karen and his mob partners Jimmy Conway and Tommy DeVito.',
 "Dr. Sayer is a pioneering neurologist who wants to take a risk and give his patients who suffer from encephalitis a drug used for Parkinson's Disease. He tries it out on one man who miraculously wakes from his perpetual catatonic state.",
 'A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.',
 'An eight-year-old troublemaker, mistakenly left home alone, must defend his home against a pair of burglars on Christmas Eve.',
 'A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her ten year old son John from an even more advanced and powerful cyborg.',
 'When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.',
 'Retired Old W

In [280]:
# Always check the length of output list - are they same length ?
len(descriptions)

25

In [282]:
# Show all votes
for vote in soup.select("span.ipc-rating-star--voteCount"):
    print(vote.get_text())

 (1.3M)
 (165K)
 (1.6M)
 (689K)
 (1.2M)
 (1.1M)
 (448K)
 (298K)
 (364K)
 (435K)
 (342K)
 (194K)
 (298K)
 (246K)
 (482K)
 (489K)
 (221K)
 (147K)
 (180K)
 (160K)
 (541K)
 (120K)
 (495K)
 (175K)
 (86K)


In [288]:
# Create votes list
# strip "\xa=(" and then ")"
votes = []
for vote in soup.select("span.ipc-rating-star--voteCount"):
    votes.append(vote.get_text().strip("\xa0(").strip(")"))
votes

['1.3M',
 '165K',
 '1.6M',
 '689K',
 '1.2M',
 '1.1M',
 '448K',
 '298K',
 '364K',
 '435K',
 '342K',
 '194K',
 '298K',
 '246K',
 '482K',
 '489K',
 '221K',
 '147K',
 '180K',
 '160K',
 '541K',
 '120K',
 '495K',
 '175K',
 '86K']

In [170]:
# Marias version:
# votes[0].getText()[2:-1]

In [290]:
# Always check the length of output list - are they same length ?
len(votes)

25

In [318]:
# Create a Release Year list.
for year in soup.select("span.sc-300a8231-7.eaXxft.dli-title-metadata-item"):
    print(year.get_text())

1990
2h 25m
16
1990
2h 1m
12
1991
1h 58m
16
1990
1h 43m
12
1991
2h 17m
16
1992
1h 39m
18
1992
2h 10m
16
1992
2h 18m
12
1990
1h 53m
18
1990
2h 42m
16
1992
2h 36m
12
1992
1h 52m
16
1990
3h 1m
12
1990
1h 47m
16
1992
1h 30m
0
1991
1h 24m
0
1990
2h 15m
12
1992
2h
6
1991
2h 10m
16
1991
1h 52m
16
1990
1h 45m
6
1992
1h 40m
12
1990
1h 58m
6
1991
3h 9m
12
1991
2h 10m
6


In [344]:
# Extract first value, because List contains release year, runtime, and age limit.
release_year = []
for year in soup.select("span.sc-300a8231-7.eaXxft.dli-title-metadata-item"):
    release_year.append(year.get_text())
release_year

['1990',
 '2h 25m',
 '16',
 '1990',
 '2h 1m',
 '12',
 '1991',
 '1h 58m',
 '16',
 '1990',
 '1h 43m',
 '12',
 '1991',
 '2h 17m',
 '16',
 '1992',
 '1h 39m',
 '18',
 '1992',
 '2h 10m',
 '16',
 '1992',
 '2h 18m',
 '12',
 '1990',
 '1h 53m',
 '18',
 '1990',
 '2h 42m',
 '16',
 '1992',
 '2h 36m',
 '12',
 '1992',
 '1h 52m',
 '16',
 '1990',
 '3h 1m',
 '12',
 '1990',
 '1h 47m',
 '16',
 '1992',
 '1h 30m',
 '0',
 '1991',
 '1h 24m',
 '0',
 '1990',
 '2h 15m',
 '12',
 '1992',
 '2h',
 '6',
 '1991',
 '2h 10m',
 '16',
 '1991',
 '1h 52m',
 '16',
 '1990',
 '1h 45m',
 '6',
 '1992',
 '1h 40m',
 '12',
 '1990',
 '1h 58m',
 '6',
 '1991',
 '3h 9m',
 '12',
 '1991',
 '2h 10m',
 '6']

In [346]:
# Create a Runtimes list:
# Runtimes list in _h _m format, but should be in  minutes format !
# Extract second value, because List contains release year, runtime, and age limit.
runtimes = release_year[1::3]
runtimes

['2h 25m',
 '2h 1m',
 '1h 58m',
 '1h 43m',
 '2h 17m',
 '1h 39m',
 '2h 10m',
 '2h 18m',
 '1h 53m',
 '2h 42m',
 '2h 36m',
 '1h 52m',
 '3h 1m',
 '1h 47m',
 '1h 30m',
 '1h 24m',
 '2h 15m',
 '2h',
 '2h 10m',
 '1h 52m',
 '1h 45m',
 '1h 40m',
 '1h 58m',
 '3h 9m',
 '2h 10m']

In [348]:
# Extract first value, because List contains release year, runtime, and age limit.
release_year = release_year[::3]
release_year

['1990',
 '1990',
 '1991',
 '1990',
 '1991',
 '1992',
 '1992',
 '1992',
 '1990',
 '1990',
 '1992',
 '1992',
 '1990',
 '1990',
 '1992',
 '1991',
 '1990',
 '1992',
 '1991',
 '1991',
 '1990',
 '1992',
 '1990',
 '1991',
 '1991']

In [350]:
# Always check the length of output list - are they same length ?
len(release_year)

25

In [352]:
# Always check the length of output list - are they same length ?
len(runtimes)

25

In [356]:
# alternative runtime with select:
# runtime = soup.select("span.sc-300a8231-7.eaXxft.dli-title-metadata-item")
# When using select, empty spaces need to be replaced with dots !!!!!!

In [358]:
# runtimes = []
# for runtime in soup.select("span.sc-300a8231-7.eaXxft.dli-title-metadata-item"):
#    runtimes.append(runtime.get_text())
# runtimes

In [436]:
# A function named scrape_imdb that takes four parameters: title_type, user_rating, start_date, and end_date.
# The function should return a DataFrame with the following columns:
# Movie Nr: The number representing the movie’s position in the list. xxxxxxxxxxx
# Title: The title of the movie. xxxxxxxxxxxxxxxxxxxxxx
# Year: The year the movie was released. xxxxxxxxxxxxxxxxxxx
# Rating: The IMDb rating of the movie. xxxxxxxxxxxxxxxxxxxx
# Runtime (min): The duration of the movie in minutes. xxxxxxxxxxxxxxxxx
# Description: A brief description of the movie. xxxxxxxxxxxxxx
# Votes: The number of votes the movie received. XXXXXXXXXXXXXXX

In [490]:
from itertools import islice

def scrape_imdb(input_titles, input_ratings, start_date, end_date):
    url = f'https://www.imdb.com/search/title/?title_type={input_titles}&release_date={start_date},{end_date}&user_rating={input_ratings}'
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    # List Initialization:
    movie_data = []  # Store all movie data in a list of dictionaries

    movies = soup.select("li.ipc-metadata-list-summary-item")
    for movie in islice(movies, 25):  # Limit to 25 movies
        movie_info = {}  # Dictionary to store info for each movie
        
        for key, selector in [
            ("Movie Nr", "h3.ipc-title__text"),
            ("Title", "h3.ipc-title__text"),
            ("Year", "span.sc-300a8231-7.eaXxft.dli-title-metadata-item"),
            ("Rating", "span.ipc-rating-star--rating"),
            ("Runtime", "span.sc-300a8231-7.eaXxft.dli-title-metadata-item"),
            ("Description", "div.ipc-html-content-inner-div"),
            ("Votes", "span.ipc-rating-star--voteCount"),
        ]:
            element = movie.select_one(selector)
            if element:
                if key == "Movie Nr":
                    movie_info[key] = element.get_text().strip().split(".")[0]
                if key == "Year":
                    movie_info[key] = element.get_text()[0::3]
                elif key == "Runtime":
                    movie_info[key] = element.get_text()[1::3]
                elif key == "Description":
                    movie_info[key] = element.get_text().strip() # Get full description
                else:
                    movie_info[key] = element.get_text().strip()
            else:
                movie_info[key] = "N/A"

        movie_data.append(movie_info)

    movies_df = pd.DataFrame(movie_data)
    return movies_df


In [492]:
scrape_imdb("feature", "7.5", "1990-01-01", "1992-12-31")

Unnamed: 0,Movie Nr,Title,Year,Rating,Runtime,Description,Votes
0,1. GoodFellas - Drei Jahrzehnte in der Mafia,1. GoodFellas - Drei Jahrzehnte in der Mafia,10,8.7,9,The story of Henry Hill and his life in the ma...,(1.3M)
1,2. Zeit des Erwachens,2. Zeit des Erwachens,10,7.8,9,Dr. Sayer is a pioneering neurologist who want...,(165K)
2,3. Das Schweigen der Lämmer,3. Das Schweigen der Lämmer,11,8.6,9,A young F.B.I. cadet must receive the help of ...,(1.6M)
3,4. Kevin - Allein zu Haus,4. Kevin - Allein zu Haus,10,7.7,9,"An eight-year-old troublemaker, mistakenly lef...",(689K)
4,5. Terminator 2: Tag der Abrechnung,5. Terminator 2: Tag der Abrechnung,11,8.6,9,"A cyborg, identical to the one who failed to k...",(1.2M)
5,6. Reservoir Dogs - Wilde Hunde,6. Reservoir Dogs - Wilde Hunde,12,8.3,9,When a simple jewelry heist goes horribly wron...,(1.1M)
6,7. Erbarmungslos,7. Erbarmungslos,12,8.2,9,Retired Old West gunslinger Will Munny relucta...,(448K)
7,8. Eine Frage der Ehre,8. Eine Frage der Ehre,12,7.7,9,A military lawyer is tasked with defending two...,(298K)
8,9. Total Recall - Die totale Erinnerung,9. Total Recall - Die totale Erinnerung,10,7.5,9,When a man goes in to have virtual vacation me...,(364K)
9,10. Der Pate 3,10. Der Pate 3,10,7.6,9,"Follows Michael Corleone, now in his 60s, as h...",(435K)
