# Scraping IMDb

## Gameplan:

Scraping the [Top 250 Movies](https://www.imdb.com/chart/top/) from IMDb.

Afterwards the results will be analysed and investigated in a new notebook

In [18]:
# Imports
# Data Handling and Storage
import pandas as pd
import numpy as np

# scraping tools
import requests
from bs4 import BeautifulSoup

# further improvements
import json

In [4]:
# storing page to scrape
url = "https://www.imdb.com/chart/top/"

# setting up a user agend (headers)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko), Chrome/114.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9"
}

# setting up try-expect function to prevent errors
print('Fetching webiste...')
try:
    page = requests.get(url, headers=headers)
    page.raise_for_status()  # -> error if page is not found
    print(f'Success! Status Code: {page.status_code}')

    # checking if content is scraped
    soup = BeautifulSoup(page.text, 'html.parser')
    print(f'Page Title: {soup.title.text}')

except requests.exceptions.RequestException as e:
    print(f'Error fetching page: {e}')


Fetching webiste...
Success! Status Code: 200
Page Title: IMDb Top 250 movies


In [14]:
# finding the big countainer (ul) on the page
movie_ul = soup.find('ul', class_='ipc-metadata-list')

# if function to prevent errors
if movie_ul:
    movies = movie_ul.find_all('li')
    print(f'Success! {len(movies)} were found.')
    
    # sneak peek to first movie
    first_movie = movies[0]
    print("\n--- Raw HTML of first movie ---")
    print(first_movie.prettify()[:500])  # only first 500 characters

else: 
    print("Error: Could not find movie list.")

Success! 25 were found.

--- Raw HTML of first movie ---
<li class="ipc-metadata-list-summary-item">
 <div class="ipc-metadata-list-summary-item__c">
  <div class="ipc-metadata-list-summary-item__tc">
   <span aria-disabled="false" class="ipc-metadata-list-summary-item__t ipc-btn--not-interactable">
   </span>
   <div class="sc-fc35a1ef-1 lmHCrT cli-parent li-compact">
    <div class="sc-fc35a1ef-0 hTMtRz">
     <div class="sc-d0224b4e-0 jfogmY cli-poster-container">
      <div class="ipc-poster ipc-poster--base ipc-poster--media-radius ipc-poster--wl


### Level 1: Scraping the Data

**Plan for scraped data:**
- title
- Release Year
- duration
- Rating
- amount of votes
- Certification rating

In [12]:
# testing the scraper on movies[0]

# creating empty list to store data
movie_data = []

# grabbing and extracting the elements

# movie title
title_tag = movies[0].find('h3', class_='ipc-title__text')
title = title_tag.text.strip() if title_tag else 'N/A'

# release year, duration and certification
metadata = movies[0].find_all('span', class_='cli-title-metadata-item')
year = metadata[0].text.strip()
duration = metadata[1].text.strip()
certification = metadata[2].text.strip()  # -> may lead to index error

# rating
rating_tag = movies[0].find('span', class_='ipc-rating-star--rating')
rating = rating_tag.text.strip() if rating_tag else 'N/A'

# amount of votes
vote_count_tag = movies[0].find('span', class_='ipc-rating-star--voteCount')
vote_count = vote_count_tag.text.strip() if vote_count_tag else 'N/A'

movie_data.append({
    'Title' : title,
    'Year' : year,
    'Duration' : duration,
    'Certification Rating' : certification,
    'Rating' : rating,
    'Votes' : vote_count
})

print(movie_data)

[{'Title': 'The Shawshank Redemption', 'Year': '1994', 'Duration': '2h 22m', 'Certification Rating': 'R', 'Rating': '9.3', 'Votes': '(3.1M)'}]


In [13]:
# since the test worked: 
# looping through all etnries to get the data of all 250 movies
# creating a new list to store all dictionaries
imdb_movies = []

# loop through all movies
for movie in movies:

    # movie title
    title_tag = movie.find('h3', class_='ipc-title__text')
    title = title_tag.text.strip() if title_tag else 'N/A'

    # release year, duration and certification
    metadata = movie.find_all('span', class_='cli-title-metadata-item')

    # safety check if all 3 items are contained in the list
    year = metadata[0].text.strip() if len(metadata) > 0 else 'N/A'
    duration = metadata[1].text.strip() if len(metadata) > 1 else 'N/A'
    certification = metadata[2].text.strip() if len(metadata) > 2 else 'N/A' # -> may lead to index error

    # rating
    rating_tag = movie.find('span', class_='ipc-rating-star--rating')
    rating = rating_tag.text.strip() if rating_tag else 'N/A'

    # amount of votes
    vote_count_tag = movie.find('span', class_='ipc-rating-star--voteCount')
    vote_count = vote_count_tag.text.strip() if vote_count_tag else 'N/A'

    # appending items to list
    imdb_movies.append({
        'Title' : title,
        'Year' : year,
        'Duration' : duration,
        'Certification Rating' : certification,
        'Rating' : rating,
        'Votes' : vote_count
    })

# converting list into DataFrame
df = pd.DataFrame(imdb_movies)

# Check & first results
print(f'Successfully scraped {len(df)} movies!')
df.head()

Successfully scraped 25 movies!


Unnamed: 0,Title,Year,Duration,Certification Rating,Rating,Votes
0,The Shawshank Redemption,1994,2h 22m,R,9.3,(3.1M)
1,The Godfather,1972,2h 55m,R,9.2,(2.2M)
2,The Dark Knight,2008,2h 32m,PG-13,9.1,(3.1M)
3,The Godfather Part II,1974,3h 22m,R,9.0,(1.5M)
4,12 Angry Men,1957,1h 36m,Approved,9.0,(963K)


##### Conclusion:

Only 25 movies were scraped instead of 250. 

-> The scraper does not scroll through the site so not all movies have been loaded so the hidden map has to be found

### Level 2: Finding the hidden Map

In [24]:
# finding the JSON map in which the data is stored -> __NEXT_DATA__
script_tag = soup.find('script', id='__NEXT_DATA__')

# if function to check for errors
if script_tag:
    # turning text inside script into python dictionary
    data = json.loads(script_tag.string) 

    # navigating JSON path to movies
    try:
        movies_data = data['props']['pageProps']['pageData']['chartTitles']['edges']
        print(f'Success! {len(movies_data)} movies were found (Hidden in JSON)')

    except KeyError:
        print('JSON was found, but keys are different than expected.')

else:
    print('Error: Could not find the __NEXT_DATA__ tag')

Success! 250 movies were found (Hidden in JSON)


In [25]:
# inspecting the data -> first movie
first_json_movie = movies_data[0]

# printing the results
print(json.dumps(first_json_movie, indent=2))

{
  "currentRank": 1,
  "node": {
    "id": "tt0111161",
    "titleText": {
      "text": "The Shawshank Redemption",
      "__typename": "TitleText"
    },
    "titleType": {
      "id": "movie",
      "text": "Movie",
      "canHaveEpisodes": false,
      "displayableProperty": {
        "value": {
          "plainText": "",
          "__typename": "Markdown"
        },
        "__typename": "DisplayableTitleTypeProperty"
      },
      "__typename": "TitleType"
    },
    "originalTitleText": {
      "text": "The Shawshank Redemption",
      "__typename": "TitleText"
    },
    "primaryImage": {
      "id": "rm1690056449",
      "width": 1200,
      "height": 1800,
      "url": "https://m.media-amazon.com/images/M/MV5BMDAyY2FhYjctNDc5OS00MDNlLThiMGUtY2UxYWVkNGY2ZjljXkEyXkFqcGc@._V1_.jpg",
      "caption": {
        "plainText": "Tim Robbins in The Shawshank Redemption (1994)",
        "__typename": "Markdown"
      },
      "__typename": "Image"
    },
    "releaseYear": {
      "ye

#### Conclusions

The json data includes all the important values (title, year, rating, etc.) and some are even "better" formatted already.

**Changes:**
- Votes: exact integer of all votes instead of x.xM
- Year: interger instead of string
- Duration instead of hours and minutes the duration is shown in seconds

### Level 3: The Final Scrape with JSON

In [29]:
# creating a new list
imdb_json_data = []

# Loop through the JSON list
for item in movies_data:
    node = item['node']

    # 1. title
    title = node['titleText']['text']

    # 2. year -> using .get() in case release year is missing
    year = node['releaseYear']['year'] if node.get('releaseYear') else None

    # 3. duration (in seconds) 
    duration_secs = node['runtime']['seconds'] if node.get('runtime') else None

    # converting duration to minutes to improve readability
    duration_mins = duration_secs // 60 if duration_secs else None

    # 4. Rating
    rating = node['ratingsSummary']['aggregateRating'] if node.get('ratingsSummary') else None

    # 5. Votes (as exact number)
    votes = node['ratingsSummary']['voteCount'] if node.get('ratingsSummary') else None

    # 6. certification
    cert = node['certificate']['rating'] if node.get('certificate') else None

    # storing everything in the list
    imdb_json_data.append({
        'Title' : title,
        'Year' : year,
        'Duration (min)' : duration_mins,
        'Duration (secs)': duration_secs,
        'Rating' : rating,
        'Votes' : votes,
        'Certification Rating' : cert
    })

# Convert list to Dataframe
df_clean = pd.DataFrame(imdb_json_data)

# checking and looking at first few rows
print(f'Successfully extracted {len(df_clean)} movies from JSON!')
df_clean.head()

Successfully extracted 250 movies from JSON!


Unnamed: 0,Title,Year,Duration (min),Duration (secs),Rating,Votes,Certification Rating
0,The Shawshank Redemption,1994,142,8520,9.3,3131912,R
1,The Godfather,1972,175,10500,9.2,2185484,R
2,The Dark Knight,2008,152,9120,9.1,3107811,PG-13
3,The Godfather Part II,1974,202,12120,9.0,1469364,R
4,12 Angry Men,1957,96,5760,9.0,962954,Approved


In [30]:
# storing the dataframe into a csv
df_clean.to_csv('imdb_top_250.csv', index=False)