<hr/>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
[Tip]: To execute the Python code in the code cell below, click on the cell to select it and press <kbd>Shift</kbd> + <kbd>Enter</kbd>.
</div>
<hr/>

# Exercise 1: IMDB - Part 1

## This exercise is split into two notebooks

1. Part: Scraping IMDB (this notebook)
2. Part: Exploratory Data Analysis (the second notebook)
    

## Part 1: Scraping IMDB

1. Task: Scrap an IMDB movie
2. Task: Convert the data to a machine readable format
3. Task: Parse the list of top 250 movies

### You have to hand in this exercise via Moodle.

# Installing & Importing Pre-Requisites

First we need to install the required libraries

In [1]:
try:
    from bs4 import BeautifulSoup
except ImportError as e:
    !pip install BeautifulSoup4

try:
    import lxml
except ImportError as e:
    !pip install lxml
    
try:
    import html5lib
except ImportError as e:
    !pip install html5lib
    
try:
    import requests_cache
except ImportError as e:
    !pip install requests_cache
    
try:
    from tqdm import tqdm
except ImportError as e:
    !pip install tqdm

import time
import numpy as np
import pandas as pd
import requests
import requests_cache
import warnings

from os.path import exists
from IPython.display import display    

# Scraping IMDB using Beautiful Soup

We will scrap the International Movies Database (IMDB) at [imdb.com](https://imdb.com) for the 250 top movies ever made. 

### Example

This is an example of the result of scraping the webpage: https://www.imdb.com/title/tt0111161/

<img src="images/imdb.png">

In [2]:
# Load the full list of movies from json format
def load_movies_json():    
    local = "data/movies_full_crawled.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/bd7bdd460c55420783aa/?dl=1")

# show only one movie
df_redemption = load_movies_json().head(1)
df_redemption

Read from local file


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


# List of movies

We will use a JSON file to get the list of movies to scrap. We use `Pandas` to read JSON-files.

The file contains three columns the 
- *titles*, 
- *ratings*, and 
- *href* 

to the movies.

In [3]:
# Load excerpt of the movies from json format
def load_top_movies_short_json():
    local = "data/movies_short.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/5d645662ec9e45338b03/?dl=1")

movies = load_top_movies_short_json()
movies

Read from local file


Unnamed: 0,title,rating,href
0,Die Verurteilten,9.2,/title/tt0111161/


## We will now load and cache the HTML pages 

This will avoid being blocked by IMDB due to too many requests.

In [4]:
requests_cache.install_cache('imdb_cache')

# Redeclaring the lists to store the data in
def get_webpages(movies):
    films = []
    headers = {
        'Accept-Language': 'en; q=1.0',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }

    for i in tqdm(range(len(movies)), desc='Scrap webpages'):
        href = movies.iloc[i].href
        
        # Make a get request
        req_time = time.perf_counter()
        response = requests.get(f'https://www.imdb.com/{href}', headers=headers)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warnings.warn(f'Request to {href} returned status {response.status_code}')

        elapsed_time = np.round(time.perf_counter() - req_time, 5)
        # print(f"Loading Webpage: {href} done in {elapsed_time} seconds")

        films.append([href, response])

        # Optional: time.sleep(1)  # Uncomment to reduce request rate and avoid blocking
        
    return films

movies_html = get_webpages(movies)

Scrap webpages: 100%|██████████| 1/1 [00:00<00:00, 70.60it/s]


# Task 1: Scrap IMDB using BeautifulSoup

<div class="alert alert-block alert-success">
    
You are expected to complete the following method to parse the webpage using BeautifulSoup.

</div>    

From each webpage, you are supposed to extract the following information:

* `url`: The URL
* `title`: The title
* `year`: The year of production
* `description`: The description of the movie
* `budget`: The budget
* `gross`: The *worldwide* gross
* `ratingValue`: The rating of the movie
* `ratingCount`: The number of votes
* `duration`: The duration of the movie
* `genreList`: A *list* of genres
* `countryList`: A *list* of the countries of origin
* `castList`: A *list* of the cast of the movie
* `characterList`: The *list* of characters played by the cast
* `directorList`: The *list* of directors

A variable ending in *List* such as *genreList* refers to the extracted data type being a list. 

**Hint:**

- Use `html.find_all()` to parse list types and `html.find()` to get a single entry
- Use `try:` and `except:` when accessing non-existing elements for some movies such as `budget` or `gross`. Or check for `None`.

In [42]:
def parse_movie(html, href):
    film = {}

    film["url"] = href

    # ADD YOUR CODE HERE
    film["title"] = html.find("span", {"data-testid": "hero__primary-text"}).text
    ratings = html.find("div", {"data-testid": "hero-rating-bar__aggregate-rating"})
    film["ratingValue"] = html.find("div", {"data-testid": "hero-rating-bar__aggregate-rating__score"}).span.text
    film["ratingCount"] = html.find("div", {"data-testid": "hero-rating-bar__aggregate-rating__score"}).next_sibling.next_sibling.text
    try:
        film["year"] = html.find("h1", {"data-testid": "hero__pageTitle"}).next_sibling.a.text
    except:
        film["year"] = "0"
    film["description"] = html.find("span", {"data-testid": "plot-xl"}).text
    try:
        film["budget"] = html.find("li", {"data-testid": "title-boxoffice-budget"}).div.span.text
    except:
        film["budget"] = "0"
    try:
        film["gross"] = html.find("li", {"data-testid": "title-boxoffice-cumulativeworldwidegross"}).div.span.text
    except:
        film["gross"] = "0"  
    film["duration"] = html.find("li", {"data-testid": "title-techspec_runtime"}).div.span.text

    genreList = [x.text for x in html.find("div", {"data-testid": "interests"}).find_all("span")]
    film["genreList"] = genreList

    countryList = [x.text for x in html.find("li", {"data-testid": "title-details-origin"}).find_all("a")]
    film["countryList"] = countryList

    castList = [x.text for x in html.find_all("a", {"data-testid": "title-cast-item__actor"})]
    film["castList"] = castList

    characterList = [x.text for x in html.find_all("a", {"data-testid": "cast-item-characters-link"})]
    film["characterList"] = characterList

    directorList = [x.text for x in html.find("li", {"data-testid": "title-pc-principal-credit"}).find_all("a")]
    film["directorList"] = directorList

    # DO NOT CHANGE FROM HERE
    return film


# DO NOT CHANGE FROM HERE
# Parse the content of the request with BeautifulSoup
href, html = movies_html[0]
html = BeautifulSoup(html.text, 'lxml')    
movie_parsed = parse_movie(html, href)

# Display the result
print ("your result:")
df = pd.DataFrame([movie_parsed])
display(df)

##### Tests ####
# Check the list types
list_types = ["directorList", 
              "genreList", 
              "countryList", 
              "castList", 
              "characterList", 
              "directorList"]
           
for t in list_types:
    if not isinstance(movie_parsed[t], list):
        print ("Error:", t, " should be a list")
        
        
print ("\n\nexpected (converted) result - we will do conversion next:")        
display(df_redemption)

your result:


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3.1M,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)","$29,334,033",2h 22m,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]




expected (converted) result - we will do conversion next:


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


# Task 2: Cleansing

<div class="alert alert-block alert-success" style="margin-top: 20px">
Some datatypes are not in a format, we can use to process the data such as the `duration` of the movie, the `rating count`  or `gross`.

Implement the following three function:

- `convert_duration()`:  Converts from the duration format to minutes. E.g. from 2h 22m to 144
- `convert_rating_count()`: Converts from the human readable format to an integer. E.g. 2.7M to 27000000 and 2.7k to 2700
- `convert_gross()`: Converts the gross to an integer. E.g.  $\$28,884,504$ to 28884504

</div>

In [37]:
# Convert to Minutes
def convert_duration(duration):
    # The column has the form 2h 22m and needs to be converted to minutes
    minutes = 0
    
    # ADD YOUR CODE HERE    
    
    parts = duration.split(" ")
    for part in parts:
        value, timeUnit = int(part[:-1]), part[-1]
        if timeUnit == "h":
            minutes += value * 60
        elif timeUnit == "m":
            minutes += value
    
    # DO NOT CHANGE FROM HERE    
    return minutes

# Convert Rating Counts
def convert_rating_count(ratings):
    # The column has the form 2.7M and needs to be converted to an integer
    count = 0
        
    # ADD YOUR CODE HERE
    if ratings[-1] == "M":
        count = float(ratings[:-1]) * 1000000
    elif ratings[-1] == "K":
        count = float(ratings[:-1]) * 1000
    
    # DO NOT CHANGE FROM HERE        
    return count

# Convert Gross
def convert_gross(value):
    # The column has the form $28,884,504 and needs to be converted to an integer
    gross = 0
    
    # ADD YOUR CODE HERE
    gross = int(value.replace("$","").replace(",",""))
    
    # DO NOT CHANGE FROM HERE        
    return gross



# DO NOT CHANGE FROM HERE        
def convert_df(df):
    df_conv = df.convert_dtypes()
    
    try:
        df_conv.ratingValue = pd.to_numeric(df_conv.ratingValue, errors='coerce')
    except Exception as e:
        print(f"Error converting ratingValue to numeric: {e}")
    
    try:
        # Use nullable integer dtype to handle NaNs safely
        df_conv.year = df_conv.year.astype('Int64')  
    except Exception as e:
        print(f"Error converting year to integer: {e}")
        
    
    df_conv.duration = df_conv.duration.apply(convert_duration)
    df_conv.ratingCount = df_conv.ratingCount.apply(convert_rating_count).astype('Int64')
    df_conv.gross = df_conv.gross.apply(convert_gross).astype('Int64')

    return df_conv

df_conv = convert_df(df)  
df_conv.head()

Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


### We will now look at the types after conversion

We will see `object` for string and list types.

In [34]:
df_conv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   url            1 non-null      string 
 1   title          1 non-null      string 
 2   ratingValue    1 non-null      Float64
 3   ratingCount    1 non-null      Int64  
 4   year           1 non-null      Int64  
 5   description    1 non-null      string 
 6   budget         1 non-null      string 
 7   gross          1 non-null      Int64  
 8   duration       1 non-null      int64  
 9   genreList      1 non-null      object 
 10  countryList    1 non-null      object 
 11  castList       1 non-null      object 
 12  characterList  1 non-null      object 
 13  directorList   1 non-null      object 
dtypes: Float64(1), Int64(3), int64(1), object(5), string(4)
memory usage: 248.0+ bytes


<hr/>

# We will finally apply your code to scrap all TOP 250 IMDB movies in the list

In [35]:
def load_movies_long_json():
    local = "data/movies_long.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/cb7631adebe54da9b0da/?dl=1")

movies_long = load_movies_long_json()
html_movies_long = get_webpages(movies_long)

Read from local file


Scrap webpages: 100%|██████████| 250/250 [05:36<00:00,  1.35s/it]


<hr/>

# Task 3: Parse the top 250 movies 

<div class="alert alert-block alert-success" style="margin-top: 20px">
Run the following code to check, if your code runs fine on all movies. 
    
<b>You do not have to add any code. Only fix your code, if an error pops up.</b>
</div>

In [40]:
# DO NO CHANGE THIS CODE
movies_parsed = []
for href, html in tqdm(html_movies_long):
    html = BeautifulSoup(html.text, 'lxml')
    movies_parsed.append(parse_movie(html, href))
        
df = pd.DataFrame(movies_parsed)    

100%|██████████| 250/250 [00:22<00:00, 10.97it/s]


In [43]:
# DO NO CHANGE THIS CODE
df_conv = convert_df(df)
df_conv.head()

Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


### Finally, we write the results to a file

In [44]:
# DO NO CHANGE THIS CODE
# Write JSON to file
df_conv.to_json("part1_submission.json", force_ascii=False, indent=4)

<hr/> 

# Submit via Moodle:
- HTML-exports of the notebooks for part 1 and part 2
- Source-codes of the notebooks for part 1 and part 2
- The json export `part1_submission.json` of the TOP 250 movies.


# Done

- You have learned how to scrap a movie webpage and use JSON-files.