<hr/>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
[Tip]: To execute the Python code in the code cell below, click on the cell to select it and press <kbd>Shift</kbd> + <kbd>Enter</kbd>.
</div>
<hr/>

# Exercise 1: IMDB - Part 1

## This exercise is split into two notebooks

1. Part: Scraping IMDB (this notebook)
2. Part: Exploratory Data Analysis (the second notebook)
    

## Part 1: Scraping IMDB

1. Task: Scrap an IMDB movie
2. Task: Convert the data to a machine readable format
3. Task: Parse the list of top 250 movies

### You have to hand in this exercise via Moodle.

# Installing & Importing Pre-Requisites

First we need to install the required libraries

In [3]:
try:
    from bs4 import BeautifulSoup
except ImportError as e:
    !pip install BeautifulSoup4

try:
    import lxml
except ImportError as e:
    !pip install lxml
    
try:
    import html5lib
except ImportError as e:
    !pip install html5lib
    
try:
    import requests_cache
except ImportError as e:
    !pip install requests_cache
    
try:
    from tqdm import tqdm
except ImportError as e:
    !pip install tqdm

import time
import numpy as np
import pandas as pd
import requests
import requests_cache
import warnings

from os.path import exists
from IPython.display import display    



# Scraping IMDB using Beautiful Soup

We will scrap the International Movies Database (IMDB) at [imdb.com](https://imdb.com) for the 250 top movies ever made. 

### Example

This is an example of the result of scraping the webpage: https://www.imdb.com/title/tt0111161/

<img src="images/imdb.png">

In [4]:
# Load the full list of movies from json format
def load_movies_json():    
    local = "data/movies_full_crawled.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/bd7bdd460c55420783aa/?dl=1")

# show only one movie
df_redemption = load_movies_json().head(1)
df_redemption

Read from local file


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


# List of movies

We will use a JSON file to get the list of movies to scrap. We use `Pandas` to read JSON-files.

The file contains three columns the 
- *titles*, 
- *ratings*, and 
- *href* 

to the movies.

In [5]:
# Load excerpt of the movies from json format
def load_top_movies_short_json():
    local = "data/movies_short.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/5d645662ec9e45338b03/?dl=1")

movies = load_top_movies_short_json()
movies

Read from local file


Unnamed: 0,title,rating,href
0,Die Verurteilten,9.2,/title/tt0111161/


## We will now load and cache the HTML pages 

This will avoid being blocked by IMDB due to too many requests.

In [6]:
requests_cache.install_cache('imdb_cache')

# Redeclaring the lists to store the data in
def get_webpages(movies):
    films = []
    headers = {
        'Accept-Language': 'en; q=1.0',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }

    for i in tqdm(range(len(movies)), desc='Scrap webpages'):
        href = movies.iloc[i].href
        
        # Make a get request
        req_time = time.perf_counter()
        response = requests.get(f'https://www.imdb.com/{href}', headers=headers)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warnings.warn(f'Request to {href} returned status {response.status_code}')

        elapsed_time = np.round(time.perf_counter() - req_time, 5)
        # print(f"Loading Webpage: {href} done in {elapsed_time} seconds")

        films.append([href, response])

        # Optional: time.sleep(1)  # Uncomment to reduce request rate and avoid blocking
        
    return films

movies_html = get_webpages(movies)

Scrap webpages: 100%|█████████████████████████████| 1/1 [00:00<00:00, 47.58it/s]


# Task 1: Scrap IMDB using BeautifulSoup

<div class="alert alert-block alert-success">
    
You are expected to complete the following method to parse the webpage using BeautifulSoup.

</div>    

From each webpage, you are supposed to extract the following information:

* `url`: The URL
* `title`: The title
* `year`: The year of production
* `description`: The description of the movie
* `budget`: The budget
* `gross`: The *worldwide* gross
* `ratingValue`: The rating of the movie
* `ratingCount`: The number of votes
* `duration`: The duration of the movie
* `genreList`: A *list* of genres
* `countryList`: A *list* of the countries of origin
* `castList`: A *list* of the cast of the movie
* `characterList`: The *list* of characters played by the cast
* `directorList`: The *list* of directors

A variable ending in *List* such as *genreList* refers to the extracted data type being a list. 

**Hint:**

- Use `html.find_all()` to parse list types and `html.find()` to get a single entry
- Use `try:` and `except:` when accessing non-existing elements for some movies such as `budget` or `gross`. Or check for `None`.

In [7]:
def parse_movie(html, href):
    film = {}

    film["url"] = href
    
    # ADD YOUR CODE HERE
        
    film["title"] = html.title.string.split("(")[0].strip() #split before the ( and then strip empty chars.

    film["ratingValue"] = html.find("span", class_="sc-4dc495c1-1 lbQcRY").string

    film["ratingCount"] = ratingCount = html.find("div", class_="sc-4dc495c1-3 eNfgcR").string
    
    film["year"] = html.title.string.split("(")[1].strip().split(")")[0].strip()
    
    film["description"] = html.find("span", class_="sc-bf30a0e-2 bRimta").string
    
    try:
        budget_element = html.find("div" , {"data-testid":"title-boxoffice-section"})
        budget = budget_element.find("span", class_="ipc-metadata-list-item__list-content-item ipc-btn--not-interactable").text
    except: 
        budget = "0"
        
    film["budget"] = budget

    try:
        gross_element = html.find("li", {"data-testid":"title-boxoffice-cumulativeworldwidegross"})
        gross = gross_element.find("span" , class_="ipc-metadata-list-item__list-content-item ipc-btn--not-interactable").text
    except:
        gross = "0"
        
    film["gross"] = gross
    
    duration_element = html.find("li", {"data-testid":"title-techspec_runtime"}).find("span" , class_="ipc-metadata-list-item__list-content-item ipc-btn--not-interactable")
    duration = duration_element.text.strip()
    
    film["duration"] = duration
    
   
    genreList = html.find("div" ,class_="ipc-chip-list__scroller")
    genreList = [g.get_text(strip=True) for g in genreList]
    
    film["genreList"] = genreList
    
    country_element = html.find("li" , {"data-testid":"title-details-origin"}).find_all("a")
    countryList =[g.get_text()for g in country_element] 
    film["countryList"] = countryList
    
    castList = html.find("div" , class_="ipc-shoveler ipc-shoveler--base ipc-shoveler--page0 title-cast__grid").find_all("a",{"data-testid":"title-cast-item__actor"})
    castList = [g.get_text(strip=True) for g in castList]
    film["castList"] = castList

    characterList = html.find("div" , class_="ipc-shoveler ipc-shoveler--base ipc-shoveler--page0 title-cast__grid").find_all("a",{"data-testid":"cast-item-characters-link"})
    characterList= [g.get_text(strip=True) for g in characterList]
    
    film["characterList"] = characterList

    directorList = html.find("li", {"data-testid":"title-pc-principal-credit"}).find_all("a")
    directorList = [g.get_text(strip=True) for g in directorList]
    print(directorList)
    film["directorList"] = directorList

    # DO NOT CHANGE FROM HERE
    return film


# DO NOT CHANGE FROM HERE
# Parse the content of the request with BeautifulSoup
href, html = movies_html[0]
html = BeautifulSoup(html.text, 'lxml')    
movie_parsed = parse_movie(html, href)

# Display the result
print ("your result:")
df = pd.DataFrame([movie_parsed])
display(df)

##### Tests ####
# Check the list types
list_types = ["directorList", 
              "genreList", 
              "countryList", 
              "castList", 
              "characterList", 
              "directorList"]
           
for t in list_types:
    if not isinstance(movie_parsed[t], list):
        print ("Error:", t, " should be a list")
        
        
print ("\n\nexpected (converted) result - we will do conversion next:")        
display(df_redemption)

['Frank Darabont']
your result:


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3.1M,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)","$29,334,033",2h 22m,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]




expected (converted) result - we will do conversion next:


Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


# Task 2: Cleansing

<div class="alert alert-block alert-success" style="margin-top: 20px">
Some datatypes are not in a format, we can use to process the data such as the `duration` of the movie, the `rating count`  or `gross`.

Implement the following three function:

- `convert_duration()`:  Converts from the duration format to minutes. E.g. from 2h 22m to 144
- `convert_rating_count()`: Converts from the human readable format to an integer. E.g. 2.7M to 27000000 and 2.7k to 2700
- `convert_gross()`: Converts the gross to an integer. E.g.  $\$28,884,504$ to 28884504

</div>

In [8]:
# Convert to Minutes
def convert_duration(duration):
    # The column has the form 2h 22m and needs to be converted to minutes
    minutes = 0
    # ADD YOUR CODE HERE
    if "h" in duration:
        hours = int(duration.split("h")[0].strip())
        
        # Try to extract minutes, but allow missing minutes
        minute_part = duration.split("h")[1].replace("m", "").strip()
        minutes = int(minute_part) if minute_part else 0
    else:
        hours = 0
        minutes = int(duration.replace("m", "").strip())
    minutes = hours * 60 + minutes
    # DO NOT CHANGE FROM HERE    
    return minutes

# Convert Rating Counts
def convert_rating_count(ratings):
    # The column has the form 2.7M and needs to be converted to an integer
    count = 0
    # ADD YOUR CODE HERE
    if "M" in ratings:
        count = ratings.split("M")[0]
        count = int(float(count) *1000000)
    elif "k" in ratings:
        count = ratings.split("k")[0]
        count = int(float(count) *1000)
    elif "K" in ratings:
        count = ratings.split("K")[0]
        count = int(float(count) *1000)    
    else:
        count = int(ratings)
    # DO NOT CHANGE FROM HERE        
    return count
            
# Convert Gross
def convert_gross(value):
    # The column has the form $28,884,504 and needs to be converted to an integer
    gross = 0
    # ADD YOUR CODE HERE
    value =  value.replace("$","").replace(",","")
    gross = int(value.strip())
    # DO NOT CHANGE FROM HERE        
    return gross


# DO NOT CHANGE FROM HERE        
def convert_df(df):
    df_conv = df.convert_dtypes()
    
    try:
        df_conv.ratingValue = pd.to_numeric(df_conv.ratingValue, errors='coerce')
    except Exception as e:
        print(f"Error converting ratingValue to numeric: {e}")
    
    try:
        # Use nullable integer dtype to handle NaNs safely
        df_conv.year = df_conv.year.astype('Int64')  
    except Exception as e:
        print(f"Error converting year to integer: {e}")
        
    
    df_conv.duration = df_conv.duration.apply(convert_duration)
    df_conv.ratingCount = df_conv.ratingCount.apply(convert_rating_count).astype('Int64')
    df_conv.gross = df_conv.gross.apply(convert_gross).astype('Int64')

    return df_conv

df_conv = convert_df(df)  
df_conv.head()

Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]


### We will now look at the types after conversion

We will see `object` for string and list types.

In [14]:
df_conv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   url            250 non-null    string 
 1   title          250 non-null    string 
 2   ratingValue    250 non-null    Float64
 3   ratingCount    250 non-null    Int64  
 4   year           250 non-null    Int64  
 5   description    250 non-null    string 
 6   budget         250 non-null    string 
 7   gross          250 non-null    Int64  
 8   duration       250 non-null    int64  
 9   genreList      250 non-null    object 
 10  countryList    250 non-null    object 
 11  castList       250 non-null    object 
 12  characterList  250 non-null    object 
 13  directorList   250 non-null    object 
dtypes: Float64(1), Int64(3), int64(1), object(5), string(4)
memory usage: 28.4+ KB


<hr/>

# We will finally apply your code to scrap all TOP 250 IMDB movies in the list

In [10]:
def load_movies_long_json():
    local = "data/movies_long.json"
    if exists(local):
        print ("Read from local file")
        return pd.read_json(local)
    else:
        print ("Read from hu-box")        
        return pd.read_json("https://box.hu-berlin.de/f/cb7631adebe54da9b0da/?dl=1")

movies_long = load_movies_long_json()
html_movies_long = get_webpages(movies_long)

Read from local file


Scrap webpages: 100%|████████████████████████| 250/250 [00:00<00:00, 661.19it/s]


<hr/>

# Task 3: Parse the top 250 movies 

<div class="alert alert-block alert-success" style="margin-top: 20px">
Run the following code to check, if your code runs fine on all movies. 
    
<b>You do not have to add any code. Only fix your code, if an error pops up.</b>
</div>

In [11]:
# DO NO CHANGE THIS CODE
movies_parsed = []
for href, html in tqdm(html_movies_long):
    html = BeautifulSoup(html.text, 'lxml')
    movies_parsed.append(parse_movie(html, href))
        
df = pd.DataFrame(movies_parsed)    

  2%|▊                                          | 5/250 [00:00<00:11, 21.21it/s]

['Frank Darabont']
['Francis Ford Coppola']
['Christopher Nolan']
['Francis Ford Coppola']
['Sidney Lumet']
['Steven Spielberg']


  4%|█▊                                        | 11/250 [00:00<00:10, 21.97it/s]

['Peter Jackson']
['Quentin Tarantino']
['Peter Jackson']
['Sergio Leone']
['Robert Zemeckis']


  7%|██▊                                       | 17/250 [00:00<00:10, 21.29it/s]

['David Fincher']
['Christopher Nolan']
['Peter Jackson']
['Irvin Kershner']
['Lana Wachowski', 'Lilly Wachowski']
['Martin Scorsese']


  9%|███▊                                      | 23/250 [00:01<00:11, 20.23it/s]

['Milos Forman']
['David Fincher']
['Akira Kurosawa']
['Frank Capra']
['Jonathan Demme']
['Kátia Lund', 'Fernando Meirelles']
['Steven Spielberg']
['Roberto Benigni']
['Christopher Nolan']


 13%|█████▍                                    | 32/250 [00:01<00:10, 20.24it/s]

['Frank Darabont']
['George Lucas']
['James Cameron']
['Robert Zemeckis']
['Hayao Miyazaki']
['Alfred Hitchcock']


 14%|█████▉                                    | 35/250 [00:01<00:09, 21.65it/s]

['Roman Polanski']
['Bong Joon Ho']
['Luc Besson']


 16%|██████▉                                   | 41/250 [00:02<00:11, 18.15it/s]

['Roger Allers', 'Rob Minkoff']
['Ridley Scott']
['Tony Kaye']
['Martin Scorsese']
['Bryan Singer']
['Christopher Nolan']


 19%|███████▉                                  | 47/250 [00:02<00:09, 21.57it/s]

['Damien Chazelle']
['Michael Curtiz']
['Olivier Nakache', 'Éric Toledano']
['Masaki Kobayashi']
['Isao Takahata']
['Charles Chaplin']


 21%|████████▉                                 | 53/250 [00:02<00:11, 16.98it/s]

['Sergio Leone']
['Alfred Hitchcock']
['Charles Chaplin']
['Ridley Scott']
['Giuseppe Tornatore']
['Francis Ford Coppola']


 24%|█████████▉                                | 59/250 [00:03<00:09, 20.51it/s]

['Christopher Nolan']
['Steven Spielberg']
['Quentin Tarantino']
['Andrew Stanton']
['Florian Henckel von Donnersmarck']
['Billy Wilder']


 25%|██████████▍                               | 62/250 [00:03<00:08, 21.75it/s]

['Stanley Kubrick']
['Stanley Kubrick']
['Charles Chaplin']


 26%|██████████▉                               | 65/250 [00:03<00:13, 13.26it/s]

['Billy Wilder']
['Anthony Russo', 'Joe Russo']
['James Cameron']
['Sam Mendes']
['Bob Persichetti', 'Peter Ramsey', 'Rodney Rothman']
['Stanley Kubrick']


 30%|████████████▍                             | 74/250 [00:04<00:09, 19.12it/s]

['Christopher Nolan']
['Park Chan-wook']
['Todd Phillips']
['Milos Forman']
['Quentin Tarantino']
['John Lasseter']


 31%|████████████▉                             | 77/250 [00:04<00:08, 20.59it/s]

['Mel Gibson']
['Adrian Molina', 'Lee Unkrich']
['Wolfgang Petersen']
['Anthony Russo', 'Joe Russo']
['Joseph Kosinski']
['Hayao Miyazaki']


 34%|██████████████▍                           | 86/250 [00:04<00:11, 13.85it/s]

['Sergio Leone']
['Gus Van Sant']
['Makoto Shinkai']
['Darren Aronofsky']
['Lee Unkrich']
['Stanley Donen', 'Gene Kelly']


 37%|███████████████▍                          | 92/250 [00:05<00:08, 18.05it/s]

['Rajkumar Hirani']
['Akira Kurosawa']
['Richard Marquand']
['Stanley Kubrick']
['Michel Gondry']
['Nadine Labaki']


 39%|████████████████▍                         | 98/250 [00:05<00:07, 21.30it/s]

['Quentin Tarantino']
['Thomas Vinterberg']
['Orson Welles']
['David Lean']
['Fritz Lang']
['Elem Klimov']


 40%|████████████████▌                        | 101/250 [00:05<00:06, 22.40it/s]

['Alfred Hitchcock']
['Alfred Hitchcock']
['Jean-Pierre Jeunet']
['Stanley Kubrick']
['Billy Wilder']


 43%|█████████████████▌                       | 107/250 [00:06<00:11, 12.53it/s]

['Billy Wilder']
['Stanley Kubrick']
['Akira Kurosawa']
['Brian De Palma']
['Thomas Kail']
['George Roy Hill']


 45%|██████████████████▌                      | 113/250 [00:06<00:08, 16.88it/s]

['Robert Mulligan']
['Michael Mann']
['Pete Docter', 'Bob Peterson']
['Denis Villeneuve']
['Martin Scorsese']
['Fritz Lang']


 48%|███████████████████▌                     | 119/250 [00:06<00:06, 20.39it/s]

['Asghar Farhadi']
['Curtis Hanson']
['Guy Ritchie']
['Vittorio De Sica']
['John McTiernan']
['Steven Spielberg']


 50%|████████████████████▌                    | 125/250 [00:06<00:05, 22.62it/s]

['Aamir Khan']
['Sam Mendes']
['Oliver Hirschbiegel']
['Sergio Leone']
['Christopher Nolan']
['Nitesh Tiwari']


 52%|█████████████████████▍                   | 131/250 [00:07<00:04, 24.16it/s]

['Charles Chaplin']
['Billy Wilder']
['Florian Zeller']
['Joseph L. Mankiewicz']
['Peter Farrelly']


 55%|██████████████████████▍                  | 137/250 [00:08<00:10, 11.09it/s]

['Martin Scorsese']
['Stanley Kramer']
['Akira Kurosawa']
['Martin Scorsese']
['Guillermo del Toro']
['Clint Eastwood']


 57%|███████████████████████▍                 | 143/250 [00:08<00:06, 15.47it/s]

['Paul Thomas Anderson']
['Peter Weir']
['Jon Watts']
['M. Night Shyamalan']
['Ron Howard']
['Akira Kurosawa']


 60%|████████████████████████▍                | 149/250 [00:08<00:05, 19.48it/s]

['Terry Gilliam', 'Terry Jones']
['John Huston']
['Martin Scorsese']
['Steven Spielberg']
['Akira Kurosawa']
['John Sturges']


 62%|█████████████████████████▍               | 155/250 [00:08<00:04, 21.81it/s]

['Quentin Tarantino']
['Ethan Coen', 'Joel Coen']
['Andrew Stanton', 'Lee Unkrich']
['David Lynch']
['John Carpenter']
['Roman Polanski']


 64%|██████████████████████████▍              | 161/250 [00:09<00:03, 23.14it/s]

['Martin Scorsese']
['Victor Fleming']
['James McTeigue']
['Ronnie Del Carmen', 'Pete Docter']
['Guy Ritchie']
['Alfred Hitchcock']


 67%|███████████████████████████▍             | 167/250 [00:09<00:03, 24.18it/s]

['Juan José Campanella']
['Hayao Miyazaki']
['David Lean']
['Martin McDonagh']
['Danny Boyle']
["Gavin O'Connor"]
['Clint Eastwood']


 69%|████████████████████████████▎            | 173/250 [00:10<00:08,  9.39it/s]

['Joel Coen', 'Ethan Coen']
['Denis Villeneuve']
['Hayao Miyazaki']
['Clint Eastwood']
['Steven Spielberg']
['Charles Chaplin']


 72%|█████████████████████████████▎           | 179/250 [00:10<00:05, 13.82it/s]

['Majid Majidi']
['Ridley Scott']
['Elia Kazan']
['Steve McQueen']
['Richard Linklater']
['Carol Reed']


 74%|██████████████████████████████▎          | 185/250 [00:11<00:03, 17.93it/s]

['Ingmar Bergman']
['David Yates']
['William Wyler']
['Clyde Bruckman', 'Buster Keaton']
['David Fincher']
['Michael Cimino']


 76%|███████████████████████████████▎         | 191/250 [00:11<00:02, 21.24it/s]

['Wes Anderson']
['Jim Sheridan']
['Stanley Kubrick']
['Henri-Georges Clouzot']
['Frank Capra']
['Buster Keaton']


 79%|████████████████████████████████▎        | 197/250 [00:11<00:02, 23.18it/s]

['Mel Gibson']
['Bong Joon Ho']
['Carlos Martínez López', 'Sergio Pablos']
['Damián Szifron']
['Ingmar Bergman']
['Lenny Abrahamson']


 81%|█████████████████████████████████▎       | 203/250 [00:11<00:01, 24.15it/s]

['George Miller']
['Adam Elliot']
['Dean DeBlois', 'Chris Sanders']
['Joel Coen', 'Ethan Coen']
['Pete Docter', 'David Silverman', 'Lee Unkrich']
['Steven Spielberg']


 84%|██████████████████████████████████▎      | 209/250 [00:11<00:01, 24.76it/s]

['Yasujirô Ozu']
['Carl Theodor Dreyer']
['Peter Weir']
['Terry George']
['James Mangold']
['John G. Avildsen']


 86%|███████████████████████████████████▎     | 215/250 [00:12<00:01, 24.94it/s]

['Oliver Stone']
['Satyajit Ray']
['Rob Reiner']
['James Cameron']


 87%|███████████████████████████████████▊     | 218/250 [00:13<00:05,  6.22it/s]

['Tom McCarthy']
['Daniel Kwan', 'Daniel Scheinert']
['James Mangold']
['Ron Howard']
['Brad Bird', 'Jan Pinkava']
['Sidney Lumet']


 91%|█████████████████████████████████████▏   | 227/250 [00:13<00:01, 12.35it/s]

['Sean Penn']
['Directors', 'Victor Fleming', 'George Cukor', 'Norman Taurog', '']
['Richard Linklater']
['Harold Ramis']
['William Friedkin']
['T.J. Gnanavel']


 93%|██████████████████████████████████████▏  | 233/250 [00:14<00:01, 16.78it/s]

['Ernst Lubitsch']
['William Wyler']
['Brad Bird']
['Gillo Pontecorvo']
['Lasse Hallström']
['John Ford']


 96%|███████████████████████████████████████▏ | 239/250 [00:14<00:00, 20.41it/s]

['Alfred Hitchcock']
['Çagan Irmak']
['Alejandro G. Iñárritu']
['Gore Verbinski']
['Mathieu Kassovitz']
['Stuart Rosenberg']


 98%|████████████████████████████████████████▏| 245/250 [00:14<00:00, 23.03it/s]

['François Truffaut']
['Ingmar Bergman']
['Frank Capra']
['Park Chan-wook']
['Robert Wise']
['Terry Jones']


100%|█████████████████████████████████████████| 250/250 [00:14<00:00, 16.94it/s]

['Akira Kurosawa']
['Tate Taylor']
['Richard Attenborough']
['Ron Clements', 'John Musker']
['Brad Bird']





In [12]:
# DO NO CHANGE THIS CODE
df_conv = convert_df(df)
df_conv.head()

Unnamed: 0,url,title,ratingValue,ratingCount,year,description,budget,gross,duration,genreList,countryList,castList,characterList,directorList
0,/title/tt0111161/,The Shawshank Redemption,9.3,3100000,1994,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)",29334033,142,"[Epic, Period Drama, Prison Drama, Drama]",[United States],"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...","[Andy Dufresne, Ellis Boyd 'Red' Redding, Ward...",[Frank Darabont]
1,/title/tt0068646/,The Godfather,9.2,2200000,1972,The aging patriarch of an organized crime dyna...,"$6,000,000 (estimated)",250925379,175,"[Epic, Gangster, Tragedy, Crime, Drama]",[United States],"[Marlon Brando, Al Pacino, James Caan, Richard...","[Don Vito Corleone, Michael, Sonny, Clemenza, ...",[Francis Ford Coppola]
2,/title/tt0468569/,The Dark Knight,9.1,3100000,2008,When a menace known as the Joker wreaks havoc ...,"$185,000,000 (estimated)",1009242873,152,"[Action Epic, Epic, Psychological Drama, Psych...","[United States, United Kingdom]","[Christian Bale, Heath Ledger, Aaron Eckhart, ...","[Bruce Wayne, Joker, Harvey Dent, Alfred, Rach...",[Christopher Nolan]
3,/title/tt0071562/,The Godfather Part II,9.0,1500000,1974,The early life and career of Vito Corleone in ...,"$13,000,000 (estimated)",48152659,202,"[Epic, Gangster, Tragedy, Crime, Drama]",[United States],"[Al Pacino, Robert Duvall, Diane Keaton, Rober...","[Michael, Tom Hagen, Kay, Vito Corleone, Fredo...",[Francis Ford Coppola]
4,/title/tt0050083/,12 Angry Men,9.0,954000,1957,The jury in a New York City murder trial is fr...,"$350,000 (estimated)",2945,96,"[Legal Drama, Psychological Drama, Crime, Drama]",[United States],"[Martin Balsam, John Fiedler, Lee J. Cobb, E.G...","[Juror 1, Juror 2, Juror 3, Juror 4, Juror 5, ...",[Sidney Lumet]


### Finally, we write the results to a file

In [13]:
# DO NO CHANGE THIS CODE
# Write JSON to file
df_conv.to_json("part1_submission.json", force_ascii=False, indent=4)

<hr/> 

# Submit via Moodle:
- HTML-exports of the notebooks for part 1 and part 2
- Source-codes of the notebooks for part 1 and part 2
- The json export `part1_submission.json` of the TOP 250 movies.


# Done

- You have learned how to scrap a movie webpage and use JSON-files.