# Become a movie director

Let's use BeautifulSoup to get some information about the top-250 rated movies on <a href="http://www.imdb.com/" target="_blank">IMDB</a>.

To complete this exercise, feel free to look at <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup documentation</a>.

1. Import `Beautifulsoup` and `requests` libraries:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
pd.set_option('display.max_colwidth', None)
from PIL import Image
from io import BytesIO
import base64
from IPython.display import HTML

2. With `requests`, get the source code of the webpage at this url: <a href="http://www.imdb.com/chart/top" target="_blank">http://www.imdb.com/chart/top</a>

In [2]:
url = 'http://www.imdb.com/chart/top'

#Get information in french
headers = {'Accept-Language': 'fr-FR,en;q=0.8'}

html_content = requests.get(url, headers=headers).text

3. Use Beautifulsoup to extract the following items from the webpage HTML code, for each of the 250 movies: ranking, title, url, crew, rating and number of voters.

Use the `.select` method to find tags you need on the website, then store those tags into lists.

Finally, create a list named `imdb` in which each item is a dictionary containing the information related to one movie.

**Hint**: You can check out the <a href="https://docs.python.org/3/library/string.html" target="_blank">string documentation</a>, in particular the `.split`, `.join` and `.replace` methods.

In [3]:
soup = BeautifulSoup(html_content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <style>
   body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
  </style>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   IMDb Top 250 - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {w

Okay, let's make it simple in purpose to understand the BeautifulSoup and HTML/CSS

In [4]:
table = soup.select('tr')
#Tags we will use to get the information
tags = ["td.titleColumn", "a", "td.posterColumn", "td.ratingColumn.imdbRating"]

We will use jpeg saved in order to display them in a dataframe

In [5]:
def get_thumbnail(path):
    i = Image.open(path)
    i.thumbnail((150, 150), Image.LANCZOS)
    return i

def image_base64(im):
    if isinstance(im, str):
        im = get_thumbnail(im)
    with BytesIO() as buffer:
        im.save(buffer, 'jpeg')
        return base64.b64encode(buffer.getvalue()).decode()

def image_formatter(im):
    return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'

In [6]:
#We stock all the information in those variables
imdb = []
for row in table:
    
    #each element in row is a list of one element so we need to take the first element
    #or using a loop for to get the element 
    #I use the loop for in purpose to show the zip function 
    poster_column = row.select(tags[2])        
    title_column = row.select(tags[0])
    rating_imdb_column = row.select(tags[3])
    
    for title_info, poster, rating_imdb in zip(title_column, poster_column, rating_imdb_column):       
        
        #TITLE COLUMN
        #Get the 3 information in cell level
        text = title_info.get_text('|' , strip=True)
        rank, title, year = text.split("|")
        #Rewrite the information as we want to display
        rank = rank.replace('.', "")
        for carac in ["(", ")"]:
            year = year.replace(carac, "")
        #Information about director and actors are in the second tag
        references = title_info.select(tags[1]) 
        #references is a list of one element
        for reference in references:
            info = reference.get('title')
            director, actors = info.split(" (dir.), ")
        
        #POSTER COLUMN
        poster_reference = poster.select('img')
        for _ in poster_reference:
            poster_link = _.get('src')    
            filename = "./images/"+ title + ".jpg"
            r = requests.get(poster_link, allow_redirects=True)
            open(filename, 'wb').write(r.content)

        #RATING IMDB COLUMN
        text_rating = rating_imdb.strong.get('title')
        text_rating = text_rating.replace(u'\xa0', '')
        rating, _, _, number_vote, *_ = text_rating.split()   
        
        #SAVING INFORMATION
        #Add information in a dictionnary and put it in imdb
        movie_information = {"Rank" : rank, "Poster": filename , "Title" : title, 
                             "Year" : year, "Director": director, "Main actors": actors,
                            "Rating IMDB": rating, "Number of vote": number_vote}
        imdb.append(movie_information)

#Display list of information stocked
display(imdb)

[{'Rank': '1',
  'Poster': './images/Les évadés.jpg',
  'Title': 'Les évadés',
  'Year': '1994',
  'Director': 'Frank Darabont',
  'Main actors': 'Tim Robbins, Morgan Freeman',
  'Rating IMDB': '9,2',
  'Number of vote': '2316345'},
 {'Rank': '2',
  'Poster': './images/Le parrain.jpg',
  'Title': 'Le parrain',
  'Year': '1972',
  'Director': 'Francis Ford Coppola',
  'Main actors': 'Marlon Brando, Al Pacino',
  'Rating IMDB': '9,1',
  'Number of vote': '1599466'},
 {'Rank': '3',
  'Poster': './images/Le parrain, 2ème partie.jpg',
  'Title': 'Le parrain, 2ème partie',
  'Year': '1974',
  'Director': 'Francis Ford Coppola',
  'Main actors': 'Al Pacino, Robert De Niro',
  'Rating IMDB': '9,0',
  'Number of vote': '1117179'},
 {'Rank': '4',
  'Poster': './images/The Dark Knight: Le chevalier noir.jpg',
  'Title': 'The Dark Knight: Le chevalier noir',
  'Year': '2008',
  'Director': 'Christopher Nolan',
  'Main actors': 'Christian Bale, Heath Ledger',
  'Rating IMDB': '9,0',
  'Number of vo

4. To check your code, loop over the `imdb` list and print some information for each movie:

In [25]:
for movie in imdb:
    if movie["Director"] == 'Alfred Hitchcock':
        rank, _, title, year, director, actors, *_ = movie.items()
        movie_info = [rank, title, year, director, actors]
        for info in movie_info:
            info = ': '.join(info)
            print(info)
#         rank = ': '.join(rank)
#         title = ': '.join(title)
#         year = ': '.join(year)
#         director = ': '.join(director)
#         actors = ': '.join(actors)
#         print(rank + ' - ' + title + ' - ' + year + ' - ' + director + ' - ' + actors)
        print(rank)

Rank: 40
Title: Psychose
Year: 1960
Director: Alfred Hitchcock
Main actors: Anthony Perkins, Janet Leigh
('Rank', '40')
Rank: 52
Title: Fenêtre sur cour
Year: 1954
Director: Alfred Hitchcock
Main actors: James Stewart, Grace Kelly
('Rank', '52')
Rank: 92
Title: Sueurs froides
Year: 1958
Director: Alfred Hitchcock
Main actors: James Stewart, Kim Novak
('Rank', '92')
Rank: 103
Title: La mort aux trousses
Year: 1959
Director: Alfred Hitchcock
Main actors: Cary Grant, Eva Marie Saint
('Rank', '103')
Rank: 150
Title: Le crime était presque parfait
Year: 1954
Director: Alfred Hitchcock
Main actors: Ray Milland, Grace Kelly
('Rank', '150')
Rank: 236
Title: Rebecca
Year: 1940
Director: Alfred Hitchcock
Main actors: Laurence Olivier, Joan Fontaine
('Rank', '236')


In [9]:
#Get the names for the dataframe where we will store information for each personnage
column_names = imdb[0].keys()
display(column_names)

#Dataframe with columns = keys
df = pd.DataFrame(columns=column_names)
display(df)

dict_keys(['Rank', 'Poster', 'Title', 'Year', 'Director', 'Main actors', 'Rating IMDB', 'Number of vote'])

Unnamed: 0,Rank,Poster,Title,Year,Director,Main actors,Rating IMDB,Number of vote


In [10]:
for movie_info in imdb:
    #Add the information converted in the dataframe
    df = df.append(movie_info, ignore_index=True)

In [11]:
HTML(df.to_html(formatters={'Poster': image_formatter}, escape=False))

Unnamed: 0,Rank,Poster,Title,Year,Director,Main actors,Rating IMDB,Number of vote
0,1,,Les évadés,1994,Frank Darabont,"Tim Robbins, Morgan Freeman",92,2316345
1,2,,Le parrain,1972,Francis Ford Coppola,"Marlon Brando, Al Pacino",91,1599466
2,3,,"Le parrain, 2ème partie",1974,Francis Ford Coppola,"Al Pacino, Robert De Niro",90,1117179
3,4,,The Dark Knight: Le chevalier noir,2008,Christopher Nolan,"Christian Bale, Heath Ledger",90,2278759
4,5,,12 hommes en colère,1957,Sidney Lumet,"Henry Fonda, Lee J. Cobb",89,681474
5,6,,La liste de Schindler,1993,Steven Spielberg,"Liam Neeson, Ralph Fiennes",89,1201355
6,7,,Le Seigneur des anneaux : Le Retour du roi,2003,Peter Jackson,"Elijah Wood, Viggo Mortensen",89,1626225
7,8,,Pulp Fiction,1994,Quentin Tarantino,"John Travolta, Uma Thurman",88,1807341
8,9,,"Le Bon, la brute, le truand",1966,Sergio Leone,"Clint Eastwood, Eli Wallach",88,682031
9,10,,Le Seigneur des anneaux : La Communauté de l'anneau,2001,Peter Jackson,"Elijah Wood, Ian McKellen",88,1642506
