## Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. 
### Let's also keep each image's URL to add to the master DataFrame later. 
- I'm going to query the MediaWiki API using _wptools_ to get a movie poster URL via each page object's image attribute.
- Using that URL, I'll programmatically download that image into a folder called bestofrt_posters.

In [1]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [2]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Arrival_(film)',
 'Baby_Driver',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'The_Dark_Knight_(film)',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [3]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

#### Note: the cell below, if correctly implemented, will likely take ~5 minutes to run.

In [4]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
       
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
2_Citizen_Kane: cannot identify image file <_io.BytesIO object at 0x7f1bd6a5fba0>
3
4
5
6
7
7_Metropolis_(1927_film): cannot identify image file <_io.BytesIO object at 0x7f1bd66daca8>
8
9
9_Casablanca_(film): cannot identify image file <_io.BytesIO object at 0x7f1bd66daf68>
10
11
11_Nosferatu: cannot identify image file <_io.BytesIO object at 0x7f1bd6a5f9e8>
12
13


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


13_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
14
15
16
17
18
19
20
21
21_Rear_Window: cannot identify image file <_io.BytesIO object at 0x7f1bd6b31258>
22
23
24
25
26
27
28
29
30
31
31_12_Angry_Men_(1957_film): cannot identify image file <_io.BytesIO object at 0x7f1bd6760048>
32
33
34
34_All_Quiet_on_the_Western_Front_(1930_film): cannot identify image file <_io.BytesIO object at 0x7f1bd6760518>
35
36
37
38
39
40
41
42
42_The_Conformist_(film): cannot identify image file <_io.BytesIO object at 0x7f1bd67cf9e8>
43
43_Rebecca_(1940_film): cannot identify image file <_io.BytesIO object at 0x7f1bd6b31410>
44


API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


44_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
45
45_Finding_Nemo: cannot identify image file <_io.BytesIO object at 0x7f1bd6760db0>
46
47
48
48_The_39_Steps_(1935_film): cannot identify image file <_io.BytesIO object at 0x7f1bd67609e8>
49
50
50_Gone_with_the_Wind_(film): cannot identify image file <_io.BytesIO object at 0x7f1bd66a7200>
51
52
53
53_Rome,_Open_City: cannot identify image file <_io.BytesIO object at 0x7f1bd6a5fba0>
54
54_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x7f1bd6b31410>
55
56
57
58
59
60
60_High_Noon: cannot identify image file <_io.BytesIO object at 0x7f1bd67cf9e8>
61
62
62_On_the_Waterfront: cannot identify image file <_io.BytesIO object at 0x7f1bd66a7410>
63
64
65
66
67
67_Roman_Holiday: cannot identify image file <_io.B

One I have completed the above code requirements, read and run the three cells below and interpret their output.

In [5]:
for key in image_errors.keys():
    print(key)

2_Citizen_Kane
7_Metropolis_(1927_film)
9_Casablanca_(film)
11_Nosferatu
13_A_Hard_Day%27s_Night_(film)
21_Rear_Window
31_12_Angry_Men_(1957_film)
34_All_Quiet_on_the_Western_Front_(1930_film)
42_The_Conformist_(film)
43_Rebecca_(1940_film)
44_Rosemary%27s_Baby_(film)
45_Finding_Nemo
48_The_39_Steps_(1935_film)
50_Gone_with_the_Wind_(film)
53_Rome,_Open_City
54_Tokyo_Story
60_High_Noon
62_On_the_Waterfront
67_Roman_Holiday
72_Battleship_Potemkin


In [6]:
# Creating DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,3,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
2,4,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
3,5,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
4,6,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
5,8,E.T._the_Extra-Terrestrial,https://upload.wikimedia.org/wikipedia/en/6/66...
6,10,Moonlight_(2016_film),https://upload.wikimedia.org/wikipedia/en/8/84...
7,12,Snow_White_and_the_Seven_Dwarfs_(1937_film),https://upload.wikimedia.org/wikipedia/en/4/49...
8,14,The_Battle_of_Algiers,https://upload.wikimedia.org/wikipedia/en/a/aa...
9,15,Dunkirk_(2017_film),https://upload.wikimedia.org/wikipedia/en/1/15...
