## Mashup: APIs, Downloading Files Programmatically, JSON
#### Quiz
- Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. Let's also keep each image's URL to add to the master DataFrame later.

- Though we're going to use a loop to minimize repetition, here's how the major parts inside that loop will work, in order:

1. We're going to query the MediaWiki API using wptools to get a movie poster URL via each page object's image attribute.
2. Using that URL, we'll programmatically download that image into a folder called bestofrt_posters.

In [3]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [4]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [5]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

#### Note: the cell below, if correctly implemented, will likely take ~5 minutes to run.

In [6]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        # Your code here (three lines)
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
1_The_Wizard_of_Oz_(1939_film): cannot identify image file <_io.BytesIO object at 0x000001758CF5A270>
2
2_Citizen_Kane: cannot identify image file <_io.BytesIO object at 0x0000017593771040>
3
3_The_Third_Man: cannot identify image file <_io.BytesIO object at 0x0000017594167F40>
4
5
6
7
8
8_Inside_Out_(2015_film): https://en.wikipedia.org/w/api.php?action=query&exintro&formatversion=2&inprop=url|watchers&list=random&pithumbsize=240&pllimit=500&ppprop=disambiguation|wikibase_item&prop=extracts|info|links|pageassessments|pageimages|pageprops|pageterms|redirects&redirects&rdlimit=500&rnlimit=1&rnnamespace=0&titles=Inside%20Out%20%282015%20film%29&plcontinue=38751266|0|The_Adventures_of_André_&_Wally_B.
9
10
10_Metropolis_(1927_film): cannot identify image file <_io.BytesIO object at 0x00000175941D1F90>
11
12
13
14
14_Singin'_in_the_Rain: cannot identify image file <_io.BytesIO object at 0x0000017594147810>
15
15_Boyhood_(film): 'image'
16
17
18
18_Psycho_(1960_film): cannot identify imag

API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
24
24_North_by_Northwest: cannot identify image file <_io.BytesIO object at 0x0000017593F12B80>
25
26
27
27_The_Maltese_Falcon_(1941_film): cannot identify image file <_io.BytesIO object at 0x0000017593EF34A0>
28
29
30
31
31_Sunset_Boulevard_(film): cannot identify image file <_io.BytesIO object at 0x00000175940C6A90>
32
32_King_Kong_(1933_film): cannot identify image file <_io.BytesIO object at 0x0000017593F2AD60>
33
34
34_The_Adventures_of_Robin_Hood: cannot identify image file <_io.BytesIO object at 0x00000175941D1F40>
35
35_Rashomon: cannot identify image file <_io.BytesIO object at 0x0000017594167BD0>
36
36_Rear_Window: cannot identify image file <_io.BytesIO object at 0x00000175940E3310>
37
38
39
40
41
42


API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
79
80
81
82
82_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x0000017593F1C810>
83
84
85
86
87
88
88_High_Noon: cannot identify image file <_io.BytesIO object at 0x0000017594166F90>
89
90
90_On_the_Waterfront: cannot identify image file <_io.BytesIO object at 0x0000017593F1C2C0>
91
92
93
94
94_The_Grapes_of_Wrath_(film): cannot identify image file <_io.BytesIO object at 0x000001759410E1D0>
95
95_Roman_Holiday: cannot identify image file <_io.BytesIO object at 0x0000017593F126D0>
96
96_Man_on_Wire: cannot identify image file <_io.BytesIO object at 0x000001759426B270>
97
98
99
100
100_Battleship_Potemkin: cannot identify image file <_io.BytesIO object at 0x0000017591626D60>


#### Once you have completed the above code requirements, read and run the three cells below and interpret their output.

In [7]:
for key in image_errors.keys():
    print(key)

1_The_Wizard_of_Oz_(1939_film)
2_Citizen_Kane
3_The_Third_Man
8_Inside_Out_(2015_film)
10_Metropolis_(1927_film)
14_Singin'_in_the_Rain
15_Boyhood_(film)
18_Psycho_(1960_film)
19_Laura_(1944_film)
22_A_Hard_Day%27s_Night_(film)
24_North_by_Northwest
27_The_Maltese_Falcon_(1941_film)
31_Sunset_Boulevard_(film)
32_King_Kong_(1933_film)
34_The_Adventures_of_Robin_Hood
35_Rashomon
36_Rear_Window
43_Bride_of_Frankenstein
47_The_Philadelphia_Story_(film)
49_Bicycle_Thieves
50_Seven_Samurai
51_The_Treasure_of_the_Sierra_Madre_(film)
57_Army_of_Shadows
60_A_Streetcar_Named_Desire_(1951_film)
61_The_Night_of_the_Hunter_(film)
66_Vertigo_(film)
68_Touch_of_Evil
71_Rebecca_(1940_film)
72_Rosemary%27s_Baby_(film)
82_Tokyo_Story
88_High_Noon
90_On_the_Waterfront
94_The_Grapes_of_Wrath_(film)
95_Roman_Holiday
96_Man_on_Wire
100_Battleship_Potemkin


In [8]:
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/9/91/12_angry_men.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '93_Harry_Potter_and_the_Deathly_Hallows_–_Part_2':
        url = 'https://upload.wikimedia.org/wikipedia/en/d/df/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2.jpg'
    title = rank_title[3:]
    df_list.append({'ranking': int(title_list.index(title) + 1),
                    'title': title,
                    'poster_url': url})
    r = requests.get(url)
    # Download movie poster image
    i = Image.open(BytesIO(r.content))
    image_file_format = url.split('.')[-1]
    i.save(folder_name + "/" + rank_title + '.' + image_file_format)

ValueError: 'he_Wizard_of_Oz_(1939_film)' is not in list

In [9]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
1,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
2,6,The_Cabinet_of_Dr._Caligari,https://upload.wikimedia.org/wikipedia/en/2/2f...
3,7,All_About_Eve,https://upload.wikimedia.org/wikipedia/commons...
4,9,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
...,...,...,...
59,92,The_Last_Picture_Show,https://upload.wikimedia.org/wikipedia/en/8/8f...
60,93,Harry_Potter_and_the_Deathly_Hallows_–_Part_2,https://upload.wikimedia.org/wikipedia/en/d/df...
61,97,Jaws_(film),https://upload.wikimedia.org/wikipedia/en/e/eb...
62,98,Toy_Story,https://upload.wikimedia.org/wikipedia/en/1/13...
