## Source: APIs (Application Programming Interfaces)

- Getting each movie's poster to add to out word cloud.
- Each movie has its poster on it Wikipedia page, we can use WIkipedia's API.
- API let's you access data from the Internet in a reasonably easy manner.
- MediaWiki, popular API for Wikipedia and is open source. It host all of the WIkipedia data.
- Rotten Tomatoes does have an API but this API doesn't provide posters and images, unfortunately and it requares access via a proposal. Code example:

```python
import rtsimple as rt
rt.API_KEY = 'YOUR API KEY HERE'
movie = rt.Movies('10489')
movie.ratings['audience_score']
```
   
### MediaWiki API

MediaWiki has a great [tutorial](https://www.mediawiki.org/wiki/API:Main_page#A_simple_example) on their website on how their API calls are structured. It's a nice and simple example and they explain the various moving parts:

* The endpoint (important takeaway: there is nothing special about this URL!)
* The format
* The action
* Action-specific parameters

### wptools Library

There are a bunch of different access libraries for MediaWiki to satisfy the variety of programming languages that exist. Here is a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for Python. This is pretty standard for most APIs. Some libraries are better than others, which again, is standard. For a MediaWiki, the most up to date and human readable one in Python is called [wptools](https://github.com/siznax/wptools). The analogous relationship for Twitter is:

* MediaWiki API → wptools
* Twitter API → tweepy

*wptools* has an even simpler tutorial on their GitHub page using the [Mahatma Gandhi Wikipedia page](https://en.wikipedia.org/wiki/Mahatma_Gandhi) as a working example.

To get a `page` object, the [usage](https://github.com/siznax/wptools/wiki/Usage#page-usage) is as follows:

```python
page = wptools.page('Mahatma_Gandhi')
```

...where *'Mahatma_Gandhi'* is the last bit of the Wikipedia URL for that page [(https://en.wikipedia.org/wiki/Mahatma_Gandhi)]. This `page` object has methods that can get us various pieces of data about that Wikipedia page, including all of the images on the page. To get all of the data:

> Simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.

```python
page = wptools.page('Mahatma_Gandhi').get()
```

Or if you already have a page object assigned to `page`:

```python
page.get()
```

* `page` now has the following attributes, which can be accessed using dot notation through `.data`
* `page.data['image']`, for example, would return a list of data for six images on this specific Wikipedia page.

In [1]:
import pandas as pd
import wptools

In [19]:
page = wptools.page('Mahatma_Gandhi')
page.get()

en.wikipedia.org (query) Mahatma_Gandhi
en.wikipedia.org (query) Mahatma Gandhi (&plcontinue=19379|0|Giov...
en.wikipedia.org (query) Mahatma Gandhi (&plcontinue=19379|0|Nisa...
en.wikipedia.org (query) Mahatma Gandhi (&plcontinue=19379|0|Vegg...
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (labels) P1938|Q5460604|Q142534|P1430|P6351|P25|...
www.wikidata.org (labels) Q9441|P3544|Q239344|Q183167|P106|P135|Q...
www.wikidata.org (labels) P5019|P646|P1296|Q5137|P1728|P244|Q2120...
www.wikidata.org (labels) Q194279|Q302736|P463|P1340|P2639|Q987|P...
www.wikidata.org (labels) P3430|Q4722851|P935|P910|P535|P4293|Q13...
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
  aliases: <list(12)> Mahatma Mohandas Karamchand Gandhi, M. K. Ga...
  assessments: <dict(10)> Biography, Politics, Alternative Views, ...
  claims: <dict(147)> P27, P19, P20, P26, P157,

<wptools.page.WPToolsPage at 0x4437198>

In [20]:
page.data['image']

[{'kind': 'query-pageimage',
  'file': 'File:Portrait Gandhi.jpg',
  'orig': 'Portrait_Gandhi.jpg',
  'timestamp': '2007-07-08T10:21:04Z',
  'size': 2951123,
  'width': 2024,
  'height': 3040,
  'url': 'https://upload.wikimedia.org/wikipedia/commons/d/d1/Portrait_Gandhi.jpg',
  'descriptionurl': 'https://commons.wikimedia.org/wiki/File:Portrait_Gandhi.jpg',
  'descriptionshorturl': 'https://commons.wikimedia.org/w/index.php?curid=2369294',
  'title': 'File:Portrait Gandhi.jpg',
  'metadata': {'DateTime': {'value': '2007-07-08 10:21:04',
    'source': 'mediawiki-metadata',
    'hidden': ''},
   'ObjectName': {'value': 'Portrait Gandhi',
    'source': 'mediawiki-metadata',
    'hidden': ''},
   'CommonsMetadataExtension': {'value': 1.2,
    'source': 'extension',
    'hidden': ''},
   'Categories': {'value': '', 'source': 'commons-categories', 'hidden': ''},
   'Assessments': {'value': '', 'source': 'commons-categories', 'hidden': ''},
   'ImageDescription': {'value': '<p>Mohandas K. Gan

In [12]:
# Use wptools.page() to get a page object
flannery = wptools.page("Flannery O'Connor")

In [6]:
# Leaving off the title invokes a random lookup in English
page = wptools.page()

en.wikipedia.org (random) 🍭
Alberto Fay (en) data
{
  pageid: 35364884
  requests: <list(1)> random
  title: Alberto Fay
}


In [7]:
# The default language is 'en' (English)
# if you specify only a language, you get a random Wikipedia page in that language
page = wptools.page(lang='zh')

zh.wikipedia.org (random) 🍭
多纹红绵鳚 (zh) data
{
  pageid: 1872332
  requests: <list(1)> random
  title: 多纹红绵鳚
}


In [8]:
# If you specify only a wiki site, you get a random page from that site
page = wptools.page(wiki='en.wikiquote.org')

en.wikiquote.org (random) 🍣
Small Time Crooks (en) data
{
  pageid: 208750
  requests: <list(1)> random
  title: Small Time Crooks
}


In [11]:
# You can also start with a Wikidata item
malcolmx = wptools.page(wikibase='Q43303')

In [13]:
# wptools.category() to get a category object:
cat = wptools.category()

en.wikipedia.org (random:14) 🌯
Category:Anita Lindblom albums (en) data
{
  pageid: 44215485
  requests: <list(1)> random
  title: Category:Anita Lindblom albums
}


In [16]:
# wptools.site() to get a site object:
site = wptools.site('de.wikisource.org')
# get_info() to get info about a site:
site.get_info()

en.wikipedia.org (query) siteinfo|siteviews|mostviewed
en.wikipedia.org (query) siteviews:uniques
Wikipedia (en) data
{
  activeusers: 139,312
  admins: 1,186
  articles: 5,801,262
  edits: 877,647,732
  images: 880,917
  info: <dict(51)> mainpage, base, sitename, logo, generator, phpv...
  jobs: 0
  mostviewed: <list(479)> {'ns': 0, 'title': 'Louis Tomlinson', 'c...
  pages: 47,049,244
  queued-massmessages: 0
  requests: <list(2)> siteinfo, sitevisitors
  site: enwiki
  siteviews: 260,963,204
  users: 35,620,119
  visitors: 71,293,965
}


<wptools.site.WPToolsSite at 0x44372e8>

In [17]:
# top() to show the most popular pages:
site.top('ja.wikipedia.org')

enwiki mostviewed articles:
1. Louis Tomlinson (390,054)
2. Nancy Pelosi (288,854)
3. 21 Savage (239,207)
4. Ted Bundy (222,342)
5. Alexandria Ocasio-Cortez (169,286)
6. Buzz Aldrin (147,907)
7. Tobias Harris (123,088)
8. Stacey Abrams (116,801)
9. XHamster (114,699)
10. State of the Union (113,073)
11. Donald Trump (103,506)
12. Deaths in 2019 (102,338)
13. Liam Neeson (98,240)
14. Tom Brady (96,472)
15. Don Cornelius (93,262)
16. Kristoff St. John (90,752)
17. Freddie Mercury (89,260)
18. Kayden Boche (85,516)
19. K.G.F: Chapter 1 (85,203)
20. Fyre Festival (85,143)
21. Alice Marie Johnson (78,374)
22. Chinese New Year (77,511)
23. Travis Barker (76,981)
24. Productivity (75,363)
25. Andrew Cunanan (72,987)


### Quiz

Get the page object for the [E.T. The Extra-Terrestial Wikipedia](https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial) page. 

In [25]:
page = wptools.page("E.T._the_Extra-Terrestrial")
page.get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) Q1757366|Q237207|P2061|P3302|Q1044183|Q...
www.wikidata.org (labels) P3135|Q131520|Q443775|P135|Q1748409|P27...
www.wikidata.org (labels) P2631|Q787145|Q586356|P31|P3808|P1552|P...
www.wikidata.org (labels) P1970|Q8877|P2529|P2508|P1874|Q505449|P...
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:ET logo 3.svg|File:E t the extr...
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(94)> P1562, P57, P272, P345, P31, P161, P373, P480...
  description: <str(63)> 1982 American science fiction film direct...
  exhtml: <str(569)> <p><i><b>E.T. the Extra-Terrestrial</b></i> i...
  exrest: <str(548)> E.T. the Extra-Terrestrial is a 1982 American...
  extext: <str(1784)> _**E.T. the Extra-Terrestri

<wptools.page.WPToolsPage at 0x449ed30>

In [32]:
page.data['image']

[{'kind': 'parse-image',
  'file': 'File:E t the extra terrestrial ver3.jpg',
  'orig': 'E t the extra terrestrial ver3.jpg',
  'timestamp': '2016-06-04T10:30:46Z',
  'size': 83073,
  'width': 253,
  'height': 394,
  'url': 'https://upload.wikimedia.org/wikipedia/en/6/66/E_t_the_extra_terrestrial_ver3.jpg',
  'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
  'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
  'title': 'File:E t the extra terrestrial ver3.jpg',
  'metadata': {'DateTime': {'value': '2016-06-04 10:30:46',
    'source': 'mediawiki-metadata',
    'hidden': ''},
   'ObjectName': {'value': 'E t the extra terrestrial ver3',
    'source': 'mediawiki-metadata',
    'hidden': ''},
   'CommonsMetadataExtension': {'value': 1.2,
    'source': 'extension',
    'hidden': ''},
   'Categories': {'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of movie posters|Files with no machine-readable auth

## JSON File Structure

* Most data from APIs comes in JSON or XML format;
* Great for representing and accessing complicated data hierarchies;
* JSON objects: a collection of key value pairs;
* In Python, JSON objects are interpreted as dictionaries;
* JSON objects keys must be strings;

**JSON arrays → Python lists.** 

**JSON objects → Python dictionaries.**

**More Information**

[Mashery: API Data Exchange: XML vs. JSON](https://www.tibco.com/blog/2014/01/23/api-data-exchange-xml-vs-json/)

## Quiz 1
Access the first image in the images attribute, which is a JSON array.

In [38]:
page.data['image'][0]

{'kind': 'parse-image',
 'file': 'File:E t the extra terrestrial ver3.jpg',
 'orig': 'E t the extra terrestrial ver3.jpg',
 'timestamp': '2016-06-04T10:30:46Z',
 'size': 83073,
 'width': 253,
 'height': 394,
 'url': 'https://upload.wikimedia.org/wikipedia/en/6/66/E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
 'title': 'File:E t the extra terrestrial ver3.jpg',
 'metadata': {'DateTime': {'value': '2016-06-04 10:30:46',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'ObjectName': {'value': 'E t the extra terrestrial ver3',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'CommonsMetadataExtension': {'value': 1.2,
   'source': 'extension',
   'hidden': ''},
  'Categories': {'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of movie posters|Files with no machine-readable author|Files with no mach

## Quiz 2
Access the director key of the infobox attribute, which is a JSON object.

In [43]:
page.data['infobox']['director']

'[[Steven Spielberg]]'

### More JSON in Python

For the example in this lesson, JSON data was sourced from an API. That isn't always the case, though! Sometimes you're given a text file with human readable JSON within it. For this situation, the [json](https://docs.python-guide.org/scenarios/json/) library is indispensable. It can parse JSON from strings or files and it can parse JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings. The tutorial on the linked documentation page is handy. This [Reading and Writing JSON to a File in Python](https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/) article from Stack Abuse is also great, which outlines `json.dump`, `json.dumps`, `json.load`, and `json.loads` (four key json library methods) well.

*pandas* also has JSON functions (the `read_json` function and the `to_json` DataFrame method), but the hierarchical advantage of JSON is wasted in pandas' tabular DataFrame so the uses are limited.

### Mashup: APIs, Downloading Files Programmatically, and JSON

With APIs, downloading files programmatically from the internet, and JSON under your belt, you now have all of the knowledge to download all of the movie poster images for the Roger Ebert review word clouds. This is your next task.

There are two key things to be aware of before you begin:

1. Wikipedia Page Titles
To access Wikipedia page data via the MediaWiki API with *wptools*, you need each movie's Wikipedia page title, i.e., what comes after the last slash in *en.wikipedia.org/wiki/* in the URL.

2. Downloading Image Files
Downloading images may seem tricky from a reading and writing perspective, in comparison to text files which you can read line by line, for example. But in reality, image files aren't special—they're just binary files. To interact with them, you don't need special software (like Photoshop or something) that "understands" images. You can use regular file opening, reading, and writing techniques, like this:

```python
import requests
r = requests.get(url)
with open(folder_name + '/' + filename, 'wb') as f:
        f.write(r.content)
```

But this technique can be error-prone. It will work most of the time, but sometimes the file you write to will be damaged. 

This type of error is why the *requests* library maintainers [recommend](http://docs.python-requests.org/en/latest/user/quickstart/#binary-response-content) using the [PIL](https://pillow.readthedocs.io/en/stable/) library (short for Pillow) and `BytesIO` from the *io* library for non-text requests, like images. They recommend that you access the response body as bytes, for non-text requests. For example, to create an image from binary data returned by a request:

```python
import requests
from PIL import Image
from io import BytesIO
r = requests.get(url)
i = Image.open(BytesIO(r.content))
```

Though you may still encounter a similar file error, this code above will at least warn us with an error message, at which point we can manually download the problematic images.

### Quiz

Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. Let's also keep each image's URL to add to the master DataFrame later.

Though we're going to use a loop to minimize repetition, here's how the major parts inside that loop will work, in order:

1. We're going to query the MediaWiki API using *wptools* to get a movie poster URL via each page object's `image` attribute.
2. Using that URL, we'll programmatically download that image into a folder called *bestofrt_posters*.

The Jupyter Notebook below contains template code that:

* Contains *title_list*, which is a list of all of the Wikipedia page titles for each movie in the Rotten Tomatoes Top 100 Movies of All Time list. This list is in the same order as the Top 100.
* Creates an empty list, *df_list*, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the [most efficient way of building a DataFrame row by row](https://stackoverflow.com/questions/28056171/how-to-build-and-fill-pandas-dataframe-from-for-loop/28058264#28058264)).
* Creates an empty folder, *bestofrt_posters*, to store the downloaded movie poster image files.
* Creates an empty dictionary, *image_errors*, to fill to keep track of movie poster image URLs that don't work.
* Loops through the Wikipedia page titles in *title_list* and:
    * Stores the ranking of that movie in the Top 100 list based on its position in *title_list*. Ranking is needed so we can join this DataFrame with the master DataFrame later. We can't join on title because the titles of the Rotten Tomatoes pages and the Wikipedia pages differ.
    * Uses [`try` and(https://www.pythonforbeginners.com/error-handling/python-try-and-except) `except` blocks]() to attempt to query MediaWiki for a movie poster image URL and to attempt to download that image. If the attempt fails and an error is encountered, the offending movie is documented in image_errors.
    * Appends a dictionary with ranking, title, and poster_url as the keys and the extracted values for each as the values to df_list.
* Inspects the images that caused errors and downloads the correct image individually (either via another URL in the *image *attribute's list or a URL from Google Images)
* Creates a DataFrame called df by converting df_list using the *pd.DataFrame* [constructor](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [44]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [46]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [47]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [52]:
for title in title_list:
    print(title)
    page = wptools.page(title).get()
    break

The_Wizard_of_Oz_(1939_film)


en.wikipedia.org (query) The_Wizard_of_Oz_(1939_film)
en.wikipedia.org (query) The Wizard of Oz (1939 film) (&plcontinu...
en.wikipedia.org (parse) 561315
www.wikidata.org (wikidata) Q193695
www.wikidata.org (labels) P3302|P4969|P436|Q719228|P840|Q319221|Q...
www.wikidata.org (labels) Q22006653|Q448644|P179|Q7704028|P1804|P...
www.wikidata.org (labels) P1877|P17|P1954|P4632|P3138|Q1558|Q5601...
www.wikidata.org (labels) Q21995136|P910|P1265|P856|P361|P2130|Q6...
en.wikipedia.org (restbase) /page/summary/The_Wizard_of_Oz_(1939_film)
en.wikipedia.org (imageinfo) File:WIZARD OF OZ ORIGINAL POSTER 19...
The Wizard of Oz (1939 film) (en) data
{
  aliases: <list(1)> Wizard of Oz
  assessments: <dict(5)> United States, Film, Children's literatur...
  claims: <dict(80)> P31, P345, P373, P57, P144, P480, P58, P162, ...
  description: 1939 movie based on the book by L. Frank Baum
  exhtml: <str(591)> <p><i><b>The Wizard of Oz</b></i> is a 1939 A...
  exrest: <str(563)> The Wizard of Oz is a 1939

In [53]:
page.data['image'][0]['url']

'https://upload.wikimedia.org/wikipedia/commons/c/ca/WIZARD_OF_OZ_ORIGINAL_POSTER_1939.jpg'

In [54]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        # Your code here (three lines)
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
64_Dr._Strangelove: cannot identify image file <_io.BytesIO object at 0x0000000008266F10>
65
66
67
68
69
70
71
72


API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


In [55]:
for key in image_errors.keys():
    print(key)

22_A_Hard_Day%27s_Night_(film)
64_Dr._Strangelove
72_Rosemary%27s_Baby_(film)


In [56]:
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == 'Mad_Max:_Fury_Road':
         url = 'https://upload.wikimedia.org/wikipedia/en/4/47/Mad_Max_Fury_Road.jpg'
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/9/91/12_angry_men.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '93_Harry_Potter_and_the_Deathly_Hallows_–_Part_2':
        url = 'https://upload.wikimedia.org/wikipedia/en/d/df/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2.jpg'
    title = rank_title[3:]
    df_list.append({'ranking': int(title_list.index(title) + 1),
                    'title': title,
                    'poster_url': url})
    r = requests.get(url)
    # Download movie poster image
    i = Image.open(BytesIO(r.content))
    image_file_format = url.split('.')[-1]
    i.save(folder_name + "/" + rank_title + '.' + image_file_format)

In [57]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen_Kane,https://upload.wikimedia.org/wikipedia/en/c/ce...
2,3,The_Third_Man,https://upload.wikimedia.org/wikipedia/en/2/21...
3,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
4,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
5,6,The_Cabinet_of_Dr._Caligari,https://upload.wikimedia.org/wikipedia/commons...
6,7,All_About_Eve,https://upload.wikimedia.org/wikipedia/en/2/22...
7,8,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
8,9,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
9,10,Metropolis_(1927_film),https://upload.wikimedia.org/wikipedia/en/0/06...


### Word Clouds

These word cloud required gathering data from two different sources:
* downloading files from the Internet: 
    * the Roger Ebert review text files
    * accessing data from an API
![](wordclouds_white/1_The_Wizard_of_Oz_(1939_film).jpg)