<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Using APIs

---

In this lab we will practice web scraping and using an API to retrieve and store data.

In [1]:
# Imports at the top
import json
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline

## IMDB TV Shows

---

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.

Here we will use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### 1 Get the top TV Shows

The Internet Movie Database contains data about movies and TV shows. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 tv shows of all time. 

Let's try to get the web page directly:

In [2]:
# Send a request to get the books.toscrape web page

web_url = "https://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"
response = requests.get(web_url)

# Check the response 
if (response.status_code == 200):
    print(response.headers['Content-Type'])
else:
    print(response.reason)

Forbidden


**Forbidden??**
Looks like we are not able to scrape the data directly as the site is not allowing automated requests.


However, we can save the HTML page as a file, then parse the file using `BeautifulSoup`.

I have saved it as as "Top250TVShows.html". 


In [3]:
# open the file
topshows_file = open("../python-web_services_apis-lab/files/Top250TVShows.html", "r", encoding="utf-8")

# read the file as as BeautifulSoup object
soup = BeautifulSoup(topshows_file.read(), 'html.parser')
type(soup)

bs4.BeautifulSoup

## 1.1 Find the movie_ids

Parse the html to obtain a list of the `movie_ids` for these movies. 


> **Hint:** movie_ids look like this: `tt2582802`
> _Everything after "/title/" and before "/?"_.
> We can use regular expressions (the `re` library) to find the tag with `href` value that contains 
> a string that looks like `/title` using `href=re.compile('/title')`


In [9]:
# Where can we find the top 250 tv shows? Inspect the elements to find it.
movies=soup.find_all("a", class_="ipc-title-link-wrapper", href=re.compile('/title'))
movies

[<a class="ipc-title-link-wrapper" href="/title/tt0903747/?ref_=chttvtp_t_1" tabindex="0"><h3 class="ipc-title__text">1. Breaking Bad</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt5491994/?ref_=chttvtp_t_2" tabindex="0"><h3 class="ipc-title__text">2. Planet Earth II</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt0795176/?ref_=chttvtp_t_3" tabindex="0"><h3 class="ipc-title__text">3. Planet Earth</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt0185906/?ref_=chttvtp_t_4" tabindex="0"><h3 class="ipc-title__text">4. Band of Brothers</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt7366338/?ref_=chttvtp_t_5" tabindex="0"><h3 class="ipc-title__text">5. Chernobyl</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt0306414/?ref_=chttvtp_t_6" tabindex="0"><h3 class="ipc-title__text">6. The Wire</h3></a>,
 <a class="ipc-title-link-wrapper" href="/title/tt0417299/?ref_=chttvtp_t_7" tabindex="0"><h3 class="ipc-title__text">7. Avatar: The Las

In [26]:
movies[0]['href'].split('/')[2]

'tt0903747'

In [16]:
# Get the movie IDs only as a list (practice list comprehension...)
top_shows = [movie['href'].split('/')[2] for movie in movies]

In [17]:
top_shows

['tt0903747',
 'tt5491994',
 'tt0795176',
 'tt0185906',
 'tt7366338',
 'tt0306414',
 'tt0417299',
 'tt6769208',
 'tt0141842',
 'tt2395695',
 'tt0081846',
 'tt9253866',
 'tt0944947',
 'tt0071075',
 'tt2861424',
 'tt7678620',
 'tt1355642',
 'tt8420184',
 'tt1533395',
 'tt0052520',
 'tt1475582',
 'tt1877514',
 'tt0103359',
 'tt2560140',
 'tt12392504',
 'tt0386676',
 'tt11126994',
 'tt0296310',
 'tt3032476',
 'tt1806234',
 'tt0303461',
 'tt2092588',
 'tt10541088',
 'tt0877057',
 'tt0081912',
 'tt0098769',
 'tt2098220',
 'tt2356777',
 'tt0098904',
 'tt0092337',
 'tt9735318',
 'tt7920978',
 'tt2802850',
 'tt0213338',
 'tt1865718',
 'tt2297757',
 'tt3530232',
 'tt7660850',
 'tt7137906',
 'tt1508238',
 'tt0108778',
 'tt2571774',
 'tt4742876',
 'tt4934214',
 'tt0472954',
 'tt0063929',
 'tt13675832',
 'tt0200276',
 'tt0081834',
 'tt0264235',
 'tt0388629',
 'tt0072500',
 'tt1831164',
 'tt3398228',
 'tt0112130',
 'tt0193676',
 'tt0096548',
 'tt0098936',
 'tt0214341',
 'tt2707408',
 'tt0353049',
 '

### 2 Get data on the top movies

Although the Internet Movie Database does not have a public API, an open API exists at http://www.tvmaze.com/api.

We will use this API to retrieve information about each of the 250 TV shows you have extracted in the previous step.



**2.1 Find the Correct API**

Check the documentation of tvmaze's api to select the correct endpoint to use in our request so that we can get the show data by the IMDB movie id.

In [28]:
## What's the endpoint URI?
URI = 'https://api.tvmaze.com/lookup/shows?imdb='

In [86]:
## Test the request with the first movie id that you found
# and decode the json info
response = requests.get(URI + 'tt0903747')
show_info=response.json()

b'{"id":169,"url":"https://www.tvmaze.com/shows/169/breaking-bad","name":"Breaking Bad","type":"Scripted","language":"English","genres":["Drama","Crime","Thriller"],"status":"Ended","runtime":60,"averageRuntime":60,"premiered":"2008-01-20","ended":"2019-10-11","officialSite":"http://www.amc.com/shows/breaking-bad","schedule":{"time":"22:00","days":["Sunday"]},"rating":{"average":9.3},"weight":98,"network":{"id":20,"name":"AMC","country":{"name":"United States","code":"US","timezone":"America/New_York"},"officialSite":null},"webChannel":null,"dvdCountry":null,"externals":{"tvrage":18164,"thetvdb":81189,"imdb":"tt0903747"},"image":{"medium":"https://static.tvmaze.com/uploads/images/medium_portrait/0/2400.jpg","original":"https://static.tvmaze.com/uploads/images/original_untouched/0/2400.jpg"},"summary":"<p><b>Breaking Bad</b> follows protagonist Walter White, a chemistry teacher who lives in New Mexico with his wife and teenage son who has cerebral palsy. White is diagnosed with Stage II

In [78]:
show_info

{'id': 169,
 'url': 'https://www.tvmaze.com/shows/169/breaking-bad',
 'name': 'Breaking Bad',
 'type': 'Scripted',
 'language': 'English',
 'genres': ['Drama', 'Crime', 'Thriller'],
 'status': 'Ended',
 'runtime': 60,
 'averageRuntime': 60,
 'premiered': '2008-01-20',
 'ended': '2019-10-11',
 'officialSite': 'http://www.amc.com/shows/breaking-bad',
 'schedule': {'time': '22:00', 'days': ['Sunday']},
 'rating': {'average': 9.3},
 'weight': 98,
 'network': {'id': 20,
  'name': 'AMC',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'},
  'officialSite': None},
 'webChannel': None,
 'dvdCountry': None,
 'externals': {'tvrage': 18164, 'thetvdb': 81189, 'imdb': 'tt0903747'},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/0/2400.jpg',
  'original': 'https://static.tvmaze.com/uploads/images/original_untouched/0/2400.jpg'},
 'summary': "<p><b>Breaking Bad</b> follows protagonist Walter White, a chemistry teacher who lives i

**2.2 Let's get some data**

Let's check that you can get each of the required info to be stored

In [38]:
# Get the name of the show
show_info['name']

'Breaking Bad'

In [37]:
# Get the average rating, return np.NaN if the value for the key is None
show_info.get('rating', {}).get('average', np.NaN)

9.3

In [42]:
# Get the genres as a string using the string join() 
', '.join(show_info['genres'])

'Drama, Crime, Thriller'

In [44]:
# Get the network name, return None if the value for the key is None
show_info.get('network', {}).get('name', None)

'AMC'

In [45]:
# Get the Premiere date
show_info.get('premiered')

'2008-01-20'

In [46]:
# Get the status
show_info.get('status')

'Ended'

In [79]:
# Get the summary
show_info.get('summary')

"<p><b>Breaking Bad</b> follows protagonist Walter White, a chemistry teacher who lives in New Mexico with his wife and teenage son who has cerebral palsy. White is diagnosed with Stage III cancer and given a prognosis of two years left to live. With a new sense of fearlessness based on his medical prognosis, and a desire to secure his family's financial security, White chooses to enter a dangerous world of drugs and crime and ascends to power in this world. The series explores how a fatal diagnosis such as White's releases a typical man from the daily concerns and constraints of normal society and follows his transformation from mild family man to a kingpin of the drug trade.</p>"

**2.3 Write a Function**

- Define a function that returns a python list object with selected information for a given id.
    - Show name
    - Rating (avg)
    - Genre(s)
    - Network name
    - Premiere date
    - Status


In [80]:
# A:
def get_show_info(movie_id):
    info_list = []
    response = requests.get('https://api.tvmaze.com/lookup/shows?imdb='+movie_id)
    if (response.status_code == 200):
        show_info=response.json()
        info_list.append(movie_id)
        info_list.append(show_info['name'])
        info_list.append(show_info['rating']['average'])
        info_list.append(','.join(show_info['genres']))
        info_list.append(show_info['network']['name'] if show_info['network'] else None)
        info_list.append(show_info['premiered'])
        info_list.append(show_info['status'])
        info_list.append(show_info['summary'])
    return info_list


In [81]:
# Test the Function
get_show_info('tt0388629')

['tt0388629',
 'One Piece',
 8.9,
 'Action,Adventure,Anime,Fantasy',
 'Fuji TV',
 '1999-10-20',
 'Running',
 '<p><b>One Piece</b> animation is based on the successful comic by Eiichiro Oda. The comic has sold more than 260 million copies. The success doesn\'t stop; the <i>One Piece</i> animation series is in its top 5 TV ratings for kids programs in Japan for past few years and the series\' most recent feature film title <i>"One Piece Film Z" </i>which was released on December 2012 has gathered 5.6 million viewers so far. The success goes beyond borders; receives high popularity on animation at terrestrial channel in Taiwan, no.1 rating on animation at a DTT channel in France, received high popularity among age 3-13 on a terrestrial channel in Germany in year 2010. The animation series has been broadcasted in many parts of the world: USA, UK, Australia, France, Spain, Portugal, Germany, Italy, Greece, Turkey, Israel, South Korea, Taiwan, China, Hong Kong, Philippine, Thailand, Singapor

**2.3 Create a DataFrame**

Let's create a `Pandas` DataFrame to store all the show data:


In [82]:
shows_df = pd.DataFrame(columns=('IMDB_id','Name','Rating','Genres','Network','Premier_Date','Status','Summary'))
shows_df

Unnamed: 0,IMDB_id,Name,Rating,Genres,Network,Premier_Date,Status,Summary


In [50]:
# Lets test getting one movie and adding to the DataFrame
show_info = get_show_info('tt2707408')
show_info

['tt2707408', 'Narcos', 8.4, 'Drama,Action,Crime', None, '2015-08-28', 'Ended']

In [51]:
# We can now append this list to our dataframe, let's store in a temporary dataframe
temp_df = shows_df.append(pd.Series(show_info, index=shows_df.columns), ignore_index=True)
temp_df

Unnamed: 0,IMDB_id,Name,Rating,Genres,Network,Premier_Date,Status
0,tt2707408,Narcos,8.4,"Drama,Action,Crime",,2015-08-28,Ended


In [71]:
show_info=[]
if len(show_info)>0:
    show_info_series = pd.Series(show_info, index=shows_df.columns)

In [65]:
show_info_series

IMDB_id                  tt2707408
Name                        Narcos
Rating                         8.4
Genres          Drama,Action,Crime
Network                       None
Premier_Date            2015-08-28
Status                       Ended
dtype: object

In [66]:
# Concatenate the Series to the DataFrame
temp_df = pd.concat([shows_df, show_info_series.to_frame().T], ignore_index=True)
temp_df

Unnamed: 0,IMDB_id,Name,Rating,Genres,Network,Premier_Date,Status
0,tt2707408,Narcos,8.4,"Drama,Action,Crime",,2015-08-28,Ended


**2.4 Store all the movies**

Now let's add the data for all 250 shows to `shows_df`. 
We need to check that the list is non-empty though before appending to the dataframe, in case we were not able to look it up using the API. 

In [83]:
# Just try to add 10 first
for show in top_shows[:10]:
    print(show)
    show_info=get_show_info(show)
    if len(show_info)>0:
        show_info_series = pd.Series(show_info, index=shows_df.columns)
        shows_df = pd.concat([shows_df, show_info_series.to_frame().T], ignore_index=True)

tt0903747
tt5491994
tt0795176
tt0185906
tt7366338
tt0306414
tt0417299
tt6769208
tt0141842
tt2395695


In [84]:
shows_df

Unnamed: 0,IMDB_id,Name,Rating,Genres,Network,Premier_Date,Status,Summary
0,tt0903747,Breaking Bad,9.3,"Drama,Crime,Thriller",AMC,2008-01-20,Ended,<p><b>Breaking Bad</b> follows protagonist Wal...
1,tt5491994,Planet Earth II,8.8,Nature,BBC One,2016-11-06,Ended,<p>David Attenborough presents a documentary s...
2,tt0795176,Planet Earth,8.9,Nature,BBC One,2006-03-05,Ended,<p>David Attenborough celebrates the amazing v...
3,tt0185906,Band of Brothers,8.9,"Drama,Action,War",HBO,2001-09-09,Ended,<p>Drawn from interviews with survivors of Eas...
4,tt7366338,Chernobyl,8.9,"Drama,History",HBO,2019-05-06,Ended,<p><b>Chernobyl</b> dramatizes the true story ...
5,tt0306414,The Wire,8.9,"Drama,Crime",HBO,2002-06-02,Ended,<p>The first season of <b>The Wire</b> (2002) ...
6,tt0417299,Avatar: The Last Airbender,8.9,"Action,Adventure,Fantasy",Nickelodeon,2005-02-21,Ended,<p>Water. Earth. Fire. Air. Only the Avatar wa...
7,tt6769208,Blue Planet II,9.1,Nature,BBC One,2017-10-29,Ended,"<p>Wildlife documentary series, presented and ..."
8,tt0141842,The Sopranos,8.8,"Drama,Crime",HBO,1999-01-10,Ended,"<p><b>The Sopranos</b>, writer-producer-direct..."
9,tt2395695,Cosmos,8.6,,National Geographic Channel,1980-09-28,Ended,<p>Hosted by renowned astrophysicist Neil deGr...


In [None]:
shows_df.to_csv('Top_shows.csv', index=False)