# Assignment 4

This notebook contains a set of exercises that will guide you through the different steps of this assignment. The aim of this assignment is to extract data about different characters from Marvel's API in order to build a database that contains information about different comics and stories.

<div class="alert alert-danger"><b>Submission deadline:</b> Thursday, January 14, 20:00</div>

### Instructions

Read carefully the following instructions before starting the exercises.

- This notebook is automatically graded. This means that there are several cells embedded into the notebook that take care of checking your code and grading it. It also means that it is important **to follow the instructions for each of the exercises** to make sure that you do everything right.

- Write your code in the dedicated cells. You can use as many cells as you like. Just make sure to include all the necessary code **before the corresponding test**.

- The tests for the introductory exercises will be open for you to see. This will help you understand how the pipeline works and check that you got the basics right. You can run these checks as many times as you want, **as long as you don't modify them**.  

- The tests for the graded exercises will remain hidden. It is important that you **do not write any code, nor do you remove the cells left in blank** for this purpose. 

- Remember that tests look for specific variable and objects. This means that in order to receive the points for each exercise, you need to **create those objects**.

Before moving on, please run the following cell. You only need run it once in order to install the ```nose``` library.

In [1]:
pip install nose

Note: you may need to restart the kernel to use updated packages.


## Getting started

The [Marvel Comics API](https://developer.marvel.com/) allows developers everywhere to access information about Marvel's vast library of comics—from what's coming up, to 70 years ago. Through this API you'll be able to access six different ypes of resources, including comics, series, stories, events, creators and characters. You can read all about the different endpoints available by taking a look at the [documentation](https://developer.marvel.com/docs).

<img src='https://www.dropbox.com/s/fizr5sip3f55nhu/marvel.png?raw=1' width=1000>

The following instructions will guide you through the steps to create an account and retrieve your API key to complete the authentication process. Still, it may be a good idea to quickly read the [general information](https://developer.marvel.com/documentation/generalinfo) before getting started.

<img src='https://www.dropbox.com/s/lb1b2uyiig5hw2l/signin.png?raw=1' width=500>

Let's begin by creating an account. Go to the <i>Sign in</i> link on the upper left of the page and fill in the information to create your account. You can use any email you want, as long as it works. Once you have completed the process at the website, make sure you check your inbox and confirm your email address.

<img src='https://www.dropbox.com/s/c0ysk8sj0o0mpbt/accept.png?raw=1' width=600>

You'll now have to go back to the API's webpage, log in and click on <i>Getting started</i>. The first time you log in, the webpage will ask you to accept the terms and conditions in order to use the API. Do so by checking the accept box at the bottom of the page.

<img src='https://www.dropbox.com/s/a9fd0cgzn1imcw6/keys.png?raw=1' width=700>

Once you accept the terms, the website will redirect you to your account settings and you'll be able to retrieve your keys.

<div class="alert alert-info">Create two new variables, <i>privatekey</i> and <i>privatekey</i>, that store your private and your public keys, respectively</div>

<div class="alert alert-warning">Notice that there is a maximum number of requests that you are allowed to make per day with these keys. Take that into consideration to make sure you can finish the assignment in time.</div>

In [2]:
privatekey = '406b15d3da92bca586e9edf59940a3aaa6885955'
publickey = '04b31c1500365f7787d769c8eb410d62'

Great! Now that you have your keys, let's take a look at how the API is built in order to make your first request.

## Making your first request

In this assignment we will only make requests to the public endpoints. The whole list is available at the [docs](https://developer.marvel.com/docs). We'll begin by directing our request to the ```characters``` endpoint in order to retrieve information about a single character of your choice. The correspoding endpoint label has been copied for you below, together with the base URL for all the endpoints in Marvel's API.

In [3]:
base_url = 'https://gateway.marvel.com'
endpoint = '/v1/public/characters'

<div class="alert alert-info">Write the code to obtain the full url for the <i>characters</i> endpoint by concatenating the two strings above. Store the result in a new variable called <i>url</i>.</div>

In [4]:
url = base_url + endpoint

If you check the docs once again, you'll see that the ```characters``` endpoint allows you to retrieve information about characters by providing different parameters, including their name, the comics or series twhere they appear, etc.

<img src='https://www.dropbox.com/s/phrm53wa066cnvu/characters.png?raw=1' width=500>

We will begin by retrieving information about a single character by ```name```. You can access the list of all the Marvel characters through the following [link](https://www.marvel.com/characters). 

<div class="alert alert-info">Create a new variable called <i>name</i> that stores the name of your character of choice in string form.</div>

In [5]:
name = 'Iron Man'

This means that the body of your request should include the name above as a parameter. Note, however, that thsi API expects you to include some additional information too in order to build a successful request. In particular, Marvel's API expects you to sign your requests. You can find instructions on how to do that in this [link](https://developer.marvel.com/documentation/authorization).

Apart from the parameters of your query, the API expects you to fill in the values for three additional parameters in all your requests:

- **apikey**. This parameter takes your *public* key.
- **ts**. This parameter takes a timestamp in string form or any other long string which can change on a request-by-request basis.
- **hash**. This parameter takes a MD5 hash of ts+privatekey+publickey.

Let's take a few minutes to see how to obtain the last two. Remember that you'll need to include all three parameters in all your requests.

#### Generating a timestamp

We are going to use the ```time``` library to obtain a timestamp for our every request. This library has a function called ```time``` that returns the current time. We can convert this output to a string to used it as our timestamp.

In [6]:
import time

ts = str(time.time())

#### Generating a MD5 hash

In order to obtain the hash, we'll use the ```hashlib``` library. The hash has to be applied over a code that corresponds to the concatenation of ts+privatekey+publickey. You can obtain it by running the cell below. Notice that the output is a long alphanumeric string.

In [7]:
import hashlib

code = ts+privatekey+publickey
md5hash = hashlib.md5(code.encode('utf-8')).hexdigest()

You can now fill in the body of your request using all the ingredients above and send to it the defined url.

<div class="alert alert-info">Write the code to make your request. Remember that the body of request should include the values for <i>apikey</i>, <i>ts</i> and <i>hash</i>, as well as that of the <i>name</i> parameter. Store the response in <b>json</b> format in a new variable called <i>response</i>.</div>

In [8]:
import requests as rq

response = rq.get(url, params={'apikey': publickey, 'ts': ts, 'hash': md5hash, 'name': name}).json()

You can run the following cell to check that you did everything right. You can run it as many times as you want, as long as you **don't modify it**.

In [9]:
from nose.tools import assert_true, assert_is_instance, assert_equal

# check that you created the variable response
assert_true(response)

# check that response is of correct type
assert_is_instance(response, dict)

# check that response has correct content
assert_equal(response['data']['results'][0]['name'].upper(), name.upper())

Take your time to investigate how the response dictionary is organized. Take a look at the different keys and try to understand what they each refer to.

<div class="alert alert-info">Write the code to identify the full url address for your character's <i>wiki</i>. Store it in a new variable called <i>url_wiki</i>.</div>

In [10]:
url_wiki = response['data']['results'][0]['urls'][1]['url']

## Poking around

Copy-paste the url above to your browser and take a look at the webpage. It contains a general profile for your character of choice, as well as a bio and some additional information. By default, this page shows a general ```OVERVIEW```, a ```IN COMICS PROFILE``` and a more specific ```IN COMICS FULL REPORT```. 

<img src='https://www.dropbox.com/s/5wxjoqwtumd2c7k/thing.png?raw=1' width=1000>

If your character happens to be so popular that he or she or it has been portrayed in a movie, you'll also get access to the ```ON SCREEN PROFILE``` and to the ```ON SCREEN FULL REPORT```. 

<img src='https://www.dropbox.com/s/ej2bno1dslpjkp9/blackwidow.png?raw=1' width=1000>

Since not all characters include these last two piece of information, let's focus on the other elements. In particular, we are interested in retrieving part of the information contained in the ```IN COMICS FULL REPORT``` tab. 

<div class="alert alert-info"><b>Exercise 1 </b>Write the code to complete function <i>get_soup</i>. This function should take a url in <i>string</i> form as input and return a <i>BeautifulSoup</i> object with the corresponding HTML code as output. The necessary libraries have already been imported for you.</div>

<div class="alert alert-warning">Make sure that the output of your function is different for different urls.</div>

In [11]:
import requests
import bs4

def get_soup(url):
    response = rq.get(url)
    return bs4.BeautifulSoup(response.text, 'html.parser')

The following cell runs the checks on your code. Please **don't write any code here**. Just leave it as it is.

In [12]:
# LEAVE BLANK

In [13]:
# LEAVE BLANK

Now that we can retrieve the code, let's identify the tag that corresponds to the ```IN COMICS FULL REPORT``` and retrieve the link

<div class="alert alert-info"><b>Exercise 2 </b>Write the code to complete function <i>get_comics_report</i>. This function should take a BeautifulSoup object with the code for a website as input and return <b>full url</b> for the <i>IN COMICS FULL REPORT</i> tab in string form as output. 

In [14]:
def get_comics_report(soup):
    x = 'https://www.marvel.com'
    for i in soup.find_all('li', {'class': 'masthead__tabs__li'}):
        if 'In Comics Full Report' in i.text:
            return x + i.find('a')['href']
    else:
        return None

The following cell runs the checks on your code. Please **don't write any code here**. Just leave it as it is.

In [15]:
# LEAVE BLANK

In [16]:
# LEAVE BLANK

When accessing the ```IN COMICS FULL REPORT``` for a character, we get information about different attributes, including the height, the weight, the gender, etc. Let's write the code to extract this information.

<div class="alert alert-info"><b>Exercise 3 </b>Write the code to complete functions <i>get_height</i>, <i>get_weight</i>, <i>get_gender</i>, <i>get_eyecolor</i> and <i>get_haircolor</i>. These function should all take a BeautifulSoup object with the code for a website as input and return, respectively, the height (in <b>float</b> form), the weight (in <b>float</b> form), the gender (in <b>str</b> form), the eyecolor (in <b>str</b> form) and the haircolor (in <b>str</b> form) as output. In all cases, only the number should be returned, nothing else. If no information is provided for any of these items, then the corresponding function should return a <b>None</b>.</div>

<div class="alert alert-warning">When information about different characters is present, return only that corresponding to the first. </div>

<div class="alert alert-warning">For the case of the height, when decimal values are given in inches, i.e. 6'6", return only the first digit, i.e. 6.</div>

<div class="alert alert-warning">For the case of the eye and haircolors, return the whole string of information, i.e. "White (formerly black)".</div>

In [17]:
def get_height(soup):
    list1 = []
    for i in soup.find_all('div', {'class': 'bioheader__stats'}):
        if 'height' in i.text:
            if i.text.isalpha() == True:
                return None
            else:
                for s in i.text:
                    if s.isdigit() == True:
                        list1.append(s)
                        return float(list1[0])
        else:
            return None

In [18]:
def get_weight(soup):
    list2 = []
    for i in soup.find_all('div', {'class': 'bioheader__stats'}):
        if 'weight' in i.text:
            if i.text.isalpha() == True:
                return None
            else:
                for q in i.text.split(';')[0]:
                    if q.isdigit() == True:
                        list2.append(q)
                out_str = ""
                out_str = out_str.join(list2)
                return float(out_str)

In [19]:
def get_gender(soup):
    list3 = []
    for w in (soup.find_all('div', {'class':'bioheader__stats'})):
        if 'gender' in w.text:
            list3.append(w.text)
    if len(list3) > 0:
        return list3[0].replace('gender','')
    else:
        return None

In [20]:
def get_eyecolor(soup):
    list4 = []
    for u in (soup.find_all('div', {'class': 'bioheader__stats'})):
        if 'eyes' in u.text:
            list4.append(u.text)
    if len(list4) > 0:
        return list4[0].replace('eyes','')
    else:
        return None 

In [21]:
def get_haircolor(soup):
    list5 = []
    for n in (soup.find_all('div', {'class': 'bioheader__stats'})):
        if 'hair' in n.text:
            list5.append(n.text)
    if len(list5) > 0:
        return list5[0].replace('hair','')
    else:
        return None 

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [22]:
# LEAVE BLANK

In [23]:
# LEAVE BLANK

In [24]:
# LEAVE BLANK

In [25]:
# LEAVE BLANK

In [26]:
# LEAVE BLANK

In [27]:
# LEAVE BLANK

In [28]:
# LEAVE BLANK

In [29]:
# LEAVE BLANK

In [30]:
# LEAVE BLANK

In [31]:
# LEAVE BLANK

In addition to the above, let's also extract information regarding the group affiliation for a given character.

<div class="alert alert-info"><b>Exercise 4 </b>Write the code to complete functions <i>get_group_affiliation</i> and <i>get_groups</i>. The first function should take a BeautifulSoup object with the code for a character report website as input and return a <b>list</b> with all the different affiliations as output. In cases where a single affiliation is provided, the function should return a list with a single element. In cases where no affiliation is provided, the function should return a <b>None</b>. The second function should take a BeautifulSoup object with the code for a character report website as input and return an <b>int</b> with the number of groups the given character is affiliated to. This function <b>should call</b> your previous function and extract the number from the returned list. If the value returned by <i>get_group_affiliation</i> is a <b>None</b>, <i>get_groups</i> should return a 0.</div>

In [32]:
def get_group_affiliation(soup):
    group_affiliations = []
    for g in soup.find_all('li', {'class':'railBioInfo__Item'}):
        if 'Group' in g.text:
            for t in g.find_all('li'):
                group_affiliations.append(t.text)
            return group_affiliations

In [33]:
def get_groups(soup):
    if get_group_affiliation(soup) == None:
        return 0
    else:
        return len(get_group_affiliation(soup))

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [34]:
# LEAVE BLANK

In [35]:
# LEAVE BLANK

In [36]:
# LEAVE BLANK

In [37]:
# LEAVE BLANK

In [38]:
# LEAVE BLANK

Well done! 

You now have the code to make a request to the API for a given character, to extract the link to the wiki webpage for that character, to request the HTML code underlying that webpage, to identify the link to the website that stores the IN COMIC FULL REPORT and finally to extract different character attributes from the website.

Let's now take a look at the different characters in a given series.

<div class="alert alert-info">Create a new variable called <i>series_title</i> to store the name of a comics series of your choosing in <b>string</b> form. </div>

In [39]:
series_title = 'Venom'

Run the following cell to check that you don't forget to ceate this variable. You can run it as many times as you want, as long as you **don't modify it**.

In [40]:
from nose.tools import assert_true, assert_is_instance, assert_equal

# check that you created the variable series_title
assert_true(series_title)

# check that series_title is of correct type
assert_is_instance(series_title, str)

First thing we need in order to extract information about our series of choice is to identify its ID. For that purpose, we are first going to make a generic get request to the ```series``` endpoint using the title defined above.

<div class="alert alert-info"><b>Exercise 5 </b>Write the code to make a get request to the <i>series</i> endpoint to extract information about the given series. Store your response in <b>json</b> format in a new variable called <i>response_series</i>.</div>

In [41]:
endpoint_series = '/v1/public/series'
url_series = base_url + endpoint_series
response_series = rq.get(url_series, params={'apikey': publickey, 'ts': ts, 'hash': md5hash, 'title': series_title}).json()

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [42]:
# LEAVE BLANK

In [43]:
# LEAVE BLANK

<div class="alert alert-info">Create a new variable called <i>series_id</i> to store the id the series of your choosing in <b>int</b> form. </div>

In [44]:
series_id = 13911

Run the following cell to check that you don't forget to ceate this variable. You can run it as many times as you want, as long as you **don't modify it**.

In [45]:
from nose.tools import assert_true, assert_is_instance, assert_equal

# check that you created the variable series_id
assert_true(series_id)

# check that series_id is of correct type
assert_is_instance(series_id, int)

Now that you have identified the id, let's extract information about the different characters tha appear in your chosen series.

<div class="alert alert-info"><b>Exercise 6 </b>Write the code to make a get request to fetch the list of characters which appear in your chosen series. Store your response in <b>json</b> format in a new variable called <i>response_characters</i>.</div>

<div class="alert alert-warning">Note that Marvel's API returns information in batches of 100 characters at most. Make a single request, so that if the number of characters in your chosen series is larger than 100 you only retrieve the first 100.</div>

In [46]:
endpoint1 = endpoint_series + '/' + str(series_id) + '/' + 'characters'
url_characters_in_series = base_url + endpoint1

response_characters = rq.get(url_characters_in_series, params={'apikey': publickey, 'ts': ts, 'hash': md5hash, 'limit': 100}).json()

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [47]:
# LEAVE BLANK

In [48]:
# LEAVE BLANK

In order to retrieve information about each of the characters separately, let's store each URI separately.

<div class="alert alert-info"><b>Exercise 7 </b>Write the code to identify the names and the wiki URLs corresponding to each character in your series. Store the results in <b>list</b> form in new variables called <i>names</i> and <i>url_wikis</i>, respectively. The list have already been initialized for you</div>

In [49]:
names = []
url_wikis = []

for i in response_characters['data']['results']:
    names.append(i['name'])
    url_wikis.append(i['urls'][1]['url'].split('?')[0])

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [50]:
# LEAVE BLANK

In [51]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 8 </b>Write the code to retrieve the height, weight, gender, eyecolor, haircolor and groups (the number, not the list) of each of the characters in your chosen series. Store these values in lists called <i>height</i>, <i>weight</i>, <i>gender</i>, <i>eyecolor</i>, <i>haircolor</i> and <i>groups</i>, respectively. For those characters for which no IN COMICS FULL REPORT is available, use a <b>None</b>.</div>

<div class="alert alert-warning">Note that you may need to re-define some of the functions defined above in order to retrieve all the information properly. Make sure to modify this code in the cells above, don't redefine any function here. All your functions should be as general as possible.</div>

<div class="alert alert-warning">You'll notice that sometimes, even if the IN COMIC FULL REPORT tab exists, you are not able to retrieve all the information. That's fine, as long as your code is correct.</div>

In [52]:
height = []
weight = []
gender = []
eyecolor = []
haircolor = []
groups = []

for i in url_wikis:
    if (get_comics_report(get_soup(i))) == None:
        x = get_soup(i)
        height.append(get_height(x))
        weight.append(get_weight(x))
        gender.append(get_gender(x))
        eyecolor.append(get_eyecolor(x))
        haircolor.append(get_haircolor(x))
        groups.append(get_groups(x))
    else:
        y = get_comics_report(get_soup(i))
        z = get_soup(y)
        height.append(get_height(z))
        weight.append(get_weight(z))
        gender.append(get_gender(z))
        eyecolor.append(get_eyecolor(z))
        haircolor.append(get_haircolor(z))
        groups.append(get_groups(z))

The following cells run the checks on your code. Please **don't write any code here**. Just leave them as they are.

In [53]:
# LEAVE BLANK

In [54]:
# LEAVE BLANK

In [55]:
# LEAVE BLANK

In [56]:
# LEAVE BLANK

In [57]:
# LEAVE BLANK

In [58]:
# LEAVE BLANK

In [59]:
# LEAVE BLANK

In [60]:
# LEAVE BLANK

In [61]:
# LEAVE BLANK

In [62]:
# LEAVE BLANK

In [63]:
# LEAVE BLANK

In [64]:
# LEAVE BLANK

In [65]:
# LEAVE BLANK

In [66]:
# LEAVE BLANK

In [67]:
# LEAVE BLANK

In [68]:
# LEAVE BLANK

In [69]:
# LEAVE BLANK

In [70]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 9 </b>Write the code to save the data above to a DataFrame called <i>df</i>. The names of the different columns should be equal to the those of the lists above: <i>names</i>, <i>height</i>, <i>weight</i>, <i>gender</i>, <i>eyecolor</i>, <i>haircolor</i> and <i>groups</i>.</div>

In [71]:
import pandas as pd 

data = {'names': names, 'height': height, 'weight': weight, 'gender': gender, 
        'eyecolor': eyecolor, 'haircolor': haircolor, 'groups': groups}

df = pd.DataFrame(data=data)

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [72]:
# LEAVE BLANK

In [73]:
# LEAVE BLANK

Information about the height and weight of each character is provided in inches and pounds, respectively. 

<div class="alert alert-info"><b>Exercise 10 </b>Write the code to create two new columns, <i>height (cm)</i> and <i>weight (kg)</i> that store the information regarding the height and weight of each character in cm and kg, respectively. Assume that 1 lbs = 0.453592 kg and 1 inch = 2.54 cm. When you are done creating the new columns, make sure to delete the existing ones.</div>

In [74]:
df['height (cm)'] = (df['height'] * 12 * 2.54)
df['weight (kg)'] = (df['weight'] * 0.453592)

In [76]:
df.drop(['height', 'weight'], axis = 1, inplace = True)

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK