# Marvel API and webscrapping project

In this project, we will use the Marvel API to retrieve data about one character, and then about all the characters in a series of your choice. This notebook uses the API as well as webscrapping.

## Creating an account

First, if not done yet, create an account to receive your keys by following this link : https://developer.marvel.com/

Then, you can get familiar with the documentation here: https://developer.marvel.com/docs

## Getting started 

First, store your public and private keys into variables

In [1]:
publickey = #YOUR KEY
privatekey = #YOUR KEY

SyntaxError: invalid syntax (<ipython-input-1-ea8ead63d521>, line 1)

## Making your first request

In addition to the query parameters for each endpoint, the Marvel API expects you to include some additional information too in order to build a successful request. In particular, Marvel's API expects you to sign your requests. In particular, the API expects you to fill in the values for three parameters in all your requests
    **apikey**: This parameter takes your public key.
    **ts**: This parameter takes a timestamp in string form or any other long string which can change on a request-by-request basis.
    **hash**: This parameter takes a MD5 hash of ts+privatekey+publickey.

## Generating a timestampÂ¶
You can obtain a timestamp for each request using the time library to . This library has a function called time that returns the current time. You can convert this output to a string to be used as a timestamp.

In [None]:
import time
ts = str(time.time())

## Generating a MD5 hash
In order to obtain the hash, you can use the hashlib library. The hash has to be applied over a code that corresponds to the concatenation of ts+privatekey+publickey. You can obtain it by running the cell below. Notice that the output is a long alphanumeric string

In [None]:
import hashlib
code = ts+privatekey+publickey
md5hash = hashlib.md5(code.encode('utf-8')).hexdigest()

### Choose your character name

In [None]:
character_name = "Loki"

### Send your query

In [None]:
import requests

auth_url = 'https://gateway.marvel.com/v1/public/characters'
# request body
params = {'name': character_name,
          'apikey': publickey,
          'ts': ts,
         'hash':md5hash}

# POST the request
response = requests.get(auth_url, params)

In [None]:
response.json()['data']['results'][0]['urls']

### URL wiki
Write the code to identify the full url address for your character's wiki and format it so that it only includes the regular url. Store it in a new variable called url_wiki.

In [None]:
# this extract the wiki url
url_wiki = response.json()['data']['results'][0]['urls'][1]['url']

# this splits the url and takes the part we need
url_wiki = url_wiki.split('?utm_campaign=')[0]

url_wiki

### Poking around

Copy-paste the url your retrieved in the code above to your browser and take a look at the webpage. It contains a general profile for your character of choice, as well as a bio and some additional information. By default, this page shows a general OVERVIEW, a IN COMICS PROFILE and a more specific IN COMICS FULL REPORT.

Since not all characters include these last two pieces of information, let's focus on the other elements instead. In particular, we are interested in retrieving part of the information contained in the IN COMICS FULL REPORT tab.

We start by retrieving the page HTML soup with *beautifulsoup*

In [None]:
from bs4 import BeautifulSoup as bs

def get_soup(url):
    response = requests.get(url)
    return bs(response.text, 'html.parser')

In [None]:
soup_wiki = get_soup(url_wiki)

We now write the code to create a function called get_comics_full_report. This function should take as input a BeautifulSoup object containing the parsed HTML code for a character's wiki website and return the full url for the IN COMICS FULL REPORT tab in string form as output. In cases where no such tag exists, your function should return a None.

In [None]:
base_url = 'https://www.marvel.com'
def get_comics_full_report(bsoup):
    if bsoup.find_all('a',{'class':'masthead__tabs__link'}):
        second_soups = bsoup.find_all('a',{'class':'masthead__tabs__link'})
        for soups in second_soups:
                span_class = soups.find('span',{'class':'masthead__tabs__link-text'}).text
                #print(span_class)
                if (span_class == 'In Comics Full Report'):
                    emp_url = soups.get('href')
                    full_url =  base_url + emp_url
                    print(str(full_url))
    else:
        return None
get_comics_full_report(soup_wiki)

When accessing the IN COMICS FULL REPORT for a character, we get information about different attributes, including the height, the weight, the gender, etc. Let's write the code to extract this information.

### weight and height

Assume that 1 lbs = 0.453592 kg, 1 foot = 30.48 cm and 1 inch = 2.54 cm. In cases where no such information is given, your functions should return a None.

In [None]:
response = get_comics_full_report(soup_wiki)

In [None]:
full_soup = soup_wiki(response)
full_soup

In [None]:
lbs = 0.453592 # kg
foot = 30.48 # cm
inch = 2.54 # cm
types = ['feet', 'inches', 'lbs']

import re

def get_height(soup):
    try:
        height = soup.find_all(class_='bioheader__stat')[0].text.split(',')[0].split()[-1]
        if height == 'None':
            return None
        else:
            # there should be an if statement here to check for the correct type of feet and inches
            # if height not in types: return None
            height = re.findall('\d+', height)
            if len(height) > 1:
                height = (int(height[0]) * foot) + (int(height[1]) * inch)
                return height
            else:
                height = (int(height[0]) * foot) + (int(height[1]) * inch)
                return height
    except:
        return None

def get_weight(soup):
    try:
        weight = soup.find_all(class_='bioheader__stat')[1].text.split(',')[0].split()[-2:]
        if weight[1][:-1] not in types:
            return None
        else:
            return lbs * float(weight[0])
    except:
        return None

### get the gender, hair colour and eyes colour

In [None]:
def get_gender(soup):
    stats = soup.find_all('div',{'class': 'bioheader__stats'})
    counter = 0
    for stat in stats:
        if (stat.find('p',{'class': 'bioheader__label'}).text=='gender'):
            gender = stat.find('p',{'class':'bioheader__stat'}).text
            counter = 1
    if counter == 1:
        return str(gender)
    else:
        return None

def get_eyes(soup):
    stats = soup.find_all('div',{'class': 'bioheader__stats'})
    counter = 0
    for stat in stats:
        if (stat.find('p',{'class': 'bioheader__label'}).text=='eyes'):
            eyes = stat.find('p',{'class':'bioheader__stat'}).text
            counter = 1
    if counter == 1:
        return str(eyes)
  
def get_hair(soup):
    stats = soup.find_all('div',{'class': 'bioheader__stats'})
    counter = 0
    for stat in stats:
        if (stat.find('p',{'class': 'bioheader__label'}).text=='hair'):
            hair = stat.find('p',{'class':'bioheader__stat'}).text
            counter = 1
    if counter == 1:
        return str(hair)
    else:
        return None
  

### get the place of origin

In [None]:
def get_place_of_origin(soup):
    stats = soup.find_all('li',{'class': 'railBioInfo__Item'})
    counter = 0
    for stat in stats:
        if (stat.find('p',{'class': 'railBioInfoItem__label'}).text=='Place of Origin'):
            origin = stat.find('ul',{'class':'railBioLinks'}).text
            counter = 1
    if counter == 1:
        return str(origin)
    else:
        return None


### get the list of powers
We will now create a list to check if our characters contains the following powers : *Flight, Hypnosis, Telepathy, Teleportation*

In [None]:
def get_powers(soup):
    stats = soup.find_all('li',{'class': 'railBioInfo__Item'})
    counter = 0
    powers = [None,None,None,None]
    try:
        for stat in stats:
            if (stat.find('p',{'class': 'railBioInfoItem__label'}).text=='Powers'):
                powers_list = stat.find('ul',{'class':'railBioLinks'}).get_text()
                counter = 1
        if counter == 1:
            #check if "flight" in powers
            power_1 ="Flight"
            if power_1 in powers_list:
                powers[0] = True
            #check if "Hypnosis" in powers
            power_2 ="Hypnosis"
            if power_2 in powers_list:
                powers[1] = True
             #check if "Telepathy" in powers
            power_3 ="Telepathy"
            if power_3 in powers_list:
             #check if "Teleportation" in powers
                powers[2] = True
            power_4 ="Teleportation"
            if power_4 in powers_list:
                powers[3] = True
        return powers
    except:
        return None
    

## Extracting the data

Let's chose a series on this link: https://www.marvel.com/comics/series

In [None]:
series_name = 'Avengers (2018 - Present)'

Write the code to make a get request the series/{seriesId}/characters endpoint to fetch the list of the first 100 characters which appear in your chosen series. Store the response in in a new variable called response_characters.

*Note that Marvel's API returns information in batches of 100 characters at most. Make a single request, so that if the number of characters in your chosen series is larger than 100 you only retrieve the first 100. If the number of characters in your chosen series is smaller than 100, you should retrieve them all in a single request.*

In [None]:
auth_url = 'https://www.marvel.com/comics/series'
soup_id = get_soup(auth_url)
ids = soup_id.find('div',{'class': 'JCAZList-list'})

for indi_id in ids.find_all('a'):
    if indi_id.text == series_name:
        href = str(indi_id.get('href'))
        hrefs = href.split('/')
        series_id = (hrefs[3])

        series_endpoint = 'https://gateway.marvel.com/v1/public/series/'+series_id+'/characters'
series_url = series_endpoint
ts = str(time.time())
code = ts+privatekey+publickey
md5hash = hashlib.md5(code.encode('utf-8')).hexdigest()
params = {'apikey': publickey, 
          'ts': ts, 
          'hash': md5hash,
          'limit': 100}

response_characters = requests.get(series_url, params)
response_characters

In what follows, you will retrieve information about each of the characters in your series above separately. For that purpose, let's first identify their names and wiki urls.

In [None]:
names = []
url_wikis = []

for character in response_characters.json()['data']['results']:
    names.append(character['name'])
    url_wikis.append(character['urls'][1]['url'])

We now write the code to retrieve the name, height, weight, gender, eyecolor, haircolor, place of origin and the powers of each of the characters in your chosen series. Store this information in a DataFrame object called marvel. Set the names of the columns to: name, height, weight, gender, eyes, hair, place_of_origin, flight, hypnosis, telepathy and teleportation, respectively. Don't define any index when defining your DataFrame.

In [None]:
import pandas as pd
from pandas import DataFrame

height = []
weight = []
gender = []
eyes = []
hair = []
place_of_origin = []
flight = []
hypnosis = []
telepathy =[]
teleportation =[]

for url_wiki in url_wikis:
    soup_wiki = get_soup(url_wiki)
    if get_comics_full_report(soup_wiki):
        url_report = get_comics_full_report(soup_wiki)
    
        #print(url_report)
        soup_report = get_soup(url_report)

        height.append(get_height(soup_report))
        weight.append(get_weight(soup_report))
        gender.append(get_gender(soup_report))
        eyes.append(get_eyes(soup_report))
        hair.append(get_hair(soup_report))
        place_of_origin.append(get_place_of_origin(soup_report))
        powers = get_powers(soup_report)
        flight.append(powers[0])
        hypnosis.append(powers[1])
        telepathy.append(powers[2])
        teleportation.append(powers[3])
        
    else:
        height.append(None)
        weight.append(None)
        gender.append(None)
        eyes.append(None)
        hair.append(None)
        place_of_origin.append(None)
        flight.append(None)
        hypnosis.append(None)
        telepathy.append(None)
        teleportation.append(None)


In [None]:
marvel = pd.DataFrame({'name':names,
                   'height':height,
                   'weight':weight,
                   'gender':gender,
                   'eyes':eyes,
                   'hair':hair,
                   'place_of_origin':place_of_origin,
                   'flight':flight,
                   'hypnosis':hypnosis,
                   'telepathy':telepathy,
                   'teleportation':teleportation})
marvel