# Exercise sheet \#4
## Using APIs
### Exercise 1
Write a Python script which gets the list of astronauts who are currently in Space. To do so, you can use the [astro.json](http://api.open-notify.org/astros.json) OpenNotify API. 

In [2]:
import requests

In [3]:
def get_astronauts():
    response = requests.get('http://api.open-notify.org/astros.json')
    data = response.json()
    
    astronauts = []
    for i in data['people']:
        astronauts.append(i['name'])
    return astronauts

In [4]:
astronauts = get_astronauts()
print(astronauts)

['Jasmin Moghbeli', 'Andreas Mogensen', 'Satoshi Furukawa', 'Konstantin Borisov', 'Oleg Kononenko', 'Nikolai Chub', "Loral O'Hara"]


### Exercise 2

#### Question 2.1
Write a Python program, which, for each astronaut found in Exercise 1, retrieves  their (English) wikipedia article and extract the article's summary and links.

In [5]:
import wikipedia

In [6]:
def get_astronauts_info(astronaut):
    wikipedia.set_lang('en')
    astronaut_page = wikipedia.page(astronaut)
    return {'summary':astronaut_page.summary,'links':astronaut_page.links}

astronauts_info = dict()
for i in astronauts:
    astronauts_info[i] = get_astronauts_info(i)

print(astronauts_info)

{'Jasmin Moghbeli': {'summary': 'Jasmin Moghbeli (born June 24, 1983) is an American U.S. Marine Corps test pilot and NASA astronaut. She is a graduate of the Massachusetts Institute of Technology, Naval Postgraduate School and Naval Test Pilot School. Moghbeli is mission commander for SpaceX’s Crew-7 and flight engineer aboard the International Space Station.\n\n', 'links': ['Achievement Medal', 'Aerospace engineering', 'Air Medal', 'Andreas Mogensen', 'Arizona', 'Artemis program', 'Astronaut ranks and positions', "Bachelor's degree", 'Bachelor of Science', 'Bad Nauheim', 'Baldwin Senior High School (New York)', 'Bell AH-1Z Viper', 'Bell AH-1 SuperCobra', 'CBS Broadcasting', 'CBS News', 'California', 'Canadian Space Agency', 'China', 'Commendation Medal', 'Denmark', 'Expedition 69', 'Expedition 70', 'Extravehicular activity', 'Flight engineer', 'Francisco Rubio (astronaut)', 'Germany', 'ISS year-long mission', 'Information technology', 'Instagram (identifier)', 'International Space St

#### Question 2.2
Extend your Python program so that it only keeps links that are pointing to wikipedia pages (in any language).

In [7]:
def get_astronauts_info2(astronaut,lang='en'):
    wikipedia.set_lang(lang)
    try:
        astronaut_page = wikipedia.page(astronaut)
    except:
        wikipedia.set_lang('en')
        astronaut_page = wikipedia.page(astronaut)

    return {'summary':astronaut_page.summary,'links':astronaut_page.links}

astronauts_info2 = dict()
for i in astronauts:
    astronauts_info2[i] = get_astronauts_info2(i,lang='de')

print(astronauts_info2)

{'Jasmin Moghbeli': {'summary': 'Jasmin Moghbeli (* 24. Juni 1983 in Bad Nauheim) ist eine US-amerikanische Testpilotin des United States Marine Corps und Raumfahrerin der NASA.', 'links': ['1983', '24. Juni', 'Afghanistan', 'Artemis-Programm', 'Bad Nauheim', 'Baldwin (New York)', 'Bell AH-1', 'Cambridge (Massachusetts)', 'Deutschland', 'Houston', 'ISS-Expedition 69', 'Ingenieurwissenschaft', 'International Security Assistance Force', 'Internationale Raumstation', 'Iran', 'Islamische Revolution', 'Kalifornien', 'Liste der Raumfahrer nach Auswahlgruppen', 'Luft- und Raumfahrttechnik', 'Lyndon B. Johnson Space Center', 'Marine Expeditionary Unit', 'Maryland', 'Massachusetts', 'Massachusetts Institute of Technology', 'Master', 'Monterey (Kalifornien)', 'NASA', 'Naval Air Station Patuxent River', 'Naval Postgraduate School', 'New York (Bundesstaat)', 'Operation Enduring Freedom', 'Raumfahrer', 'SpaceX Crew-6', 'SpaceX Crew-7', 'Testpilot', 'Texas', 'The New Yorker', 'U.S. Naval Test Pilot 

#### Question 2.3
Extend your Python program so that it processes these links as follows:
 - it retrieves the corresponding article and then extracts its references


In [8]:
def get_astronauts_info3(astronaut,lang='en'):
    wikipedia.set_lang(lang)
    try:
        astronaut_page = wikipedia.page(astronaut)
    except:
        wikipedia.set_lang('en')
        astronaut_page = wikipedia.page(astronaut)

    return {'summary':astronaut_page.summary,'references':astronaut_page.references}

astronauts_info3 = dict()
for i in astronauts:
    astronauts_info3[i] = get_astronauts_info3(i)

print(astronauts_info3)

{'Jasmin Moghbeli': {'summary': 'Jasmin Moghbeli (born June 24, 1983) is an American U.S. Marine Corps test pilot and NASA astronaut. She is a graduate of the Massachusetts Institute of Technology, Naval Postgraduate School and Naval Test Pilot School. Moghbeli is mission commander for SpaceX’s Crew-7 and flight engineer aboard the International Space Station.\n\n', 'references': ['https://www.nasa.gov/astronauts/biographies/jasmin-moghbeli/biography', 'http://www.cbsnews.com/news/nasa-new-astronauts/', 'https://www.newyorker.com/news/news-desk/jasmin-moghbeli-americas-badass-immigrant-astronaut', 'https://www.mitathletics.com/sports/w-baskbl/2019-20/releases/20200116nopykv', 'https://www.instagram.com/astrojaws/', 'https://web.archive.org/web/20210315005729/https://www.mitathletics.com/sports/w-baskbl/2019-20/releases/20200116nopykv', 'https://www.nasa.gov/press-release/nasa-s-spacex-crew-7-launches-to-international-space-station', 'https://samagame.com/en/news/exploring-the-journey-b

#### Question 2.4
Extend your Python program to compute the average number of views for each astronaut's main article.

In [9]:
import wptools

In [10]:
def get_astronauts_info4(astronaut,lang='en'):
    wikipedia.set_lang(lang)
    try:
        astronaut_page = wikipedia.page(astronaut)
    except:
        wikipedia.set_lang('en')
        astronaut_page = wikipedia.page(astronaut)

    views = wptools.page(astronaut).get_more().data['views']
    return {'summary':astronaut_page.summary.strip(),'references':astronaut_page.references,'views':views}



astronauts_info4 = dict()
for i in astronauts:
    astronauts_info4[i] = get_astronauts_info4(i)

print(astronauts_info4)

en.wikipedia.org (querymore) Jasmin Moghbeli
Jasmin Moghbeli (en) data
{
  backlinks: <list(113)> {'pageid': 664, 'ns': 0, 'title': 'Astron...
  categories: <list(19)> Category:1983 births, Category:American p...
  contributors: 208
  files: <list(13)> File:Commons-logo.svg, File:Flag of Denmark.sv...
  languages: <list(25)> {'lang': 'ar', 'title': 'جاسمين مغبل'}, {'...
  pageid: 54253250
  requests: <list(1)> querymore
  title: Jasmin Moghbeli
  views: 1,219
}
en.wikipedia.org (querymore) Andreas Mogensen
en.wikipedia.org (querymore) 22874912 (&blcontinue=0|6961237)
en.wikipedia.org (querymore) 22874912 (&blcontinue=0|37699689)
Andreas Mogensen (en) data
{
  backlinks: <list(1362)> {'pageid': 664, 'ns': 0, 'title': 'Astro...
  categories: <list(11)> Category:1976 births, Category:21st-centu...
  contributors: 131
  files: <list(19)> File:Andreas Mogensen official portrait.jpg, F...
  languages: <list(25)> {'lang': 'ar', 'title': 'أندرياس موغنسن'},...
  pageid: 22874912
  requests: <li

{'Jasmin Moghbeli': {'summary': 'Jasmin Moghbeli (born June 24, 1983) is an American U.S. Marine Corps test pilot and NASA astronaut. She is a graduate of the Massachusetts Institute of Technology, Naval Postgraduate School and Naval Test Pilot School. Moghbeli is mission commander for SpaceX’s Crew-7 and flight engineer aboard the International Space Station.', 'references': ['https://www.nasa.gov/astronauts/biographies/jasmin-moghbeli/biography', 'http://www.cbsnews.com/news/nasa-new-astronauts/', 'https://www.newyorker.com/news/news-desk/jasmin-moghbeli-americas-badass-immigrant-astronaut', 'https://www.mitathletics.com/sports/w-baskbl/2019-20/releases/20200116nopykv', 'https://www.instagram.com/astrojaws/', 'https://web.archive.org/web/20210315005729/https://www.mitathletics.com/sports/w-baskbl/2019-20/releases/20200116nopykv', 'https://www.nasa.gov/press-release/nasa-s-spacex-crew-7-launches-to-international-space-station', 'https://samagame.com/en/news/exploring-the-journey-baldw

#### Question 2.5
Export the extracted information in a CSV file having the following fields:

`Astronaut's name ; Article's summary ; links separated by commas ; number of views`

In [11]:
import pandas as pd

In [12]:
data = []
for i in astronauts_info4.keys():
    data.append([i,astronauts_info4[i]['summary'],astronauts_info4[i]['references'],astronauts_info4[i]['views']])

In [13]:
df = pd.DataFrame(data,columns=['Name','Summary','Links','Number of views'])
df.to_csv('astronauts.csv',index=False,sep=';')

### Exercise 3
#### Question 3.1 
Using Wikipedia, compile the list of UEFA's Intertoto cup winners sorted by country.

In [48]:
uefa = wikipedia.page('List of UEFA Intertoto Cup winners').html().encode("utf-16")

In [49]:
df_club_country = pd.read_html(uefa)[1]

In [50]:
df_club_country

Unnamed: 0,Year,Nation,Winners,Score,Runners-up,Nation.1,Venue,Unnamed: 7
0,1995,France,Bordeaux,2–0,Karlsruher SC,Germany,"Wildparkstadion, Karlsruhe, Germany",
1,1995,France,Bordeaux,2–2,Karlsruher SC,Germany,"Parc Lescure, Bordeaux, France",
2,1995,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate,Bordeaux won 4–2 on aggregate
3,1995,France,Strasbourg,1–1,Tirol Innsbruck,Austria,"Tivoli Neu, Innsbruck, Austria",
4,1995,France,Strasbourg,6–1,Tirol Innsbruck,Austria,"Stade de la Meinau, Strasbourg, France",
...,...,...,...,...,...,...,...,...
100,2007[b],Germany,Hamburger SV,4–0,Dacia Chișinău,Moldova,"HSH Nordbank Arena, Hamburg, Germany",
101,2007[b],Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate,Hamburger SV won 5–1 on aggregate
102,2008[c],Portugal,Braga,2–0,Sivasspor,Turkey,"4 Eylül, Sivas, Turkey",
103,2008[c],Portugal,Braga,3–0,Sivasspor,Turkey,"Estádio Municipal de Braga, Braga, Portugal",


In [51]:
pairs = dict()
for _, row in df_club_country.iterrows():
    if row['Nation'] == row['Winners']:
        continue

    if row['Winners'] in pairs:
        continue
    
    pairs[row['Winners']] = row['Nation']

In [52]:
pairs

{'Bordeaux': 'France',
 'Strasbourg': 'France',
 'Karlsruher SC': 'Germany',
 'Guingamp': 'France',
 'Silkeborg': 'Denmark',
 'Auxerre': 'France',
 'Bastia': 'France',
 'Lyon': 'France',
 'Valencia': 'Spain',
 'Werder Bremen': 'Germany',
 'Bologna': 'Italy',
 'Montpellier': 'France',
 'Juventus': 'Italy',
 'West Ham United': 'England',
 'Udinese': 'Italy',
 'Celta Vigo': 'Spain',
 'VfB Stuttgart': 'Germany',
 'Aston Villa': 'England',
 'Paris Saint-Germain': 'France',
 'Troyes': 'France',
 'Málaga': 'Spain',
 'Fulham': 'England',
 'Schalke 04': 'Germany',
 'Villarreal': 'Spain',
 'Perugia': 'Italy',
 'Lille': 'France',
 'Lens': 'France',
 'Marseille': 'France',
 'Hamburger SV': 'Germany',
 'Newcastle United': 'England',
 'Braga': 'Portugal'}

In [53]:
df = pd.read_html(uefa)[2]

In [54]:
df = df[df['Titles']>0]

In [55]:
df

Unnamed: 0,Club,Titles,Runners-up,Years won,Years runner-up
0,Villarreal,2,1,"2003, 2004",2002
1,Hamburger SV,2,1,"2005, 2007",1999
2,VfB Stuttgart,2,0,"2000, 2002",—
3,Schalke 04,2,0,"2003, 2004",—
4,Karlsruher SC,1,1,1996,1995
5,Auxerre,1,1,1997,2000
6,Bologna,1,1,1998,2002
7,Valencia,1,1,1998,2005
8,Montpellier,1,1,1999,1997
9,Lille,1,1,2004,2002


In [56]:
country_winners = dict()

for _, row in df.iterrows():
    if pairs[row['Club']] in country_winners.keys():
        country_winners[pairs[row['Club']]].append(row['Club'])
    else:
        country_winners[pairs[row['Club']]] = [row['Club']]

country_winners

{'Spain': ['Villarreal', 'Valencia', 'Celta Vigo', 'Málaga'],
 'Germany': ['Hamburger SV',
  'VfB Stuttgart',
  'Schalke 04',
  'Karlsruher SC',
  'Werder Bremen'],
 'France': ['Auxerre',
  'Montpellier',
  'Lille',
  'Bordeaux',
  'Strasbourg',
  'Guingamp',
  'Bastia',
  'Lyon',
  'Paris Saint-Germain',
  'Troyes',
  'Lens',
  'Marseille'],
 'Italy': ['Bologna', 'Juventus', 'Udinese', 'Perugia'],
 'England': ['Newcastle United', 'West Ham United', 'Aston Villa', 'Fulham'],
 'Denmark': ['Silkeborg'],
 'Portugal': ['Braga']}

#### Question 3.2
Compute the number of Intertoto winners per country.

In [57]:
country_winners_count = dict()
for k, v in country_winners.items():
    country_winners_count[k] = len(v)

country_winners_count

{'Spain': 4,
 'Germany': 5,
 'France': 12,
 'Italy': 4,
 'England': 4,
 'Denmark': 1,
 'Portugal': 1}

### Exercise 4

Let's play with New York Times API! Register here to get your API access token: https://developer.nytimes.com/apis

#### Question 4.1

Extract for each astronaut articles in NYT mentioning them from the year 2022. Create a dictionay where by the astronauts name you could get information on number of articles published about them in 2022, links to the last 10 articles of 2022, and list of keywords which are associated with those articles, including how many times each keyword appeared. 

For example 

    {"John Smith": {"articles": ["nytimes.com/The_Best_Astronaut",
                                 "nytimes.com/Do_we_really_believe_in_space_travel"],
                   "num_articles": 2,
                   "keywords": {"space": 2, "celebrities": 1}}}
                   
<span style="color:red">ATTENTION!</span> Never publish your API access key on github or any other open website! It is only to be used by you. (Well, otherwise the API service could ban you :) )

In [152]:
astronauts

['Jasmin Moghbeli',
 'Andreas Mogensen',
 'Satoshi Furukawa',
 'Konstantin Borisov',
 'Oleg Kononenko',
 'Nikolai Chub',
 "Loral O'Hara"]

In [62]:
key = 'RgC6nSFioFofY1Yk6GEjLBTmA20vZSvv'

In [155]:
import time
nyt_astronauts = dict()
for i in astronauts:
    nyt_astronauts_info = {
        'articles':[],
        'num_articles':0,
        'keywords':[]
    }
    query = i
    filter = 'pub_year:(2020)'
    
    response = requests.get(f"https://api.nytimes.com/svc/search/v2/articlesearch.json?q={query}&fq={filter}&api-key={key}")
    js = response.json()['response']['docs']
    
    if js:
        for j in js:
            nyt_astronauts_info['articles'].append(j['web_url'])
            for k in j['keywords']:
                if k not in nyt_astronauts_info['keywords']:
                    nyt_astronauts_info['keywords'].append(k['value'])
    else:
        nyt_astronauts_info['articles'] = []
        nyt_astronauts_info['keywords'] = []

    nyt_astronauts_info['num_articles'] = len(nyt_astronauts_info['articles'])
    
    nyt_astronauts[i] = nyt_astronauts_info

    time.sleep(12)

In [156]:
print(nyt_astronauts)

{'Jasmin Moghbeli': {'articles': ['https://www.nytimes.com/2020/12/09/science/nasa-astronauts-moon.html'], 'num_articles': 1, 'keywords': ['Moon', 'Artemis Program', 'Space and Astronomy', 'Apollo Project', 'National Aeronautics and Space Administration', 'Bridenstine, James F (1975- )', 'Koch, Christina H', 'Meir, Jessica (1977- )', 'Mann, Nicole Aunapu', 'Acaba, Joseph M', 'Barron, Kayla J', 'Chari, Raja', 'Dominick, Matthew Stuart', 'Glover, Victor J Jr', 'Hoburg, Warren', 'Kim, Jonathan (1984- )', 'Lindgren, Kjell N', 'McClain, Anne C', 'Moghbeli, Jasmin', 'Rubins, Kathleen (1978- )', 'Rubio, Francisco C (1975- )', 'Tingle, Scott D', 'Watkins, Jessica A (1988- )', 'Wilson, Stephanie D']}, 'Andreas Mogensen': {'articles': ['https://www.nytimes.com/interactive/2020/11/02/science/iss-20th-anniversary-timeline.html'], 'num_articles': 1, 'keywords': ['International Space Station', 'Space Shuttles']}, 'Satoshi Furukawa': {'articles': ['https://www.nytimes.com/interactive/2020/11/02/scien

#### Question 4.2. Bonus :) 

What happens when we want to extract links to all the articles and not just the last 10? Extend your program to iterate through pages of the query results :)