# Datenjournalismus in Python - 
# Eine praktische Einführung in die Programmierung


### Natalie Widmann




Wintersemester 2022 / 2023


Universität Leipzig







![Timeline](../imgs/timeline.png)

# Teil III - APIs und Webscraping


# Inhalte

## - was wir lernen

- Was sind APIs
- Daten abrufen und speichern
- Filepfade





### Challenges of datafiles

But why use an API instead of a static CSV dataset you can download from the web? APIs are useful in the following cases:

- The data is changing quickly. An example of this is stock price data. It doesn’t really make sense to regenerate a dataset and download it every minute — this will take a lot of bandwidth, and be pretty slow.
- You want a small piece of a much larger set of data. Reddit comments are one example. What if you want to just pull your own comments on Reddit? It doesn’t make much sense to download the entire Reddit database, then filter just your own comments.
- There is repeated computation involved. Spotify has an API that can tell you the genre of a piece of music. You could theoretically create your own classifier, and use it to compute music categories, but you’ll never have as much data as Spotify does.


# Was sind APIs?

An API, or Application Programming Interface, is a server that you can use to retrieve and send data to using code. APIs are most commonly used to retrieve data, and that will be the focus of this beginner tutorial.

When we want to receive data from an API, we need to make a request. Requests are used all over the web. For instance, when you visited this blog post, your web browser made a request to the Dataquest web server, which responded with the content of this web page.

API requests work in exactly the same way – you make a request to an API server for data, and it responds to your request.


![API](../imgs/API.png)

### Unser erster API Request in Python

Dafür benutzen wir das `requests` python package.

Dokumentation: https://requests.readthedocs.io/en/latest/

In [1]:
!pip install requests

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import requests

Um Daten einer API abzufragen verwendet man einen sogenannten `GET` request.
Dafür hat das `request` package die Funktion `requests.get()`.
Diese nimmt als Argument die url entgegen.

In [32]:
#url = 'https://api.open-notify.org/this-api-doesnt-exist'
url = 'http://api.open-notify.org/astros.json'
response = requests.get(url)

In [30]:
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [18]:
response

<Response [200]>

### Status Codes

Die Antwort eines request, enthält einen response Code der sagt ob die Anfrage erfolgreich war.

Der Status Code wird über `.status_code` abgerufen.

In [8]:
response.status_code

200

### Relevante Status Codes


| Status Code  | Bedeutung  | 
|----------|:-------------  |
| **200** | OK. Eine Verbindung zur API konnte hergestellt werden. |
| **204** | Eine API Verbindung wurde hergestellt aber keine Daten übertragen |
| **400** | Eine API Verbindung wurde hergestellt aber keine Daten übertragen |
| **401** | Authentifizierungsfehler!  |
| **403** |	Zugang zum Endpunkt ist nicht gestattet. |
| **404** | Die API / url wurde nicht gefunden. |
| **500** |	Interner Server Fehler. |


### API Data

Die Daten der API können über `.json()` abgerufen werden.

In [24]:
data = response.json()

In [26]:
data

{'people': [{'craft': 'Tiangong', 'name': 'Cai Xuzhe'},
  {'craft': 'Tiangong', 'name': 'Chen Dong'},
  {'craft': 'Tiangong', 'name': 'Liu Yang'},
  {'craft': 'ISS', 'name': 'Sergey Prokopyev'},
  {'craft': 'ISS', 'name': 'Dmitry Petelin'},
  {'craft': 'ISS', 'name': 'Frank Rubio'},
  {'craft': 'ISS', 'name': 'Nicole Mann'},
  {'craft': 'ISS', 'name': 'Josh Cassada'},
  {'craft': 'ISS', 'name': 'Koichi Wakata'},
  {'craft': 'ISS', 'name': 'Anna Kikina'},
  {'craft': 'Shenzhou 15', 'name': 'Fei Junlong'},
  {'craft': 'Shenzhou 15', 'name': 'Deng Qingming'},
  {'craft': 'Shenzhou 15', 'name': 'Zhang Lu'}],
 'number': 13,
 'message': 'success'}

In [None]:
type(data)

#### Welche Daten sind verfügbar?

Zeige alle Keys im Dictionary an.

dict_keys(['people', 'number', 'message'])

Zeige die Anzahl an Menschen an, die gerade im All ist

Drucke die Namen aller Menschen im All aus

## Daten lokal speichern

In [35]:
import json

with open("astronauts.json", "w") as outfile:
    json.dump(data, outfile)

### Reading JSON


In [36]:
import json
 
with open('astronauts.json', 'r') as openfile:
    new_data = json.load(openfile)
 
print(new_data)
print(new_data)

{'people': [{'craft': 'Tiangong', 'name': 'Cai Xuzhe'}, {'craft': 'Tiangong', 'name': 'Chen Dong'}, {'craft': 'Tiangong', 'name': 'Liu Yang'}, {'craft': 'ISS', 'name': 'Sergey Prokopyev'}, {'craft': 'ISS', 'name': 'Dmitry Petelin'}, {'craft': 'ISS', 'name': 'Frank Rubio'}, {'craft': 'ISS', 'name': 'Nicole Mann'}, {'craft': 'ISS', 'name': 'Josh Cassada'}, {'craft': 'ISS', 'name': 'Koichi Wakata'}, {'craft': 'ISS', 'name': 'Anna Kikina'}, {'craft': 'Shenzhou 15', 'name': 'Fei Junlong'}, {'craft': 'Shenzhou 15', 'name': 'Deng Qingming'}, {'craft': 'Shenzhou 15', 'name': 'Zhang Lu'}], 'number': 13, 'message': 'success'}
{'people': [{'craft': 'Tiangong', 'name': 'Cai Xuzhe'}, {'craft': 'Tiangong', 'name': 'Chen Dong'}, {'craft': 'Tiangong', 'name': 'Liu Yang'}, {'craft': 'ISS', 'name': 'Sergey Prokopyev'}, {'craft': 'ISS', 'name': 'Dmitry Petelin'}, {'craft': 'ISS', 'name': 'Frank Rubio'}, {'craft': 'ISS', 'name': 'Nicole Mann'}, {'craft': 'ISS', 'name': 'Josh Cassada'}, {'craft': 'ISS', 'n

### File Paths

In [39]:
response.json()

{'meta': {'abgeordnetenwatch_api': {'version': '2.3',
   'changelog': 'https://www.abgeordnetenwatch.de/api/version-changelog/aktuell',
   'licence': 'CC0 1.0',
   'licence_link': 'https://creativecommons.org/publicdomain/zero/1.0/deed.de',
   'documentation': 'https://www.abgeordnetenwatch.de/api/entitaeten/parliament'},
  'status': 'ok',
  'status_message': '',
  'result': {'count': 18, 'total': 18, 'range_start': 0, 'range_end': 100}},
 'data': [{'id': 18,
   'entity_type': 'parliament',
   'label': 'Schleswig-Holstein',
   'api_url': 'https://www.abgeordnetenwatch.de/api/v2/parliaments/18',
   'abgeordnetenwatch_url': 'https://www.abgeordnetenwatch.de/schleswig-holstein',
   'label_external_long': 'Landtag Schleswig-Holstein',
   'current_project': {'id': 138,
    'entity_type': 'parliament_period',
    'label': 'Schleswig-Holstein 2022 - 2027',
    'api_url': 'https://www.abgeordnetenwatch.de/api/v2/parliament-periods/138',
    'abgeordnetenwatch_url': 'https://www.abgeordnetenwat

## Twitter API

In [53]:
with open('.env', 'r') as f:
    TOKEN = f.readline()

In [54]:
def create_headers(token):
    headers = {"Authorization": "Bearer {}".format(token)}
    return headers

In [77]:
def create_url(keyword, max_results = 10):
    
    search_url = "https://api.twitter.com/2/tweets/search/recent" #Change to the endpoint you want to collect data from

    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

In [78]:
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

In [91]:
headers = create_headers(TOKEN)
keyword = "Musk lang:de"
start_time = "2021-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 15

In [92]:
url = create_url(keyword, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1])

Endpoint Response Code: 200


In [93]:
json_response

{'data': [{'in_reply_to_user_id': '1509990162595467265',
   'id': '1599864361287553024',
   'conversation_id': '1599863110000869376',
   'reply_settings': 'everyone',
   'public_metrics': {'retweet_count': 0,
    'reply_count': 0,
    'like_count': 0,
    'quote_count': 0},
   'edit_history_tweet_ids': ['1599864361287553024'],
   'text': '@S04Juulius Schon lustig alles und jeden auf der Welt anprangern aber 9€ im Monat Dauerauftrag an Elon Musk.',
   'referenced_tweets': [{'type': 'replied_to', 'id': '1599863110000869376'}],
   'lang': 'de',
   'source': 'Twitter for iPhone',
   'created_at': '2022-12-05T20:32:42.000Z',
   'author_id': '1033905116540219394'},
  {'id': '1599864273185882113',
   'conversation_id': '1599864273185882113',
   'reply_settings': 'everyone',
   'public_metrics': {'retweet_count': 379,
    'reply_count': 0,
    'like_count': 0,
    'quote_count': 0},
   'edit_history_tweet_ids': ['1599864273185882113'],
   'text': 'RT @maxotte_says: "Was Musk mit #twittergate m