In [1]:
import pandas as pd
import keys
import json

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

# Introduction

In this tutorial we give an overview of how to use `APIs (Application Programming Interfaces)` to retrieve data. An API is a set of protocols and routines for building and interacting with software applications. APIs are hosted in web servers. It is a very effective and quickly way to retrieve data that changes frequently. Via API one is able to retrieve real time data as well as historical data. An example of interesting business application involves combining real time data together with historical data to predict  demand of products. 

Imagine you have a bakery close by a train station in the Netherlands. The historical data from the Dutch Railway, NS (Nederlandse Spoorwegen) could be combined with weather historical data obtained from [KNMI weather API]( https://weerlive.nl/delen.php) to build a forecasting model. Then the actual data would allow  forecasting the demand of clients on a particular day.

We start this tutorial showing a simple API: The [Open Movie database (OMDb) API]( http://www.omdbapi.com/). Then we take a look on how to get information from both NS and KNMI Weer APIs. To close it we check out how to pull data from [Twitter](https://twitter.com/?lang=en).

# Some background in JSON files

A standard form for transferring data through APIs is the human readable file format JSON (JavaScript Object Notation). Hence, before getting data from APIs let's get some basics about JSON files.

The image below is the output of a OMDb API request.

![](../images/omdb_json.JPG)


Notice that JSON consists of key-value pairs like a Python dictionary. That’s why when loading JSONs into Python dictionary is a natural choice. The keys in JSONs are always strings enclosed in quotation marks. The values can be strings, integers, arrays or even objects. An object can even be a JSON and then you have nested JSONs. We can see in the JSON above that all keys are strings between quotation marks. Most of the values are strings but notice that `Ratings` is a list of dictionaries.

The [JSON library]( https://docs.python.org/3.6/library/json.html) has two main methods:

* `dumps` -- Takes in a Python object, and converts it to a string
* `loads` -- Takes a JSON string, and converts it to a Python object

We use the first to **save** an object and the second to **load** it. To exemplify I’ll use some information about a series this time: "The Queen's Gambit".


In [2]:
series_info = {'Title': "The Queen's Gambit", 
               'Year': '2020', 'Rated': 'TV-MA', 
               'Released': '23 Oct 2020', 
               'Runtime': '395 min', 
               'Genre': 'Drama, Sport', 
               'Actors': 'Anya Taylor-Joy, Chloe Pirrie, Bill Camp, Marcin Dorocinski', 
               'Plot': 'Orphaned at the tender age of nine, prodigious introvert Beth Harmon discovers and masters the game of chess in 1960s USA. But child stardom comes at a price.', 
               'Language': 'English', 
               'Country': 'USA', 
               'Awards': 'Nominated for 2 Golden Globes. Another 3 wins & 16 nominations.',
               'imdbRating': '8.6', 
               'imdbVotes': '258,170', 
               'imdbID': 'tt10048342', 
               'Type': 'series', 
               'totalSeasons': '1'}

In [3]:
import json

# convert a python object in string and save it
with open('../data/processed/serie.json', 'w') as fp:
    json.dump(series_info, fp)

Let's recover the content of the JSON file we just saved.

In [4]:
# takes a json string, converts it into a Python object, and save it: json_data
with open("../data/processed/serie.json") as json_file:
    json_data = json.load(json_file)

In [5]:
json_data

{'Title': "The Queen's Gambit",
 'Year': '2020',
 'Rated': 'TV-MA',
 'Released': '23 Oct 2020',
 'Runtime': '395 min',
 'Genre': 'Drama, Sport',
 'Actors': 'Anya Taylor-Joy, Chloe Pirrie, Bill Camp, Marcin Dorocinski',
 'Plot': 'Orphaned at the tender age of nine, prodigious introvert Beth Harmon discovers and masters the game of chess in 1960s USA. But child stardom comes at a price.',
 'Language': 'English',
 'Country': 'USA',
 'Awards': 'Nominated for 2 Golden Globes. Another 3 wins & 16 nominations.',
 'imdbRating': '8.6',
 'imdbVotes': '258,170',
 'imdbID': 'tt10048342',
 'Type': 'series',
 'totalSeasons': '1'}

In [6]:
# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

Title:  The Queen's Gambit
Year:  2020
Rated:  TV-MA
Released:  23 Oct 2020
Runtime:  395 min
Genre:  Drama, Sport
Actors:  Anya Taylor-Joy, Chloe Pirrie, Bill Camp, Marcin Dorocinski
Plot:  Orphaned at the tender age of nine, prodigious introvert Beth Harmon discovers and masters the game of chess in 1960s USA. But child stardom comes at a price.
Language:  English
Country:  USA
Awards:  Nominated for 2 Golden Globes. Another 3 wins & 16 nominations.
imdbRating:  8.6
imdbVotes:  258,170
imdbID:  tt10048342
Type:  series
totalSeasons:  1


# Using APIS to retrieve information from the web

As commented previously, an API is a set of protocols and routines for building and interacting with software applications which allows two software programs to communicate with each other. For instances, if one wants to stream actual weather information by writing some Python code, he/she would use a weather API such as KNMI streaming Weer API. On the other hand, if someone wants to automate pulling and processing data from the Dutch Railway NS, he/she could use the NS API. 

Using APIs has become normal practice nowadays. Marketing companies and social scientists use APIs from Twitter, Facebook, Instagram, for example. Many other companies and organizations have APIs. [Rapid API]( https://rapidapi.com/?site) is a good way to get informed about APIs available.

Now that we know a bit about JSON including how to save and load JSON files it is time to use APIs and Python to automate data retrieval .

## Omdb

Let's start with the [OMDb API](http://www.omdbapi.com/).

In order to get information over `The Queen's Gambit` series which we saw partially in the previous section I used the following URL to make a request:


First, notice that in place of my key I used `keys.omdb_key`. To protect my API keys, I listed them in a python script `keys.py` and added it to my `.gitinore` file.

I'm providing another script `your_keys.py` with the same format so you can fill in your own keys. Then, to run the API you just need to call your script `import your_keys` and use a specific key `your_keys.KEY`. And remember to don't share this script with others. If sharing your work on GitHub list in your key script in  `.gitignore`.

For OMDb API the `request URL` is `http://www.omdbapi.com/?apikey=[yourkey]&`. Usually, `?` indicates the query part, i.e., where we specify parameters. But before setting your search parameters you need to add your `apikey`. Then here we can say that the part after `&` is referred to as the query string. 


In [7]:
import keys # import script containing keys
import requests 

# Package the request, send the request and catch the response: r
request_url = "http://www.omdbapi.com/?apikey="+keys.omdb_key

parameters = {"t":"The Queen's Gambit"}

response = requests.get(request_url, params=parameters)

print(response)

<Response [200]>


In [8]:
# Get the response data as a Python object.  Verify that it's a dictionary.
json_data = response.json()
print(type(json_data))
print(json_data)

<class 'dict'>
{'Title': "The Queen's Gambit", 'Year': '2020', 'Rated': 'TV-MA', 'Released': '23 Oct 2020', 'Runtime': '395 min', 'Genre': 'Drama, Sport', 'Director': 'N/A', 'Writer': 'N/A', 'Actors': 'Anya Taylor-Joy, Chloe Pirrie, Bill Camp, Marcin Dorocinski', 'Plot': 'Orphaned at the tender age of nine, prodigious introvert Beth Harmon discovers and masters the game of chess in 1960s USA. But child stardom comes at a price.', 'Language': 'English', 'Country': 'USA', 'Awards': 'Nominated for 2 Golden Globes. Another 3 wins & 16 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BM2EwMmRhMmUtMzBmMS00ZDQ3LTg4OGEtNjlkODk3ZTMxMmJlXkEyXkFqcGdeQXVyMjM5ODk1NDU@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.6/10'}], 'Metascore': 'N/A', 'imdbRating': '8.6', 'imdbVotes': '258,170', 'imdbID': 'tt10048342', 'Type': 'series', 'totalSeasons': '1', 'Response': 'True'}


In [9]:
import keys # import script containing keys
import requests 

# Package the request, send the request and catch the response: r
request_url = "http://www.omdbapi.com/?apikey="+keys.omdb_key

parameters = {"t":"The Queen's Gambit"}

response = requests.get(request_url, params=parameters)

print(response)

# Get the response data as a Python object.  Verify that it's a dictionary.
json_data = response.json()
print(type(json_data))
print(json_data)

<Response [200]>
<class 'dict'>
{'Title': "The Queen's Gambit", 'Year': '2020', 'Rated': 'TV-MA', 'Released': '23 Oct 2020', 'Runtime': '395 min', 'Genre': 'Drama, Sport', 'Director': 'N/A', 'Writer': 'N/A', 'Actors': 'Anya Taylor-Joy, Chloe Pirrie, Bill Camp, Marcin Dorocinski', 'Plot': 'Orphaned at the tender age of nine, prodigious introvert Beth Harmon discovers and masters the game of chess in 1960s USA. But child stardom comes at a price.', 'Language': 'English', 'Country': 'USA', 'Awards': 'Nominated for 2 Golden Globes. Another 3 wins & 16 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BM2EwMmRhMmUtMzBmMS00ZDQ3LTg4OGEtNjlkODk3ZTMxMmJlXkEyXkFqcGdeQXVyMjM5ODk1NDU@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.6/10'}], 'Metascore': 'N/A', 'imdbRating': '8.6', 'imdbVotes': '258,170', 'imdbID': 'tt10048342', 'Type': 'series', 'totalSeasons': '1', 'Response': 'True'}


Notice that the URL can be constructed using simple string manipulation. However, for more complex requests using the structure presented above can make the task easier and less prone to error.

For instances, we could build the url as follows:

In [10]:
# Assign URL to variable: url
url = "http://www.omdbapi.com/?apikey="+keys.omdb_key+"&t=soul"

# Package the request, send the request and catch the response: r
response = requests.get(url)

# Get the response data as a Python object.  Verify that it's a dictionary.
json_data = response.json()
print(type(json_data))
print(json_data)

<class 'dict'>
{'Title': 'Soul', 'Year': '2020', 'Rated': 'PG', 'Released': '25 Dec 2020', 'Runtime': '100 min', 'Genre': 'Animation, Adventure, Comedy, Family, Fantasy, Music', 'Director': 'Pete Docter, Kemp Powers(co-director)', 'Writer': 'Pete Docter (story & screenplay by), Mike Jones (story & screenplay by), Kemp Powers (story & screenplay by)', 'Actors': 'Jamie Foxx, Tina Fey, Graham Norton, Rachel House', 'Plot': 'After landing the gig of a lifetime, a New York jazz pianist suddenly finds himself trapped in a strange land between Earth and the afterlife.', 'Language': 'English', 'Country': 'USA', 'Awards': 'Nominated for 2 Golden Globes. Another 55 wins & 71 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BZGE1MDg5M2MtNTkyZS00MTY5LTg1YzUtZTlhZmM1Y2EwNmFmXkEyXkFqcGdeQXVyNjA3OTI0MDc@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.1/10'}, {'Source': 'Metacritic', 'Value': '83/100'}], 'Metascore': '83', 'imdbRating': '8.1', 'imdbVotes'

## NS APIs

Now is time to explore a bit the [NS API](https://apiportal.ns.nl/) which is the API of the *Nederlandse Spoorwegen*, i.e., the Dutch Railway. With this API we can extract actual information while historical data can be retrieved from sites like [**rijdendetreinen.nl**](https://www.rijdendetreinen.nl/over/open-data) and [**NDOV Loket**](https://ndovloket.nl/index.html).

In addition, [GoTrain](https://github.com/rijdendetreinen/gotrain) is a server application for receiving, processing and distributing real-time data about train services in the Netherlands. It is designed to continuously receive data streams offered as open data by the [Dutch Railways (NS)](https://www.ns.nl/).

A list of stations was supposed to be found in the website of NS https://www.ns.nl/en/travel-information/ns-api/documentation-station-list.html. However, the links was not active when we built this notebook. [Wikipedia](https://en.wikipedia.org/wiki/Railway_stations_in_the_Netherlands) provides a list of the code of the stations and at [NDOV loket](http://data.ndovloket.nl/ns/) you can download a list containing the `UCI codes`. An UCI code is an identifier for a railway station in Europe, CIS countries, China, Mongolia, North Africa and the Middle East. I've made a list downloaded at 18/01 available in `data/raw`.

In order to use the NS API, you need to register so you can get an API key. Go to the  [`starter guide`](https://apiportal.ns.nl/startersguide) to register and access some other information about the NS API. 

There are different [APIs available](https://apiportal.ns.nl/docs/services/) and before using a certain API you need to subscribe to the specific API you want to use.

The image below shows the APIs I've subscribed.

![](../images/NS_profile.JPG)


### [`GET` Stations](https://apiportal.ns.nl/docs/services/reisinformatie-api/operations/getStations?)

Here we get some information about stations which is included in the `Ns-App` subscription.

At [Reisinformatie API page](https://apiportal.ns.nl/docs/services/reisinformatie-api/) you can see all available operations and example of code to be used in different languages.

Notice that for NS API we have:

* `Request URL`: Depends on the operation, for `Get Stations`: https://gateway.apiportal.ns.nl/reisinformatie-api/api/v2/stations
* `Request headers`: The NS API key, i.e., `Ocp-Apim-Subscription-Key`
* `Request parameters`: These are our query strings and they vary in accord with the API you are using.

Then, notice that we add `headers` to our request.

For the `GET` Stations we don't have any specific parameter. However, we can perform some filtering after to get information only on Dutch train stations.


In [11]:
import urllib

headers = {
    # Request headers
    'Ocp-Apim-Subscription-Key': keys.ns_app_key,
}

params = urllib.parse.urlencode({
})


response = requests.get("https://gateway.apiportal.ns.nl/reisinformatie-api/api/v2/stations", headers=headers, params=params)
json_data = response.json()

json_data

{'payload': [{'UICCode': '8002084',
   'stationType': 'SNELTREIN_STATION',
   'EVACode': '8000139',
   'code': 'MGZB',
   'sporen': [],
   'synoniemen': ['Gunzburg'],
   'heeftFaciliteiten': True,
   'heeftVertrektijden': True,
   'heeftReisassistentie': False,
   'namen': {'lang': 'Günzburg', 'middel': 'Günzburg', 'kort': 'Günzburg'},
   'land': 'D',
   'lat': 48.460226,
   'lng': 10.278707,
   'radius': 0,
   'naderenRadius': 0,
   'ingangsDatum': '2018-12-16'},
  {'UICCode': '8002140',
   'stationType': 'SNELTREIN_STATION',
   'EVACode': '8000013',
   'code': 'MA',
   'sporen': [],
   'synoniemen': [],
   'heeftFaciliteiten': True,
   'heeftVertrektijden': True,
   'heeftReisassistentie': False,
   'namen': {'lang': 'Augsburg Hbf',
    'middel': 'Augsburg Hbf',
    'kort': 'Augsburg'},
   'land': 'D',
   'lat': 48.3654307143927,
   'lng': 10.88547706604,
   'radius': 0,
   'naderenRadius': 0,
   'ingangsDatum': '2018-12-16'},
  {'UICCode': '8003004',
   'stationType': 'KNOOPPUNT_INT

In [12]:
json_data.keys()

dict_keys(['payload'])

In [13]:
# Building a dataframe using the dictionary within 'payload'
df = pd.DataFrame(json_data['payload'])
df

Unnamed: 0,UICCode,stationType,EVACode,code,sporen,synoniemen,heeftFaciliteiten,heeftVertrektijden,heeftReisassistentie,namen,land,lat,lng,radius,naderenRadius,ingangsDatum
0,8002084,SNELTREIN_STATION,8000139,MGZB,[],[Gunzburg],True,True,False,"{'lang': 'Günzburg', 'middel': 'Günzburg', 'ko...",D,48.460226,10.278707,0,0,2018-12-16
1,8002140,SNELTREIN_STATION,8000013,MA,[],[],True,True,False,"{'lang': 'Augsburg Hbf', 'middel': 'Augsburg H...",D,48.365431,10.885477,0,0,2018-12-16
2,8003004,KNOOPPUNT_INTERCITY_STATION,8010255,BHF,[],[],True,True,False,"{'lang': 'Berlin Ostbahnhof', 'middel': 'Berli...",D,52.510499,13.4347,0,0,2018-12-16
3,8003025,INTERCITY_STATION,8010404,BSPD,[],[],True,True,False,"{'lang': 'Berlin-Spandau', 'middel': 'Berlin-S...",D,52.534315,13.198947,0,0,2018-12-16
4,8007799,INTERCITY_STATION,8011102,GSB,[],[],True,True,False,"{'lang': 'Berlin Gesundbrunnen', 'middel': 'Be...",D,52.548633,13.390427,0,0,2018-12-16
5,8008016,STOPTREIN_STATION,8000037,ESRT,[],[],True,True,False,"{'lang': 'Schwerte (Ruhr)', 'middel': 'Schwert...",D,51.442281,7.55896,0,0,2018-12-16
6,8008073,KNOOPPUNT_SNELTREIN_STATION,8000142,HAGEN,[],[],True,True,False,"{'lang': 'Hagen Hbf', 'middel': 'Hagen Hbf', '...",D,51.362747,7.460249,0,0,2013-05-29
7,8008082,STOPTREIN_STATION,8006718,WUPPV,[],[],True,True,False,"{'lang': 'Wuppertal-Vohwinkel', 'middel': 'Wup...",D,51.23351,7.07237,0,0,2018-12-16
8,8008094,MEGA_STATION,8000085,DUSSEL,[],[Dusseldorf],True,True,False,"{'lang': 'Düsseldorf Hbf', 'middel': 'Düsseldo...",D,51.220146,6.793137,0,0,2017-06-07
9,8008134,STOPTREIN_STATION,8001795,EENP,[],[],True,True,False,"{'lang': 'Ennepetal', 'middel': 'Ennepetal', '...",D,51.304892,7.343285,0,0,2018-12-16


In [14]:
# Checking available countries
df.land.unique()

array(['D', 'B', 'GB', 'A', 'NL', 'CH', 'F'], dtype=object)

In [15]:
# Getting info only about Netherlands
df_NL = df[df.land=='NL']
df_NL.reset_index(drop=True, inplace = True)

In [16]:
df_NL.head()

Unnamed: 0,UICCode,stationType,EVACode,code,sporen,synoniemen,heeftFaciliteiten,heeftVertrektijden,heeftReisassistentie,namen,land,lat,lng,radius,naderenRadius,ingangsDatum
0,8400045,STOPTREIN_STATION,8400045,ATN,"[{'spoorNummer': '1'}, {'spoorNummer': '2'}]",[],True,True,True,"{'lang': 'Aalten', 'middel': 'Aalten', 'kort':...",NL,51.921327,6.578627,200,1200,2020-05-13
1,8400047,STOPTREIN_STATION,8400047,AC,"[{'spoorNummer': '2'}, {'spoorNummer': '3'}]",[],True,True,False,"{'lang': 'Abcoude', 'middel': 'Abcoude', 'kort...",NL,52.2785,4.977,200,1200,2011-05-01
2,8400049,STOPTREIN_STATION,8400049,AKM,"[{'spoorNummer': '2'}, {'spoorNummer': '3'}]",[],True,True,True,"{'lang': 'Akkrum', 'middel': 'Akkrum', 'kort':...",NL,53.046391,5.843611,200,1600,1990-01-01
3,8400050,KNOOPPUNT_INTERCITY_STATION,8400050,AMR,"[{'spoorNummer': '1'}, {'spoorNummer': '2'}, {...",[],True,True,True,"{'lang': 'Alkmaar', 'middel': 'Alkmaar', 'kort...",NL,52.637779,4.739722,525,1200,2003-07-01
4,8400051,KNOOPPUNT_INTERCITY_STATION,8400051,AML,"[{'spoorNummer': '2'}, {'spoorNummer': '2a'}, ...",[],True,True,True,"{'lang': 'Almelo', 'middel': 'Almelo', 'kort':...",NL,52.358055,6.653889,525,1200,2017-12-10


The name of the Station is shown as dictionaries in `df_NL['namen']`. We can for example make it better to access, like this:

In [17]:
df_NL['namen_lang'] = df_NL['namen'].apply(lambda x: x['lang'])
df_NL['namen_middel'] = df_NL['namen'].apply(lambda x: x['middel'])
df_NL['namen_kort'] = df_NL['namen'].apply(lambda x: x['kort'])
df_NL.drop(columns = ['namen'], inplace = True)

In [18]:
df_NL[['namen_lang','namen_middel','namen_kort']]

Unnamed: 0,namen_lang,namen_middel,namen_kort
0,Aalten,Aalten,Aalten
1,Abcoude,Abcoude,Abcoude
2,Akkrum,Akkrum,Akkrum
3,Alkmaar,Alkmaar,Alkmaar
4,Almelo,Almelo,Almelo
5,Alkmaar Noord,Alkmaar N.,Alkmaar N
6,Alphen a/d Rijn,Alphen a/d Rijn,Alphen
7,Amersfoort Schothorst,Schothorst,Schothorst
8,Amersfoort Centraal,Amersfoort C.,Amersfrt C
9,Amsterdam RAI,RAI,RAI


In [19]:
df_NL[['UICCode', 'stationType', 'EVACode', 'code', 'sporen',
       'land', 'lat', 'lng', 'ingangsDatum','namen_lang']].head()

Unnamed: 0,UICCode,stationType,EVACode,code,sporen,land,lat,lng,ingangsDatum,namen_lang
0,8400045,STOPTREIN_STATION,8400045,ATN,"[{'spoorNummer': '1'}, {'spoorNummer': '2'}]",NL,51.921327,6.578627,2020-05-13,Aalten
1,8400047,STOPTREIN_STATION,8400047,AC,"[{'spoorNummer': '2'}, {'spoorNummer': '3'}]",NL,52.2785,4.977,2011-05-01,Abcoude
2,8400049,STOPTREIN_STATION,8400049,AKM,"[{'spoorNummer': '2'}, {'spoorNummer': '3'}]",NL,53.046391,5.843611,1990-01-01,Akkrum
3,8400050,KNOOPPUNT_INTERCITY_STATION,8400050,AMR,"[{'spoorNummer': '1'}, {'spoorNummer': '2'}, {...",NL,52.637779,4.739722,2003-07-01,Alkmaar
4,8400051,KNOOPPUNT_INTERCITY_STATION,8400051,AML,"[{'spoorNummer': '2'}, {'spoorNummer': '2a'}, ...",NL,52.358055,6.653889,2017-12-10,Almelo


### [`GET` Arrivals](https://apiportal.ns.nl/docs/services/reisinformatie-api/operations/getArrivals?)

This operation lists arrivals for a specific station.

[There](https://apiportal.ns.nl/docs/services/reisinformatie-api/operations/getArrivals) you have many parameters to use in your query.

In [20]:
headers = {
    # Request headers
    'Ocp-Apim-Subscription-Key': keys.ns_app_key,
}

params = urllib.parse.urlencode({
    # Request parameters
    'lang': 'nl',
    'station': 'TB',
#     'uicCode': '{string}',
#     'dateTime': '{string}',
    'maxJourneys': 100,
})

response = requests.get("https://gateway.apiportal.ns.nl/reisinformatie-api/api/v2/arrivals?", 
                        headers=headers, 
                        params=params)
data = response.json()

data

{'payload': {'source': 'PPV',
  'arrivals': [{'origin': 'Den Haag Centraal',
    'name': 'NS  1163',
    'plannedDateTime': '2021-03-05T17:51:00+0100',
    'plannedTimeZoneOffset': 60,
    'actualDateTime': '2021-03-05T17:52:59+0100',
    'actualTimeZoneOffset': 60,
    'plannedTrack': '1',
    'product': {'number': '1163',
     'categoryCode': 'IC',
     'shortCategoryName': 'NS Intercity',
     'longCategoryName': 'Intercity',
     'operatorCode': 'NS',
     'operatorName': 'NS',
     'type': 'TRAIN'},
    'trainCategory': 'IC',
    'cancelled': False,
    'messages': [],
    'arrivalStatus': 'INCOMING'},
   {'origin': 'Weert',
    'name': 'NS  5264',
    'plannedDateTime': '2021-03-05T17:51:00+0100',
    'plannedTimeZoneOffset': 60,
    'actualDateTime': '2021-03-05T17:51:00+0100',
    'actualTimeZoneOffset': 60,
    'plannedTrack': '2',
    'product': {'number': '5264',
     'categoryCode': 'SPR',
     'shortCategoryName': 'NS Sprinter',
     'longCategoryName': 'Sprinter',
     'o

In [21]:
data.keys()

dict_keys(['payload'])

In [22]:
df_arrivals = pd.DataFrame(data['payload'])
df_arrivals.head()

Unnamed: 0,source,arrivals
0,PPV,"{'origin': 'Den Haag Centraal', 'name': 'NS 1..."
1,PPV,"{'origin': 'Weert', 'name': 'NS 5264', 'plann..."
2,PPV,"{'origin': 'Zwolle', 'name': 'NS 3661', 'plan..."
3,PPV,"{'origin': 'Dordrecht', 'name': 'NS 1965', 'p..."
4,PPV,"{'origin': 'Roosendaal', 'name': 'NS 3664', '..."


In [23]:
df_arrivals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   source    20 non-null     object
 1   arrivals  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


In [24]:
df_arrivals['arrivals'][0]

{'origin': 'Den Haag Centraal',
 'name': 'NS  1163',
 'plannedDateTime': '2021-03-05T17:51:00+0100',
 'plannedTimeZoneOffset': 60,
 'actualDateTime': '2021-03-05T17:52:59+0100',
 'actualTimeZoneOffset': 60,
 'plannedTrack': '1',
 'product': {'number': '1163',
  'categoryCode': 'IC',
  'shortCategoryName': 'NS Intercity',
  'longCategoryName': 'Intercity',
  'operatorCode': 'NS',
  'operatorName': 'NS',
  'type': 'TRAIN'},
 'trainCategory': 'IC',
 'cancelled': False,
 'messages': [],
 'arrivalStatus': 'INCOMING'}

In [25]:
df_arrivals['arrivals'][4]

{'origin': 'Roosendaal',
 'name': 'NS  3664',
 'plannedDateTime': '2021-03-05T18:02:00+0100',
 'plannedTimeZoneOffset': 60,
 'actualDateTime': '2021-03-05T18:02:00+0100',
 'actualTimeZoneOffset': 60,
 'plannedTrack': '1',
 'product': {'number': '3664',
  'categoryCode': 'IC',
  'shortCategoryName': 'NS Intercity',
  'longCategoryName': 'Intercity',
  'operatorCode': 'NS',
  'operatorName': 'NS',
  'type': 'TRAIN'},
 'trainCategory': 'IC',
 'cancelled': False,
 'messages': [{'message': 'Vandaag afkomstig uit Breda', 'style': 'INFO'}],
 'arrivalStatus': 'INCOMING'}

### [`GET` Trips](https://apiportal.ns.nl/docs/services/reisinformatie-api/operations/getTravelAdvice?)

For the last example using NS API let’s use operation `GET` Trips which returns a travel advice for the given parameters.

I've asked travel advice with Origin in `Den Bosch` and destination `Tilburg`.

In [26]:
headers = {
    # Request headers
    'Ocp-Apim-Subscription-Key': keys.ns_app_key,
}

params = urllib.parse.urlencode({
    # Request parameters
#     'lang': '{string}', # mostly in Dutch 
#     'fromStation': '{string}',
    'originUicCode': '8400319', #code for 'S-HERTOGENBOSCH
#     'originLat': '{number}',
#     'originLng': '{number}',
#     'originName': '{string}',
#     'toStation': '{string}',
    'destinationUicCode': '8400597', # UIC code for Tilburg
#     'destinationLat': '{number}',
#     'destinationLng': '{number}',
#     'destinationName': '{string}',
#     'viaStation': '{string}',
#     'viaUicCode': '{string}',
#     'viaLat': '{number}',
#     'viaLng': '{number}',
    'originWalk': 'false',
    'originBike': 'false',
    'originCar': 'false',
    'destinationWalk': 'false',
    'destinationBike': 'false',
    'destinationCar': 'false',
#     'dateTime': '{string}',
    'searchForArrival': 'true',
    'departure': 'true',
#     'context': '{string}',
    'shorterChange': 'false',
#     'addChangeTime': '{integer}',
#     'minimalChangeTime': '{integer}',
#     'viaWaitTime': '{integer}',
#     'originAccessible': '{boolean}',
    'travelAssistance': 'false',
#     'travelAssistanceTransferTime': '{integer}',
#     'accessibilityEquipment1': '{string}',
#     'accessibilityEquipment2': '{string}',
    'searchForAccessibleTrip': 'false',
#     'filterTransportMode': '{string}',
    'localTrainsOnly': 'false',
    'excludeHighSpeedTrains': 'false',
    'excludeTrainsWithReservationRequired': 'false',
    'yearCard': 'false',
#     'product': '{string}',
    'discount': 'NO_DISCOUNT',
    'travelClass': '2',
    'polylines': 'false',
    'passing': 'false',
    'travelRequestType': 'DEFAULT',
})

response = requests.get("https://gateway.apiportal.ns.nl/reisinformatie-api/api/v3/trips", 
                        headers=headers, 
                        params=params)
data = response.json()

data


{'source': 'HARP',
 'trips': [{'idx': 0,
   'uid': 'arnu|fromStation=8400319|toStation=8400597|plannedFromTime=2021-03-05T16:42:00+01:00|plannedArrivalTime=2021-03-05T16:57:00+01:00|yearCard=false|excludeHighSpeedTrains=false|searchForAccessibleTrip=false',
   'ctxRecon': 'arnu|fromStation=8400319|toStation=8400597|plannedFromTime=2021-03-05T16:42:00+01:00|plannedArrivalTime=2021-03-05T16:57:00+01:00|yearCard=false|excludeHighSpeedTrains=false|searchForAccessibleTrip=false',
   'plannedDurationInMinutes': 15,
   'actualDurationInMinutes': 15,
   'transfers': 0,
   'status': 'NORMAL',
   'messages': [],
   'legs': [{'idx': '0',
     'name': 'IC 3657',
     'travelType': 'PUBLIC_TRANSIT',
     'direction': 'Roosendaal',
     'cancelled': False,
     'changePossible': True,
     'alternativeTransport': False,
     'journeyDetailRef': 'HARP_S2S-1|554031|0|784|5032021',
     'origin': {'name': "'s-Hertogenbosch",
      'lng': 5.29362,
      'lat': 51.69048,
      'countryCode': 'NL',
      

In [27]:
data.keys()

dict_keys(['source', 'trips', 'scrollRequestBackwardContext', 'scrollRequestForwardContext'])

In [28]:
len(data['trips'])

6

In [29]:
data['trips'][0]

{'idx': 0,
 'uid': 'arnu|fromStation=8400319|toStation=8400597|plannedFromTime=2021-03-05T16:42:00+01:00|plannedArrivalTime=2021-03-05T16:57:00+01:00|yearCard=false|excludeHighSpeedTrains=false|searchForAccessibleTrip=false',
 'ctxRecon': 'arnu|fromStation=8400319|toStation=8400597|plannedFromTime=2021-03-05T16:42:00+01:00|plannedArrivalTime=2021-03-05T16:57:00+01:00|yearCard=false|excludeHighSpeedTrains=false|searchForAccessibleTrip=false',
 'plannedDurationInMinutes': 15,
 'actualDurationInMinutes': 15,
 'transfers': 0,
 'status': 'NORMAL',
 'messages': [],
 'legs': [{'idx': '0',
   'name': 'IC 3657',
   'travelType': 'PUBLIC_TRANSIT',
   'direction': 'Roosendaal',
   'cancelled': False,
   'changePossible': True,
   'alternativeTransport': False,
   'journeyDetailRef': 'HARP_S2S-1|554031|0|784|5032021',
   'origin': {'name': "'s-Hertogenbosch",
    'lng': 5.29362,
    'lat': 51.69048,
    'countryCode': 'NL',
    'uicCode': '8400319',
    'type': 'STATION',
    'plannedTimeZoneOffse

With this query we get a bunch of information about 6 trips from Den Bosch to Tilburg. Including punctuality, how crowded, and different prices depending of the subscription you may have with NS.

## Weather APIs

There are many [weather APIs available](https://rapidapi.com/category/Weather). In the Netherlands we can use, for instance :

* For private and study use: https://weerlive.nl/delen.php
* For commercial use: https://meteoserver.nl/

Using weerlive.nl you can get current weather data from the [KNMI (
KNMI - Koninklijk Nederlands Meteorologisch Instituut)](https://www.knmi.nl/home), i.e., Royal Netherlands Meteorological Institute which is the Dutch national weather forecasting service for free (maximal 300 data requests per day).

Meteoserver has different APIs that you can use for free until the limit of 500 requests/ month. Historical data can be obtained from the [`meteoserver`](https://meteoserver.nl/weerstatistieken-API.php) by paying 60 euro/month.

Like for the NS API, for both weather APIs you need to subscribe in order to obtain an API key.

### Current Weather using [Weerlive.nl](https://weerlive.nl/index.php)

Here we make a request about the current weather in Tilburg.


In [30]:
request_url = "https://weerlive.nl/api/json-data-10min.php?key="+keys.weerlive_key

params = {"locatie":"Tilburg"}

response = requests.get(request_url, 
                        params=params)
data = response.json()

data


{'liveweer': [{'plaats': 'Tilburg',
   'temp': '5.4',
   'gtemp': '2.3',
   'samenv': 'Onbewolkt',
   'lv': '50',
   'windr': 'NNO',
   'windms': '4',
   'winds': '3',
   'windk': '7.8',
   'windkmh': '14.4',
   'luchtd': '1031.0',
   'ldmmhg': '773',
   'dauwp': '-5',
   'zicht': '45',
   'verw': 'Zonnige perioden, morgen in het noorden meer bewolking',
   'sup': '07:13',
   'sunder': '18:29',
   'image': 'zonnig',
   'd0weer': 'halfbewolkt',
   'd0tmax': '6',
   'd0tmin': '-2',
   'd0windk': '2',
   'd0windknp': '6',
   'd0windms': '3',
   'd0windkmh': '11',
   'd0windr': 'NO',
   'd0neerslag': '4',
   'd0zon': '77',
   'd1weer': 'halfbewolkt',
   'd1tmax': '6',
   'd1tmin': '0',
   'd1windk': '2',
   'd1windknp': '4',
   'd1windms': '2',
   'd1windkmh': '7',
   'd1windr': 'N',
   'd1neerslag': '20',
   'd1zon': '40',
   'd2weer': 'bewolkt',
   'd2tmax': '4',
   'd2tmin': '0',
   'd2windk': '2',
   'd2windknp': '4',
   'd2windms': '2',
   'd2windkmh': '7',
   'd2windr': 'N',
   'd2ne

In [31]:
len(data['liveweer'][0])

49

As answer to your request you get 48 variables related to the weather in the chosen location which include among other temperature, some information about the wind, visibility, when the sun sets and rises.

### Hourly Weather Forecasts using [Meteoserver](https://meteoserver.nl/)

In [32]:
request_url = "https://data.meteoserver.nl/api/uurverwachting.php?key="+keys.meteoserver_key

params = {"locatie":"Tilburg"}

response = requests.get(request_url, 
                        params=params)
data = response.json()

data


{'plaatsnaam': [{'plaats': 'Tilburg'}],
 'data': [{'tijd': '1614960000',
   'tijd_nl': '05-03-2021 18:00',
   'offset': '11',
   'loc': 'none',
   'temp': '6',
   'winds': '4',
   'windb': '3',
   'windknp': '8',
   'windkmh': '14.4',
   'windr': '45',
   'windrltr': 'NO',
   'vis': '50000',
   'neersl': '0',
   'luchtd': '1030.9',
   'luchtdmmhg': '773.2',
   'luchtdinhg': '30.44',
   'hw': '0',
   'mw': '0',
   'lw': '0',
   'tw': '0',
   'rv': '42',
   'gr': '88',
   'cape': '-',
   'cond': '5',
   'ico': '7',
   'samenv': 'Zonnig',
   'icoon': 'zonnig'},
  {'tijd': '1614963600',
   'tijd_nl': '05-03-2021 19:00',
   'offset': '12',
   'loc': 'none',
   'temp': '5',
   'winds': '3',
   'windb': '2',
   'windknp': '6',
   'windkmh': '10.8',
   'windr': '45',
   'windrltr': 'NO',
   'vis': '50000',
   'neersl': '0',
   'luchtd': '1031',
   'luchtdmmhg': '773.3',
   'luchtdinhg': '30.45',
   'hw': '0',
   'mw': '0',
   'lw': '0',
   'tw': '0',
   'rv': '41',
   'gr': '38',
   'cape': '-

# Twitter API
Now is time to explore a bit the Twitter API. You will notice that it differs a bit in relation to the ones we have explored so far. For example, for the previous ones we have API keys. For the Twitter API we have keys and access tokens.

We will also need to use some package to help with the authentication process, i.e., an authentication handler such as [Tweepy](https://www.tweepy.org/) or [`python-twitter`](https://python-twitter.readthedocs.io/en/latest/index.html). The json file obtained is also a bit more nested and complex. There are many different fields including information such as tweet text, user, language, time of tweet, location etc.

To gain access to the Twitter API, one needs to create a Twitter account, in case you don’t already have one. Then log into the [Twitter Apps]( https://developer.twitter.com/en) and Apply(https://developer.twitter.com/en/apply-for-access). After that you just need to agree to some terms and conditions to have available your keys and access tokens. These are the authentication credentials that will allow you to access the Twitter API.

There are different Twitter APIs, like for instances, the REST (Representational State Transfer) API which read and write Twitter data, and the Streaming API. The public streaming API streams the public data flowing through Twitter.

In this tutorial we use the REST API to collect tweets from users as well as tweets obtained by using queries. 

Summarizing, in this section we:

1. Collect tweets from the user timeline (`GetUserTimeline`)
2. Collect tweets using queries (`GetSearch`)
3. Select which information from the data retrieved will be kept
4. Save collected data in .csv file

From the previous mentioned Python packages, I'm using [`python-twitter`](https://python-twitter.readthedocs.io/en/latest/index.html), a python wrapper around the Twitter API.


## The Twitter API and authentication

**Attention**: `private_twitter_credentials.py` contains my Twitter credentials. Insert your Twitter credentials in `twitter_credentials.py`
and replace `private_twitter_credentials` by `twitter_credentials` in this notebook.

In [2]:
import private_twitter_credentials
import twitter
import datetime
import time

TodaysDate = time.strftime("%Y-%m-%d-%H-%M")

### Seeting up twitter authentication

I'll be using [`python-twitter`](https://python-twitter.readthedocs.io/en/latest/index.html) a python wrapper around the Twitter API.

In [3]:
consumer_key = private_twitter_credentials.consumer_key
consumer_secret = private_twitter_credentials.consumer_secret
access_token = private_twitter_credentials.access_token
access_token_secret = private_twitter_credentials.access_token_secret

api = twitter.Api(
    consumer_key         =   consumer_key,
    consumer_secret      =   consumer_secret,
    access_token_key     =   access_token,
    access_token_secret  =   access_token_secret,
    tweet_mode = 'extended' # to ensure that we get the full text of the users' original tweets
)

## Getting some Tweets

In this project we want to :

* Access user timeline Tweets, i.e., apply `GetUserTimeline` method on object `api` created in last section
* Access Tweets resulting from some query, i.e., apply `GetSearch` method on `api`.

Consider we want to get some Tweets from `Mark Rutte`, that continues to be in the spotlight during this Corona crises. The account’s Twitter handle of `Mark Rutte` is `@markrutte` and we use the argument `screen_name` as the handle without `@`.

In [4]:
# we will use for now `screen_name` and `count`. Count has limit of 200, i.e., we can only retrieve 200 tweets per call
tweets_mrutte = api.GetUserTimeline(screen_name='markrutte', count = 200)

# see all available info in a dictionary

tweets_mrutte = [ _.AsDict() for _ in tweets_mrutte]

In [5]:
tweets_mrutte

[{'created_at': 'Thu Mar 11 08:28:14 +0000 2021',
  'favorite_count': 111,
  'full_text': 'We zijn nu bezig de eindstreep van de crisis te halen, maar ik wil ook vooruitkijken. \n\nMet de Volkskrant sprak ik onder andere over mijn plannen voor een nieuwe kerncentrale en mijn ideeën over migratie.\nhttps://t.co/NAnGtOLBAp',
  'hashtags': [],
  'id': 1369928136759119875,
  'id_str': '1369928136759119875',
  'lang': 'nl',
  'retweet_count': 20,
  'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
  'urls': [{'expanded_url': 'https://www.volkskrant.nl/nieuws-achtergrond/mark-rutte-bij-een-nieuwe-vluchtelingencrisis-moeten-we-bereid-zijn-de-grenzen-te-sluiten~bb6231870/',
    'url': 'https://t.co/NAnGtOLBAp'}],
  'user': {'created_at': 'Sun Oct 12 15:24:42 +0000 2008',
   'description': 'Samen naar de eindstreep. En verder. Lees mijn brief aan alle Nederlanders in de link.',
   'favourites_count': 1,
   'followers_count': 140164,
   'friends_count': 1,
   '

Checking the 1st one:

In [6]:
tweets_mrutte[0]

{'created_at': 'Thu Mar 11 08:28:14 +0000 2021',
 'favorite_count': 111,
 'full_text': 'We zijn nu bezig de eindstreep van de crisis te halen, maar ik wil ook vooruitkijken. \n\nMet de Volkskrant sprak ik onder andere over mijn plannen voor een nieuwe kerncentrale en mijn ideeën over migratie.\nhttps://t.co/NAnGtOLBAp',
 'hashtags': [],
 'id': 1369928136759119875,
 'id_str': '1369928136759119875',
 'lang': 'nl',
 'retweet_count': 20,
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 'urls': [{'expanded_url': 'https://www.volkskrant.nl/nieuws-achtergrond/mark-rutte-bij-een-nieuwe-vluchtelingencrisis-moeten-we-bereid-zijn-de-grenzen-te-sluiten~bb6231870/',
   'url': 'https://t.co/NAnGtOLBAp'}],
 'user': {'created_at': 'Sun Oct 12 15:24:42 +0000 2008',
  'description': 'Samen naar de eindstreep. En verder. Lees mijn brief aan alle Nederlanders in de link.',
  'favourites_count': 1,
  'followers_count': 140164,
  'friends_count': 1,
  'id': 16708728,
  

Second, we obtain Tweets based on some query, i.e., apply `GetSearch` method on `api`.
Because of the lockdown many sectors of the economy are suffering. During the last weeks there were many reactions, for example, connected with the catering sector since restaurants, bars and similar were closed again. So, let's say that one wants to search Tweets that mention `horeca` and `COVID-19`. In this case you use the method `GetSearch` using your query as argument. 

The easiest way to have the query right is going to [Twitter’s Advanced Search](https://twitter.com/search-advanced) and typing what you want to know. Then using as your `raw_query` the part of search URL after the "?", removing the `&src=typd` portion.

Let's try it out.

The URL I get is:


Therefore, I use `raw_query = q=covid-19%2C%20horeca`.

In [7]:
search_result = api.GetSearch(raw_query='q=covid-19%2C%20horeca')
search_result = [ _.AsDict() for _ in search_result]

In [8]:
search_result

[{'created_at': 'Fri Mar 12 06:30:55 +0000 2021',
  'full_text': 'Branża piekarnicza czeka na odbicie w HoReCa\n\nPandemia COVID-19 spowodowała, że w branży piekarniczej w zeszłym roku odnotowano spadek zamówień z hoteli…\n\nhttps://t.co/xzXFoj2n7A https://t.co/g4b8ybkzzL',
  'hashtags': [],
  'id': 1370261000453242880,
  'id_str': '1370261000453242880',
  'lang': 'pl',
  'media': [{'display_url': 'pic.twitter.com/g4b8ybkzzL',
    'expanded_url': 'https://twitter.com/finansovo/status/1370261000453242880/photo/1',
    'id': 1370260981293678593,
    'media_url': 'http://pbs.twimg.com/media/EwQlGj-WgAEqRPa.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/EwQlGj-WgAEqRPa.jpg',
    'sizes': {'medium': {'w': 936, 'h': 312, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 680, 'h': 227, 'resize': 'fit'},
     'large': {'w': 936, 'h': 312, 'resize': 'fit'}},
    'type': 'photo',
    'url': 'https://t.co/g4b8ybkzzL'}],
  'source': '<a href="ht

In [9]:
len(search_result)

15

The results from `GetSearch` are limited to 7 days. In the last 7 days we got 15 Tweets that mentioned `horeca` and `covid-19`. Details of the most recent is shown below:

In [10]:
search_result[0]

{'created_at': 'Fri Mar 12 06:30:55 +0000 2021',
 'full_text': 'Branża piekarnicza czeka na odbicie w HoReCa\n\nPandemia COVID-19 spowodowała, że w branży piekarniczej w zeszłym roku odnotowano spadek zamówień z hoteli…\n\nhttps://t.co/xzXFoj2n7A https://t.co/g4b8ybkzzL',
 'hashtags': [],
 'id': 1370261000453242880,
 'id_str': '1370261000453242880',
 'lang': 'pl',
 'media': [{'display_url': 'pic.twitter.com/g4b8ybkzzL',
   'expanded_url': 'https://twitter.com/finansovo/status/1370261000453242880/photo/1',
   'id': 1370260981293678593,
   'media_url': 'http://pbs.twimg.com/media/EwQlGj-WgAEqRPa.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/EwQlGj-WgAEqRPa.jpg',
   'sizes': {'medium': {'w': 936, 'h': 312, 'resize': 'fit'},
    'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
    'small': {'w': 680, 'h': 227, 'resize': 'fit'},
    'large': {'w': 936, 'h': 312, 'resize': 'fit'}},
   'type': 'photo',
   'url': 'https://t.co/g4b8ybkzzL'}],
 'source': '<a href="https://mobile.twitt

When retrieving Tweets from user timeline Twitter API limits us to 200 Tweets at a time, and from search to 100 tweets. This is the parameter `count` from both methods, `GetUserTimeline` and `GetSearch`.

Another constraint that we need to deal with is that the Twitter API is rate limited, meaning Twitter puts restrictions on how much data you can take at a time. More details about it [here](https://developer.twitter.com/en/docs/basics/rate-limiting#:~:text=Rate%20limiting%20of%20the%20standard,per%20window%20per%20access%20token.).

Because I want to retrieve much more than 200 Tweets (if possible) I'll write a class with two methods.

Class `TweetMiner` contains two methods: 

* `mine_user_tweets` which mine user's tweets making use of [GetUserTimeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline)

* `search_tweets` which mine tweets using [GetSearch](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets)

Details about both uses will be show respectively, in sections:

- [Getting twitter by user](#Getting-twitter-by-user)

- [Applying GetSearch to search for a defined query](#Applying-GetSearch-to-search-for-a-defined-query)


Notice that in this class I'm also selecting which information from the Tweets I want to keep in a form of a dictionary. I'm almost collecting everything. Probably for the purpose I have in mind now I'll not be using all this but it was my choice to keep what I kept. Feel free to adapt it.

Next, you find a function where I use the list of dictionaries obtained from apply my `TweetMiner`, organize it a bit and save the result in a .csv file


In [11]:
import datetime

class TweetMiner(object):
    """ Make possible obtaining tweets using twitter user id (mine_user_tweets) or performing a standard Twitter 
    API search (search_tweets)"""

    
    def __init__(self, api, result_limit = 20, max_pages = 40):
        """result_limit = count that can take max 200 (mine_user_tweets) and max 100 (search_tweets)"""
        
        self.api = api        
        self.result_limit = result_limit
        self.max_pages = max_pages
        

    def mine_user_tweets(self, user, mine_retweets=False):
        """ Mine tweets of user = screen_name or user_id"""

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= self.max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1, 
                                                        include_rts=mine_retweets)
                statuses = [ _.AsDict() for _ in statuses]
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, 
                                                        include_rts=mine_retweets)
                statuses = [_.AsDict() for _ in statuses]
                
            for item in statuses:
                # Using try except here.
                # When retweets = 0 we get an error (GetUserTimeline fails to create a key, 'retweet_count')
                try:
                    mined = {
                        'mined_at':         datetime.datetime.now(),
                        'created_at':       item['created_at'],
                        'tweet_id':         item['id'],
                        'tweet_id_str':     item['id_str'],
                        'screen_name':      item['user']['screen_name'],
                        'favorite_count':   item['favorite_count'],
                        'text':             item['full_text'],
                        'source':           item['source'],
                        'hashtags':         item['hashtags'],
                        'urls':             item['urls'],
                        'language':         item['lang'],
                        'retweet_count':    item['retweet_count'],
                        #user info
                        'user_favourites_count': item['user']['favourites_count'],
                        'followers_count':  item['user']['followers_count'],
                        'friends_count':    item['user']['friends_count']
                    }
            
                
                except:
                    mined = {
                        'mined_at':         datetime.datetime.now(),
                        'created_at':       item['created_at'],
                        'tweet_id':         item['id'],
                        'tweet_id_str':     item['id_str'],
                        'screen_name':      item['user']['screen_name'],
#                         'favorite_count':   item['favorite_count'],
                        'text':             item['full_text'],
                        'source':           item['source'],
                        'hashtags':         item['hashtags'],
                        'urls':             item['urls'],
                        'language':         item['lang'],
                        'retweet_count':    0,
                        # user info
                        'user_favourites_count': item['user']['favourites_count'],
                        'followers_count':  item['user']['followers_count'],
                        'friends_count':    item['user']['friends_count']
                        }
                
                last_tweet_id = item['id']
                data.append(mined)
                
            page += 1
            
        return data
    
    def search_tweets(max_pages = 20, count = 20, raw_query = None, result_type = 'mixed'):
        """ Search tweets """

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses = api.GetSearch(raw_query=raw_query, count = count, result_type=result_type, 
                                         max_id=last_tweet_id - 1)
                statuses = [ _.AsDict() for _ in statuses]
            else:
                statuses = api.GetSearch(raw_query=raw_query, count = count, result_type=result_type)
                statuses = [_.AsDict() for _ in statuses]
                
            for item in statuses:
                # Using try except here.
                try:
                    mined = {
                        'mined_at':                datetime.datetime.now(),
                        'created_at':              item['created_at'],
                        'tweet_id':                item['id'],
                        'tweet_id_str':            item['id_str'],
                        'in_reply_to_screen_name': item['in_reply_to_screen_name'],
                        'in_reply_to_status_id':   item['in_reply_to_status_id'],
                        'in_reply_to_user_id':     item['in_reply_to_user_id'],
                        'language':                item['lang'],
                        'text':                    item['full_text'],
                        'hashtags':                item['hashtags'],
                        'source':                  item['source'],
                       # info about user
                        'user_id':                 item['user']['id'],
                        'user_screen_name':        item['user']['screen_name'],
                        'user_location':           item['user']['location'],
                        'user_favourites_count':   item['user']['favourites_count'],
                        'followers_count':         item['user']['followers_count'],
                        'friends_count':           item['user']['friends_count']
                    }
                    

                    
                except:
                    mined = {
                        'mined_at':                datetime.datetime.now(),
                        'created_at':              item['created_at'],
                        'tweet_id':                item['id'],
                        'tweet_id_str':            item['id_str'],
#                         'in_reply_to_screen_name': item['in_reply_to_screen_name'],
#                         'in_reply_to_status_id':   item['in_reply_to_status_id'],
#                         'in_reply_to_user_id':     item['in_reply_to_user_id'],
                        'language':                item['lang'],
                        'text':                    item['full_text'],
                        'hashtags':                item['hashtags'],
                        'source':                  item['source'],
                       # info about user
                        'user_id':                 item['user']['id'],
                        'user_screen_name':        item['user']['screen_name'],
#                        'user_location':           item['user']['location'],
#                         'user_favourites_count':   item['user']['favourites_count'],
                        'followers_count':         item['user']['followers_count'],
                        'friends_count':           item['user']['friends_count']
                    }
                                            
                
                last_tweet_id = item['id']
                data.append(mined)
                
            page += 1
            
        return data

In [12]:
def process_and_save(df, file_name, mine_user_twitter=1):
    """ Save retrieved tweets in csv file.
    
    Input:
    
    df : dataframe of tweets'data
    file_name: name with which the csv will be saved (without extension)
    mine_user_twitter: Indicates if df came contains tweets from a twitter user (mine_user_twitter=1), i.e., it was 
    obtained using GetUserTimeline since the information obtained from this method is different from an API search 
    from GetSearch. Therefore, when using GetSearch (mine_user_twitter=0).
    
    """
    
    TodaysDate = time.strftime("%Y-%m-%d-%H-%M")

    
    # Create columns 'year', 'month', 'day', 'hour', 'min' from 'created_at'
    df['created_at'] = pd.to_datetime(df['created_at'])
        
    df['year'] = df['created_at'].dt.year 
    df['month'] = df['created_at'].dt.month 
    df['day'] = df['created_at'].dt.day 
    df['hour'] = df['created_at'].dt.hour 
    df['minute'] = df['created_at'].dt.minute
    df['day_of_week'] = df['created_at'].dt.weekday
    
    if mine_user_twitter:
    
        df = df[['mined_at', 'created_at', 'year', 'month', 'day','day_of_week', 'hour', 'minute', 'screen_name', 
                 'tweet_id', 'tweet_id_str',  'retweet_count', 'favorite_count', 'source','hashtags', 'urls', 'language', 
                 'user_favourites_count', 'followers_count','friends_count','text']]
    else:
        
        try:
            df = df[['mined_at','created_at', 'year', 'month', 'day','day_of_week','hour', 'minute', 'tweet_id', 'tweet_id_str', 
                 'in_reply_to_screen_name','in_reply_to_status_id','in_reply_to_user_id', 'hashtags','source','language', 
                 'user_screen_name','user_id','user_location','user_favourites_count','followers_count','friends_count', 
                 'text']]
        except:
            df = df[['mined_at','created_at', 'year', 'month', 'day','day_of_week','hour', 'minute', 'tweet_id', 
                     'tweet_id_str','hashtags','source','language', 'user_screen_name','user_id','followers_count',
                     'friends_count', 'text']]
        
    
    df.sort_values(by='created_at',inplace = True)
    
    # the normal use of drop_duplicates raise typeerror unhashable type 'list' so I needed to adapt using str
    # as pointed out in https://stackoverflow.com/questions/43855462/pandas-drop-duplicates-method-not-working?rq=1
    
    df = df.loc[df.astype(str).drop_duplicates(subset=['created_at','tweet_id','text']).index]
    df.reset_index(drop = True, inplace = True)
    
    
    df.to_csv("../data/tweets/"+file_name+"_"+TodaysDate+".csv", index = False)
    
    return df

Now that we have our class to retrieve Tweets and our function to save the result in .csv let's collect data.

You can use this data, for example, to analyze the sentiment of the users towards `Mark Rutte` or anything else that you decided to point your research too, like how you users perceive your product or service. 

We would like to collect as much data as possible so we can have more confidence in our analysis. However, when using Twitter API we need to consider some limitations.

First, it is difficult to have control on how far in the past we can go when retrieving user timeline data (using `TweetMiner. mine_user_tweets()`. So, we will play mainly with parameters `result_limit`, i.e., count and `max_pages`.

Second, when performing search with API it is only possible to access Tweets of the last 7 days (more details [here](https://developer.twitter.com/en/docs/tweets/search/overview/standard)). So unfortunately, when performing queries, we will be limited and we will not be able to go back months ago.

## Getting twitter by user

To start we will obtain tweets for `Mark Rutte`.

### @markrutte

In [13]:
# Result limit == count parameter from our GetUserTimeline() it can take max 200
# More pages more back in time you can go
miner = TweetMiner(api, result_limit=200, max_pages = 30)
markrutte = miner.mine_user_tweets(user="markrutte")
df_markrutte = process_and_save(pd.DataFrame(markrutte), "markrutte")

In [14]:
df_markrutte

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,screen_name,tweet_id,tweet_id_str,retweet_count,favorite_count,source,hashtags,urls,language,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:05.816678,2010-03-11 10:17:31+00:00,2010,3,11,3,10,17,markrutte,10316213327,10316213327,82,146,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[],nl,1,140164,1,"Voor nieuws rondom de VVD, andere Kamerleden o..."
1,2021-03-12 09:34:05.816678,2011-03-31 15:04:20+00:00,2011,3,31,3,15,4,markrutte,53472702136201216,53472702136201216,76,180,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[],nl,1,140164,1,Blijf op de hoogte van het laatste nieuws rond...
2,2021-03-12 09:34:05.816678,2017-02-12 08:36:45+00:00,2017,2,12,6,8,36,markrutte,830697137415581696,830697137415581696,739,1227,"<a href=""http://twitter.com/download/iphone"" r...",[],[],nl,1,140164,1,"Nul procent, Geert. NUL procent.\n\nHet.\nGaat..."
3,2021-03-12 09:34:05.816678,2017-02-17 18:04:45+00:00,2017,2,17,4,18,4,markrutte,832652018472800256,832652018472800256,23,90,"<a href=""http://twitter.com/download/iphone"" r...",[],"[{'expanded_url': 'http://bit.ly/2kQz2KB', 'ur...",nl,1,140164,1,Samen met e-sporter van @AFCAjax Koen Weijland...
4,2021-03-12 09:34:05.816678,2017-02-25 11:14:56+00:00,2017,2,25,5,11,14,markrutte,835447984859009024,835447984859009024,223,408,"<a href=""http://twitter.com/download/iphone"" r...",[],[],nl,1,140164,1,Ik kies voor optimisme! https://t.co/xJtaFCSzzD
5,2021-03-12 09:34:05.816678,2017-03-02 13:15:44+00:00,2017,3,2,3,13,15,markrutte,837290327576887297,837290327576887297,358,1236,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[{'expanded_url': 'https://www.facebook.com/ma...,nl,1,140164,1,"""Hebben jullie enig idee wat ik de hele dag do..."
6,2021-03-12 09:34:05.816678,2017-03-09 07:54:20+00:00,2017,3,9,3,7,54,markrutte,839746158666870784,839746158666870784,43,132,"<a href=""http://twitter.com/download/iphone"" r...",[],[{'expanded_url': 'https://twitter.com/apechto...,nl,1,140164,1,"Eens. Zorgvuldig, maar geen vertraging. https:..."
7,2021-03-12 09:34:05.816678,2017-03-11 09:44:29+00:00,2017,3,11,5,9,44,markrutte,840498654041559040,840498654041559040,220,502,"<a href=""http://twitter.com/download/iphone"" r...",[{'text': 'kiesmaar'}],[{'expanded_url': 'https://twitter.com/geertwi...,nl,1,140164,1,Gaan we voor chaos en een wegloper? \n\nOf bou...
8,2021-03-12 09:34:05.816678,2017-03-14 10:33:49+00:00,2017,3,14,1,10,33,markrutte,841598232434266112,841598232434266112,284,808,"<a href=""https://studio.twitter.com"" rel=""nofo...",[],[],nl,1,140164,1,Ik barst van de energie om door te gaan als mi...
9,2021-03-12 09:34:05.816678,2017-06-14 13:52:15+00:00,2017,6,14,2,13,52,markrutte,874987852391751681,874987852391751681,47,153,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],"[{'expanded_url': 'http://facebook.com/VVD', '...",nl,1,140164,1,Vanavond praat ik je graag bij over de formati...


In [15]:
df_markrutte.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               35 non-null     datetime64[ns]     
 1   created_at             35 non-null     datetime64[ns, UTC]
 2   year                   35 non-null     int64              
 3   month                  35 non-null     int64              
 4   day                    35 non-null     int64              
 5   day_of_week            35 non-null     int64              
 6   hour                   35 non-null     int64              
 7   minute                 35 non-null     int64              
 8   screen_name            35 non-null     object             
 9   tweet_id               35 non-null     int64              
 10  tweet_id_str           35 non-null     object             
 11  retweet_count          35 non-null     int64              
 

In [16]:
min(df_markrutte.created_at),max(df_markrutte.created_at)

(Timestamp('2010-03-11 10:17:31+0000', tz='UTC'),
 Timestamp('2021-03-11 08:28:14+0000', tz='UTC'))

The first register is from March 11th, 2010 and the most recent is from 22nd, February 2021.

Only 20 registers were retrieved.

In [17]:
print("Mark Rutte's followers", df_markrutte.loc[df_markrutte.shape[0]-1,'followers_count'])
print("Mark Rutte's friends", df_markrutte.loc[df_markrutte.shape[0]-1,'friends_count'])

Mark Rutte's followers 140164
Mark Rutte's friends 1


Only one friend?? 😳

What about [`JADS`](https://www.jads.nl/)?

In [18]:
jads = miner.mine_user_tweets(user="jadatascience")
df_jads = process_and_save(pd.DataFrame(jads), "jads")

In [19]:
df_jads.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,screen_name,tweet_id,tweet_id_str,retweet_count,favorite_count,source,hashtags,urls,language,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:11.457002,2016-04-07 15:24:52+00:00,2016,4,7,3,15,24,jadatascience,718097216255180800,718097216255180800,7,4.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],"[{'expanded_url': 'http://www.jads.nl/', 'url'...",en,547,1532,298,Accreditation @NVAO for our BSc Data Science a...
1,2021-03-12 09:34:11.457002,2016-04-26 14:49:36+00:00,2016,4,26,1,14,49,jadatascience,724973711380611072,724973711380611072,4,5.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[{'expanded_url': 'http://www.jads.nl/ds-e-eve...,en,547,1532,298,Join our Data Science and Entrepreneurship eve...
2,2021-03-12 09:34:11.457002,2016-04-27 18:09:58+00:00,2016,4,27,2,18,9,jadatascience,725386520178335744,725386520178335744,7,3.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[{'text': 'datascience'}],"[{'expanded_url': 'http://www.jads.nl/campus',...",en,547,1532,298,A former convent in @shertogenbosch turns into...
3,2021-03-12 09:34:11.457002,2016-04-28 16:59:57+00:00,2016,4,28,3,16,59,jadatascience,725731291224748032,725731291224748032,11,6.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[{'expanded_url': 'https://app.studielink.nl/'...,en,547,1532,298,Now open for registration @infostudielink: Joi...
4,2021-03-12 09:34:11.457002,2016-04-29 06:11:10+00:00,2016,4,29,4,6,11,jadatascience,725930405325471746,725930405325471746,12,2.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",[],[{'expanded_url': 'http://www.jads.nl/bachelor...,en,547,1532,298,Goodmorning! Now open for registration in Stud...


In [20]:
df_jads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               297 non-null    datetime64[ns]     
 1   created_at             297 non-null    datetime64[ns, UTC]
 2   year                   297 non-null    int64              
 3   month                  297 non-null    int64              
 4   day                    297 non-null    int64              
 5   day_of_week            297 non-null    int64              
 6   hour                   297 non-null    int64              
 7   minute                 297 non-null    int64              
 8   screen_name            297 non-null    object             
 9   tweet_id               297 non-null    int64              
 10  tweet_id_str           297 non-null    object             
 11  retweet_count          297 non-null    int64              

In [21]:
min(df_jads.created_at),max(df_jads.created_at)

(Timestamp('2016-04-07 15:24:52+0000', tz='UTC'),
 Timestamp('2021-02-08 10:31:16+0000', tz='UTC'))

We went back until April 7th, 2016. And the last data collected was on February 8th, 2021. 297 entries were obtained.

In [22]:
print("JADS's followers", df_jads.loc[df_jads.shape[0]-1,'followers_count'])
print("JADS's friends", df_jads.loc[df_jads.shape[0]-1,'friends_count'])

JADS's followers 1532
JADS's friends 298


We have more friends but less followers than Rutte. Curious about [`MKB datalab`](https://jadsmkbdatalab.nl)?

In [23]:
jadsmkbdatalab = miner.mine_user_tweets(user="jadsmkbdatalab")
df_jadsmkbdatalab = process_and_save(pd.DataFrame(jadsmkbdatalab), "jadsmkbdatalab")

In [24]:
df_jadsmkbdatalab.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,screen_name,tweet_id,tweet_id_str,retweet_count,favorite_count,source,hashtags,urls,language,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:16.381542,2019-04-25 13:50:08+00:00,2019,4,25,3,13,50,jadsmkbdatalab,1121411069040365570,1121411069040365570,3,3.0,"<a href=""http://twuffer.com"" rel=""nofollow"">Tw...",[],"[{'expanded_url': 'https://bit.ly/2J1vl3k', 'u...",nl,336,218,1479,"Platform Driven by Data, waar JADS MKB Datalab..."
1,2021-03-12 09:34:16.381542,2019-04-25 13:52:30+00:00,2019,4,25,3,13,52,jadsmkbdatalab,1121411666032373761,1121411666032373761,3,5.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[],[{'expanded_url': 'https://www.jadsmkbdatalab....,nl,336,218,1479,Met het MKB Datalab helpt @Jadatascience onder...
2,2021-03-12 09:34:16.381542,2019-04-26 12:59:42+00:00,2019,4,26,4,12,59,jadsmkbdatalab,1121760767626350592,1121760767626350592,2,5.0,"<a href=""https://postfity.com"" rel=""nofollow"">...",[{'text': 'datascience'}],[],nl,336,218,1479,Vandaag werken we met vier bedrijven (en 9 stu...
3,2021-03-12 09:34:16.381542,2019-04-29 06:39:27+00:00,2019,4,29,0,6,39,jadsmkbdatalab,1122752238278070274,1122752238278070274,2,3.0,"<a href=""http://twitter.com/download/iphone"" r...",[{'text': 'datascience'}],[{'expanded_url': 'https://twitter.com/Dinalog...,nl,336,218,1479,We ontvangen zo een groep ondernemers uit de l...
4,2021-03-12 09:34:16.381542,2019-04-30 14:33:48+00:00,2019,4,30,1,14,33,jadsmkbdatalab,1123233999428755458,1123233999428755458,1,1.0,"<a href=""https://postfity.com"" rel=""nofollow"">...",[],"[{'expanded_url': 'https://pos.li/2bu1wf', 'ur...",nl,336,218,1479,Lees hier hoe Bakkerscafé Royal de personeelsp...


In [25]:
df_jadsmkbdatalab.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               30 non-null     datetime64[ns]     
 1   created_at             30 non-null     datetime64[ns, UTC]
 2   year                   30 non-null     int64              
 3   month                  30 non-null     int64              
 4   day                    30 non-null     int64              
 5   day_of_week            30 non-null     int64              
 6   hour                   30 non-null     int64              
 7   minute                 30 non-null     int64              
 8   screen_name            30 non-null     object             
 9   tweet_id               30 non-null     int64              
 10  tweet_id_str           30 non-null     object             
 11  retweet_count          30 non-null     int64              
 

In [26]:
min(df_jadsmkbdatalab.created_at),max(df_jadsmkbdatalab.created_at)

(Timestamp('2019-04-25 13:50:08+0000', tz='UTC'),
 Timestamp('2021-01-27 07:34:49+0000', tz='UTC'))

The first date we retrieved was April 25th, 2019, and the last data January 27th, 2021. 30 entries were obtained.

In [27]:
print("MKB datalab's followers", df_jadsmkbdatalab.loc[df_jadsmkbdatalab.shape[0]-1,'followers_count'])
print("MKB datalab's friends", df_jadsmkbdatalab.loc[df_jadsmkbdatalab.shape[0]-1,'friends_count'])

MKB datalab's followers 218
MKB datalab's friends 1479


Wow we are doing good about friends! 🤩

Let's work in getting more followers!

## Applying GetSearch to search for a defined query


Twitter’s search parameters are a bit complex, to perform a particular search, you can consult Twitter’s documentation at https://dev.twitter.com/rest/public/search.

As said before, an easier way is to make use of [Twitter’s Advanced Search](https://twitter.com/search-advanced), and then use the part of search URL after the "?" to use `raw_query`, removing the `&src=type` portion.

In this section we will perform some queries including `covid-19` and `horeca` which a hot topic the last weeks. We will also include two queries about `vaccinatiepassport` which is appearing a lot at the [news lately](https://www.dw.com/en/eu-vaccine-passport-an-ethical-and-legal-minefield/a-56747519).

In addition, the results obtained from Twitter API searches against a sampling of recent Tweets published in the past 7 days (more details [here](https://developer.twitter.com/en/docs/tweets/search/overview/standard)). So, the results you see in the website when using `Twitter’s Advanced Search` will not per se be in the result of the API search.

### Query 1: `horeca` & `covid-19`

Using all these words we get: https://twitter.com/search?q=horeca%2C%20covid-19&src=typed_query


In [28]:
query_01 = 'q=horeca%1C%10covid-19'
result_01 = TweetMiner.search_tweets(raw_query = query_01)
len(result_01)

300

In [29]:
df_query_01 = process_and_save(pd.DataFrame(result_01),"query_01",mine_user_twitter=0)
df_query_01.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,user_screen_name,user_id,user_location,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:23.519406,2021-03-10 09:26:14+00:00,2021,3,10,2,9,26,1369580344136044546,1369580344136044546,,,,"[{'text': 'Covid_19'}, {'text': 'Horeca'}, {'t...","<a href=""https://mobile.twitter.com"" rel=""nofo...",es,horecacadiz,1364548797540491271,,,38,145,Diputación repartirá 50 millones de euros en l...
1,2021-03-12 09:34:25.836627,2021-03-10 13:13:32+00:00,2021,3,10,2,13,13,1369637547727552515,1369637547727552515,,,,"[{'text': 'restaurant'}, {'text': 'Brussels'},...","<a href=""https://mobile.twitter.com"" rel=""nofo...",en,DiageoEU,1253676027823271938,,,1560,762,Watch Yen Pham of Yi Chan #restaurant in #Brus...
2,2021-03-12 09:34:24.218334,2021-03-10 15:00:09+00:00,2021,3,10,2,15,0,1369664378992201729,1369664378992201729,,,,"[{'text': 'restaurant'}, {'text': 'Brussels'},...","<a href=""https://www.hootsuite.com"" rel=""nofol...",en,OHoreca,1344325939916894209,,,20,121,RT @DiageoEU: Watch Yen Pham of Yi Chan #resta...
3,2021-03-12 09:34:24.936415,2021-03-10 17:01:23+00:00,2021,3,10,2,17,1,1369694885637283841,1369694885637283841,,,,"[{'text': 'foodistribution'}, {'text': 'horeca...","<a href=""https://metricool.com"" rel=""nofollow""...",ca,bgrup,312015239,,,201,90,¿Qué tipos de menú deberíamos utilizar en époc...
4,2021-03-12 09:34:25.836627,2021-03-10 17:15:44+00:00,2021,3,10,2,17,15,1369698499764355080,1369698499764355080,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,Roseforyou007,109992593,,,331,1914,RT @ZZPLoosduinen: TESTBEWIJZEN VOOR TOEGANG T...


In [30]:
df_query_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 15 non-null     datetime64[ns]     
 1   created_at               15 non-null     datetime64[ns, UTC]
 2   year                     15 non-null     int64              
 3   month                    15 non-null     int64              
 4   day                      15 non-null     int64              
 5   day_of_week              15 non-null     int64              
 6   hour                     15 non-null     int64              
 7   minute                   15 non-null     int64              
 8   tweet_id                 15 non-null     int64              
 9   tweet_id_str             15 non-null     object             
 10  in_reply_to_screen_name  1 non-null      object             
 11  in_reply_to_status_id    1 non-nul

Note that we get 300 results but only 15 are unique.

### Query 2: `horeca` & `lockdown` & `covid-19`

Again all words: https://twitter.com/search?q=horeca%2C%20lockdown%2C%20covid-19&src=typed_query

In [31]:
query_02 = 'q=horeca%2C%20lockdown%2C%20covid'
result_02 = TweetMiner.search_tweets(raw_query = query_02)
len(result_02)

200

In [32]:
df_query_02 = process_and_save(pd.DataFrame(result_02),"query_02",mine_user_twitter=0)
df_query_02.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,user_screen_name,user_id,user_location,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:28.263873,2021-03-05 12:29:29+00:00,2021,3,5,4,12,29,1367814522162057221,1367814522162057221,,,,"[{'text': 'tiktok'}, {'text': 'hetkleinecafe'}...","<a href=""http://twitter.com/download/iphone"" r...",nl,NadiaPalesa,68967455,,,5325,1023,Zo of u nog een mening heeft? @PowNed #tiktok ...
1,2021-03-12 09:34:27.564400,2021-03-05 21:31:29+00:00,2021,3,5,4,21,31,1367950922198958081,1367950922198958081,bunnywonho_,1.367948e+18,7.351917e+17,[],"<a href=""http://twitter.com/download/android"" ...",en,wonhos3rdtattoo,2172471809,2️⃣1️⃣🏳️‍🌈 she/her,58271.0,330,271,@bunnywonho_ u know im suddenly very glad that...
2,2021-03-12 09:34:27.564400,2021-03-06 22:45:04+00:00,2021,3,6,5,22,45,1368331825110736898,1368331825110736898,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,AndreMeekes,313203705,,,501,823,Lockdown tot 2050... Horeca voor altijd geslot...
3,2021-03-12 09:34:26.868907,2021-03-07 11:00:26+00:00,2021,3,7,6,11,0,1368516886074384396,1368516886074384396,,,,"[{'text': 'HORECA'}, {'text': 'COVID'}, {'text...","<a href=""https://www.hootsuite.com"" rel=""nofol...",en,ElliottHygiene,976697785,,,1108,1792,As #HORECA organisations plan ahead for liftin...
4,2021-03-12 09:34:28.714530,2021-03-07 12:19:06+00:00,2021,3,7,6,12,19,1368536686007885825,1368536686007885825,,,,"[{'text': 'HORECA'}, {'text': 'COVID'}, {'text...","<a href=""http://twitter.com/download/iphone"" r...",en,only1mich1972,4827610234,,,44,126,RT @ElliottHygiene: As #HORECA organisations p...


In [33]:
df_query_02.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 10 non-null     datetime64[ns]     
 1   created_at               10 non-null     datetime64[ns, UTC]
 2   year                     10 non-null     int64              
 3   month                    10 non-null     int64              
 4   day                      10 non-null     int64              
 5   day_of_week              10 non-null     int64              
 6   hour                     10 non-null     int64              
 7   minute                   10 non-null     int64              
 8   tweet_id                 10 non-null     int64              
 9   tweet_id_str             10 non-null     object             
 10  in_reply_to_screen_name  3 non-null      object             
 11  in_reply_to_status_id    3 non-null

### Query 3: `horeca` OR `lockdown` OR `covid-19`

And now using any of these words: https://twitter.com/search?q=(covid-19%2C%20OR%20horeca%2C%20OR%20lockdown)&src=typed_query

In [34]:
query_03 = 'q=horeca%2C%20lockdown%2C%20covid'
result_03 = TweetMiner.search_tweets(raw_query = query_03)
len(result_03)

200

In [35]:
df_query_03 = process_and_save(pd.DataFrame(result_03),"query_03",mine_user_twitter=0)
df_query_03.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,user_screen_name,user_id,user_location,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:33.689731,2021-03-05 12:29:29+00:00,2021,3,5,4,12,29,1367814522162057221,1367814522162057221,,,,"[{'text': 'tiktok'}, {'text': 'hetkleinecafe'}...","<a href=""http://twitter.com/download/iphone"" r...",nl,NadiaPalesa,68967455,,,5325,1023,Zo of u nog een mening heeft? @PowNed #tiktok ...
1,2021-03-12 09:34:32.995926,2021-03-05 21:31:29+00:00,2021,3,5,4,21,31,1367950922198958081,1367950922198958081,bunnywonho_,1.367948e+18,7.351917e+17,[],"<a href=""http://twitter.com/download/android"" ...",en,wonhos3rdtattoo,2172471809,2️⃣1️⃣🏳️‍🌈 she/her,58271.0,330,271,@bunnywonho_ u know im suddenly very glad that...
2,2021-03-12 09:34:32.995926,2021-03-06 22:45:04+00:00,2021,3,6,5,22,45,1368331825110736898,1368331825110736898,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,AndreMeekes,313203705,,,501,823,Lockdown tot 2050... Horeca voor altijd geslot...
3,2021-03-12 09:34:32.069367,2021-03-07 11:00:26+00:00,2021,3,7,6,11,0,1368516886074384396,1368516886074384396,,,,"[{'text': 'HORECA'}, {'text': 'COVID'}, {'text...","<a href=""https://www.hootsuite.com"" rel=""nofol...",en,ElliottHygiene,976697785,,,1108,1792,As #HORECA organisations plan ahead for liftin...
4,2021-03-12 09:34:34.121999,2021-03-07 12:19:06+00:00,2021,3,7,6,12,19,1368536686007885825,1368536686007885825,,,,"[{'text': 'HORECA'}, {'text': 'COVID'}, {'text...","<a href=""http://twitter.com/download/iphone"" r...",en,only1mich1972,4827610234,,,44,126,RT @ElliottHygiene: As #HORECA organisations p...


In [36]:
df_query_03.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 10 non-null     datetime64[ns]     
 1   created_at               10 non-null     datetime64[ns, UTC]
 2   year                     10 non-null     int64              
 3   month                    10 non-null     int64              
 4   day                      10 non-null     int64              
 5   day_of_week              10 non-null     int64              
 6   hour                     10 non-null     int64              
 7   minute                   10 non-null     int64              
 8   tweet_id                 10 non-null     int64              
 9   tweet_id_str             10 non-null     object             
 10  in_reply_to_screen_name  3 non-null      object             
 11  in_reply_to_status_id    3 non-null

In [37]:
df_concat = pd.concat([df_query_01,df_query_02,df_query_03])

In [38]:
df_concat.drop_duplicates(subset=['created_at','text'],inplace=True)

In [39]:
df_concat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 9
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 25 non-null     datetime64[ns]     
 1   created_at               25 non-null     datetime64[ns, UTC]
 2   year                     25 non-null     int64              
 3   month                    25 non-null     int64              
 4   day                      25 non-null     int64              
 5   day_of_week              25 non-null     int64              
 6   hour                     25 non-null     int64              
 7   minute                   25 non-null     int64              
 8   tweet_id                 25 non-null     int64              
 9   tweet_id_str             25 non-null     object             
 10  in_reply_to_screen_name  4 non-null      object             
 11  in_reply_to_status_id    4 non-null

Putting all three queries together we get 23 unique tweets.

In [40]:
df_concat.to_csv("../data/tweets/concat_queries"+TodaysDate+".csv",index=False)

### Query 4: `vaccin` & `vaccinatie` & `covid-19`

A new query about covid-19 vaccin: https://twitter.com/search?lang=nl&q=vaccin%20vaccinatie%20covid-19&src=typed_query

In [41]:
query_04 = 'q=vaccin%20vaccinatie%20covid-19'
result_04 = TweetMiner.search_tweets(raw_query = query_04)
len(result_04)

300

In [42]:
df_query_04 = process_and_save(pd.DataFrame(result_04),"query_04",mine_user_twitter=0)
df_query_04.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,user_screen_name,user_id,user_location,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:38.800670,2021-03-11 21:13:23+00:00,2021,3,11,3,21,13,1370120692121927682,1370120692121927682,,,,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,HenkJZoer,1363453132697702402,,,6,25,@MrsKrass @SimoneGezond @hugodejonge @EMA_News...
1,2021-03-12 09:34:41.376670,2021-03-11 21:36:36+00:00,2021,3,11,3,21,36,1370126535965954054,1370126535965954054,mzelst,1.370075e+18,98975049.0,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,Nicole_SMA,1375211450,"Gelderland, Nederland",1911.0,177,387,@mzelst @YorickB @Maarten_vw Wanneer komt er e...
2,2021-03-12 09:34:39.589447,2021-03-12 01:33:06+00:00,2021,3,12,4,1,33,1370186054939066373,1370186054939066373,,,,"[{'text': 'AstraZeneca'}, {'text': 'vaccin'}, ...","<a href=""http://twitter.com/download/iphone"" r...",nl,JanineBorn,1055570294272856064,,,57,101,RT @jaaprog1: @hugodejonge. Er is nog helemaal...
3,2021-03-12 09:34:40.369732,2021-03-12 06:06:05+00:00,2021,3,12,4,6,6,1370254750910922759,1370254750910922759,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,Zilvera1,1241234821,,,917,423,Dit zou de trombose effecten van het vaccin ku...
4,2021-03-12 09:34:41.376670,2021-03-12 06:09:37+00:00,2021,3,12,4,6,9,1370255641181356041,1370255641181356041,,,,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,Etherischevogel,712603972192878593,,,288,623,RT @Zilvera1: Dit zou de trombose effecten van...


In [43]:
df_query_04.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 15 non-null     datetime64[ns]     
 1   created_at               15 non-null     datetime64[ns, UTC]
 2   year                     15 non-null     int64              
 3   month                    15 non-null     int64              
 4   day                      15 non-null     int64              
 5   day_of_week              15 non-null     int64              
 6   hour                     15 non-null     int64              
 7   minute                   15 non-null     int64              
 8   tweet_id                 15 non-null     int64              
 9   tweet_id_str             15 non-null     object             
 10  in_reply_to_screen_name  5 non-null      object             
 11  in_reply_to_status_id    5 non-nul

In [44]:
df_query_04.drop_duplicates(subset=['created_at','text'],inplace=True)
df_query_04.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 15 non-null     datetime64[ns]     
 1   created_at               15 non-null     datetime64[ns, UTC]
 2   year                     15 non-null     int64              
 3   month                    15 non-null     int64              
 4   day                      15 non-null     int64              
 5   day_of_week              15 non-null     int64              
 6   hour                     15 non-null     int64              
 7   minute                   15 non-null     int64              
 8   tweet_id                 15 non-null     int64              
 9   tweet_id_str             15 non-null     object             
 10  in_reply_to_screen_name  5 non-null      object             
 11  in_reply_to_status_id    5 non-nul

In [45]:
df_query_04.to_csv("../data/tweets/vaccin_querie_"+TodaysDate+".csv",index=False)

### Query 5: `vaccinatiepassport` & `covid-19`

New hot subject involving COVID in EU. The vaccinatiepassport. Are you for or against? What people think about it? : https://twitter.com/search?lang=nl&q=vaccinatiepaspoort%20eu&src=typed_query

In [46]:
query_05 = 'q=vaccinatiepaspoort%20eu'
result_05 = TweetMiner.search_tweets(raw_query = query_05)
len(result_05)

300

In [47]:
result_05[0]

{'mined_at': datetime.datetime(2021, 3, 12, 9, 34, 41, 742140),
 'created_at': 'Fri Mar 12 08:11:10 +0000 2021',
 'tweet_id': 1370286228021649414,
 'tweet_id_str': '1370286228021649414',
 'language': 'nl',
 'text': 'RT @annstrikje: In 2019 kwam de WHO met een plan om IEDEREEN te prikken, waarin gesproken wordt over "sterke surveillancesystemen voor door…',
 'hashtags': [],
 'source': '<a href="http://www.myplume.com/" rel="nofollow">Plume\xa0for\xa0Android</a>',
 'user_id': 1521898405,
 'user_screen_name': 'Ept33Ept',
 'followers_count': 58,
 'friends_count': 45}

In [48]:
df_query_05 = process_and_save(pd.DataFrame(result_05),"query_05",mine_user_twitter=0)
df_query_05.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,user_screen_name,user_id,user_location,user_favourites_count,followers_count,friends_count,text
0,2021-03-12 09:34:43.979838,2021-03-11 20:02:02+00:00,2021,3,11,3,20,2,1370102739141804036,1370102739141804036,Beastcoin2,1.370101e+18,1.293114e+18,[],"<a href=""http://twitter.com/download/iphone"" r...",nl,ultrayingyang,2910805487,Planet Earth,16372.0,134,77,@Beastcoin2 @AnnekeHogenes @hans_ruis @carlien...
1,2021-03-12 09:34:46.324630,2021-03-11 20:16:39+00:00,2021,3,11,3,20,16,1370106414945996806,1370106414945996806,SnelleStemwijzr,1.370104e+18,627648900.0,[],"<a href=""http://twitter.com/download/iphone"" r...",nl,annstrikje,171084541,Nederland,25153.0,8611,2246,@SnelleStemwijzr @AusterRyan Je kunt toch het ...
2,2021-03-12 09:34:44.709342,2021-03-11 20:16:57+00:00,2021,3,11,3,20,16,1370106490682552321,1370106490682552321,ManfredFranken,1.370019e+18,92958530.0,"[{'text': 'EU'}, {'text': 'vaccinatiepaspoort'}]","<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,jaaprog1,2342166796,the Netherlands,3286.0,736,697,@ManfredFranken @sas61 En #EU wil dat iedereen...
3,2021-03-12 09:34:45.375198,2021-03-11 20:35:11+00:00,2021,3,11,3,20,35,1370111078588383234,1370111078588383234,,,,"[{'text': 'StemFVD'}, {'text': 'StemPVV'}, {'t...","<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,Steve_ice_t,50577417,,,78,153,Besef niet te laat dat Nederland naar de afgro...
4,2021-03-12 09:34:46.324630,2021-03-11 20:45:14+00:00,2021,3,11,3,20,45,1370113607166533638,1370113607166533638,,,,"[{'text': 'EU'}, {'text': 'vaccinatiepaspoort'}]","<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,sas61,30329938,,,1366,1315,RT @jaaprog1: @ManfredFranken @sas61 En #EU wi...


In [49]:
df_query_05.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 15 non-null     datetime64[ns]     
 1   created_at               15 non-null     datetime64[ns, UTC]
 2   year                     15 non-null     int64              
 3   month                    15 non-null     int64              
 4   day                      15 non-null     int64              
 5   day_of_week              15 non-null     int64              
 6   hour                     15 non-null     int64              
 7   minute                   15 non-null     int64              
 8   tweet_id                 15 non-null     int64              
 9   tweet_id_str             15 non-null     object             
 10  in_reply_to_screen_name  4 non-null      object             
 11  in_reply_to_status_id    4 non-nul

In [50]:
df_query_05.drop_duplicates(subset=['created_at','text'],inplace=True)
df_query_05.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   mined_at                 15 non-null     datetime64[ns]     
 1   created_at               15 non-null     datetime64[ns, UTC]
 2   year                     15 non-null     int64              
 3   month                    15 non-null     int64              
 4   day                      15 non-null     int64              
 5   day_of_week              15 non-null     int64              
 6   hour                     15 non-null     int64              
 7   minute                   15 non-null     int64              
 8   tweet_id                 15 non-null     int64              
 9   tweet_id_str             15 non-null     object             
 10  in_reply_to_screen_name  4 non-null      object             
 11  in_reply_to_status_id    4 non-nul

In [51]:
df_query_05.to_csv("../data/tweets/vaccinatiepassport_querie_"+TodaysDate+".csv",index=False)

# Conclusions

APIs are important tools when building Apps, getting information to build forecasting models, and to know better your customers. Only to enumerate some of the interesting ways of using APIs.
A standard form for transferring data through APIs is JSON (JavaScript Object Notation), a human-readable file format that has dictionary format which makes easy to work on Python.

Using API became so common that more and more organizations have APIs.
In this tutorial we explored superficially some APIs so you can have a feeling on how to use them.
Go on and explore even more these APIs or try some other. Check out [Rapid APIs]( https://rapidapi.com/?site) for more APIs. 

To go deeper on Twitter APIs you try, for example, the nice tutorials provided by Twitter [here]( https://developer.twitter.com/en/docs/twitter-api/tutorials). This [article]( https://towardsdatascience.com/how-do-data-scientists-use-twitter-let-us-count-the-ways-50494e2a95c8) shows some ways Data Scientists use Twitter.
