<a href="https://colab.research.google.com/github/MJMortensonWarwick/large_scale_data_for_research/blob/main/lsdc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Scale Data Collection for Research
This notebook will walk through a hypothetical data collection process for a research project - going from resource (open data store) through to a NoSQL database we can query.

All research projects are different. All APIs are slightly different and need a different set of instructions (code). There is no "best" database - just different use-cases. However, the ideas here can be fairly easily adapated to __your__ research project :)

We will begin by installing and importing the relevant packages (libraries).


In [None]:
!pip install tinydb

Collecting tinydb
  Downloading tinydb-4.8.0-py3-none-any.whl (24 kB)
Installing collected packages: tinydb
Successfully installed tinydb-4.8.0


In [None]:
import json
import requests
import tinydb

For this project we have identified the [World Bank](https://documents.worldbank.org/en/publication/documents-reports/api) as a suitable data source. World Bank offers an open data resource, behind an API which doesn't require specific keys (i.e. no need to register/request access).

We will be collecting articles on the topic of "open data" (very meta I know) and we will be storing them in a NoSQL (key-value) database called [TinyDB](https://tinydb.readthedocs.io/en/latest/). As the name suggests, this is a small/lightweight database, but with enough features to fit our use-case.

Let's begin by querying the API and seeing what we can get:

In [None]:
# World Bank API endpoint
api_url = "https://search.worldbank.org/api/v2/wds?format=json&qterm="

# query term
q = "open%20data"

# full endpoint
endpoint = api_url + q

So here we have created a variable called _api\_url_ which holds the standard query URL for the World Bank. We have specified we want the output format to be JSON (because we hate XML). The end of the URL has an empty space to add on a query term ("qterm").

The next line specifies the query term (_q_)we will use, "open%20data" (note: the "%20" is percent encoding for space meaning the actual query is "open data").

Finally, we combine both together as the variable _endpoint_ which we are now ready to query:

In [None]:
r = requests.get(endpoint)
r

<Response [200]>

We have now accessed the API and received a response code of 200 (which means success!). Let's check the content:

In [None]:
r.content

b'{"rows":10,"os":0,"page":1,"total":12598,"documents":{"D30920848":{"id":"30920848","geo_region_mdks":{"0":{"geo_region_mdk":"Asia!$!517198"},"1":{"geo_region_mdk":"World!$!517191"},"2":{"geo_region_mdk":"Southeast Asia!$!Southeast Asia"}},"last_modified_date":"2019-10-01T00:00:00Z","admreg":"East Asia and Pacific,East Asia and Pacific","admreg_key":"119225,119225","authors":{"0":{"author":"Andreasson,Kim Johan"},"1":{"author":"Boyera,Stephane"},"2":{"author":"Herzog,Timothy Grant"},"3":{"author":"Kim,Seunghyun"},"4":{"author":"Kuznetsova Morrison,Alla V."},"5":{"author":"Lan Huong,Tran Thi"},"6":{"author":"Lan Huong,Nguyen Thi"}},"count":"Vietnam","count_key":"82695","docna":{"0":{"docna":"Digital Government and Open Data Readiness Assessment"}},"docty":"Working Paper","docty_key":"620264","owner":"Indicators and Data Services (DECIS)","projn":"VN-Open Data Technical Assistance For\\n            Digital Vietnam -- P164025","trustfund":"TF0A6482-Vietnam: Digital Government\\n         

We get some seemingly relevant data. However, we can also see that we only get 10 results - not particularly "large scale". However, we may also see there are a total of
We get some seemingly relevant data. However, we can also see that we only get 10 results - not particularly "large scale". However, we may also see there are a total of 12,598 results possible (maybe not acutally large scale but more than you would get from the average survey!

Let's modify the query so we can request a specific amount:



In [None]:
# World Bank API endpoint
api_url = "https://search.worldbank.org/api/v2/wds?format=json&qterm="

# query term
q = "open%20data"

# total rows to collect
rows = "&rows=20"

# start page
start = "&os=0"

endpoint = api_url + q + rows + start

r = requests.get(endpoint)
r.content

b'{"rows":20,"os":0,"page":1,"total":12598,"documents":{"D30920848":{"id":"30920848","geo_region_mdks":{"0":{"geo_region_mdk":"Asia!$!517198"},"1":{"geo_region_mdk":"World!$!517191"},"2":{"geo_region_mdk":"Southeast Asia!$!Southeast Asia"}},"last_modified_date":"2019-10-01T00:00:00Z","admreg":"East Asia and Pacific,East Asia and Pacific","admreg_key":"119225,119225","authors":{"0":{"author":"Andreasson,Kim Johan"},"1":{"author":"Boyera,Stephane"},"2":{"author":"Herzog,Timothy Grant"},"3":{"author":"Kim,Seunghyun"},"4":{"author":"Kuznetsova Morrison,Alla V."},"5":{"author":"Lan Huong,Tran Thi"},"6":{"author":"Lan Huong,Nguyen Thi"}},"count":"Vietnam","count_key":"82695","docna":{"0":{"docna":"Digital Government and Open Data Readiness Assessment"}},"docty":"Working Paper","docty_key":"620264","owner":"Indicators and Data Services (DECIS)","projn":"VN-Open Data Technical Assistance For\\n            Digital Vietnam -- P164025","trustfund":"TF0A6482-Vietnam: Digital Government\\n         

Again, not necessarily large scale but getting there! It also shows another characteristic of "big data" ... messiness. Let's try to clean it up:

In [None]:
response_dict = r.json()
response_dict

{'rows': 20,
 'os': 0,
 'page': 1,
 'total': 12598,
 'documents': {'D30920848': {'id': '30920848',
   'geo_region_mdks': {'0': {'geo_region_mdk': 'Asia!$!517198'},
    '1': {'geo_region_mdk': 'World!$!517191'},
    '2': {'geo_region_mdk': 'Southeast Asia!$!Southeast Asia'}},
   'last_modified_date': '2019-10-01T00:00:00Z',
   'admreg': 'East Asia and Pacific,East Asia and Pacific',
   'admreg_key': '119225,119225',
   'authors': {'0': {'author': 'Andreasson,Kim Johan'},
    '1': {'author': 'Boyera,Stephane'},
    '2': {'author': 'Herzog,Timothy Grant'},
    '3': {'author': 'Kim,Seunghyun'},
    '4': {'author': 'Kuznetsova Morrison,Alla V.'},
    '5': {'author': 'Lan Huong,Tran Thi'},
    '6': {'author': 'Lan Huong,Nguyen Thi'}},
   'count': 'Vietnam',
   'count_key': '82695',
   'docna': {'0': {'docna': 'Digital Government and Open Data Readiness Assessment'}},
   'docty': 'Working Paper',
   'docty_key': '620264',
   'owner': 'Indicators and Data Services (DECIS)',
   'projn': 'VN-Ope

Let's extract the abstract of the first record:

In [None]:
response_dict['documents']['D30920848']['abstracts']['cdata!']

'This report, composed of two separate\n            themes of Digital Government Readiness Assessment (DGRA) and\n            Open Data Readiness Assessment (ODRA), is intended to help\n            government assess their digital environments and frame their\n            own strategies.In order to assess the potential for a\n            Digital Enabling Government Initiative (DEGI) for Vietnam,\n            this report compiles two chapters of aforementioned DGRA and\n            ODRA. Specifically, it assesses potential opportunities and\n            challenges of improving digital government and open data\n            initiatives in the country. Although DGRA and ODRA are two\n            separate assessments with different dimensions evaluated,\n            they take a similar methodological approach from a broader\n            point of view, starting with the desk research and later\n            expanding to scoping mission. Therefore, both chapters of\n            DGRA and ODRA ar

Success (note I got here by just looking at the structure of the output above). However, we see a lot of "\n" symbols (the page break symbol). We don't need this so let's replace it:

In [None]:
mod_abstract = response_dict['documents']['D30920848']['abstracts']['cdata!'].replace("\n", "")
mod_abstract

'This report, composed of two separate            themes of Digital Government Readiness Assessment (DGRA) and            Open Data Readiness Assessment (ODRA), is intended to help            government assess their digital environments and frame their            own strategies.In order to assess the potential for a            Digital Enabling Government Initiative (DEGI) for Vietnam,            this report compiles two chapters of aforementioned DGRA and            ODRA. Specifically, it assesses potential opportunities and            challenges of improving digital government and open data            initiatives in the country. Although DGRA and ODRA are two            separate assessments with different dimensions evaluated,            they take a similar methodological approach from a broader            point of view, starting with the desk research and later            expanding to scoping mission. Therefore, both chapters of            DGRA and ODRA are similar in format but outl

Lost the "\n" stuff but still have these weird white spaces. Let's fix that by using a join function:

In [None]:
mod_abstract = ' '.join(mod_abstract.split())
mod_abstract

'This report, composed of two separate themes of Digital Government Readiness Assessment (DGRA) and Open Data Readiness Assessment (ODRA), is intended to help government assess their digital environments and frame their own strategies.In order to assess the potential for a Digital Enabling Government Initiative (DEGI) for Vietnam, this report compiles two chapters of aforementioned DGRA and ODRA. Specifically, it assesses potential opportunities and challenges of improving digital government and open data initiatives in the country. Although DGRA and ODRA are two separate assessments with different dimensions evaluated, they take a similar methodological approach from a broader point of view, starting with the desk research and later expanding to scoping mission. Therefore, both chapters of DGRA and ODRA are similar in format but outlined in respective assessment dimension and individual indicators. Since its onset in the fall of 2017, intensive desk research was conducted, and a field

Success! We now have the main data we want - the article abstract. However, we may want some other metadata to go with it. Let's put a whole record together for the first item:

In [None]:
record_one= {} # empty dictionary

record_one['id'] = response_dict['documents']['D30920848']['id']
record_one['date'] = response_dict['documents']['D30920848']['last_modified_date']
record_one['topic'] = response_dict['documents']['D30920848']['teratopic']

mod_abstract = response_dict['documents']['D30920848']['abstracts']['cdata!'].replace("\n", "")
mod_abstract = ' '.join(mod_abstract.split())
record_one['abstract'] = mod_abstract

record_one # print

{'id': '30920848',
 'date': '2019-10-01T00:00:00Z',
 'topic': 'Governance,Public Sector Development,Information and Communication Technologies',
 'abstract': 'This report, composed of two separate themes of Digital Government Readiness Assessment (DGRA) and Open Data Readiness Assessment (ODRA), is intended to help government assess their digital environments and frame their own strategies.In order to assess the potential for a Digital Enabling Government Initiative (DEGI) for Vietnam, this report compiles two chapters of aforementioned DGRA and ODRA. Specifically, it assesses potential opportunities and challenges of improving digital government and open data initiatives in the country. Although DGRA and ODRA are two separate assessments with different dimensions evaluated, they take a similar methodological approach from a broader point of view, starting with the desk research and later expanding to scoping mission. Therefore, both chapters of DGRA and ODRA are similar in format but 

Awesome! We now have some sensible outputs! However, only 1. How about the other 12,597?

This many will start getting quite ugly so its time to use a database:

In [None]:
from tinydb import TinyDB, Query

db = TinyDB('mydb.json')

We have now created a database in Colab. We can check this by looking at the files on the server:

In [None]:
import os
os.listdir("/content")

['.config', 'mydb.json', 'sample_data']

We can see "mydb.json" - our key-value database. Let's starting looping through the API to extract all the records and put them in our database. First we need to setup the parameters so we can loop through multiple searches:

In [None]:
# World Bank API endpoint
api_url = "https://search.worldbank.org/api/v2/wds?format=json"

# total rows to collect
rows = "&rows=20&os="

# start page
start = 0

# query term
q = "&qterm=open%20data"

Note we have changed start page so it is just a number (and added the "&os=" to the element above). This allows us to increase the value by 20 for each loop we make.

Next let's put our code from earlier together into a for loop, and for each record add to the database:



In [None]:
import time

record_set = {} # empty dictionary
record_count = 0 # reset count of records

# 12,598 / 20 means 630 calls. Let's just take the first 500 (10,000 records)
for call in range(500): # 500 calls
  endpoint = api_url + rows + str(start)  + q # in the first call start = 0

  try:
    r = requests.get(endpoint) # get the data

    response_dict = r.json() # convert to a dictionary

    document_ids = list(response_dict['documents'].keys()) # get document IDs

  except:
    pass # if API call fails then skip the call

  record_set = {} # empty dictionary

  for record in range(20): # 20 records per call
    record_set[record_count] = {} # nested dictionary for the current count

    for field in ['id', 'last_modified_date', 'teratopic']:
      try: # blank if there's an error
        record_set[record_count][field] = response_dict['documents'][document_ids[record_count]][field]
      except:
        record_set[record_count][field] = "N/A"

      try: # blank if there's an error
        mod_abstract = response_dict['documents'][document_ids[record_count]]['abstracts']['cdata!'].replace("\n", "")
        mod_abstract = ' '.join(mod_abstract.split())
      except:
        mod_abstract = "N/A"
      record_set[record_count]['abstract'] = mod_abstract

    # add to database
    db.insert(record_set[record_count])

    record_count +=1 # increment the count

  record_count = 0 # reset record_count for the next call

  #update start row by 20
  start = start + 20

  time.sleep(1) # wait 1s before calling again

In [None]:
# print the database
for item in db:
  print(item)

{'id': '30920848', 'abstract': 'This report, composed of two separate themes of Digital Government Readiness Assessment (DGRA) and Open Data Readiness Assessment (ODRA), is intended to help government assess their digital environments and frame their own strategies.In order to assess the potential for a Digital Enabling Government Initiative (DEGI) for Vietnam, this report compiles two chapters of aforementioned DGRA and ODRA. Specifically, it assesses potential opportunities and challenges of improving digital government and open data initiatives in the country. Although DGRA and ODRA are two separate assessments with different dimensions evaluated, they take a similar methodological approach from a broader point of view, starting with the desk research and later expanding to scoping mission. Therefore, both chapters of DGRA and ODRA are similar in format but outlined in respective assessment dimension and individual indicators. Since its onset in the fall of 2017, intensive desk rese

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



{'id': '17198763', 'abstract': 'This tenth edition of Doing Business sheds light on how easy or difficult it is for a local entrepreneur to open and run a small to medium-size business when complying with relevant regulations. It measures and tracks changes in regulations affecting eleven areas in the life cycle of a business: starting a business, dealing with construction permits, getting electricity, registering property, getting credit, protecting investors, paying taxes, trading across borders, enforcing contracts, resolving insolvency and employing workers. Doing Business presents quantitative indicators on business regulations and the protection of property rights that can be compared across 185 economies, from Afghanistan to Zimbabwe, over time. The indicators are used to analyze economic outcomes and identify what reforms have worked, where and why. This economy profile presents the Doing Business indicators for Turkey. To allow useful comparison, it also provides data for othe

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




{'id': '33040005', 'abstract': 'N/A', 'last_modified_date': 'N/A', 'teratopic': 'Social Development,Law and Development,Governance,Public Sector Development,Health, Nutrition and Population'}
{'id': '16833395', 'abstract': 'The development objective of the Southern Africa Trade and Transport Facilitation Project has been identified as the following: to contribute to economic growth in the eastern and southern Africa through facilitation of the movement of goods and people and the fostering of regional integration among the countries served by the corridor. Negative impacts include: noise pollution, air pollution, spills of hazardous material, oil pollution, traffic impacts, water pollution, soil disposal, and erosion. Mitigation measures include: 1) areas cleared of vegetation should be re-vegetated to prevent soil erosion; 2) water sprinkling to reduce the dust at the construction site; 3) use of dust masks to operators and those working in the dusty areas; 4) the wastewater will be 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Now we have a big database of articles! Let's do a query to test it:

In [None]:
Abstract = Query() # query object

db.search(Abstract.id == '30920848')

[{'id': '30920848',
  'abstract': 'This report, composed of two separate themes of Digital Government Readiness Assessment (DGRA) and Open Data Readiness Assessment (ODRA), is intended to help government assess their digital environments and frame their own strategies.In order to assess the potential for a Digital Enabling Government Initiative (DEGI) for Vietnam, this report compiles two chapters of aforementioned DGRA and ODRA. Specifically, it assesses potential opportunities and challenges of improving digital government and open data initiatives in the country. Although DGRA and ODRA are two separate assessments with different dimensions evaluated, they take a similar methodological approach from a broader point of view, starting with the desk research and later expanding to scoping mission. Therefore, both chapters of DGRA and ODRA are similar in format but outlined in respective assessment dimension and individual indicators. Since its onset in the fall of 2017, intensive desk r

Now we have 12,000 records of data in a database. You can export that database to your local machine or conduct your analyses on Colab.