# Data Gathering

This notebook is dedicated to gathering data, because its note wise to comne these functions with data analysis. Gathering data from this site takes some time and may slow down the analysis. The data is hug in size compared to what github allows to be uploaded, so I have to include this into to .gitignore file.


API documentation:

https://www.mtmt.hu/system/files/mtmt2_api_dokumentacio.pdf


So I have to figure out how to make a good query that fits my needs... And here it is:

https://m2.mtmt.hu/api/publication?cond=published;eq;true&cond=core;eq;true&cond=institutes;inia;133&cond=category.mtid;eq;1&cond=publishedYear;range;2014%2C2023&sort=publishedYear,desc&sort=firstAuthor,asc&page=1&size=67027&labelLang=hun&cite_type=2

It's a bit long but now it almost has everything that I need:

1. Its from ELTE
2. Narrowed down to 2014-today (arbitrarily choosen)
3. It does sorting (from recent to oldest and by first author)
4. Has the maximum of 5000 publications in it
5. Its hungarian (?)

**Now I have to figure out how to go around the 5000 publication limit** 

The tip was that I can look into the data, quickly get the last publications number and go with that with the next query. But how so? I didn't see anything about that in the docs so I can't wrap my head around the tought.

**THE SOLUTION**

The solultion for going around the limit is to do a sorting by some value and give it a limiting condition. But how so? with MTID, and using MTID as a first level sorting. Something like this:


`https://m2.mtmt.hu/api/publication?`

`cond=published;eq;true&`

`cond=core;eq;true&`

`cond=institutes;inia;133&`

`cond=category.mtid;eq;1&`

`cond=mtid;gt;32597004&`

`cond=publishedYear;range;2014%2C2023&`

`sort=mtid,asc&`

`sort=publishedYear,desc&`

`sort=firstAuthor,asc&`

`size=20&`

`labelLang=hun&`

`cite_type=2&`

`page=1&`

`format=json`

The mtid;qt;32597004 limits publications in this way. Just needs to sort by mtid first, look for the 5000th publications number and just increase it till there is no returned answer.

## cURL

This can be used to grab the returned response to save them! But you need to have cURL installed. Luckily I have it as it recently came to windows, which is handy. Now I have to play with the reponses to see when to end the constant rolling of the responses. As far as I know, if the response contains no content, its a smart query describing that there is no result. This way I don't have to ask the backend to give constant calculation to how many uncutElements (not listed) are there.

## How to grab all content

There is a `5000` paging limit, but my content number is close to 67000. To go by this limit, I have to:

1. give no limitation to get the first 5000 publication
2. See the mtid o the last publication
3. give the last publications mtid as a lower limit to download the next 5000 publication
4. repeat

Now I only have to test how to handle cURL with smaller downloadings.

In [1]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import sys
import os
import pathlib
import glob

import json

import networkx as nx

import re
import uuid

In [2]:
def makeQuery(folder_name,response_name,response_number,limit1=0):
    """
    USES cURL!
    CREATES FOLDER!
    USES FOLDER TO SAVE!
    
    Extra:
        1. Displays give query (here it is burnt in)
        2. Does the query
    """
    
    #creating a folder for our downloaded data, dedicated to me. 
    os.system('mkdir data\\'+str(folder_name))
    query_string = (
                    '"https://m2.mtmt.hu/api/publication?'+
                    'cond=published;eq;true&' +
                    'cond=core;eq;true&' + 
                    'cond=institutes;inia;133&' +
                    'cond=category.mtid;eq;1&' + 
                    'cond=mtid;gt;'+ str(limit1) + '&' +
                    'cond=publishedYear;range;2014%2C2023&' +
                    'sort=mtid,asc&' + 
                    'sort=publishedYear,desc&' + 
                    'sort=firstAuthor,asc&' + 
                    'size=5000&' + 
                    'labelLang=hun&' + 
                    'site_type=2&' + 
                    'page=1&' + 
                    'format=json"'                    
                   )
                    
    #downloading it
    print('curl ' +
          query_string +
          ' > ' +
          'data\\' +
          folder_name +
          '/' + 
          response_name +
          '_' + 
          str(response_number) + 
          '.json' )
    os.system('curl ' +
              query_string +
              ' > ' +
              'data\\' +
              folder_name +
              '/' +
              response_name +
              '_' +
              str(response_number) +
              '.json')

In [7]:
makeQuery(folder_name='tmp',
          response_name='first',
          response_number=1,
          limit1=3259700400)

curl "https://m2.mtmt.hu/api/publication?cond=published;eq;true&cond=core;eq;true&cond=institutes;inia;133&cond=category.mtid;eq;1&cond=mtid;gt;3259700400&cond=publishedYear;range;2014%2C2023&sort=mtid,asc&sort=publishedYear,desc&sort=firstAuthor,asc&size=5000&labelLang=hun&site_type=2&page=1&format=json" > data\tmp/first_1.json


This seems to correctly work, now I only have to extend this to accomidate with the limitation to go through the entire content that fullfils my criteria.

## Finals

It seems like it doesn't return a smart query response, but a response that has zero amount of publcations in it. This means that the field `first` and `last` becomes **True** as no content is returned. This can be a good limiting factor: if first and last is true, the response has no content in it, its a good measure for stopping the algorithm, resulting into the last file being deleted.

**Checked**: It turns out that 5000/5000 paging return last and first as true. I have to transition to see the number of elements returned.

## First quick test

In [17]:
#checkers
remaining = True

#general stuff
fldr_name = 'final'
rspns_name = 'publications'
rspns_nmbr = 1
lmt0 = 0

while remaining:
    #do the query
    makeQuery(folder_name=fldr_name,
              response_name=rspns_name,
              response_number=rspns_nmbr,
              limit1=lmt0)
    
    #read some stuff from it
    path = "data/" + fldr_name + "/" + rspns_name + "_" + str(rspns_nmbr) + ".json"
    file = open(path,"rt",encoding="utf-8")
    data = json.load(file)
    file.close()

    #check if this is the last one: if less or equal than 5000, we stop
    if data['paging']['totalElements'] < 5000:
        remaining=False
    else:
        #increase incrementals
        rspns_nmbr = rspns_nmbr+1            #get the number for the next file
        lmt0 = data['content'][-1]['mtid']   #get the last element's mtid as the new limit instead of the previous
        
        #this is the end of this loop

curl "https://m2.mtmt.hu/api/publication?cond=published;eq;true&cond=core;eq;true&cond=institutes;inia;133&cond=category.mtid;eq;1&cond=mtid;gt;0&cond=publishedYear;range;2014%2C2023&sort=mtid,asc&sort=publishedYear,desc&sort=firstAuthor,asc&size=5000&labelLang=hun&site_type=2&page=1&format=json" > data\final/publications_1.json
curl "https://m2.mtmt.hu/api/publication?cond=published;eq;true&cond=core;eq;true&cond=institutes;inia;133&cond=category.mtid;eq;1&cond=mtid;gt;2833665&cond=publishedYear;range;2014%2C2023&sort=mtid,asc&sort=publishedYear,desc&sort=firstAuthor,asc&size=5000&labelLang=hun&site_type=2&page=1&format=json" > data\final/publications_2.json
curl "https://m2.mtmt.hu/api/publication?cond=published;eq;true&cond=core;eq;true&cond=institutes;inia;133&cond=category.mtid;eq;1&cond=mtid;gt;2980452&cond=publishedYear;range;2014%2C2023&sort=mtid,asc&sort=publishedYear,desc&sort=firstAuthor,asc&size=5000&labelLang=hun&site_type=2&page=1&format=json" > data\final/publications_3.

# Conclusion

I just simply love cURL. Its simple to grab responses, but good enough for scraping. Still, the responses as saved correctly, they all seems to be fine and so the finalized data is here and present. Now I have to jump onto its analisation, which is not that easy as I have to go through multiple files.