# Basic querying of articles

1. Query articles
2. Store them in Pandas DataFreame. Some dataprocessing is actually made at this step. It includes:
   1. Convert *authors*, *titles*, *abstract*, *journal* and *publisher* strings to lowercase.
   2. Convert *publication_date* to correct datetime format.
   3. Convert, which are actualli strings to pandas string format.
   4. All other datatypes are set to correct python datatypes.
3. Save to to *.csv* file.
4. Optionally save to *.xlsx* for manual marking.

In [None]:
%load_ext autoreload
%autoreload 2

from artfinder import Crossref
import logging
import os

logging.basicConfig(level=logging.INFO)

crosref = Crossref(app='artfinder', email='aapopov1@mephi.ru')

## Make a search query and construct pandas dataframe with results
* Valid fields for query are in Crossref.FIELDS_QUERY
* Valid fields for filter are in Crossref.FILTER_VALIDATOR
* Specifying the same filter several times results in OR semantics, while specifying different filters results in AND semantics
* [RESP-API documentation](https://github.com/CrossRef/rest-api-doc?tab=readme-ov-file#queries)



In [2]:
author_name = 'Barcikowski'
df = crosref.query(author=author_name).filter(from_pub_date='1993', type=['proceedings-article', 'journal-article']).get_df()
df.info()

DEBUG:crossref.restful:Request URL: https://api.crossref.org/works
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.crossref.org:443


DEBUG:urllib3.connectionpool:https://api.crossref.org:443 "GET /works?query.author=Barcikowski&filter=from-pub-date%3A1993%2Ctype%3Aproceedings-article%2Ctype%3Ajournal-article&cursor=%2A&rows=100 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.crossref.org:443
DEBUG:urllib3.connectionpool:https://api.crossref.org:443 "GET /works?query.author=Barcikowski&filter=from-pub-date%3A1993%2Ctype%3Aproceedings-article%2Ctype%3Ajournal-article&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAwY8h2FmJnOVJRVU1xUzlLa2ZpTkltazIzencAAAAAE8594hZWX2tkV3YyWVFVcV9hOFZHWGN4cURnAAAAADHzY9IWUnBjZWFMRGtRSEtfOUZieGxBaklIUQAAAAAuagrwFmZPZjNDVWhLUnh1RDFtUWM4WFIzUHcAAAAAD9P7ZBZpcXRGT3lZOVQ1bUFtcEdTS2xJUTRRAAAAABGAnIYWZjkwZGpsbmpTUG10QXVtRFhYcXlaUQ%3D%3D&rows=100 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.crossref.org:443
DEBUG:urllib3.connectionpool:https://api.crossref.org:443 "GET /works?query.author=Barcikowski&filter=from-pub-date%3A1993%2C

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549 entries, 0 to 548
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   publisher               549 non-null    string        
 1   license                 349 non-null    object        
 2   is_referenced_by_count  549 non-null    object        
 3   link                    486 non-null    object        
 4   authors                 549 non-null    object        
 5   abstract                185 non-null    string        
 6   title                   549 non-null    string        
 7   doi                     549 non-null    string        
 8   type                    549 non-null    string        
 9   journal                 549 non-null    string        
 10  issn                    492 non-null    string        
 11  volume                  480 non-null    string        
 12  issue                   369 non-null    string    

## Save data

In [5]:
path_to_save = os.path.join('database', author_name + '.csv')
df.to_csv(path_to_save, index=False)

### Save for marking data

In [None]:
path_for_processing = ''
df[['title', 'abstract', 'doi']].to_excel(path_for_processing, index=False)