In [1]:
import json
import requests
import regex as re
import pandas as pd

In [2]:
%run func_dataset_retrieval.py

## Queries

Queries can be modified, but important to record in each dataset retrieved from the following source the queries that allowed to retrieve it. Queries are encoded as an integer while iterating through it (1st query is 0, 2nd is 1, etc...) so it is important to export the table with the query content and their corresponding id.

In [3]:
path_query='C:/Users/vals3103/Downloads/df_query.xlsx'
df_queries = pd.read_excel(path_query, engine = "openpyxl", converters={'id_query':int})
df_queries

Unnamed: 0,query,id_query
0,Québec AND survey AND species,0
1,"Québec AND ""time series"" AND species",1
2,Québec AND inventory AND species,2
3,Québec AND species,3
4,Québec AND abundance AND species,4
5,Québec AND occurrence AND species,5
6,Québec AND population AND species,6
7,Québec AND sites AND species,7
8,Québec AND sampling AND species,8
9,Québec AND collection AND species,9


Here, importantly, subset the queries to not take into account the ones that have already been done. For instance, to ignore the 7 first query (id >=):

In [8]:
queries = df_queries.query("id_query >= 7").to_dict()['query']

In [9]:
for i, query in queries.items():
    print(i)
    print(query)

7
Québec AND sites AND species
8
Québec AND sampling AND species
9
Québec AND collection AND species


## Zenodo

In [25]:
df_zenodo = retrieve_zenodo(queries)

In [26]:
df_zenodo.head()

Unnamed: 0,url,title,description,method,notes,keywords,locations,publication_date,cited_articles,id_query,source
0,https://doi.org/10.5061/dryad.s1rn8pk7d,Boreal aspen understory diversity along a cont...,<p>This dataset contains vascular plant specie...,<p>Sampling took place in 33 trembling aspen (...,"<p>Associated paper: Crispo, Jean, Fenton, Led...",understory vegetation; trembling aspen; plant ...,,2021-06-09,,987,zenodo
1,https://doi.org/10.5061/dryad.dbrv15f1c,"Range shifts in butternut, a rare, endangered ...",<p><strong>Aim: </strong>Range shifts are a ke...,"<p class=""MsoNormal""><span style=""font-size:11...",<p>Data was cleaned and processed in R - genet...,central-marginal hypothesis; species migration...,,2022-03-10,,987,zenodo
2,https://doi.org/10.5061/dryad.vp530,Data from: Temporal dynamics of plant-soil fee...,1. Pathogens can accumulate on invasive plants...,,"<div class=""o-metadata__file-usage-entry"">Fung...",454-pyrosequencing; pathogen accumulation; Inv...,Canada; Ontario,2018-05-22,https://doi.org/10.1111/1365-2745.12459,7,zenodo
3,https://doi.org/10.5281/zenodo.6246853,Environmental variables measured in 624 lakes ...,<p>The file &ldquo;LakePluse_env_624lakes.csv&...,,,Lakes; Canada; fish; Environmental variables,,2022-02-23,https://doi.org/10.5281/zenodo.4701262; https:...,87,zenodo
4,https://doi.org/10.5061/dryad.0rxwdbs2c,Exploration and diet specialization in eastern...,<p>Individual diet specialization (IDS) is wid...,"<p>From 2012 to 2016, we live-trapped wild eas...",<p>Funding provided by: Natural Sciences and E...,C and N stable isotopes; exploration behavior;...,,2022-03-08,https://doi.org/10.5281/zenodo.5898699,7,zenodo


Number of results:

In [44]:
len(df_zenodo.index)

114

First filter: we check that "Québec" or "Quebec" appears in the sections title, method, description, notes, keywords or locations

In [28]:
df_zenodo["full_text"] = df_zenodo["title"] + " " + df_zenodo["method"] + " " + df_zenodo["description"] + " " +df_zenodo["notes"] + " " + df_zenodo["keywords"]+ " " + df_zenodo["locations"]
score = df_zenodo['full_text'].apply(lambda x: kw_in_text(x, ["Québec", "Quebec"])) 
df_zenodo["quebec"] = score

Second filter: for already detected urls we have to record the fact that they were detected with the new query.

In [45]:
done_articles = pd.read_excel("C://Users//vals3103//Downloads//to_do_11_03_22.xlsx", engine = "openpyxl")
done_urls = list(set(done_articles.url.to_list()))


In [40]:
df_zenodo["done"] = 0
df_zenodo["done"][df_zenodo.url.isin(done_urls)] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_zenodo["done"][df_zenodo.url.isin(done_urls)] = 1


Export the dataset

In [43]:
df_zenodo.to_excel("C://Users//vals3103//Post-doc//Text_mining//zenodo_DD_MM_YYYY.xlsx", header=True, index=False)