# Novelpy Tutorial

The aim of this notebook is to showcase the capabilities of the Novelpy package using a controlled sample. We will discuss the different features we have implemented and those that we intend to add in the future. This notebook exclusively operates with JSON. However, please note that for RAM and, sometimes, speed efficiency, we typically use MongoDB. If you prefer to use MongoDB, make sure to refer to this notebook instead:[ Novelpy MongoDB Tutorial ](https://github.com/Kwirtz/novelpy/tutorial/tuts_MongoDB.ipynb)(the comments and discussion remain the same; only the scripts change).

First we recommend you create a specific environment. We use SciPy and it tends to be tricky in terms of compatiblity issues. Then create a project folder and you need to add the en_core_sci_lg folder inside (you can find it here https://allenai.github.io/scispacy/) and the path to the files should be like this en_core_sci_lg-0.5.3\en_core_sci_lg\en_core_sci_lg-0.5.3. As you can see we use the 0.5.3 version we will tell you when to change if you have another version.

We have provided a small sample of data to help you become acquainted with the package and the required data structure. To obtain this sample, one needs to run the following code in the "project" folder:

In [2]:
from novelpy.utils.get_sample import download_sample
download_sample()

Citation_net_sample.zip: 100%|██████████| 191M/191M [01:05<00:00, 3.08MiB/s] 
Meshterms_sample.zip: 100%|██████████| 149M/149M [00:34<00:00, 4.56MiB/s] 
Ref_Journals_sample.zip: 100%|██████████| 16.0M/16.0M [00:00<00:00, 37.7MiB/s]
Title_abs_sample.zip: 100%|██████████| 784M/784M [02:27<00:00, 5.57MiB/s] 
authors_sample.zip: 100%|██████████| 396M/396M [01:15<00:00, 5.50MiB/s] 


This will create a folder named "Data" with various subfolders inside. Within each subfolder, you will find a JSON file for each year. Let's explore each folder to understand its purpose and the structure of the data inside. First most Novelty indicator work at the year level, explaining the choice of a file per year. Then depending on which indicator you run you need different information. Please refer to this paper https://arxiv.org/abs/2211.10346 if you want to learn more what we summarize here.

## Co-occurence novelty based indicators
Let us start with the indicators that use a matrix of co-occurence. These indicator look at either the combination of journals in the references of a paper or the combination of keywords (in the case of PubMed MeshTerms). 
Here's a list of these indicators: Uzzi et al. (2013), Foster et al. (2015), Lee et al. (2015), Wang et al. (2017)

For these indicators you will only need the folders Meshterms_sample or Ref_Journals_sample. For the indicators of Foster et al., Lee et al. and Wang et al. you only need three pieces of information of a document. The year of creation of the document and the entities they use. So each JSON file will be a list of dictionaries. Here is the example of a single dictionary:


dict_Ref_Journals = {"PMID": 16992327, "year": 1896, "c04_referencelist": [{"item": "0022-3751"}]}

OR

dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830"}, {"descUI": "D001695"}]}

For the indicator of Uzzi et al. you also need the year of creation of the entities:


dict_Ref_Journals = {"PMID": 16992327, "year": 1896", "c04_referencelist": [{"item": "0022-3751", "year": 1893}]}

OR

dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830", "year": 1999}, {"descUI": "D001695", "year": 1999}]}

## Word-embedding based Novelty indicators

To run Shibayama et al. (2021), one needs the Citation_net_sample (i.e. a list of the ID of papers the document cite and not only) but also Title_abs_sample in which you will find the abstract and/or title of papers.

dict_citation_net = {"PMID": 20793277, "year": 1850, "refs_pmid_wos": [20794613, 20794649, 20794685, 20794701, 20794789, 20794829]}

AND

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}

Or You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}

To run Pelletier et Wirtz. (2023) you need the Title_abs_sample but also authors_sample in which you will find the list of the Authors of a document

dict_authors_list = {"PMID": 20793277, "year": 1850, "a02_authorlist": [{"id":201645},{"id":51331354}]}

AND

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}

Or You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}


## Disruption indicators




In [2]:
import novelpy

# all the cooc possible not including the one done above

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Ref_Journals_sample",
                 year_var="year",
                 var = "c04_referencelist",
                 sub_var = "item",
                 time_window = range(1995,2016),
                 weighted_network = True, self_loop = True)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Ref_Journals_sample",
                 year_var="year",
                 var = "c04_referencelist",
                 sub_var = "item",
                 time_window = range(1995,2016),
                 weighted_network = False, self_loop = False)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Meshterms_sample",
                 year_var="year",
                 var = "Mesh_year_category",
                 sub_var = "descUI",
                 time_window = range(1995,2016),
                 weighted_network = True, self_loop = True)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Meshterms_sample",
                 year_var="year",
                 var = "Mesh_year_category",
                 sub_var = "descUI",
                 time_window = range(1995,2016),
                 weighted_network = False, self_loop = False)

ref_cooc.main()


Get item list, loop on every doc: 100%|██████████| 38874/38874 [00:00<00:00, 1164363.93it/s]
Get item list, loop on every doc: 100%|██████████| 40946/40946 [00:00<00:00, 1296914.20it/s]
Get item list, loop on every doc: 100%|██████████| 42302/42302 [00:00<00:00, 1219306.93it/s]
Get item list, loop on every doc: 100%|██████████| 44803/44803 [00:00<00:00, 1312364.01it/s]
Get item list, loop on every doc: 100%|██████████| 46779/46779 [00:00<00:00, 1257492.82it/s]
Get item list, loop on every doc: 100%|██████████| 49872/49872 [00:00<00:00, 1260300.22it/s]
Get item list, loop on every doc: 100%|██████████| 52046/52046 [00:00<00:00, 1181956.60it/s]
Get item list, loop on every doc: 100%|██████████| 54721/54721 [00:00<00:00, 1251883.19it/s]
Get item list, loop on every doc: 100%|██████████| 58439/58439 [00:00<00:00, 1232580.37it/s]
Get item list, loop on every doc: 100%|██████████| 62241/62241 [00:00<00:00, 1131452.08it/s]
Get item list, loop on every doc: 100%|██████████| 67361/67361 [00:00<

In [None]:
import tqdm
import novelpy

# Uzzi et al.(2013) Meshterms_sample
for focal_year in tqdm.tqdm(range(2009,2011), desc = "Computing indicator for window of time"):
    Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           density = True)
    Uzzi.get_indicator()

In [None]:
import tqdm
import novelpy


# Uzzi et al.(2013) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
    Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           density = True)
    Uzzi.get_indicator()

In [None]:
import tqdm
import novelpy




# Foster et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
    Foster = novelpy.indicators.Foster2015(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           starting_year = 1995,
                                           community_algorithm = "Louvain",
                                           density = True)
    Foster.get_indicator()


In [None]:
import tqdm
import novelpy

for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
    Foster = novelpy.indicators.Foster2015(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           starting_year = 1995,
                                           community_algorithm = "Louvain",
                                           density = True)
    Foster.get_indicator()

In [1]:
import tqdm
import novelpy

# Lee et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
    Lee = novelpy.indicators.Lee2015(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           density = True)
    Lee.get_indicator()

Computing indicator for window of time:   0%|          | 0/11 [00:00<?, ?it/s]

loading cooc for focal year 2000
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 53323.68it/s]


items_loaded !
Getting score per year ...
comb_scores


  recip = np.true_divide(1., other)


pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 49872/49872 [01:52<00:00, 443.41it/s]
Computing indicator for window of time:   9%|▉         | 1/11 [02:08<21:22, 128.21s/it]

saved
Done !
loading cooc for focal year 2001
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:01<00:00, 42818.89it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 52046/52046 [01:57<00:00, 441.42it/s]
Computing indicator for window of time:  18%|█▊        | 2/11 [04:24<19:55, 132.82s/it]

saved
Done !
loading cooc for focal year 2002
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:01<00:00, 43166.12it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 54721/54721 [02:01<00:00, 451.70it/s]
Computing indicator for window of time:  27%|██▋       | 3/11 [06:43<18:06, 135.87s/it]

saved
Done !
loading cooc for focal year 2003
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:01<00:00, 47592.20it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 58439/58439 [02:15<00:00, 430.32it/s]
Computing indicator for window of time:  36%|███▋      | 4/11 [09:18<16:42, 143.27s/it]

saved
Done !
loading cooc for focal year 2004
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:01<00:00, 42111.46it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 62241/62241 [02:24<00:00, 432.01it/s]
Computing indicator for window of time:  45%|████▌     | 5/11 [12:02<15:04, 150.82s/it]

saved
Done !
loading cooc for focal year 2005
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:01<00:00, 66996.55it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 67361/67361 [02:31<00:00, 443.36it/s]
Computing indicator for window of time:  55%|█████▍    | 6/11 [14:56<13:12, 158.54s/it]

saved
Done !
loading cooc for focal year 2006
cooc loaded !
loading items for papers in 2006


get_papers_item: 100%|██████████| 70501/70501 [00:01<00:00, 43465.62it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 70501/70501 [02:29<00:00, 471.94it/s]
Computing indicator for window of time:  64%|██████▎   | 7/11 [17:46<10:49, 162.26s/it]

saved
Done !
loading cooc for focal year 2007
cooc loaded !
loading items for papers in 2007


get_papers_item: 100%|██████████| 75717/75717 [00:01<00:00, 50012.35it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 75717/75717 [02:36<00:00, 482.69it/s]
Computing indicator for window of time:  73%|███████▎  | 8/11 [20:43<08:21, 167.24s/it]

saved
Done !
loading cooc for focal year 2008
cooc loaded !
loading items for papers in 2008


get_papers_item: 100%|██████████| 81228/81228 [00:01<00:00, 50687.74it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 81228/81228 [02:48<00:00, 481.42it/s]
Computing indicator for window of time:  82%|████████▏ | 9/11 [23:55<05:49, 174.85s/it]

saved
Done !
loading cooc for focal year 2009
cooc loaded !
loading items for papers in 2009


get_papers_item: 100%|██████████| 84496/84496 [00:01<00:00, 50581.36it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 84496/84496 [02:59<00:00, 471.64it/s]
Computing indicator for window of time:  91%|█████████ | 10/11 [27:17<03:03, 183.20s/it]

saved
Done !
loading cooc for focal year 2010
cooc loaded !
loading items for papers in 2010


get_papers_item: 100%|██████████| 89168/89168 [00:01<00:00, 51556.36it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 89168/89168 [03:09<00:00, 470.51it/s]
Computing indicator for window of time: 100%|██████████| 11/11 [30:50<00:00, 168.24s/it]

saved
Done !





In [2]:
import tqdm
import novelpy



# Lee et al.(2015) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
    Lee = novelpy.indicators.Lee2015(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           density = True)
    Lee.get_indicator()

Computing indicator for window of time:   0%|          | 0/11 [00:00<?, ?it/s]

loading cooc for focal year 2000
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 404171.84it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 49872/49872 [00:01<00:00, 25918.47it/s]
Computing indicator for window of time:   9%|▉         | 1/11 [00:05<00:50,  5.05s/it]

saved
Done !
loading cooc for focal year 2001
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 182525.42it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 52046/52046 [00:02<00:00, 25885.47it/s]
Computing indicator for window of time:  18%|█▊        | 2/11 [00:09<00:44,  4.93s/it]

saved
Done !
loading cooc for focal year 2002
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:00<00:00, 211928.17it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 54721/54721 [00:02<00:00, 24569.50it/s]
Computing indicator for window of time:  27%|██▋       | 3/11 [00:14<00:39,  4.96s/it]

saved
Done !
loading cooc for focal year 2003
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:00<00:00, 218412.22it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 58439/58439 [00:02<00:00, 25539.64it/s]
Computing indicator for window of time:  36%|███▋      | 4/11 [00:19<00:35,  5.01s/it]

saved
Done !
loading cooc for focal year 2004
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:00<00:00, 196304.19it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 62241/62241 [00:02<00:00, 23082.57it/s]
Computing indicator for window of time:  45%|████▌     | 5/11 [00:25<00:31,  5.22s/it]

saved
Done !
loading cooc for focal year 2005
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:00<00:00, 226793.84it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 67361/67361 [00:04<00:00, 16020.55it/s]
Computing indicator for window of time:  55%|█████▍    | 6/11 [00:32<00:29,  5.89s/it]

saved
Done !
loading cooc for focal year 2006
cooc loaded !
loading items for papers in 2006


get_papers_item: 100%|██████████| 70501/70501 [00:00<00:00, 230408.53it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 70501/70501 [00:02<00:00, 28607.79it/s]
Computing indicator for window of time:  64%|██████▎   | 7/11 [00:38<00:22,  5.74s/it]

saved
Done !
loading cooc for focal year 2007
cooc loaded !
loading items for papers in 2007


get_papers_item: 100%|██████████| 75717/75717 [00:00<00:00, 193529.23it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 75717/75717 [00:02<00:00, 29291.18it/s]
Computing indicator for window of time:  73%|███████▎  | 8/11 [00:43<00:17,  5.69s/it]

saved
Done !
loading cooc for focal year 2008
cooc loaded !
loading items for papers in 2008


get_papers_item: 100%|██████████| 81228/81228 [00:00<00:00, 239960.31it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 81228/81228 [00:02<00:00, 30095.45it/s]
Computing indicator for window of time:  82%|████████▏ | 9/11 [00:49<00:11,  5.77s/it]

saved
Done !
loading cooc for focal year 2009
cooc loaded !
loading items for papers in 2009


get_papers_item: 100%|██████████| 84496/84496 [00:00<00:00, 221835.19it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 84496/84496 [00:02<00:00, 30899.75it/s]
Computing indicator for window of time:  91%|█████████ | 10/11 [00:55<00:05,  5.79s/it]

saved
Done !
loading cooc for focal year 2010
cooc loaded !
loading items for papers in 2010


get_papers_item: 100%|██████████| 89168/89168 [00:00<00:00, 244659.78it/s]


items_loaded !
Getting score per year ...
comb_scores
pickle dump
Matrice done !
Getting score per paper ...


start: 100%|██████████| 89168/89168 [00:03<00:00, 27012.61it/s]
Computing indicator for window of time: 100%|██████████| 11/11 [01:01<00:00,  5.63s/it]

saved
Done !





In [4]:
# Wang et al.(2017) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2011)):
    Wang = novelpy.indicators.Wang2017(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           time_window_cooc = 3,
                                           n_reutilisation = 1,
                                           starting_year = 1995,
                                           density = True)
    Wang.get_indicator()


100%|██████████| 3/3 [00:03<00:00,  1.18s/it]


loading cooc for focal year 2000
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 87135.96it/s] 


items_loaded !
Getting score per year ...


  self.futur_adj[self.futur_adj < self.n_reutilisation] = 0
  self._set_arrayXarray(i, j, x)


Matrice done !
Getting score per paper ...


start: 100%|██████████| 49872/49872 [09:37<00:00, 86.42it/s]
  9%|▉         | 1/11 [14:19<2:23:16, 859.70s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.58s/it]


loading cooc for focal year 2001
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 77191.65it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 52046/52046 [10:04<00:00, 86.08it/s] 
 18%|█▊        | 2/11 [29:27<2:13:10, 887.83s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.61s/it]


loading cooc for focal year 2002
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:00<00:00, 73920.76it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 54721/54721 [10:23<00:00, 87.74it/s] 
 27%|██▋       | 3/11 [45:21<2:02:27, 918.38s/it]

saved
Done !


100%|██████████| 3/3 [00:03<00:00,  1.21s/it]


loading cooc for focal year 2003
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:02<00:00, 23427.73it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 58439/58439 [11:33<00:00, 84.23it/s]
 36%|███▋      | 4/11 [1:02:54<1:53:18, 971.24s/it]

saved
Done !


100%|██████████| 3/3 [00:03<00:00,  1.22s/it]


loading cooc for focal year 2004
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:02<00:00, 25570.52it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 62241/62241 [12:38<00:00, 82.05it/s] 
 45%|████▌     | 5/11 [1:22:03<1:43:31, 1035.29s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.52s/it]


loading cooc for focal year 2005
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:00<00:00, 184603.74it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 67361/67361 [13:20<00:00, 84.18it/s] 
 55%|█████▍    | 6/11 [1:42:08<1:31:05, 1093.16s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.35s/it]


loading cooc for focal year 2006
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2006


get_papers_item: 100%|██████████| 70501/70501 [00:02<00:00, 27513.05it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 70501/70501 [13:07<00:00, 89.51it/s]
 64%|██████▎   | 7/11 [2:02:16<1:15:22, 1130.62s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.44s/it]


loading cooc for focal year 2007
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2007


get_papers_item: 100%|██████████| 75717/75717 [00:02<00:00, 27126.15it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 75717/75717 [13:42<00:00, 92.06it/s] 
 73%|███████▎  | 8/11 [2:23:18<58:37, 1172.62s/it]  

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.49s/it]


loading cooc for focal year 2008
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2008


get_papers_item: 100%|██████████| 81228/81228 [00:02<00:00, 30761.85it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 81228/81228 [14:26<00:00, 93.70it/s]
 82%|████████▏ | 9/11 [2:45:09<40:31, 1215.62s/it]

saved
Done !


100%|██████████| 3/3 [00:06<00:00,  2.24s/it]


loading cooc for focal year 2009
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2009


get_papers_item: 100%|██████████| 84496/84496 [00:00<00:00, 85296.54it/s] 


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 84496/84496 [15:33<00:00, 90.54it/s] 
 91%|█████████ | 10/11 [3:08:20<21:09, 1269.79s/it]

saved
Done !


100%|██████████| 3/3 [00:04<00:00,  1.66s/it]


loading cooc for focal year 2010
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2010


get_papers_item: 100%|██████████| 89168/89168 [00:02<00:00, 33509.20it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 89168/89168 [16:29<00:00, 90.08it/s]
100%|██████████| 11/11 [3:32:37<00:00, 1159.75s/it]

saved
Done !





In [5]:
import tqdm
import novelpy

# Wang et al.(2017) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2011)):
    Wang = novelpy.indicators.Wang2017(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           time_window_cooc = 3,
                                           n_reutilisation = 1,
                                           starting_year = 1995,
                                           density = True)
    Wang.get_indicator()

100%|██████████| 3/3 [00:00<00:00,  5.12it/s]


loading cooc for focal year 2000
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 457552.43it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 49872/49872 [00:02<00:00, 23315.48it/s]
  9%|▉         | 1/11 [01:51<18:32, 111.20s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  8.62it/s]


loading cooc for focal year 2001
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 452576.69it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 52046/52046 [00:02<00:00, 21755.20it/s]
 18%|█▊        | 2/11 [03:45<16:55, 112.85s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  8.13it/s]


loading cooc for focal year 2002
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:00<00:00, 434292.26it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 54721/54721 [00:02<00:00, 20356.87it/s]
 27%|██▋       | 3/11 [05:44<15:26, 115.80s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  6.83it/s]


loading cooc for focal year 2003
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:00<00:00, 439396.83it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 58439/58439 [00:03<00:00, 19341.30it/s]
 36%|███▋      | 4/11 [07:47<13:51, 118.79s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  6.65it/s]


loading cooc for focal year 2004
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:01<00:00, 50761.83it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 62241/62241 [00:03<00:00, 17772.24it/s]
 45%|████▌     | 5/11 [09:53<12:06, 121.09s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  6.02it/s]


loading cooc for focal year 2005
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:00<00:00, 426262.78it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 67361/67361 [00:04<00:00, 14828.80it/s]
 55%|█████▍    | 6/11 [11:59<10:14, 122.91s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  6.07it/s]


loading cooc for focal year 2006
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2006


get_papers_item: 100%|██████████| 70501/70501 [00:01<00:00, 58994.49it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 70501/70501 [00:03<00:00, 18863.69it/s]
 64%|██████▎   | 7/11 [14:07<08:17, 124.45s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  5.84it/s]


loading cooc for focal year 2007
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2007


get_papers_item: 100%|██████████| 75717/75717 [00:00<00:00, 596202.01it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 75717/75717 [00:04<00:00, 18729.54it/s]
 73%|███████▎  | 8/11 [16:16<06:18, 126.11s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  5.52it/s]


loading cooc for focal year 2008
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2008


get_papers_item: 100%|██████████| 81228/81228 [00:01<00:00, 63858.64it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 81228/81228 [00:04<00:00, 16938.12it/s]
 82%|████████▏ | 9/11 [18:28<04:15, 127.96s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  5.05it/s]


loading cooc for focal year 2009
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2009


get_papers_item: 100%|██████████| 84496/84496 [00:00<00:00, 435547.98it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 84496/84496 [00:06<00:00, 13997.56it/s]
 91%|█████████ | 10/11 [20:42<02:09, 129.81s/it]

saved
Done !


100%|██████████| 3/3 [00:00<00:00,  4.80it/s]


loading cooc for focal year 2010
Calculate past matrix 
Calculate futur matrix
Calculate difficulty matrix
cooc loaded !
loading items for papers in 2010


get_papers_item: 100%|██████████| 89168/89168 [00:01<00:00, 69984.72it/s]


items_loaded !
Getting score per year ...
Matrice done !
Getting score per paper ...


start: 100%|██████████| 89168/89168 [00:08<00:00, 10680.96it/s]
100%|██████████| 11/11 [22:58<00:00, 125.31s/it]

saved
Done !





In [1]:
from novelpy.utils.embedding import Embedding

embedding = Embedding(
            year_variable = 'year',
            time_range = range(2000,2011),
            id_variable = 'PMID',
            references_variable = 'refs_pmid_wos',
            pretrain_path = 'en_core_sci_lg-0.4.0/en_core_sci_lg/en_core_sci_lg-0.4.0',
            title_variable = 'ArticleTitle',
            abstract_variable = 'a04_abstract',
            abstract_subvariable = 'AbstractText')

# articles

embedding.get_articles_centroid(
      collection_articles = 'Title_abs_sample',
      collection_embedding = 'embedding')



init_dbs_centroid


  0%|          | 0/11 [00:00<?, ?it/s]

load_data_centroid


100%|██████████| 49872/49872 [25:42<00:00, 32.34it/s]
  9%|▉         | 1/11 [26:13<4:22:19, 1573.91s/it]

load_data_centroid


100%|██████████| 52046/52046 [26:21<00:00, 32.91it/s]
 18%|█▊        | 2/11 [53:09<3:59:46, 1598.48s/it]

load_data_centroid


100%|██████████| 54721/54721 [26:50<00:00, 33.98it/s]
 27%|██▋       | 3/11 [1:20:35<3:36:02, 1620.32s/it]

load_data_centroid


100%|██████████| 58439/58439 [28:35<00:00, 34.06it/s]
 36%|███▋      | 4/11 [1:49:50<3:15:13, 1673.42s/it]

load_data_centroid




In [None]:
import novelpy
import tqdm

for focal_year in tqdm.tqdm(range(2000,2011), desc = "Computing indicator for window of time"):
 shibayama = novelpy.indicators.Shibayama2021(
      collection_name = 'Citation_net_sample',
      collection_embedding_name = 'embedding',
      id_variable = 'PMID',
      year_variable = 'year',
      ref_variable = 'refs_pmid_wos',
      entity = ['title_embedding','abstract_embedding'],
      focal_year = focal_year,
      density = True)

 shibayama.get_indicator()

In [None]:
from novelpy.utils import Embedding
from novelpy.utils import create_authors_past
import novelpy

# First step is to create a collection where each doc contains the author ID and its list of document he coauthored
clean = create_authors_past(
                             collection_name = "authors_sample",
                             id_variable = "PMID",
                             variable = "a02_authorlist",
                             sub_variable = "AID")

clean.author2paper()
clean.update_db()

embedding = Embedding(
      year_variable = 'year',
      id_variable = 'PMID',
      references_variable = 'refs_pmid_wos',
      pretrain_path = r'en_core_sci_lg-0.4.0\en_core_sci_lg\en_core_sci_lg-0.4.0',
      title_variable = 'ArticleTitle',
      abstract_variable = 'a04_abstract',
      abstract_subvariable = 'AbstractText',
      aut_id_variable = 'AID',
      aut_pubs_variable = 'doc_list')


"""
embedding.get_articles_centroid(
      collection_articles = 'Title_abs_sample',
      collection_embedding = 'embedding')
"""



embedding.feed_author_profile(
    aut_id_variable = 'AID',
    aut_pubs_variable = 'doc_list',
    collection_authors = 'authors_sample_cleaned',
    collection_embedding = 'embedding')

In [None]:
from novelpy.indicators.Author_proximity import Author_proximity

for year in range(2000,2011):
    author =  Author_proximity(
                         collection_name = 'authors_sample',
                         id_variable = 'PMID',
                         year_variable = 'year',
                         aut_list_variable = 'a02_authorlist',
                         aut_id_variable = 'AID',
                         entity = ['title','abstract'],
                         focal_year = year,
                         windows_size = 5,
                       density = True)

    author.get_indicator()