# Novelpy Tutorial

The aim of this notebook is to showcase the capabilities of the Novelpy package using a controlled sample. We will discuss the different features we have implemented and those that we intend to add in the future. This notebook exclusively operates with JSON. However, please note that for RAM and, sometimes, speed efficiency, we typically use MongoDB. If you prefer to use MongoDB, make sure to refer to this notebook instead:[ Novelpy MongoDB Tutorial ](https://github.com/Kwirtz/novelpy/tutorial/tuts_MongoDB.ipynb)(only few lines changes to allow connection to MongoDB, for the comments and information still refer to the actual notebook).

Structure of the Notebook:
- [First steps and presentation of the data.](#first-steps-and-presentation-of-the-data)
- [Computation for co-ocurence based novelty indicators.](#computation-for-co-ocurence-based-novelty-indicators)
- [Computation for text based novelty indicators.](#computation-for-text-based-novelty-indicators)
- [Computation for disruption indicators.](#computation-for-disruption-indicators)

<a name="first-steps-and-presentation-of-the-data"></a>
## First steps and presentation of the data.

First we recommend you create a specific environment. We use SciPy and it tends to be tricky in terms of compatiblity issues. Then create a project folder and you need to add the en_core_sci_lg folder inside (you can find it here https://allenai.github.io/scispacy/) and the path to the files should be like this en_core_sci_lg-0.5.3\en_core_sci_lg\en_core_sci_lg-0.5.3. As you can see we use the 0.5.3 version we will tell you when to change it if you have another version.

We have provided a small sample of data to help you become acquainted with the package and the required data structure. To obtain this sample, one needs to run the following code in the "project" folder:

In [5]:
from novelpy.utils.get_sample import download_sample
download_sample()

Citation_net_sample.zip: 100%|██████████| 191M/191M [02:54<00:00, 1.15MiB/s] 
Meshterms_sample.zip: 100%|██████████| 149M/149M [02:42<00:00, 965kiB/s]  
Ref_Journals_sample.zip: 100%|██████████| 16.0M/16.0M [00:08<00:00, 2.04MiB/s]
Title_abs_sample.zip: 100%|██████████| 784M/784M [10:48<00:00, 1.27MiB/s]  
authors_sample.zip: 100%|██████████| 396M/396M [07:01<00:00, 987kiB/s]  


This will create a folder named "Data" with various subfolders inside. Within each subfolder, you will find a JSON file for each year. Most Novelty indicator work at the year level, explaining the choice of a file per year. Then depending on which indicator you run you need different information. Please refer to this paper https://arxiv.org/abs/2211.10346 if you want to learn more about the conceptual framework.

### Co-occurence novelty based indicators
Let us start with the indicators that use a matrix of co-occurence. These indicator look at either the combination of journals in the references of a paper or the combination of keywords (in the case of PubMed MeshTerms). The conceptual idea behind it is that no new knowledge comes from scratch. New knowledge is just combination of past knowledge. The assumption here is that knowledge is represented by keywords or journal categories. Here's a list of these indicators: Uzzi et al. (2013), Foster et al. (2015), Lee et al. (2015), Wang et al. (2017)

For these indicators you will only work folders Meshterms_sample or Ref_Journals_sample you got from the sample. For the indicators of Foster et al., Lee et al. and Wang et al. you only need three information for a document. The ID of a document, the year of creation of the document and the entities they use. So each JSON file will be a list of dictionaries. Here is the example of a single dictionary:

In [None]:
dict_Ref_Journals = {"PMID": 16992327, "year": 1896, "c04_referencelist": [{"item": "0022-3751"}]}

#OR

dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830"}, {"descUI": "D001695"}]}

For the indicator of Uzzi et al. you also need the year of creation of the entities:

In [None]:
dict_Ref_Journals = {"PMID": 16992327, "year": 1896, "c04_referencelist": [{"item": "0022-3751", "year": 1893}]}

#OR

dict_Meshterms = {"PMID": 12255534, "year": 1902, "Mesh_year_category": [{"descUI": "D000830", "year": 1999}, {"descUI": "D001695", "year": 1999}]}


### Text based Novelty indicators


Indicators of novelty derived from text operate on the premise that knowledge is encapsulated within the abstract, title, or keywords of a document. These indicators employ a text embedding technique, typically utilizing word2vec, to establish a measure of distance between words. The greater the distance between two words, the less frequently they co-occur. Therefore, when a document employs words that are particularly distant from each other, it is deemed novel. Novelpy supports two such indicators: one proposed by Shibayama et al. (2021) and another by Pelletier and Wirtz (2023).

To run Shibayama et al. (2021), one needs the Citation_net_sample (i.e. a list of the ID of papers the document cite and not only) but also Title_abs_sample in which you will find the abstract and/or title of papers.

In [None]:
dict_citation_net = {"PMID": 20793277, "year": 1850, "refs_pmid_wos": [20794613, 20794649, 20794685, 20794701, 20794789, 20794829]}

#AND

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}
#Or You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}


To run Pelletier et Wirtz (2023) you need the Title_abs_sample but also authors_sample in which you will find the list of the Authors of a document


In [None]:
dict_authors_list = {"PMID": 20793277, "year": 1850, "a02_authorlist": [{"id":201645},{"id":51331354}]}

#AND

dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":[{"AbstractText":"This is the abstract"}]}
#Or You can also have the following format for title abs. In this case leave the abstract_sub_variable argument empty
dict_title_abs = {"PMID": 20793277, "year": 1850, "ArticleTitle": "Here is the title", "a04_abstract":"This is the abstract"}



### Disruption indicators

Finally, for disruptiveness indicators, one only need the Citation_net_sample.

In [None]:
dict_citation_net = {"PMID": 20793277, "year": 1850, "refs_pmid_wos": [20794613, 20794649, 20794685, 20794701, 20794789, 20794829]}

<a name="computation-for-co-ocurence-based-novelty-indicators"></a>
## Computation for co-ocurence based novelty indicators 

We have 4 co-occurence based indicators Uzzi et al., Foster et al., Lee et al., Wang et al.
We first start by computing the co-ocurence matrices (i.e pairwise usage of items). It does not matter if the original data is in JSON or on MongoDB, these co-ocurence matrices are saved in the pickle format in Data/docs.

In [6]:
import novelpy

# all the cooc possible not including the one done above

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Ref_Journals_sample",
                 year_var="year",
                 var = "c04_referencelist",
                 sub_var = "item",
                 time_window = range(1995,2016),
                 weighted_network = True, self_loop = True)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Ref_Journals_sample",
                 year_var="year",
                 var = "c04_referencelist",
                 sub_var = "item",
                 time_window = range(1995,2016),
                 weighted_network = False, self_loop = False)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Meshterms_sample",
                 year_var="year",
                 var = "Mesh_year_category",
                 sub_var = "descUI",
                 time_window = range(1995,2016),
                 weighted_network = True, self_loop = True)

ref_cooc.main()

ref_cooc = novelpy.utils.cooc_utils.create_cooc(
                 collection_name = "Meshterms_sample",
                 year_var="year",
                 var = "Mesh_year_category",
                 sub_var = "descUI",
                 time_window = range(1995,2016),
                 weighted_network = False, self_loop = False)

ref_cooc.main()


Get item list, loop on every doc: 100%|██████████| 38874/38874 [00:00<00:00, 747583.12it/s]
Get item list, loop on every doc: 100%|██████████| 40946/40946 [00:00<00:00, 926831.22it/s]
Get item list, loop on every doc: 100%|██████████| 42302/42302 [00:00<00:00, 1082587.61it/s]
Get item list, loop on every doc: 100%|██████████| 44803/44803 [00:00<00:00, 1018279.66it/s]
Get item list, loop on every doc: 100%|██████████| 46779/46779 [00:00<00:00, 974604.09it/s]
Get item list, loop on every doc: 100%|██████████| 49872/49872 [00:00<00:00, 941002.77it/s]
Get item list, loop on every doc: 100%|██████████| 52046/52046 [00:00<00:00, 442428.24it/s]
Get item list, loop on every doc: 100%|██████████| 54721/54721 [00:00<00:00, 901634.65it/s]
Get item list, loop on every doc: 100%|██████████| 58439/58439 [00:00<00:00, 979883.23it/s]
Get item list, loop on every doc: 100%|██████████| 62241/62241 [00:00<00:00, 881073.51it/s]
Get item list, loop on every doc: 100%|██████████| 67361/67361 [00:00<00:00, 8

After this you can run the indicators. Note that for computation purpose and storage we compute it for only 2 years but you can compute it for atleast 10 years.

In [7]:
import tqdm
import novelpy

# Uzzi et al.(2013) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           density = True)
    Uzzi.get_indicator()

Computing indicator for window of time:   0%|          | 0/6 [00:00<?, ?it/s]

loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:01<00:00, 39144.70it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [33:13<00:00, 99.67s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2000 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 94.02it/s]
100%|██████████| 20/20 [00:30<00:00,  1.52s/it]
100%|██████████| 19378929/19378929 [2:22:45<00:00, 2262.36it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [03:03<00:00, 271.62it/s]
Computing indicator for window of time:  17%|█▋        | 1/6 [3:02:13<15:11:08, 10933.70s/it]

Results are in Result/uzzi/Mesh_year_category
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:01<00:00, 31269.96it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [38:00<00:00, 114.03s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2001 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 83.60it/s]
100%|██████████| 20/20 [00:34<00:00,  1.72s/it]
100%|██████████| 20434046/20434046 [2:31:44<00:00, 2244.28it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [02:49<00:00, 307.08it/s]
Computing indicator for window of time:  33%|███▎      | 2/6 [6:18:15<12:41:23, 11420.88s/it]

Results are in Result/uzzi/Mesh_year_category
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:00<00:00, 55151.25it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [35:14<00:00, 105.72s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2002 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 96.62it/s]
100%|██████████| 20/20 [00:32<00:00,  1.63s/it]
100%|██████████| 20825880/20825880 [2:32:48<00:00, 2271.41it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2002  papers ...


start: 100%|██████████| 54721/54721 [02:52<00:00, 317.49it/s]
Computing indicator for window of time:  50%|█████     | 3/6 [9:32:23<9:36:13, 11524.47s/it] 

Results are in Result/uzzi/Mesh_year_category
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:01<00:00, 53564.64it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [38:28<00:00, 115.44s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2003 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 93.46it/s]
100%|██████████| 20/20 [00:34<00:00,  1.75s/it]
100%|██████████| 22103148/22103148 [2:43:02<00:00, 2259.55it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2003  papers ...


start: 100%|██████████| 58439/58439 [03:06<00:00, 313.74it/s]
Computing indicator for window of time:  67%|██████▋   | 4/6 [13:00:24<6:36:43, 11901.98s/it]

Results are in Result/uzzi/Mesh_year_category
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:01<00:00, 55771.50it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [40:37<00:00, 121.89s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2004 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 90.09it/s]
100%|██████████| 20/20 [00:36<00:00,  1.84s/it]
100%|██████████| 22850833/22850833 [2:48:09<00:00, 2264.73it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2004  papers ...


start: 100%|██████████| 62241/62241 [03:22<00:00, 307.65it/s]
Computing indicator for window of time:  83%|████████▎ | 5/6 [16:36:01<3:24:35, 12275.36s/it]

Results are in Result/uzzi/Mesh_year_category
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:01<00:00, 39728.77it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [43:39<00:00, 130.96s/it]


Done ! Saved in Data/cooc_sample/Mesh_year_category/
Getting the uzzi novelty score for combination of items in 2005 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 71.00it/s]
100%|██████████| 20/20 [00:41<00:00,  2.06s/it]

In [1]:
import tqdm
import novelpy


# Uzzi et al.(2013) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Uzzi = novelpy.indicators.Uzzi2013(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           density = True)
    Uzzi.get_indicator()

Computing indicator for window of time:   0%|          | 0/6 [00:00<?, ?it/s]

loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 176225.26it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [00:46<00:00,  2.32s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2000 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 216.88it/s]
100%|██████████| 20/20 [00:00<00:00, 25.23it/s]
100%|██████████| 850989/850989 [06:29<00:00, 2186.06it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [00:02<00:00, 18041.67it/s]
Computing indicator for window of time:  17%|█▋        | 1/6 [07:47<38:59, 467.97s/it]

Results are in Result/uzzi/c04_referencelist
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 204598.85it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [00:47<00:00,  2.37s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2001 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 185.06it/s]
100%|██████████| 20/20 [00:00<00:00, 24.93it/s]
100%|██████████| 885779/885779 [06:49<00:00, 2163.88it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [00:02<00:00, 18637.14it/s]
Computing indicator for window of time:  33%|███▎      | 2/6 [15:57<32:02, 480.64s/it]

Results are in Result/uzzi/c04_referencelist
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2002


get_papers_item: 100%|██████████| 54721/54721 [00:00<00:00, 199978.66it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [00:54<00:00,  2.73s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2002 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 204.06it/s]
100%|██████████| 20/20 [00:00<00:00, 21.08it/s]
100%|██████████| 1042040/1042040 [08:07<00:00, 2136.10it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2002  papers ...


start: 100%|██████████| 54721/54721 [00:03<00:00, 14172.86it/s]
Computing indicator for window of time:  50%|█████     | 3/6 [25:41<26:23, 527.85s/it]

Results are in Result/uzzi/c04_referencelist
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2003


get_papers_item: 100%|██████████| 58439/58439 [00:00<00:00, 176552.31it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [01:05<00:00,  3.29s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2003 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 142.66it/s]
100%|██████████| 20/20 [00:01<00:00, 13.77it/s]
100%|██████████| 1043240/1043240 [08:28<00:00, 2053.43it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2003  papers ...


start: 100%|██████████| 58439/58439 [00:03<00:00, 14904.15it/s]
Computing indicator for window of time:  67%|██████▋   | 4/6 [35:55<18:43, 561.81s/it]

Results are in Result/uzzi/c04_referencelist
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2004


get_papers_item: 100%|██████████| 62241/62241 [00:00<00:00, 179887.55it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [01:15<00:00,  3.78s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2004 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 137.42it/s]
100%|██████████| 20/20 [00:01<00:00, 14.41it/s]
100%|██████████| 1141269/1141269 [08:47<00:00, 2165.05it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2004  papers ...


start: 100%|██████████| 62241/62241 [00:03<00:00, 16887.31it/s]
Computing indicator for window of time:  83%|████████▎ | 5/6 [46:32<09:48, 588.91s/it]

Results are in Result/uzzi/c04_referencelist
Done !
loading cooc for indicator focal year uzzi
cooc loaded !
loading items for papers in 2005


get_papers_item: 100%|██████████| 67361/67361 [00:00<00:00, 220061.98it/s]


items_loaded !
Creating sample for Uzzi et al. (2013) ...




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling




start sampling


Create sample network: 100%|██████████| 20/20 [01:37<00:00,  4.89s/it]


Done ! Saved in Data/cooc_sample/c04_referencelist/
Getting the uzzi novelty score for combination of items in 2005 ...


Get sample network: 100%|██████████| 20/20 [00:00<00:00, 173.89it/s]
100%|██████████| 20/20 [00:01<00:00, 11.28it/s]
100%|██████████| 1619015/1619015 [11:45<00:00, 2295.31it/s]


Matrice done !
Attributing the uzzi novelty indicator for 2005  papers ...


start: 100%|██████████| 67361/67361 [00:05<00:00, 11434.59it/s]
Computing indicator for window of time: 100%|██████████| 6/6 [1:00:32<00:00, 605.39s/it]

Results are in Result/uzzi/c04_referencelist
Done !





In [2]:
import tqdm
import novelpy




# Foster et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Foster = novelpy.indicators.Foster2015(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           starting_year = 1995,
                                           community_algorithm = "Louvain",
                                           density = True)
    Foster.get_indicator()


Computing indicator for window of time:   0%|          | 0/2 [00:00<?, ?it/s]

loading cooc for indicator focal year foster
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 62650.60it/s]


items_loaded !
Create empty df ...
Empty df created !
Compute community and community appartenance for 1995-2000
Get Partition of community ...
Partition Done !
Updating the score matrix ...
Done ...
Done !
Getting the foster novelty score for combination of items in 2000 ...
Done !
Attributing the foster novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [00:22<00:00, 2200.21it/s]
Computing indicator for window of time:  50%|█████     | 1/2 [11:47<11:47, 707.68s/it]

Results are in Result/foster/Mesh_year_category
Done !
loading cooc for indicator focal year foster
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:06<00:00, 7609.03it/s] 


items_loaded !
Create empty df ...
Empty df created !
Compute community and community appartenance for 1995-2001
Get Partition of community ...
Partition Done !
Updating the score matrix ...
Done ...
Done !
Getting the foster novelty score for combination of items in 2001 ...
Done !
Attributing the foster novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [00:25<00:00, 2036.08it/s]
Computing indicator for window of time: 100%|██████████| 2/2 [24:12<00:00, 726.33s/it]

Results are in Result/foster/Mesh_year_category
Done !





In [3]:
import tqdm
import novelpy

for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Foster = novelpy.indicators.Foster2015(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           starting_year = 1995,
                                           community_algorithm = "Louvain",
                                           density = True)
    Foster.get_indicator()

Computing indicator for window of time:   0%|          | 0/2 [00:00<?, ?it/s]

loading cooc for indicator focal year foster
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:07<00:00, 6541.21it/s]


items_loaded !
Create empty df ...
Empty df created !
Compute community and community appartenance for 1995-2000
Get Partition of community ...
Partition Done !
Updating the score matrix ...
Done ...
Done !
Getting the foster novelty score for combination of items in 2000 ...
Done !
Attributing the foster novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [00:00<00:00, 52076.38it/s]
Computing indicator for window of time:  50%|█████     | 1/2 [00:43<00:43, 43.72s/it]

Results are in Result/foster/c04_referencelist
Done !
loading cooc for indicator focal year foster
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 363378.53it/s]


items_loaded !
Create empty df ...
Empty df created !
Compute community and community appartenance for 1995-2001
Get Partition of community ...
Partition Done !
Updating the score matrix ...
Done ...
Done !
Getting the foster novelty score for combination of items in 2001 ...
Done !
Attributing the foster novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [00:00<00:00, 104310.65it/s]
Computing indicator for window of time: 100%|██████████| 2/2 [01:15<00:00, 37.72s/it]

Results are in Result/foster/c04_referencelist
Done !





In [4]:
import tqdm
import novelpy

# Lee et al.(2015) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Lee = novelpy.indicators.Lee2015(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           density = True)
    Lee.get_indicator()

Computing indicator for window of time:   0%|          | 0/2 [00:00<?, ?it/s]

loading cooc for indicator focal year lee
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:01<00:00, 35297.34it/s]


items_loaded !
Getting the lee novelty score for combination of items in 2000 ...
comb_scores


  recip = np.true_divide(1., other)


pickle dump
Matrice done !
Attributing the lee novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [01:55<00:00, 432.19it/s]
Computing indicator for window of time:  50%|█████     | 1/2 [02:18<02:18, 138.51s/it]

Results are in Result/lee/Mesh_year_category
Done !
loading cooc for indicator focal year lee
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:01<00:00, 34835.48it/s]


items_loaded !
Getting the lee novelty score for combination of items in 2001 ...
comb_scores
pickle dump
Matrice done !
Attributing the lee novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [02:03<00:00, 419.78it/s]
Computing indicator for window of time: 100%|██████████| 2/2 [04:40<00:00, 140.32s/it]

Results are in Result/lee/Mesh_year_category
Done !





In [5]:
import tqdm
import novelpy



# Lee et al.(2015) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2002), desc = "Computing indicator for window of time"):
    Lee = novelpy.indicators.Lee2015(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           density = True)
    Lee.get_indicator()

Computing indicator for window of time:   0%|          | 0/2 [00:00<?, ?it/s]

loading cooc for indicator focal year lee
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 374169.49it/s]


items_loaded !
Getting the lee novelty score for combination of items in 2000 ...
comb_scores
pickle dump
Matrice done !
Attributing the lee novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [00:02<00:00, 24785.71it/s]
Computing indicator for window of time:  50%|█████     | 1/2 [00:05<00:05,  5.07s/it]

Results are in Result/lee/c04_referencelist
Done !
loading cooc for indicator focal year lee
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 341774.35it/s]


items_loaded !
Getting the lee novelty score for combination of items in 2001 ...
comb_scores
pickle dump
Matrice done !
Attributing the lee novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [00:02<00:00, 20179.48it/s]
Computing indicator for window of time: 100%|██████████| 2/2 [00:10<00:00,  5.23s/it]

Results are in Result/lee/c04_referencelist
Done !





In [6]:
# Wang et al.(2017) Meshterms_sample
for focal_year in tqdm.tqdm(range(2000,2002)):
    Wang = novelpy.indicators.Wang2017(collection_name = "Meshterms_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "Mesh_year_category",
                                           sub_variable = "descUI",
                                           focal_year = focal_year,
                                           time_window_cooc = 3,
                                           n_reutilisation = 1,
                                           starting_year = 1995,
                                           density = True)
    Wang.get_indicator()


100%|██████████| 3/3 [00:03<00:00,  1.21s/it]


loading cooc for indicator focal year wang
Calculate past matrix for Wang et al.(2017)
Calculate futur matrix for Wang et al.(2017)
Calculate difficulty matrix for Wang et al.(2017)
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 74303.21it/s]


items_loaded !
Getting the wang novelty score for combination of items in 2000 ...


  self.futur_adj[self.futur_adj < self.n_reutilisation] = 0
  self._set_arrayXarray(i, j, x)


Matrice done !
Attributing the wang novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [11:11<00:00, 74.28it/s]
 50%|█████     | 1/2 [16:41<16:41, 1001.08s/it]

Results are in Result/wang/Mesh_year_category_3_1_restricted50
Done !


100%|██████████| 3/3 [00:03<00:00,  1.28s/it]


loading cooc for indicator focal year wang
Calculate past matrix for Wang et al.(2017)
Calculate futur matrix for Wang et al.(2017)
Calculate difficulty matrix for Wang et al.(2017)
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:01<00:00, 29001.10it/s]


items_loaded !
Getting the wang novelty score for combination of items in 2001 ...
Matrice done !
Attributing the wang novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [10:12<00:00, 84.91it/s] 
100%|██████████| 2/2 [32:07<00:00, 963.54s/it] 

Results are in Result/wang/Mesh_year_category_3_1_restricted50
Done !





In [7]:
import tqdm
import novelpy

# Wang et al.(2017) Ref_Journals_sample
for focal_year in tqdm.tqdm(range(2000,2002)):
    Wang = novelpy.indicators.Wang2017(collection_name = "Ref_Journals_sample",
                                           id_variable = 'PMID',
                                           year_variable = 'year',
                                           variable = "c04_referencelist",
                                           sub_variable = "item",
                                           focal_year = focal_year,
                                           time_window_cooc = 3,
                                           n_reutilisation = 1,
                                           starting_year = 1995,
                                           density = True)
    Wang.get_indicator()

100%|██████████| 3/3 [00:00<00:00,  4.16it/s]


loading cooc for indicator focal year wang
Calculate past matrix for Wang et al.(2017)
Calculate futur matrix for Wang et al.(2017)
Calculate difficulty matrix for Wang et al.(2017)
cooc loaded !
loading items for papers in 2000


get_papers_item: 100%|██████████| 49872/49872 [00:00<00:00, 377818.27it/s]


items_loaded !
Getting the wang novelty score for combination of items in 2000 ...
Matrice done !
Attributing the wang novelty indicator for 2000  papers ...


start: 100%|██████████| 49872/49872 [00:02<00:00, 21245.62it/s]
 50%|█████     | 1/2 [01:19<01:19, 79.73s/it]

Results are in Result/wang/c04_referencelist_3_1_restricted50
Done !


100%|██████████| 3/3 [00:01<00:00,  2.97it/s]


loading cooc for indicator focal year wang
Calculate past matrix for Wang et al.(2017)
Calculate futur matrix for Wang et al.(2017)
Calculate difficulty matrix for Wang et al.(2017)
cooc loaded !
loading items for papers in 2001


get_papers_item: 100%|██████████| 52046/52046 [00:00<00:00, 337442.70it/s]


items_loaded !
Getting the wang novelty score for combination of items in 2001 ...
Matrice done !
Attributing the wang novelty indicator for 2001  papers ...


start: 100%|██████████| 52046/52046 [00:02<00:00, 20972.37it/s]
100%|██████████| 2/2 [02:43<00:00, 81.57s/it]

Results are in Result/wang/c04_referencelist_3_1_restricted50
Done !





<a name="computation-for-text-based-novelty-indicators"></a>
## Computation for text based novelty indicators

Novelpy supports two text based indicator: Shibayama et al. (2021) and Pelletier et Wirtz (2023).

Both these indicators first require to embed articles using their text either on their Abstract, title or keywords. We have an argument for each of the text unit although we recommend to just combine every text you want in one field and use only one (For example in your dict have {"abstract_variable":Abstract+title+keywords}). We make use of the pretrain of Scipy using en_core_sci_lg-0.5.3
Note that this will only work on english paper but we made the pretrain_path flexible enough so that you can use your own embedding. For each article we then compute the centroid of the text.

In [None]:
from novelpy.utils.embedding import Embedding

embedding = Embedding(
            year_variable = 'year',
            time_range = range(2000,2002),
            id_variable = 'PMID',
            references_variable = 'refs_pmid_wos',
            pretrain_path = 'en_core_sci_lg-0.5.3/en_core_sci_lg/en_core_sci_lg-0.5.3',
            title_variable = 'ArticleTitle',
            abstract_variable = 'a04_abstract',
            abstract_subvariable = 'AbstractText',
            keywords_variable = None,
            keywords_subvariable = None)

# articles

embedding.get_articles_centroid(
      collection_articles = 'Title_abs_sample',
      collection_embedding = 'embedding')

Once the centroid for the focal papers is calculated you can directly run Shibayama et al. (2021)

In [None]:
import novelpy
import tqdm

for focal_year in tqdm.tqdm(range(1995,2002), desc = "Computing indicator for window of time"):
 shibayama = novelpy.indicators.Shibayama2021(
      collection_name = 'Citation_net_sample',
      collection_embedding_name = 'embedding',
      id_variable = 'PMID',
      year_variable = 'year',
      ref_variable = 'refs_pmid_wos',
      entity = ['title_embedding','abstract_embedding'],
      focal_year = focal_year,
      density = True)

 shibayama.get_indicator()

For the Pelletier et Wirtz (2023) indicator you need to add a step to create the profile of authors for each year. Meaning you create a dict for each author for a given year that contains information on the centroid of the articles he/she co-authored. 

In [None]:
from novelpy.utils import Embedding
from novelpy.utils import create_authors_past
import novelpy

# First step is to create a collection where each doc contains the author ID and its list of document he coauthored
clean = create_authors_past(
                             collection_name = "authors_sample",
                             id_variable = "PMID",
                             variable = "a02_authorlist",
                             sub_variable = "AID")

clean.author2paper()
clean.update_db()

embedding = Embedding(
      year_variable = 'year',
      id_variable = 'PMID',
      references_variable = 'refs_pmid_wos',
      pretrain_path = r'en_core_sci_lg-0.5.3\en_core_sci_lg\en_core_sci_lg-0.5.3',
      title_variable = 'ArticleTitle',
      abstract_variable = 'a04_abstract',
      abstract_subvariable = 'AbstractText',
      aut_id_variable = 'AID',
      aut_pubs_variable = 'doc_list')


"""
embedding.get_articles_centroid(
      collection_articles = 'Title_abs_sample',
      collection_embedding = 'embedding')
"""



embedding.feed_author_profile(
    aut_id_variable = 'AID',
    aut_pubs_variable = 'doc_list',
    collection_authors = 'authors_sample_cleaned',
    collection_embedding = 'embedding')

Once this data is created you can run Pelletier et Wirtz (2023) indicator the following way:

In [None]:
from novelpy.indicators.Author_proximity import Author_proximity

for year in range(2000,2002):
    author =  Author_proximity(
                         collection_name = 'authors_sample',
                         id_variable = 'PMID',
                         year_variable = 'year',
                         aut_list_variable = 'a02_authorlist',
                         aut_id_variable = 'AID',
                         entity = ['title','abstract'],
                         focal_year = year,
                         windows_size = 5,
                       density = True)

    author.get_indicator()

<a name="computation-for-disruption-indicators"></a>
## Computation for disruption indicators 

In [None]:
import novelpy

clean = novelpy.utils.preprocess_disruptiveness.create_citation_network(collection_name = "Citation_net_sample",
                                                                        id_variable = "PMID", variable = "refs_pmid_wos")
clean.id2citedby()
clean.update_db()

In [None]:
import tqdm
import novelpy


for year in range(2000,2011):
    disruptiveness = novelpy.Disruptiveness(
        collection_name = 'Citation_net_sample_cleaned',
        focal_year = year,
        id_variable = 'PMID',
        refs_list_variable ='refs',
        cits_list_variable = 'cited_by',
        year_variable = 'year')
    
    disruptiveness.get_indicators(parallel = False)


<a name="To go further"></a>
## To go further

We have added some features that might be of some use. The first things is to compute indicators only on a list of papers. This can be done with the argument "list_ids" for any indicator computation

In [None]:
# This will compute the lee et al. indicator for papers with PMID in ["10592257","10594130"]. Make sure that the ids correspond to the focal_year

focal_year = 2000

Lee = novelpy.indicators.Lee2015(collection_name = "Ref_Journals_sample",
                                        id_variable = 'PMID',
                                        year_variable = 'year',
                                        variable = "c04_referencelist",
                                        sub_variable = "item",
                                        focal_year = focal_year,
                                        density = True,
                                        list_ids=["10592257","10594130"])

Lee.get_indicator()

Another argument that can be changed is density. To give you the most info density=True keeps the novelty score of each combination in the paper giving you a distribution of novelty score. If you only want the score for a paper and not the distribution then put density=False.