1. JSTOR data to be parsed to be placed under data folder.
2. The database jstor-authority.db to be placed under database folder

These are the steps to perform authority on the JSTOR dataset.(Note: Commands to run each step are written seperately after this cell)

A. Parsing JSTOR data.
1. Parse all the JSTOR xml files to the database. 
2. Mesh content is placed in Authority/r_table/parser/results/mesh in <journal-name>.txt format.
3. Get mesh terms for the above content by running Authority/r_table/parser/fetch_mesh_terms.bat. ( Set input and output paths in the bat file. currently only batch script is available)
4. Store mesh terms in the database by executing Authority/r_table/parser/mesh_id_parser/mesh_output_parser.py

B. Creating r_table.
1. create reference sets - attribute match, non match and name set mixed.
2. compute similarity profiles and compute r
3. Run smoothening in matlab
4. Post processing r-values(interpolation and extrapolation)

C. Clustering
1. perform clustering.

D. Gold standard databases:
1. Create google_scholar, bio diversity heritage and jstor self citation db.

E. Evaluation:
1. evaluate using the above gold standard data.


In [None]:
!pip install -r r_table/requirements.txt

________________________________________________________________________________________________
A. Parsing JSTOR data
1. Parse all the JSTOR xml files to the database.s


In [None]:
%cd r_table/parser
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-ab-part-001.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-ab-part-002.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-cdef-part-001.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-cdef-part-002.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-ghij-part-001.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-ghij-part-002.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-klmnop-part-001.zip"
!python main.py  --zip_file "../../data/receipt-id-561931-jcodes-klmnop-part-002.zip"
!python main.py  --zip_file "../../data/qrstuvwyz.zip"

2. Mesh content is placed in Authority/r_table/parser/results/mesh in <journal-name>.txt format. 
3. Download the jar file from https://ii.nlm.nih.gov/Web_API/ and place it under mesh_id_parser folder. Follow the instructions from https://ii.nlm.nih.gov/Web_API/ and register your email. Make necessary changes like providing email and password in the GenericBatch.java.
4. Get mesh terms for the above content by running Authority/r_table/parser/fetch_mesh_terms.bat or .sh file ( Set input and output paths in the bat/sh file and set web_api_examples_path to the "SKR_Web_API_V2_3\SKR_Web_API_V2_3\examples" folder )
5. Store mesh terms using below command.


In [None]:
%cd mesh_id_parser/
!python mesh_output_parser.py --file ../results/mesh/output/

_____________________________________________________________________________________________
B. Creating r_table.
1. create reference sets - attribute match, non match and name set mixed.

In [None]:
%cd ../reference_sets/
!python delete_sets.py
!python create_sets.py

2. compute similarity profiles and compute r

In [None]:
%cd ../compute_r/
!python compute_similarity.py
!python compute_r.py

1. Results of above commands are placed under computer_r/results folder. Similarity profiles are stored in x<score_attribute>\_m.json and x<score_attribute>\_nm.json files. R-table value are stored in r_x<score_attribute>.json.
2. Smoothen r values by running matlab script. smoothing_quadratic.m
3. final results will be in results/r_smoothen.txt
4. Note: This is done on a different node.
 ____________________________________________________________________
5. Interpolate r values by running post processing script

!python post_processing_r.py

copy resultant r_x1.json, r_x2.json, r_final.json and upper_profiles.txt from results folder to clustering folder for clustering.


In [None]:
%cd results
%cp r_x1.json r_x2.json r_final.json upper_profiles.txt r_x10.json nicknames.json ../../../clustering/r_table/

C. Clustering
1. perform clustering

2. The script script.sh performs clustering on the blocks paralelly to speed up the process. Command to use python main_firstinitial.py <starting_block_number> <ending_block number>. By running the command, you perform clustering on blocks starting from <starting_block_number> to <ending_block>. Adjust the parallelization according to your compute speeds.

3. The script store.sh combines the results obtained from above parallel computing and stores them in the database.

In [None]:
%cd ../../../clustering/
!./script.sh
!./store.sh

D. Evaluation. 
1. Google scholar. Results will be placed in evaluation/google_scholar/evaluation_results_gs.txt

In [None]:
%cd ../evaluation/google scholar/
!python evaluate_gs.py

2. Self citations. Results will be placed in evaluation/self-citations/final_eval_results_self.txt

In [None]:
%cd ../self-citations/
!./script.sh
!python combine_results.py

1. BHL. Results will be placed in evaluation/bhl/evaluation_results_bhl.txt

In [None]:
%cd ../bhl/
!python evaluate_bhl.py

In [None]:
#code to compute metrics.
tp = 6995
fp = 477
fn = 6
tn = 49
s= tp+fp+tn+fn
accuracy = (tp+tn)/s
print("bhl results:")
print("accuracy : ",accuracy)
pairwise_precision = tp/(tp+fp)
print("pairwise_precision : ",pairwise_precision)
pairwise_recall = tp/(tp+fn)
print("pairwise_recall : ",pairwise_recall)
pairwise_f1 = (2*pairwise_precision*pairwise_recall)/(pairwise_recall+pairwise_precision)
print("pairwise_f1 : ",pairwise_f1)
ler = fp/(tp+fp)
print("pairwise lumping error rate : ",ler)
ser = fn/(tn+fn)
print("pairwise splitting error rate : ",ser)
er = (fp+fn)/s
print("pairwise error rate : ",er)

#google scholar
tp = 291938
fp = 18028
fn = 238
tn = 5198
s= tp+fp+tn+fn

accuracy = (tp+tn)/s
print("gs results:")
print("accuracy : ",accuracy)
pairwise_precision = tp/(tp+fp)
print("pairwise_precision : ",pairwise_precision)
pairwise_recall = tp/(tp+fn)
print("pairwise_recall : ",pairwise_recall)
pairwise_f1 = (2*pairwise_precision*pairwise_recall)/(pairwise_recall+pairwise_precision)
print("pairwise_f1 : ",pairwise_f1)
ler = fp/(tp+fp)
print("pairwise lumping error rate : ",ler)
ser = fn/(tn+fn)
print("pairwise splitting error rate : ",ser)
er = (fp+fn)/s
print("pairwise error rate : ",er)

