# __8 Considering impact__

Goal:
- Access the impacts of pubs over time for different:
  - Countries
  - Topics

Approach
- For each country/sp/toc for each year, determine:
 - Average pub weighted against journal
   - overall percentile rank
   - SJR
   - cites/doc (2 years)

Data source
- [Medline journal list](https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt)
- [Scimago Journal & Country Rank](https://www.scimagojr.com/) for journal ranking data

Key number:
- Pubmed journal
  - \# with jtitle and ISSN: 31624
  - \# with jtitle, no ISSN: 4257
  - \# with ISSN, no jtitle: 0
  - 23 ISSNs are assigned to >=2 JournalTitle
  - Some JournalTitles are assigned to 2 ISSNs beyond print/online.
- Of 421307 records, 163's journal names are not consistent with those in Medline.

## ___Setup___

### Module import

In conda env `base`

In [26]:
import pickle
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from datetime import datetime
from urllib import request
from time import sleep

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "8_impact"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with topic assignment info
dir42      = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
#corpus_file = dir42 / "test.tsv"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"
file_topic_name   = dir44 / "fig4_4_tot_heatmap_weighted_xscaled_names.txt"

# country data
dir74            = proj_dir / "7_countries/7_4_consolidate_all"
ci_file          = dir74 / 'country_info_final_a3.txt'



# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Journal list___

### Download

Example
```
--------------------------------------------------------
JrId: 1
JournalTitle: AADE editors' journal
jtitle: AADE Ed J
ISSN (Print): 0160-6999
ISSN (Online): 
IsoAbbr: AADE Ed J
NlmId: 7708172
--------------------------------------------------------
JrId: 2
...
```

In [5]:
# Get journal list
jm_url  = "https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt"
jm_file = work_dir / "J_Medline.txt"
_ = request.urlretrieve(jm_url, jm_file)

### Establish journal dictionary

Considerations
- The ISSN number has the formmat `xxxx-xxxx`, rid of `-` to simplfy matching with SJR data.
- Some journal names have multiple ISSNs, and some ISSNs refer to different journal names.

In [90]:
def add_to_dict(i_to_j, j_to_i, issn, jtitle):
  if issn != "":
    if issn not in i_to_j:
      i_to_j[issn] = [jtitle]
    else:
      redun_issn.append([issn, jtitle, i_to_j[issn]])
      i_to_j[issn].append(jtitle)

    if jtitle not in j_to_i:
      j_to_i[jtitle] = [issn]
    else:
      j_to_i[jtitle].append(issn)

  return i_to_j, j_to_i

In [95]:
with open(jm_file) as f:
  i_to_j = {} # {ISSN: [JournalTitle]}
  j_to_i = {} # {JournalTitle: [ISSN]}

  # issn : issn print
  # issn2: issn online
  jtitle = issn = issn2 = ""
  wi_m_no_i = wi_i_no_m = 0
  redun_issn   = []
  f.readline() # rid of 1st line
  for line in f:
    # new record
    if line.startswith("----"):
      if jtitle != "" and (issn != "" or issn2 != ""):
        i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
        if issn != issn2:
          i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

      if jtitle == "":
        wi_i_no_m += 1
        print("With ISSN, no jtitle:", issn, issn2)
      if issn == "" and issn2 == "":
        wi_m_no_i += 1
      # reset
      jtitle = issn = issn2 = ""
    elif line.startswith("JournalTitle"):
      jtitle_tokens = line.strip().split("JournalTitle: ")
      if len(jtitle_tokens) == 2:
        jtitle = jtitle_tokens[1]
    elif line.startswith("ISSN (Print)"):
      issn_tokens = line.strip().split("ISSN (Print): ")
      if len(issn_tokens) == 2:
        issn = "".join(issn_tokens[1].split("-"))
    elif line.startswith("ISSN (Online)"):
      issn2_tokens = line.strip().split("ISSN (Online): ")
      if len(issn2_tokens) == 2:
        issn2 = "".join(issn2_tokens[1].split("-"))

# Add the last record
if jtitle != "" and (issn != "" or issn2 != ""):
  i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
  if issn != issn2:
    i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

if jtitle == "":
  wi_i_no_m += 1
  print("With ISSN, no jtitle:", issn, issn2)
if issn == "" and issn2 == "":
  wi_m_no_i += 1

In [96]:
len(redun_issn), len(i_to_j), len(j_to_i), wi_m_no_i, wi_i_no_m

(23, 44090, 31556, 4257, 0)

In [97]:
i_to_j['02785846']

['Progress in neuro-psychopharmacology & biological psychiatry',
 'Progress in neuro-psychopharmacology']

In [98]:
redun_issn

[['07306652',
  'Journal of cellular biochemistry. Supplement',
  ['Journal of supramolecular structure and cellular biochemistry. Supplement',
   'Journal of cellular biochemistry. Supplement']],
 ['07077270',
  'The Journal of otolaryngology. Supplement',
  ['The Journal of otolaryngology',
   'The Journal of otolaryngology. Supplement']],
 ['02785846',
  'Progress in neuro-psychopharmacology',
  ['Progress in neuro-psychopharmacology & biological psychiatry',
   'Progress in neuro-psychopharmacology']],
 ['10969888',
  'Journal of mass spectrometry : JMS',
  ['Biological mass spectrometry', 'Journal of mass spectrometry : JMS']],
 ['03027430',
  'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)',
  ['Journal belge de radiologie',
   'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)']],
 ['03855716',
  'Memai h

## ___Journal impact data___ 

### Download


In [29]:
sjr_base_url = "https://www.scimagojr.com/journalrank.php?out=xls&year="

sjr_dir   = work_dir / "_sjr"
sjr_dir.mkdir(parents=True, exist_ok=True)

for year in range(2001,2023):
  sjr_file = sjr_dir / f"scimagojr_{year}.csv"
  print(sjr_file)
  if not sjr_file.exists():
    _ = request.urlretrieve(sjr_base_url + str(year), sjr_file)
    sleep(5)


/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2001.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2002.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2003.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2004.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2005.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2006.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2007.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2008.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2009.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2010.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2011.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2012.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2013.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2014.csv
/home/shius/projects/plant_sci_his

### Check out sjr data

In [65]:
sjr_file = sjr_dir / "scimagojr_2001.csv"
df = pd.read_csv(sjr_file, sep=";")
df.head(2)

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2001),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,16801,Annual Review of Biochemistry,journal,"15454509, 00664154",39946,Q1,316,23,88,...,4143,88,3246,18043,United States,Northern America,Annual Reviews Inc.,"1946-1948, 1950-1960, 1962-2022",Biochemistry (Q1),"Biochemistry, Genetics and Molecular Biology"
1,2,20651,Annual Review of Immunology,journal,"15453278, 07320582",36369,Q1,317,24,84,...,4437,84,4808,17092,United States,Northern America,Annual Reviews Inc.,1983-2022,Immunology (Q1); Immunology and Allergy (Q1),Immunology and Microbiology; Medicine


In [66]:
df.columns

Index(['Rank', 'Sourceid', 'Title', 'Type', 'Issn', 'SJR', 'SJR Best Quartile',
       'H index', 'Total Docs. (2001)', 'Total Docs. (3years)', 'Total Refs.',
       'Total Cites (3years)', 'Citable Docs. (3years)',
       'Cites / Doc. (2years)', 'Ref. / Doc.', 'Country', 'Region',
       'Publisher', 'Coverage', 'Categories', 'Areas'],
      dtype='object')

## ___Topical impact___

### Read topic assignment

- Lifted from script_5_3
- Use the no dup file from 7_5

In [56]:
# topic data-frame
corpus_file_nodup = dir42 / 'table7_5_corpus_with_topic_assignment_nodup.tsv.gz'
tdf = pd.read_csv(corpus_file_nodup, sep='\t', compression='gzip', index_col=[0])
print("topic dataframe:", tdf.shape)

topic dataframe: (421307, 12)


In [57]:
tdf.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


### Get pmid, date, issn, and topic

In [137]:
tdf_issns = []
not_found = []
for journal in tdf.Journal.values:
  if journal.find("&amp;") != -1:
    journal = journal.replace("&amp;", "&")
  # anomaly with period
  # e.g., "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
  #       Comptes rendus hebdomadaires des seances de l'Academie des sciences. Serie D: Sciences naturelles
  if journal.find("Comptes rendus hebdomadaires des seances de l'Academie des sciences") != -1 or \
     journal == "Development (Cambridge, England). Supplement" or \
     journal == "Nucleic acids research. Supplement (2001)":
    journal = journal.split(".")[0]
  elif journal == "Biology bulletin of the Academy of Sciences of the USSR":
    journal = "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
  elif journal == "Journal of chromatography":
    journal = "Journal of chromatography. A"
  elif journal.find("Ukrains'kyi biokhimichnyi zhurnal") != -1:
    journal = "Ukrains'kyi biokhimichnyi zhurnal"
  try:
    issns = j_to_i[journal]
    tdf_issns.append(",".join(issns))
  except KeyError:
    not_found.append(journal)
    tdf_issns.append("")

In [132]:
len(not_found), sorted(not_found)

(163,
 ['Acta biochimica et biophysica; Academiae Scientiarum Hungaricae',
  'Acta biochimica et biophysica; Academiae Scientiarum Hungaricae',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Agricultural systems',
  'Agricultural systems',
  'Ai zheng = Aizheng = Chinese journal of cancer',
  'Ai zheng = Aizheng = Chinese journal of cancer',
  'Applied mathematical modelling',
  'Bulletin de la Societe de pathologie exotique et de ses filiales',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH proto

In [138]:
len(tdf.Journal.values), len(tdf_issns)

(421307, 421307)

In [144]:
# Create pmid-topic dataframe
pdjit = tdf[['PMID', 'Date', 'Journal', 'Topic']]

# Insert ISSN
pdjit.insert(3, 'ISSN', tdf_issns)

# Add year
year = [int(date.split('-')[0]) for date in pdjit['Date'].values]
pdjit['Year'] = year

# Save the dataframe
pdjit.to_csv(work_dir / 'table_pdjity.tsv', sep='\t', index=False)
pdjit.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pdjit['Year'] = year


Unnamed: 0,PMID,Date,Journal,ISSN,Topic,Year
0,61,1975-12-11,Biochimica et biophysica acta,6300218782434,52,1975
1,67,1975-11-20,Biochimica et biophysica acta,6300218782434,48,1975


### Get topc indices and names

In [145]:
# topic assignments
toc_array = pdjit['Topic'].values

# topic indices
tocs = np.unique(toc_array)

# exclude topic=-1
tocs_90 = tocs[1:]

# number of topic=-1
n_rec_toc_unassigned = sum((toc_array==-1).astype(int))

# number of docs with topic assignment. Originally was thinking about minus
# unassigned, but realize that the totol for taxa would be the number of total
# docs, so it does not make sense to remove unassigned.
n_rec_total  = len(toc_array)

print("number of topic=-1:", n_rec_toc_unassigned)
print("number of docs with topic assignment:", n_rec_total)

number of topic=-1: 49192
number of docs with topic assignment: 421307


### Get topical impact for each topic, each year

## ___Testing___

### SJR data

In [67]:
found_issn = 0
not_found = []
for issns in df.Issn:
  issns = issns.strip().split(", ")
  found = 0
  for issn in issns:
    if issn in i_to_j:
      found_issn += 1
      found = 1
      break
  
  if found:
    found = 0
  else:
    not_found.append(issn)

print(f"Total ISSNs: {len(df)}, in Pubmed: {found_issn}")

Total ISSNs: 18024, in Pubmed: 10558


In [69]:
not_found[:10]

['09642536',
 '15734463',
 '10950761',
 '03044157',
 '00796379',
 '-',
 '08940347',
 '15740048',
 '13864181',
 '00653055']