# __8 Considering impact__

Goal:
- Access the impacts of pubs over time (1999-2020) for different:
  - Countries
  - Topics

Approach
- For each country/sp/toc for each year, determine:
 - Average pub weighted against journal
   - Rank percentile
   - Normalized (0-1):
     - SJR
     - H index
     - Cites / Doc. (2years)

Data source
- [Medline journal list](https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt)
- [Scimago Journal & Country Rank](https://www.scimagojr.com/) for journal ranking data

Key number:
- Pubmed journal
  - \# with jtitle and ISSN: 31624
  - \# with jtitle, no ISSN: 4257
  - \# with ISSN, no jtitle: 0
  - 23 ISSNs are assigned to >=2 JournalTitle
  - Some JournalTitles are assigned to 2 ISSNs beyond print/online.
- Of 421307 records, 163's journal names are not consistent with those in Medline.

## ___Setup___

### Module import

In conda env `base`

In [1]:
import pickle, xlsxwriter
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from datetime import datetime
from urllib import request
from time import sleep

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "8_impact"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with topic assignment info
dir42      = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
#corpus_file = dir42 / "test.tsv"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"
file_topic_name   = dir44 / "fig4_4_tot_heatmap_weighted_xscaled_names.txt"

# country data
dir74            = proj_dir / "7_countries/7_4_consolidate_all"
ci_file          = dir74 / 'country_info_final_a3.txt'



# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Journal list___

### Download

Example
```
--------------------------------------------------------
JrId: 1
JournalTitle: AADE editors' journal
jtitle: AADE Ed J
ISSN (Print): 0160-6999
ISSN (Online): 
IsoAbbr: AADE Ed J
NlmId: 7708172
--------------------------------------------------------
JrId: 2
...
```

In [3]:
# Get journal list
jm_url  = "https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt"
jm_file = work_dir / "J_Medline.txt"
if not jm_file.exists():
  _ = request.urlretrieve(jm_url, jm_file)

### Establish journal dictionary

Considerations
- The ISSN number has the formmat `xxxx-xxxx`, rid of `-` to simplfy matching with SJR data.
- Some journal names have multiple ISSNs, and some ISSNs refer to different journal names.

In [4]:
def add_to_dict(i_to_j, j_to_i, issn, jtitle):
  if issn != "":
    if issn not in i_to_j:
      i_to_j[issn] = [jtitle]
    else:
      redun_issn.append([issn, jtitle, i_to_j[issn]])
      i_to_j[issn].append(jtitle)

    if jtitle not in j_to_i:
      j_to_i[jtitle] = [issn]
    else:
      j_to_i[jtitle].append(issn)

  return i_to_j, j_to_i

In [5]:
with open(jm_file) as f:
  i_to_j = {} # {ISSN: [JournalTitle]}
  j_to_i = {} # {JournalTitle: [ISSN]}

  # issn : issn print
  # issn2: issn online
  jtitle = issn = issn2 = ""
  wi_m_no_i = wi_i_no_m = 0
  redun_issn   = []
  f.readline() # rid of 1st line
  for line in f:
    # new record
    if line.startswith("----"):
      if jtitle != "" and (issn != "" or issn2 != ""):
        i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
        if issn != issn2:
          i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

      if jtitle == "":
        wi_i_no_m += 1
        print("With ISSN, no jtitle:", issn, issn2)
      if issn == "" and issn2 == "":
        wi_m_no_i += 1
      # reset
      jtitle = issn = issn2 = ""
    elif line.startswith("JournalTitle"):
      jtitle_tokens = line.strip().split("JournalTitle: ")
      if len(jtitle_tokens) == 2:
        jtitle = jtitle_tokens[1]
    elif line.startswith("ISSN (Print)"):
      issn_tokens = line.strip().split("ISSN (Print): ")
      if len(issn_tokens) == 2:
        issn = "".join(issn_tokens[1].split("-"))
    elif line.startswith("ISSN (Online)"):
      issn2_tokens = line.strip().split("ISSN (Online): ")
      if len(issn2_tokens) == 2:
        issn2 = "".join(issn2_tokens[1].split("-"))

# Add the last record
if jtitle != "" and (issn != "" or issn2 != ""):
  i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
  if issn != issn2:
    i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

if jtitle == "":
  wi_i_no_m += 1
  print("With ISSN, no jtitle:", issn, issn2)
if issn == "" and issn2 == "":
  wi_m_no_i += 1

In [6]:
len(redun_issn), len(i_to_j), len(j_to_i), wi_m_no_i, wi_i_no_m

(23, 44090, 31556, 4257, 0)

In [7]:
i_to_j['02785846']

['Progress in neuro-psychopharmacology & biological psychiatry',
 'Progress in neuro-psychopharmacology']

In [8]:
redun_issn

[['07306652',
  'Journal of cellular biochemistry. Supplement',
  ['Journal of supramolecular structure and cellular biochemistry. Supplement',
   'Journal of cellular biochemistry. Supplement']],
 ['07077270',
  'The Journal of otolaryngology. Supplement',
  ['The Journal of otolaryngology',
   'The Journal of otolaryngology. Supplement']],
 ['02785846',
  'Progress in neuro-psychopharmacology',
  ['Progress in neuro-psychopharmacology & biological psychiatry',
   'Progress in neuro-psychopharmacology']],
 ['10969888',
  'Journal of mass spectrometry : JMS',
  ['Biological mass spectrometry', 'Journal of mass spectrometry : JMS']],
 ['03027430',
  'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)',
  ['Journal belge de radiologie',
   'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)']],
 ['03855716',
  'Memai h

## ___Journal impact data___ 

### Download

__IMPORTANT__: SJR data use "," for decimal point!!!!


In [40]:
sjr_base_url = "https://www.scimagojr.com/journalrank.php?out=xls&year="

sjr_dir   = work_dir / "_sjr"
sjr_dir.mkdir(parents=True, exist_ok=True)
yr_range  = range(1999,2021)
for year in yr_range:
  sjr_file = sjr_dir / f"scimagojr_{year}.csv"
  print(sjr_file)
  if not sjr_file.exists():
    _ = request.urlretrieve(sjr_base_url + str(year), sjr_file)
    sleep(5)


/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_1999.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2000.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2001.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2002.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2003.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2004.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2005.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2006.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2007.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2008.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2009.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2010.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2011.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2012.csv
/home/shius/projects/plant_sci_his

### Check out sjr data

In [10]:
sjr_file = sjr_dir / "scimagojr_2001.csv"
df = pd.read_csv(sjr_file, sep=";")
df.head(2)

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2001),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,16801,Annual Review of Biochemistry,journal,"15454509, 00664154",39946,Q1,316,23,88,...,4143,88,3246,18043,United States,Northern America,Annual Reviews Inc.,"1946-1948, 1950-1960, 1962-2022",Biochemistry (Q1),"Biochemistry, Genetics and Molecular Biology"
1,2,20651,Annual Review of Immunology,journal,"15453278, 07320582",36369,Q1,317,24,84,...,4437,84,4808,17092,United States,Northern America,Annual Reviews Inc.,1983-2022,Immunology (Q1); Immunology and Allergy (Q1),Immunology and Microbiology; Medicine


In [11]:
df.columns

Index(['Rank', 'Sourceid', 'Title', 'Type', 'Issn', 'SJR', 'SJR Best Quartile',
       'H index', 'Total Docs. (2001)', 'Total Docs. (3years)', 'Total Refs.',
       'Total Cites (3years)', 'Citable Docs. (3years)',
       'Cites / Doc. (2years)', 'Ref. / Doc.', 'Country', 'Region',
       'Publisher', 'Coverage', 'Categories', 'Areas'],
      dtype='object')

### Functions for processing SJR data

In [166]:
def normalize(vals, invert=0):
  '''Do min-max normalization to range 0-1
  Args:
    vals (list): list of values
    invert (int): 0: do not invert, 1: invert (treat larger values as smaller)
  Return:
    vals_norm (list): normalized values
  '''

  # convert to float
  vals2 = []
  for v in vals:
    if type(v) == str:
      # See if three is comma that is used to indicate decimal point!!!!
      if v.find(",") == -1:
        vals2.append(float(v))
      else:
        vals2.append(float(v.replace(",", ".")))
    else:
      vals2.append(v)

  #if sum(np.isnan(vals2)) > 0:
  #  print("ERR: has NA")
  #  print(vals)
  #  isnan = np.isnan(vals2)
  #  print([vals[i] for i, x in enumerate(isnan) if x])

  vals = vals2.copy()
  vals2 = np.array(vals2)

  if invert == 1:
    vmax = max(vals)
    vmin = min(vals)
    vals = [(vmax - v + vmin) for v in vals]
  
  # normalize
  vmin = min(vals)
  vmax = max(vals)
  vals = [(v - vmin) / (vmax - vmin) for v in vals]

  return vals, vmin, vmax

In [167]:
def get_sjr_df(yr):
  '''Get SJR dataframe for a given year
  Args:
    yr (int): year
  Return:
    d_metric (dict): {issn: [Prank, SJR, Hidx, Cite]}
  '''

  # process SJR data for that yr
  # From 2014 and on, 2014, columns 5 (ISSN) have mixed types. 
  # Specify dtype option on import or set low_memory=False.
  sjr_file  = sjr_dir / f"scimagojr_{yr}.csv"
  sjr_df    = pd.read_csv(sjr_file, sep=";", low_memory=False)

  # get rank  
  j_prank, pmin, pmax = normalize(sjr_df['Rank'].values, 1)  # percentile rank
  j_sjr  , smin, smax = normalize(sjr_df['SJR'].values)      # SJR
  j_hidx , hmin, hmax = normalize(sjr_df['H index'].values)  # H-index
  j_cite , cmin, cmax = normalize(sjr_df['Cites / Doc. (2years)'].values) # cites per doc

  # build metric dictionary
  d_metric = {} # {issn: [Prank, SJR, Hidx, Cite]}
  issns    = sjr_df['Issn'].values
  for idx, issn in enumerate(issns):
    if issn.find(", ") != -1:
      issn = issn.split(", ")
    else:
      issn = [str(issn)]
    
    for token in issn:
      d_metric[token] = [j_prank[idx], j_sjr[idx], j_hidx[idx], j_cite[idx]]

  return d_metric, sjr_df, [pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]
  

### Establish a dictionary with metrics for different years

In [169]:
log_d_d_metric = work_dir / "log_d_d_metric.txt"

# for debuggin purpose
m_yr_min_max = {} # {yr: [pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]}
with open(log_d_d_metric, "w") as f:
  f.write(f"Year\t[pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]\n")
  d_d_metric = {} # {yr: d_metric}
  for yr in tqdm(yr_range):
    # Get the metric for that year
    d_metric, _, min_max = get_sjr_df(yr)
    d_d_metric[yr]       = d_metric
    m_yr_min_max[yr]     = min_max

    f.write(f"{yr}\t{min_max}\n")


100%|██████████| 22/22 [00:09<00:00,  2.32it/s]


In [170]:
# Save metric dictionary as pickle
with open(work_dir / "sjr_metric_dicts.pkl", "wb") as f:
  pickle.dump(d_d_metric, f)

## ___Topical impact___

### Data processing

#### Read topic assignment

- Lifted from script_5_3
- Use the no dup file from 7_5

In [17]:
# topic data-frame
corpus_file_nodup = dir42 / 'table7_5_corpus_with_topic_assignment_nodup.tsv.gz'
tdf = pd.read_csv(corpus_file_nodup, sep='\t', compression='gzip', index_col=[0])
print("topic dataframe:", tdf.shape)

topic dataframe: (421307, 12)


In [18]:
tdf.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


#### Get pmid, date, issn, and topic

In [19]:
tdf_issns = []
not_found = []
for journal in tdf.Journal.values:
  if journal.find("&amp;") != -1:
    journal = journal.replace("&amp;", "&")
  # anomaly with period
  # e.g., "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
  #       Comptes rendus hebdomadaires des seances de l'Academie des sciences. Serie D: Sciences naturelles
  if journal.find("Comptes rendus hebdomadaires des seances de l'Academie des sciences") != -1 or \
     journal == "Development (Cambridge, England). Supplement" or \
     journal == "Nucleic acids research. Supplement (2001)":
    journal = journal.split(".")[0]
  elif journal == "Biology bulletin of the Academy of Sciences of the USSR":
    journal = "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
  elif journal == "Journal of chromatography":
    journal = "Journal of chromatography. A"
  elif journal.find("Ukrains'kyi biokhimichnyi zhurnal") != -1:
    journal = "Ukrains'kyi biokhimichnyi zhurnal"
  try:
    issns = j_to_i[journal]
    tdf_issns.append(",".join(issns))
  except KeyError:
    not_found.append(journal)
    tdf_issns.append("")

In [20]:
len(not_found), sorted(not_found)

(163,
 ['Acta biochimica et biophysica; Academiae Scientiarum Hungaricae',
  'Acta biochimica et biophysica; Academiae Scientiarum Hungaricae',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Acta physiologiae plantarum',
  'Agricultural systems',
  'Agricultural systems',
  'Ai zheng = Aizheng = Chinese journal of cancer',
  'Ai zheng = Aizheng = Chinese journal of cancer',
  'Applied mathematical modelling',
  'Bulletin de la Societe de pathologie exotique et de ses filiales',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH protocols',
  'CSH proto

In [21]:
len(tdf.Journal.values), len(tdf_issns)

(421307, 421307)

In [22]:
# Create pmid-topic dataframe
pdjit = tdf[['PMID', 'Date', 'Journal', 'Topic']]

# Insert ISSN
pdjit.insert(3, 'ISSN', tdf_issns)

# Add year
year = [int(date.split('-')[0]) for date in pdjit['Date'].values]
pdjit['Year'] = year

# Save the dataframe
pdjit.to_csv(work_dir / 'table_pdjity.tsv', sep='\t', index=False)
pdjit.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pdjit['Year'] = year


Unnamed: 0,PMID,Date,Journal,ISSN,Topic,Year
0,61,1975-12-11,Biochimica et biophysica acta,6300218782434,52,1975
1,67,1975-11-20,Biochimica et biophysica acta,6300218782434,48,1975


#### Get topc indices and names

In [23]:
# topic assignments
toc_array = pdjit['Topic'].values

# topic indices
tocs = np.unique(toc_array)

# exclude topic=-1
tocs_90 = tocs[1:]

# number of topic=-1
n_rec_toc_unassigned = sum((toc_array==-1).astype(int))

# number of docs with topic assignment. Originally was thinking about minus
# unassigned, but realize that the totol for taxa would be the number of total
# docs, so it does not make sense to remove unassigned.
n_rec_total  = len(toc_array)

print("number of topic=-1:", n_rec_toc_unassigned)
print("number of docs with topic assignment:", n_rec_total)

number of topic=-1: 49192
number of docs with topic assignment: 421307


### Get topical impact for each topic, each year

#### Get metric lists for each topic

4 metrics in order:
- Percentile rank
- SJR
- H-index
- Cite/doc

In [171]:
def get_m_lst_lst(toc, yr):
  '''Get a list of metric lists for a given topic and year
  Args:
    toc (int): topic index
    yr (int): year
  Return:
    m_lst_lst (list): [m_lst]
    not_found (dict): 
  '''
  
  df        = pdjit.loc[(pdjit['Topic']==toc) & (pdjit['Year']==yr)]
  issns     = df['ISSN'].values
  journals  = df['Journal'].values
  pmids     = df['PMID'].values
  d_metric  = d_d_metric[yr] # {issn: [Prank, SJR, Hidx, Cite]}
  not_found = {}             # {journal: [issn, [pmids]}
  m_lst_lst = []  # [m_lst]
  for idx, issn in enumerate(issns):
    issn    = issn.split(",")
    journal = journals[idx]
    pmid    = pmids[idx]
    m_list = []
    for issn_token in issn:
      if issn_token in d_metric:
        m_list.append(d_metric[issn_token])
    if m_list == []:
      if journal not in not_found:
        not_found[journal] = [issn, [pmid]]
      else:
        not_found[journal][1].append(pmid)
    else:
      # get average if multiple issns
      m_list2 = []
      for idx in range(0,4):
        m_sum = 0
        for ms in m_list:
          m_sum += ms[idx]
        m_avg = m_sum / len(m_list)
        m_list2.append(m_avg)

      # for debugging
      m_lst_lst.append([pmid,journal,issn,m_list2])

  return m_lst_lst, not_found

#### Calculate average

Determine average impact
- For pubs in topic T in year Y
  - Go through journal of each pub
  - Get the metric for that journal
  - Add the metrics
  - Divid the metric total by number of pubs in topic T in year Y


In [186]:
log_toc_yr_issn_not_found = work_dir / "log_toc_yr_issn_not_found.txt"

with open(log_toc_yr_issn_not_found, "w") as f:
  t_y_avg = {} # {topic: {year: [Prank, SJR, Hidx, Cite]}}

  # For each topic
  zero_pub = []
  for toc in tqdm(tocs):
    # output to log file
    f.write(f"-----\ntopic={toc}\n")
    t_y_avg[toc] = {}

    # For each year
    for yr in yr_range:
      # [[prank, sjr, hidx, cite]] for all records in a given topic-year
      m_lst_lst, not_found = get_m_lst_lst(toc, yr) 
      #print(len(m_lst_lst))
      
      # compile metrics into a 2d array
      m_2d = []
      for m_list in m_lst_lst:
        m_2d.append(m_list[3])
      m_2d  = np.array(m_2d)

      # determine n_pub for each metric after removing NA
      n_pub = np.subtract([m_2d.shape[0]]*4, sum(np.isnan(m_2d)))
      # For a few cases without publication, set to NaN
      if 0 in n_pub:
        t_y_avg[toc][yr] = [np.nan]*4
        zero_pub.append([toc, yr])
      else:
        # calculate average and store in dict
        # Issue: RuntimeWarning: invalid value encountered in true_divide
        # https://www.geeksforgeeks.org/how-to-fix-invalid-value-encountered-in-true_divide/
        m_sum = np.nansum(m_2d, axis=0)
        m_avg = np.divide(m_sum, n_pub)
        t_y_avg[toc][yr] = m_avg
      
      # output to log file
      f.write(f" {yr}:\n")
      for journal in not_found:
        f.write(f"  {journal}:\n" + \
                f"    ISSN:{not_found[journal][0]}\n" + \
                f"    PMID:{not_found[journal][1]}\n")

100%|██████████| 91/91 [00:05<00:00, 15.62it/s]


In [187]:
zero_pub

[[5, 1999],
 [5, 2000],
 [5, 2002],
 [5, 2003],
 [5, 2004],
 [5, 2006],
 [5, 2007],
 [5, 2008],
 [5, 2009],
 [5, 2011],
 [5, 2012],
 [5, 2015],
 [8, 2001],
 [8, 2004],
 [8, 2007],
 [8, 2008]]

#### Generate output

In [188]:
t_y_avg

{-1: {1999: array([0.93132899, 0.06693772, 0.20763649, 0.08632451]),
  2000: array([0.90052219, 0.07144787, 0.19605551, 0.07628635]),
  2001: array([0.88038837, 0.06071294, 0.18281353, 0.07585605]),
  2002: array([0.88976276, 0.06079344, 0.18228675, 0.06574801]),
  2003: array([0.880466  , 0.05927028, 0.17409542, 0.06547605]),
  2004: array([0.87593081, 0.05758664, 0.1673759 , 0.06685195]),
  2005: array([0.86111752, 0.05330474, 0.15830547, 0.07191964]),
  2006: array([0.88752926, 0.05479186, 0.16445434, 0.07384339]),
  2007: array([0.87356704, 0.05456691, 0.1551519 , 0.07407889]),
  2008: array([0.87502924, 0.05401463, 0.15359121, 0.08395089]),
  2009: array([0.88841267, 0.06399907, 0.1617655 , 0.06862464]),
  2010: array([0.88907512, 0.04865923, 0.15465756, 0.06165152]),
  2011: array([0.88472422, 0.04615543, 0.15382801, 0.06021781]),
  2012: array([0.89103239, 0.04183726, 0.15738658, 0.03925749]),
  2013: array([0.89805486, 0.04601646, 0.16167972, 0.03718126]),
  2014: array([0.8914

## ___Testing___

### SJR data

In [None]:
found_issn = 0
not_found = []
for issns in df.Issn:
  issns = issns.strip().split(", ")
  found = 0
  for issn in issns:
    if issn in i_to_j:
      found_issn += 1
      found = 1
      break
  
  if found:
    found = 0
  else:
    not_found.append(issn)

print(f"Total ISSNs: {len(df)}, in Pubmed: {found_issn}")

In [None]:
not_found[:10]

### SJR data calculation

Example:
- [pmid, journal, [issn], [Prank, SJR, H-idx, Cite/doc]]
- [9872323, 'Oncogene', ['09509232', '14765594'], [0.9908989489753978, 0.09022571303899402, 0.270473328324568, 0.13452256033578175]]



In [117]:
sjr_file  = sjr_dir / f"scimagojr_1999.csv"
sjr_df    = pd.read_csv(sjr_file, sep=";", low_memory=False)
sjr_df[sjr_df["Title"] == "Oncogene"]

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (1999),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
155,156,12523,Oncogene,journal,"14765594, 09509232",4649,Q1,360,924,2088,...,13818,2087,641,4954,United Kingdom,Western Europe,Nature Publishing Group,1987-2022,Cancer Research (Q1); Genetics (Q1); Molecular...,"Biochemistry, Genetics and Molecular Biology"


In [112]:
def get_val(journal, yr):
  m_vals = []
  # metric type
  m_types = ["Rank", "SJR", "H index", "Cites / Doc. (2years)"]
  for m_type in m_types:
    val = sjr_df[sjr_df["Title"] == journal][m_type].values[0]
    if type(val) == str and val.find(",") != -1:
      val = val.replace(",", ".")
    m_vals.append(float(val))

  print(m_vals)

  min_max = m_yr_min_max[yr]
  print(normalize([min_max[0], min_max[1], m_vals[0]], invert=1))
  print(normalize([min_max[2], min_max[3], m_vals[1]]))
  print(normalize([min_max[4], min_max[5], m_vals[2]]))
  print(normalize([min_max[6], min_max[7], m_vals[3]]))

In [116]:
get_val("Oncogene", 1999)

[156.0, 4.649, 360.0, 6.41]
([1.0, 0.0, 0.9908989489753978], 1, 17032)
([0.0, 1.0, 0.09022571303899402], 0.1, 50.518)
([0.0, 1.0, 0.270473328324568], 0, 1331)
([0.0, 1.0, 0.13452256033578175], 0.0, 47.65)


In [120]:
sjr_df[sjr_df["Title"] == "Plant Cell"]

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (1999),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
52,53,16594,Plant Cell,journal,"1532298X, 10404651",8834,Q1,380,196,535,...,5620,522,994,5443,United States,Northern America,Oxford University Press,1989-2022,Cell Biology (Q1); Plant Science (Q1),Agricultural and Biological Sciences; Biochemi...


In [121]:
get_val("Plant Cell", 1999)

[53.0, 8.834, 380.0, 9.94]
([1.0, 0.0, 0.9969467441723915], 1, 17032)
([0.0, 1.0, 0.17323178229997224], 0.1, 50.518)
([0.0, 1.0, 0.28549962434259957], 0, 1331)
([0.0, 1.0, 0.20860440713536202], 0.0, 47.65)


In [108]:
# For Oncogene
print(normalize([m_yr_min_max[0], m_yr_min_max[1], m_vals[0]], invert=1))
print(normalize([m_yr_min_max[2], m_yr_min_max[3], m_vals[1]]))
print(normalize([m_yr_min_max[4], m_yr_min_max[5], m_vals[2]]))
print(normalize([m_yr_min_max[6], m_yr_min_max[7], m_vals[3]]))

KeyError: 0

### Get topic=0, year=2001

In [None]:
len(tocs), len(yr_range), pdjit.head(2)

(91,
 21,
    PMID        Date                        Journal               ISSN  Topic  \
 0    61  1975-12-11  Biochimica et biophysica acta  00063002,18782434     52   
 1    67  1975-11-20  Biochimica et biophysica acta  00063002,18782434     48   
 
    Year  
 0  1975  
 1  1975  )

In [None]:
pdjit.loc[(pdjit['Topic']==0) & (pdjit['Year']==2001)]

Unnamed: 0,PMID,Date,Journal,ISSN,Topic,Year
48287,11139585,2001-01-15,The Journal of biological chemistry,"00219258,1083351X",0,2001
49696,11251404,2001-03-17,Allergy,0105453813989995,0,2001
49697,11251633,2001-03-17,Clinical and experimental allergy : journal of...,0954789413652222,0,2001
49698,11251634,2001-03-17,Clinical and experimental allergy : journal of...,0954789413652222,0,2001
49936,11270469,2001-03-29,Asian Pacific journal of allergy and immunology,0125877X,0,2001
50467,11306929,2001-04-18,International archives of allergy and immunology,1018243814230097,0,2001
50973,11344353,2001-05-10,The Journal of allergy and clinical immunology,0091674910976825,0,2001
51031,11350307,2001-05-15,Allergy,0105453813989995,0,2001
51907,11419719,2001-06-23,"Journal of chromatography. B, Biomedical scien...",1387227318785603,0,2001
51908,11419723,2001-06-23,"Journal of chromatography. B, Biomedical scien...",1387227318785603,0,2001


### Deal with nan

In [None]:
arr = np.array([5,4,2,2,4,np.nan,np.nan,6])
np.isnan(arr)

array([False, False, False, False, False,  True,  True, False])

In [None]:
arr[np.isnan(arr)]

array([nan, nan])