# __8.1 Considering topical impact__

Goal:
- Access the impacts of pubs over time (1999-2020) for different:
  - Topics
  - Countries in 8.2

Approach
- For each country/sp/toc for each year, determine:
  - Average impact score
- Before calculate average, impact scores are processed for each year
  - Rank
    - Inverted, min-max normalized, then squared
    - Squared to create more distance between top ranks from bottome ranks
  - The rest not normalized because there can be significant differences in the max values across years which lead to anomalous results:
    - SJR
    - H index
    - Cites / Doc. (2years)
    - For anomalous results, see `impact_topics_MOD_ANOMALY.xlsx`

Data source
- [Medline journal list](https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt)
- [Scimago Journal & Country Rank](https://www.scimagojr.com/) for journal ranking data

Key number:
- Pubmed journal
  - \# with jtitle and ISSN: 31624
  - \# with jtitle, no ISSN: 4257
  - \# with ISSN, no jtitle: 0
  - 23 ISSNs are assigned to >=2 JournalTitle
  - Some JournalTitles are assigned to 2 ISSNs beyond print/online.
- Of 421307 records, 163's journal names are not consistent with those in Medline.

## ___Setup___

### Module import

In conda env `base`

In [3]:
import pickle
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from urllib import request
from time import sleep

### Key variables

In [4]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "8_impact"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with topic assignment info
dir42      = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
#corpus_file = dir42 / "test.tsv"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"
file_topic_name   = dir44 / "fig4_4_tot_heatmap_weighted_xscaled_names.txt"

# country data
dir74            = proj_dir / "7_countries/7_4_consolidate_all"
ci_file          = dir74 / 'country_info_final_a3.txt'



# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Journal list___

### Download

Example
```
--------------------------------------------------------
JrId: 1
JournalTitle: AADE editors' journal
jtitle: AADE Ed J
ISSN (Print): 0160-6999
ISSN (Online): 
IsoAbbr: AADE Ed J
NlmId: 7708172
--------------------------------------------------------
JrId: 2
...
```

In [5]:
# Get journal list
jm_url  = "https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt"
jm_file = work_dir / "J_Medline.txt"
if not jm_file.exists():
  _ = request.urlretrieve(jm_url, jm_file)

### Establish journal dictionary

Considerations
- The ISSN number has the formmat `xxxx-xxxx`, rid of `-` to simplfy matching with SJR data.
- Some journal names have multiple ISSNs, and some ISSNs refer to different journal names.

In [6]:
def add_to_dict(i_to_j, j_to_i, issn, jtitle):
  if issn != "":
    if issn not in i_to_j:
      i_to_j[issn] = [jtitle]
    else:
      redun_issn.append([issn, jtitle, i_to_j[issn]])
      i_to_j[issn].append(jtitle)

    if jtitle not in j_to_i:
      j_to_i[jtitle] = [issn]
    else:
      j_to_i[jtitle].append(issn)

  return i_to_j, j_to_i

In [7]:
with open(jm_file) as f:
  i_to_j = {} # {ISSN: [JournalTitle]}
  j_to_i = {} # {JournalTitle: [ISSN]}

  # issn : issn print
  # issn2: issn online
  jtitle = issn = issn2 = ""
  wi_m_no_i = wi_i_no_m = 0
  redun_issn   = []
  f.readline() # rid of 1st line
  for line in f:
    # new record
    if line.startswith("----"):
      if jtitle != "" and (issn != "" or issn2 != ""):
        i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
        if issn != issn2:
          i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

      if jtitle == "":
        wi_i_no_m += 1
        print("With ISSN, no jtitle:", issn, issn2)
      if issn == "" and issn2 == "":
        wi_m_no_i += 1
      # reset
      jtitle = issn = issn2 = ""
    elif line.startswith("JournalTitle"):
      jtitle_tokens = line.strip().split("JournalTitle: ")
      if len(jtitle_tokens) == 2:
        jtitle = jtitle_tokens[1]
    elif line.startswith("ISSN (Print)"):
      issn_tokens = line.strip().split("ISSN (Print): ")
      if len(issn_tokens) == 2:
        issn = "".join(issn_tokens[1].split("-"))
    elif line.startswith("ISSN (Online)"):
      issn2_tokens = line.strip().split("ISSN (Online): ")
      if len(issn2_tokens) == 2:
        issn2 = "".join(issn2_tokens[1].split("-"))

# Add the last record
if jtitle != "" and (issn != "" or issn2 != ""):
  i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn, jtitle)
  if issn != issn2:
    i_to_j, j_to_i = add_to_dict(i_to_j, j_to_i, issn2, jtitle)

if jtitle == "":
  wi_i_no_m += 1
  print("With ISSN, no jtitle:", issn, issn2)
if issn == "" and issn2 == "":
  wi_m_no_i += 1

In [8]:
len(redun_issn), len(i_to_j), len(j_to_i), wi_m_no_i, wi_i_no_m

(23, 44090, 31556, 4257, 0)

In [9]:
i_to_j['02785846']

['Progress in neuro-psychopharmacology & biological psychiatry',
 'Progress in neuro-psychopharmacology']

In [10]:
redun_issn

[['07306652',
  'Journal of cellular biochemistry. Supplement',
  ['Journal of supramolecular structure and cellular biochemistry. Supplement',
   'Journal of cellular biochemistry. Supplement']],
 ['07077270',
  'The Journal of otolaryngology. Supplement',
  ['The Journal of otolaryngology',
   'The Journal of otolaryngology. Supplement']],
 ['02785846',
  'Progress in neuro-psychopharmacology',
  ['Progress in neuro-psychopharmacology & biological psychiatry',
   'Progress in neuro-psychopharmacology']],
 ['10969888',
  'Journal of mass spectrometry : JMS',
  ['Biological mass spectrometry', 'Journal of mass spectrometry : JMS']],
 ['03027430',
  'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)',
  ['Journal belge de radiologie',
   'JBR-BTR : organe de la Societe royale belge de radiologie (SRBR) = orgaan van de Koninklijke Belgische Vereniging voor Radiologie (KBVR)']],
 ['03855716',
  'Memai h

## ___Journal impact data___ 

### Download

__IMPORTANT__: SJR data use "," for decimal point!!!!


In [11]:
sjr_base_url = "https://www.scimagojr.com/journalrank.php?out=xls&year="

sjr_dir   = work_dir / "_sjr"
sjr_dir.mkdir(parents=True, exist_ok=True)
yr_range  = range(1999,2021)
for year in yr_range:
  sjr_file = sjr_dir / f"scimagojr_{year}.csv"
  print(sjr_file)
  if not sjr_file.exists():
    _ = request.urlretrieve(sjr_base_url + str(year), sjr_file)
    sleep(5)


/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_1999.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2000.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2001.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2002.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2003.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2004.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2005.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2006.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2007.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2008.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2009.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2010.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2011.csv
/home/shius/projects/plant_sci_hist/8_impact/_sjr/scimagojr_2012.csv
/home/shius/projects/plant_sci_his

### Check out sjr data

In [12]:
sjr_file = sjr_dir / "scimagojr_2001.csv"
df = pd.read_csv(sjr_file, sep=";")
df.head(2)

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2001),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,16801,Annual Review of Biochemistry,journal,"15454509, 00664154",39946,Q1,316,23,88,...,4143,88,3246,18043,United States,Northern America,Annual Reviews Inc.,"1946-1948, 1950-1960, 1962-2022",Biochemistry (Q1),"Biochemistry, Genetics and Molecular Biology"
1,2,20651,Annual Review of Immunology,journal,"15453278, 07320582",36369,Q1,317,24,84,...,4437,84,4808,17092,United States,Northern America,Annual Reviews Inc.,1983-2022,Immunology (Q1); Immunology and Allergy (Q1),Immunology and Microbiology; Medicine


In [13]:
df.columns

Index(['Rank', 'Sourceid', 'Title', 'Type', 'Issn', 'SJR', 'SJR Best Quartile',
       'H index', 'Total Docs. (2001)', 'Total Docs. (3years)', 'Total Refs.',
       'Total Cites (3years)', 'Citable Docs. (3years)',
       'Cites / Doc. (2years)', 'Ref. / Doc.', 'Country', 'Region',
       'Publisher', 'Coverage', 'Categories', 'Areas'],
      dtype='object')

### Functions for processing SJR data

Orignally, all metric in the SJR data is normalized for each year through min-max normalization:
- Discovered that some year, there are some anomalies for max values. 
- Particularly in 2015, the max cites/doc is 52 while in the other years, it is doubled. 
- This leads to a global inflation of normalized values for 2015 that they look much highers. There are probably other issues like this.

Thus, I decide not to normalize per year for metrics except rank:
- Rank is still normalized and inverted because the max value (rank 1 is always there).
- Rank is inverted so it is precentiles, than squared
  - The squaring is to put more distance between higher ranks, when journals are lower ranked, the difference in values do not matter so much.


In [14]:
def process_vals(vals, rank_data=0):
  '''Make sure all values are floating point numbers
  Args:
    vals (list): list of values
    def (int): 1 for rank data value that need to inverted, normalized, and
     squaared.
  Return:
    vals (list): floating point values
    vmin (float): minimum value, returned for debugging purpose
    vmax (float): maximum value, returned for debugging purpose
  '''

  # convert to float
  vals2 = []
  for v in vals:
    if type(v) == str:
      # See if three is comma that is used to indicate decimal point!!!!
      if v.find(",") == -1:
        vals2.append(float(v))
      else:
        vals2.append(float(v.replace(",", ".")))
    else:
      vals2.append(v)

  #if sum(np.isnan(vals2)) > 0:
  #  print("ERR: has NA")
  #  print(vals)
  #  isnan = np.isnan(vals2)
  #  print([vals[i] for i, x in enumerate(isnan) if x])

  vals = vals2.copy()
  vals2 = np.array(vals2)

  if rank_data == 1:
    vmax = max(vals)
    vmin = min(vals)
    # invert
    vals = [(vmax - v + vmin) for v in vals]
    # normalize
    vals = [(v - vmin) / (vmax - vmin) for v in vals]
    # square
    vals = [v*v for v in vals]
  
  vmin = min(vals)
  vmax = max(vals)

  return vals, vmin, vmax

In [15]:
def get_sjr_df(yr):
  '''Get SJR dataframe for a given year
  Args:
    yr (int): year
  Return:
    d_metric (dict): {issn: [Prank, SJR, Hidx, Cite]}
  '''

  # process SJR data for that yr
  # From 2014 and on, 2014, columns 5 (ISSN) have mixed types. 
  # Specify dtype option on import or set low_memory=False.
  sjr_file  = sjr_dir / f"scimagojr_{yr}.csv"
  sjr_df    = pd.read_csv(sjr_file, sep=";", low_memory=False)

  # get metric
  j_prank, pmin, pmax = process_vals(sjr_df['Rank'].values, 1)  
  j_sjr  , smin, smax = process_vals(sjr_df['SJR'].values)      
  j_hidx , hmin, hmax = process_vals(sjr_df['H index'].values)  
  j_cite , cmin, cmax = process_vals(sjr_df['Cites / Doc. (2years)'].values) 

  # build metric dictionary
  d_metric = {} # {issn: [Prank, SJR, Hidx, Cite]}
  issns    = sjr_df['Issn'].values
  for idx, issn in enumerate(issns):
    if issn.find(", ") != -1:
      issn = issn.split(", ")
    else:
      issn = [str(issn)]
    
    for token in issn:
      d_metric[token] = [j_prank[idx], j_sjr[idx], j_hidx[idx], j_cite[idx]]

  return d_metric, sjr_df, [pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]
  

### Establish a dictionary with metrics for different years

d_d_metric: {yr: d_metric}

In [16]:
log_d_d_metric = work_dir / "log_d_d_metric.txt"

# for debuggin purpose
m_yr_min_max = {} # {yr: [pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]}
with open(log_d_d_metric, "w") as f:
  f.write(f"Year\t[pmin, pmax, smin, smax, hmin, hmax, cmin, cmax]\n")
  d_d_metric = {} # {yr: d_metric}
  for yr in tqdm(yr_range):
    # Get the metric for that year
    d_metric, _, min_max = get_sjr_df(yr)
    d_d_metric[yr]       = d_metric
    m_yr_min_max[yr]     = min_max

    f.write(f"{yr}\t{min_max}\n")


100%|██████████| 22/22 [00:07<00:00,  3.12it/s]


In [17]:
# Save metric dictionary as pickle
with open(work_dir / "sjr_metric_dicts.pkl", "wb") as f:
  pickle.dump(d_d_metric, f)

## ___Topical impact___

### Data processing

#### Read topic assignment and create pdjit

- Lifted from script_5_3
- Use the no dup file from 7_5
- pdjit dataframe: pmid, date, journal, issn, topic, year

In [20]:
def get_pdjit():
  print("Read topic data-frame")
  corpus_file_nodup = dir42 / 'table7_5_corpus_with_topic_assignment_nodup.tsv.gz'
  tdf = pd.read_csv(corpus_file_nodup, sep='\t', compression='gzip', index_col=[0])
  print("", tdf.shape)

  # Get pmid, date, issn, and topic
  tdf_issns = []
  not_found = []
  for journal in tdf.Journal.values:
    if journal.find("&amp;") != -1:
      journal = journal.replace("&amp;", "&")
    # anomaly with period
    # e.g., "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
    #       Comptes rendus hebdomadaires des seances de l'Academie des sciences. Serie D: Sciences naturelles
    if journal.find("Comptes rendus hebdomadaires des seances de l'Academie des sciences") != -1 or \
      journal == "Development (Cambridge, England). Supplement" or \
      journal == "Nucleic acids research. Supplement (2001)":
      journal = journal.split(".")[0]
    elif journal == "Biology bulletin of the Academy of Sciences of the USSR":
      journal = "Biology bulletin of the Academy of Sciences of the USSR. Akademiia nauk SSSR"
    elif journal == "Journal of chromatography":
      journal = "Journal of chromatography. A"
    elif journal.find("Ukrains'kyi biokhimichnyi zhurnal") != -1:
      journal = "Ukrains'kyi biokhimichnyi zhurnal"
    try:
      issns = j_to_i[journal]
      tdf_issns.append(",".join(issns))
    except KeyError:
      not_found.append(journal)
      tdf_issns.append("")

  print("Len not found:", len(not_found))
  print(sorted(not_found))
  print("Len tdf.Journal.values, tdf_issns")
  print(len(tdf.Journal.values), len(tdf_issns))

  # Create dataframe with pmid, date, journal, topic
  pdjit = tdf[['PMID', 'Date', 'Journal', 'Topic']]

  # Insert ISSN
  pdjit.insert(3, 'ISSN', tdf_issns)

  # Add year to pdjit
  year = [int(date.split('-')[0]) for date in pdjit['Date'].values]
  pdjit['Year'] = year

  return pdjit

In [23]:

# Check if this file is already generated
pdjit_file = work_dir / 'table_pdjity.tsv'
if pdjit_file.is_file():
  print("pdjit file exists")
else:
  pdjit = get_pdjit()
  pdjit.to_csv(pdjit_file, sep='\t', index=False)

# Regardless if the file was generated or not, read it.
pdjit = pd.read_csv(pdjit_file, sep='\t')
pdjit.head(2)

pdjit file exists


Unnamed: 0,PMID,Date,Journal,ISSN,Topic,Year
0,61,1975-12-11,Biochimica et biophysica acta,6300218782434,52,1975
1,67,1975-11-20,Biochimica et biophysica acta,6300218782434,48,1975


#### Get topc indices and names

In [24]:
# topic assignments
toc_array = pdjit['Topic'].values

# topic indices
tocs = np.unique(toc_array)

# exclude topic=-1
tocs_90 = tocs[1:]

# number of topic=-1
n_rec_toc_unassigned = sum((toc_array==-1).astype(int))

# number of docs with topic assignment. Originally was thinking about minus
# unassigned, but realize that the totol for taxa would be the number of total
# docs, so it does not make sense to remove unassigned.
n_rec_total  = len(toc_array)

print("number of topic=-1:", n_rec_toc_unassigned)
print("number of docs with topic assignment:", n_rec_total)

number of topic=-1: 49192
number of docs with topic assignment: 421307


### Get topical impact for each topic, each year

#### Get metric lists for each topic

4 metrics in order:
- Percentile rank
- SJR
- H-index
- Cite/doc

In [34]:
def get_m_lst_lst(toc, yr):
  '''Get a list of metric lists for a given topic and year
  Args:
    toc (int): topic index
    yr (int): year
  Return:
    m_lst_lst (list): [m_lst]
    not_found (dict): 
  '''
  
  df        = pdjit.loc[(pdjit['Topic']==toc) & (pdjit['Year']==yr)]
  issns     = df['ISSN'].values
  journals  = df['Journal'].values
  pmids     = df['PMID'].values
  d_metric  = d_d_metric[yr] # {issn: [Prank, SJR, Hidx, Cite]}
  not_found = {}             # {journal: [issn, [pmids]}
  m_lst_lst = []  # [m_lst]
  for idx, issn in enumerate(issns):
    journal = journals[idx]
    pmid    = pmids[idx]

    # first check if issn is np.nan
    if type(issn) == float:
      if not np.isnan(issn):
        printO("ERR: float but not nan", issn)
      issn = []
    else:
      issn = issn.split(",")

    # make sure issn, if exist, is in d_metric, then append to a metric list
    m_list = []  
    for issn_token in issn:
      if issn_token in d_metric:
        metrics = d_metric[issn_token]
        m_list.append(metrics)

    # check if this journal is found in d_meric
    if m_list == []:
      # not found, add to a dictionary for logging purpose
      if journal not in not_found:
        not_found[journal] = [issn, [pmid]]
      else:
        not_found[journal][1].append(pmid)
    else:
      # get average if multiple issns
      m_list2 = []
      for idx in range(0,4):
        m_sum = 0
        for ms in m_list:
          m_sum += ms[idx]
        m_avg = m_sum / len(m_list)
        m_list2.append(m_avg)

      # need m_list2, but add more info for debugging
      m_lst_lst.append([pmid,journal,issn,m_list2, m_list])

  return m_lst_lst, not_found

#### Calculate average

Determine average impact
- For pubs in topic T in year Y
  - Go through journal of each pub
  - Get the metric for that journal
  - Add the metrics
  - Divid the metric total by number of pubs in topic T in year Y

In [35]:
log_toc_yr_issn_not_found = work_dir / "log_toc_yr_issn_not_found.txt"

with open(log_toc_yr_issn_not_found, "w") as f:
  t_y_avg = {} # {topic: {year: [Prank, SJR, Hidx, Cite]}}

  # For each topic
  zero_pub = []
  for toc in tqdm(tocs):
    # output to log file
    f.write(f"-----\ntopic={toc}\n")
    t_y_avg[toc] = {}

    # For each year
    for yr in yr_range:
      # [[prank, sjr, hidx, cite]] for all records in a given topic-year
      m_lst_lst, not_found = get_m_lst_lst(toc, yr) 
      #print(len(m_lst_lst))
      
      # compile metrics into a 2d array
      m_2d = []
      for m_list in m_lst_lst:
        m_2d.append(m_list[3])
      m_2d  = np.array(m_2d)

      # determine n_pub for each metric after removing NA
      n_pub = np.subtract([m_2d.shape[0]]*4, sum(np.isnan(m_2d)))
      # For a few cases without publication, set to NaN
      if 0 in n_pub:
        t_y_avg[toc][yr] = [np.nan]*4
        zero_pub.append([toc, yr])
      else:
        # calculate average and store in dict
        # Issue: RuntimeWarning: invalid value encountered in true_divide
        # https://www.geeksforgeeks.org/how-to-fix-invalid-value-encountered-in-true_divide/
        m_sum = np.nansum(m_2d, axis=0)
        m_avg = np.divide(m_sum, n_pub)
        t_y_avg[toc][yr] = m_avg
      
      # output to log file
      f.write(f" {yr}:\n")
      for journal in not_found:
        f.write(f"  {journal}:\n" + \
                f"    ISSN:{not_found[journal][0]}\n" + \
                f"    PMID:{not_found[journal][1]}\n")

100%|██████████| 91/91 [00:09<00:00,  9.50it/s]


In [36]:
zero_pub

[[5, 1999],
 [5, 2000],
 [5, 2002],
 [5, 2003],
 [5, 2004],
 [5, 2006],
 [5, 2007],
 [5, 2008],
 [5, 2009],
 [5, 2011],
 [5, 2012],
 [5, 2015],
 [8, 2001],
 [8, 2004],
 [8, 2007],
 [8, 2008]]

#### Generate output

Build dataframes
- A datafame for each of metric: prank, sjr, hidx, cite

Use dataframe to generate an xlsx with 4 sheets:
- One sheet for each of Prank, SJR, Hidx, Cite
- Rows are topics, columns are years
- Two sheet for each metric
  - Original vs min-max normalized

In [37]:
t_y_avg[0][1999]

array([  0.79302746,   1.40879412, 229.02941176,   2.77529412])

In [38]:
toc_excel_file    = work_dir / 'impact_topics.xlsx'
toc_excel_writer  = pd.ExcelWriter(toc_excel_file, engine='xlsxwriter')

metric_names = ['prank', 'sjr', 'hidx', 'cite']
for metric_idx in range(4):
  metric_nm      = metric_names[metric_idx]
  metric_2d      = [] # a 2D list: toc, then year
  metric_2d_norm = [] # a 2D list: toc, then year, normalized for each toc
  for toc in tocs:
    metric_toc = []
    for yr in yr_range:
      metric_toc.append(t_y_avg[toc][yr][metric_idx])
    metric_2d.append(metric_toc)

    # do min-max normalization
    m_min = min(metric_toc)
    m_max = max(metric_toc)
    metric_toc_norm = [(m-m_min)/(m_max-m_min) for m in metric_toc]
    metric_2d_norm.append(metric_toc_norm)
  
  df_metric  = pd.DataFrame(metric_2d, index=tocs, columns=yr_range)
  df_metric2 = pd.DataFrame(metric_2d_norm, index=tocs, columns=yr_range)
  print(metric_nm, df_metric.shape)
  
  df_metric.to_excel(toc_excel_writer, sheet_name=metric_nm)
  df_metric2.to_excel(toc_excel_writer, sheet_name=metric_nm+"_norm")

toc_excel_writer.close()
  

prank (91, 22)
sjr (91, 22)
hidx (91, 22)
cite (91, 22)


## ___Testing___

### Check anomalies

In the spreadsheet, there are some strange patterns:
- cite_norm: 2015 is abnormally high compared to neighboring years.
- sjr_norm: 2019 is almost all 0, 2015 is abnormally high (though not as bad as cite_norm)

Ok, look at log_d_d_metric.txt, the issue is with c_max (cites/doc max value) that is half of others.
- year [pmax, pmin, smin, smax, hmax, hmin, cmin, cmax]
- 2014	[1, 33044, 0.1, 47.751, 0, 1331, 0.0, 117.0]
- 2015	[1, 33787, 0.1, 35.501, 0, 1331, 0.0, 52.14]
- 2016	[1, 34401, 0.1, 43.002, 0, 1331, 0.0, 101.2]

#### Check avg and median cites/doc

Looks fine

In [39]:
def get_stat(year, field):
  sjr_file = sjr_dir / f"scimagojr_{year}.csv"
  sjr_df   = pd.read_csv(sjr_file, sep=";", low_memory=False)
  vals     = sjr_df[field].values

  #print(sjr_df.dtypes)

  # convert to float
  vals2 = []
  for v in vals:
    if type(v) == str:
      # See if three is comma that is used to indicate decimal point!!!!
      if v.find(",") == -1:
        vals2.append(float(v))
      else:
        vals2.append(float(v.replace(",", ".")))
    else:
      vals2.append(v)

  sjr_nan  = np.sum(np.isnan(vals2))
  sjr_med  = np.nanmedian(vals2)
  sjr_avg  = np.nanmean(vals2)

  print(f"{year}-{field}: med={sjr_med}, avg={sjr_avg}, n_nan={sjr_nan}")

In [40]:
get_stat(2014, "Cites / Doc. (2years)")
get_stat(2015, "Cites / Doc. (2years)")
get_stat(2016, "Cites / Doc. (2years)")

2014-Cites / Doc. (2years): med=0.45, avg=1.0064662268490496, n_nan=0
2015-Cites / Doc. (2years): med=0.47, avg=1.0252990795276289, n_nan=0
2016-Cites / Doc. (2years): med=0.51, avg=1.0648251504316735, n_nan=0


#### Check pubs in topic 0 for cites/doc

Ok, for toc=0, there is a large difference in mean and median between 2014, 2015, and 2016.
- These are based on normalized values, something wrong with normalization?
- Then why just 2015?

In [41]:
toc0_2014, nf_2014 = get_m_lst_lst(0, 2014)
toc0_2015, nf_2015 = get_m_lst_lst(0, 2015)
toc0_2016, nf_2016 = get_m_lst_lst(0, 2016)

In [42]:
toc0_2014[:2]

[[24383968,
  'Asian Pacific journal of allergy and immunology',
  ['0125877X'],
  [0.5194481823679986, 0.483, 39.0, 1.21],
  [[0.5194481823679986, 0.483, 39, 1.21]]],
 [24468678,
  'International immunopharmacology',
  ['15675769', '18781705'],
  [0.7741422240396723, 1.061, 126.0, 2.83],
  [[0.7741422240396723, 1.061, 126, 2.83],
   [0.7741422240396723, 1.061, 126, 2.83]]]]

In [43]:
# So journal not found is similar
len(nf_2014), len(nf_2015), len(nf_2016)

(1, 1, 0)

In [44]:
def get_cites_per_doc_list(m_lst_lst):
  cpd_list = [rec[3][3] for rec in m_lst_lst]
  cpd_nan  = np.sum(np.isnan(cpd_list))
  cpd_avg  = np.nanmean(cpd_list)
  cpd_med  = np.nanmedian(cpd_list)
  print(f"len:{len(cpd_list)},nan={cpd_nan},avg={cpd_avg},med={cpd_med}")


In [45]:
get_cites_per_doc_list(toc0_2014)
get_cites_per_doc_list(toc0_2015)
get_cites_per_doc_list(toc0_2016)

len:28,nan=0,avg=3.236428571428571,med=3.0549999999999997
len:31,nan=0,avg=3.766451612903225,med=3.51
len:21,nan=0,avg=2.8785714285714286,med=2.86


#### Check same journal across years

Problem: very few journals are the same across years. But for those found in at least two years
- There are generally 2x higher values in 2015.
- E.g., PloS one [0.03170940170940171, 0.06731875719217491, nan]
  - Look at SJR record, Cites / Doc. (2years)
    - 2014: 3.71
    - 2015: 3.51
  - These are fine. So something is not right about on how these values are normalized or populated.

In [46]:
def build_j_cpd(j_cpd, m_lst_lst, yr):
  idx = 0
  if yr == 2015:
    idx = 1
  elif yr == 2016:
    idx = 2
  
  for m_lst in m_lst_lst:
    journal = m_lst[1]
    metric  = m_lst[3][3]
    if journal in j_cpd:
      j_cpd[journal][idx] = metric
    else:
      j_cpd[journal] = [np.nan, np.nan, np.nan]
      j_cpd[journal][idx] = metric

  return j_cpd

In [47]:
j_cpd = {} # {journal:[cpd2014, cpd2015, cpd2016]}
j_cpd = build_j_cpd(j_cpd, toc0_2014, 2014)
j_cpd = build_j_cpd(j_cpd, toc0_2015, 2015)
j_cpd = build_j_cpd(j_cpd, toc0_2016, 2016)

In [48]:
# Look at Plos one as an example
for j in j_cpd:
  if sum(np.isnan(j_cpd[j])) < 2:
    print(j, "\t", j_cpd[j])

Molecular nutrition &amp; food research 	 [4.91, nan, 4.54]
Clinical and experimental allergy : journal of the British Society for Allergy and Clinical Immunology 	 [4.78, nan, 4.69]
Allergy 	 [6.67, 6.77, nan]
Journal of agricultural and food chemistry 	 [3.28, 3.19, 3.48]
BMC plant biology 	 [4.4, nan, 4.31]
The Journal of allergy and clinical immunology 	 [7.18, 7.37, nan]
Journal of investigational allergology &amp; clinical immunology 	 [1.45, 1.27, nan]
PloS one 	 [3.71, 3.51, nan]
Scientific reports 	 [nan, 5.78, 4.65]
Biomolecular NMR assignments 	 [nan, 0.67, 0.4]


#### Check just Plos one

The original cites/doc values looks normal in 2015, not off from other years
- Point to issue with normalization again.

In [49]:
toc0_2014[0][1]

'Asian Pacific journal of allergy and immunology'

In [50]:
for rec in toc0_2014:
  if rec[1] == "PloS one":
    print(rec[3][3], rec[4][0][3])

3.71 3.71


In [51]:
for rec in toc0_2015:
  if rec[1] == "PloS one":
    print(rec[3][3], rec[4][0][3])

3.51 3.51
3.51 3.51
3.51 3.51
3.51 3.51
3.51 3.51


#### Check SJR data

In [52]:
found_issn = 0
not_found = []
for issns in df.Issn:
  issns = issns.strip().split(", ")
  found = 0
  for issn in issns:
    if issn in i_to_j:
      found_issn += 1
      found = 1
      break
  
  if found:
    found = 0
  else:
    not_found.append(issn)

print(f"Total ISSNs: {len(df)}, in Pubmed: {found_issn}")

Total ISSNs: 18024, in Pubmed: 10588


In [53]:
not_found[:10]

['09642536',
 '15734463',
 '10950761',
 '03044157',
 '00796379',
 '-',
 '08940347',
 '15740048',
 '13864181',
 '00653055']

#### Check SJR data calculation

Example:
- [pmid, journal, [issn], [Prank, SJR, H-idx, Cite/doc]]
- [9872323, 'Oncogene', ['09509232', '14765594'], [0.9908989489753978, 0.09022571303899402, 0.270473328324568, 0.13452256033578175]]

Nothing going on here.



In [54]:
sjr_file  = sjr_dir / f"scimagojr_1999.csv"
sjr_df    = pd.read_csv(sjr_file, sep=";", low_memory=False)
sjr_df[sjr_df["Title"] == "Oncogene"]

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (1999),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
155,156,12523,Oncogene,journal,"14765594, 09509232",4649,Q1,360,924,2088,...,13818,2087,641,4954,United Kingdom,Western Europe,Nature Publishing Group,1987-2022,Cancer Research (Q1); Genetics (Q1); Molecular...,"Biochemistry, Genetics and Molecular Biology"


In [59]:
def get_val(journal, yr):
  m_vals = []
  # metric type
  m_types = ["Rank", "SJR", "H index", "Cites / Doc. (2years)"]
  for m_type in m_types:
    val = sjr_df[sjr_df["Title"] == journal][m_type].values[0]
    if type(val) == str and val.find(",") != -1:
      val = val.replace(",", ".")
    m_vals.append(float(val))

  print(m_vals)

  min_max = m_yr_min_max[yr]
  print(process_vals([min_max[0], min_max[1], m_vals[0]], rank_data=1))
  print(process_vals([min_max[2], min_max[3], m_vals[1]]))
  print(process_vals([min_max[4], min_max[5], m_vals[2]]))
  print(process_vals([min_max[6], min_max[7], m_vals[3]]))

In [60]:
get_val("Oncogene", 1999)

[156.0, 4.649, 360.0, 6.41]
([1.0, 0.9872205785667325, 0.0], 0.0, 1.0)
([0.1, 50.518, 4.649], 0.1, 50.518)
([0, 1331, 360.0], 0, 1331)
([0.0, 47.65, 6.41], 0.0, 47.65)


In [61]:
sjr_df[sjr_df["Title"] == "Plant Cell"]

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (1999),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
52,53,16594,Plant Cell,journal,"1532298X, 10404651",8834,Q1,380,196,535,...,5620,522,994,5443,United States,Northern America,Oxford University Press,1989-2022,Cell Biology (Q1); Plant Science (Q1),Agricultural and Biological Sciences; Biochemi...


In [62]:
get_val("Plant Cell", 1999)

[53.0, 8.834, 380.0, 9.94]
([1.0, 0.9626201495194019, 0.0], 0.0, 1.0)
([0.1, 50.518, 8.834], 0.1, 50.518)
([0, 1331, 380.0], 0, 1331)
([0.0, 47.65, 9.94], 0.0, 47.65)


### Get topic=0, year=2001

In [65]:
len(tocs), len(yr_range), pdjit.head(2)

(91,
 22,
    PMID        Date                        Journal               ISSN  Topic  \
 0    61  1975-12-11  Biochimica et biophysica acta  00063002,18782434     52   
 1    67  1975-11-20  Biochimica et biophysica acta  00063002,18782434     48   
 
    Year  
 0  1975  
 1  1975  )

In [66]:
pdjit.loc[(pdjit['Topic']==0) & (pdjit['Year']==2001)]

Unnamed: 0,PMID,Date,Journal,ISSN,Topic,Year
48287,11139585,2001-01-15,The Journal of biological chemistry,"00219258,1083351X",0,2001
49696,11251404,2001-03-17,Allergy,0105453813989995,0,2001
49697,11251633,2001-03-17,Clinical and experimental allergy : journal of...,0954789413652222,0,2001
49698,11251634,2001-03-17,Clinical and experimental allergy : journal of...,0954789413652222,0,2001
49936,11270469,2001-03-29,Asian Pacific journal of allergy and immunology,0125877X,0,2001
50467,11306929,2001-04-18,International archives of allergy and immunology,1018243814230097,0,2001
50973,11344353,2001-05-10,The Journal of allergy and clinical immunology,0091674910976825,0,2001
51031,11350307,2001-05-15,Allergy,0105453813989995,0,2001
51907,11419719,2001-06-23,"Journal of chromatography. B, Biomedical scien...",1387227318785603,0,2001
51908,11419723,2001-06-23,"Journal of chromatography. B, Biomedical scien...",1387227318785603,0,2001


### Deal with nan

In [67]:
arr = np.array([5,4,2,2,4,np.nan,np.nan,6])
np.isnan(arr)

array([False, False, False, False, False,  True,  True, False])

In [68]:
arr[np.isnan(arr)]

array([nan, nan])