# __Step 4.4b: Retrieve specific records__

With the topic assignment done, a new dataframe was created to include the assignment info in the dataframe containing date, journal, and corpus info. Goals here are to:
- Get records in a specific time frame for interpretation.

## ___Set up___

### Module import

In [54]:
import pickle, gzip
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime

### Key variables

In [57]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "4_topic_model/4_4_over_time"
work_dir.mkdir(parents=True, exist_ok=True)

# modified corpus file
dir42       = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
# saved model
topic_model_file = dir42 / "topic_model_updated"

# for writing dataframe of a specific topic and duration
toc_dur_dir = work_dir / "_select_topic_duration"
toc_dur_dir.mkdir(parents=True, exist_ok=True)

ctfidf_dir = work_dir / "ctfidf_over_time"

## ___Corpus dataframe___

### Read dataframe tsv

In [9]:
#https://stackoverflow.com/questions/28200404/pandas-read-table-use-first-column-as-index
corpus_df = pd.read_csv(corpus_file, sep='\t', index_col=0, compression='gzip')

In [15]:
corpus_df.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


In [16]:
corpus_df.shape

(421658, 12)

### Get timestamp

In [13]:
# Turn all dates into timestamps 
dates      = corpus_df['Date'].values
timestamps = []
for date in dates:
  [yr, mo, da] = date.split('-') # year, month, day
  dt   = datetime(int(yr), int(mo), int(da))
  ts   = dt.timestamp()
  timestamps.append(ts)
len(timestamps)

421658

In [17]:
### Add timestamp column to corpus_df
corpus_df["Timestamps"] = timestamps

In [18]:
corpus_df.head(1)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic,Timestamps
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52,187506000.0


## ___Retrieval records for a specific topic___

### Testing

In [28]:
topic = 61
corpus_df_topic = corpus_df[corpus_df['Topic'] == topic].sort_values('Timestamps')
corpus_df_topic.shape

(16183, 13)

In [32]:
# passing date string
date_start = "1900-1-1"
date_end   = "1980-1-1"

# get year, month, and day strings
y_sta, m_sta, d_sta = date_start.split("-")
y_end, y_end, y_end = date_end.split("-")

# get timestamps
ts_start   = datetime(int(y_sta), int(m_sta), int(d_sta)).timestamp()
ts_end     = datetime(int(y_end), int(y_end), int(y_end)).timestamp()

In [35]:
# For multiple conditions: cannnot use and/or need &/|. Expect 385 records.
corpus_df_topic_dur = corpus_df_topic.loc[
                          (corpus_df_topic['Timestamps'] >= ts_start) &
                          (corpus_df_topic['Timestamps'] < ts_end)]
corpus_df_topic_dur.shape

(385, 13)

In [82]:
corpus_df_topic_dur.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic,Timestamps
128650,568762,17732689,1960-01-22,"Science (New York, N.Y.)",Relation of Antigens of Melampsora lini and Li...,A specific antigen was found in each of four r...,flax,Relation of Antigens of Melampsora lini and Li...,1,0.812838,relation antigens melampsora lini linum usitat...,61,-313786800.0
73981,404936,13718505,1960-11-04,"Science (New York, N.Y.)",Effects of supernumerary chromosomes on produc...,One of the two types of supernumerary chromoso...,haplopappus,Effects of supernumerary chromosomes on produc...,1,0.771993,effects supernumerary chromosomes production p...,61,-288990000.0


In [39]:
tsv_file = toc_dur_dir / f"corpus_df_{topic}_{date_start}_{date_end}.tsv"

In [42]:
corpus_df_topic_dur.to_csv(tsv_file, sep='\t')

In [44]:
# search for specific term
term = "qtl"
txts  = corpus_df_topic_dur["Preprocessed corpus"].values
for txt in txts:
  if term in txt:
    print(txt[(txt.find(term)-20) : (txt.find(term)+len(term)+20)])

### Set up function

In [88]:
def get_corpus_df_topic_dur(topic, date_start, date_end):
  '''Get corpus_df for a specific topic within a specific time frame
  Args
    topic (int): topic number
    date_start (str): starting date of the timeframe, year-month-day, inclusive
    date_end (str): ending date of the time frame
  Return
    corpus_df_topic_dur (DataFrame): corpus df of a specific topic/duration
  Output
    corpus_df_topic_dur (csv)
  '''
  corpus_df_topic = corpus_df[
                        corpus_df['Topic'] == topic].sort_values('Timestamps')

  # get year, month, and day strings
  y_sta, m_sta, d_sta = date_start.split("-")
  y_end, m_end, d_end = date_end.split("-")

  # get timestamps
  ts_start   = datetime(int(y_sta), int(m_sta), int(d_sta)).timestamp()
  ts_end     = datetime(int(y_end), int(m_end), int(d_end)).timestamp()
  corpus_df_topic_dur = corpus_df_topic.loc[
                            (corpus_df_topic['Timestamps'] >= ts_start) &
                            (corpus_df_topic['Timestamps'] < ts_end)]
  tsv_file = toc_dur_dir / f"corpus_df_{topic}_{date_start}_{date_end}.tsv"
  corpus_df_topic_dur.to_csv(tsv_file, sep='\t')

  return corpus_df_topic_dur


In [97]:
def find_term(corpus_df, term):
  '''Search for specific term
  Args
    corpus_df (DataFrame): returned from get_corpus_df_topic_dur()
    term (str): term to find
  '''
  txts  = corpus_df_topic_dur["Preprocessed corpus"].values
  pmids = corpus_df_topic_dur["PMID"].values
  for idx, txt in enumerate(txts):
    tlist = txt.split(" ")
    if term in tlist:
      print(pmids[idx], tlist[(tlist.index(term)-3) : (tlist.index(term)+4)])

### Topic 44. Light, CO2, photosynthesis

In [101]:
corpus_df_topic_dur44 = get_corpus_df_topic_dur(44, 
                                                "1917-1-1",
                                                "1978-1-1")

In [104]:
find_term(corpus_df_topic_dur44, "co")

24425354 ['(', '60', ')', 'co', 'gamma', 'source', '.']
24317906 ['.', ')', 'clone', 'co', '527.', 'vigorous', 'fast']


### Topic 45. Hormone

In [None]:
corpus_df_topic_dur45 = get_corpus_df_topic_dur(45, 
                                                "1917-1-1",
                                                "1978-1-1")

In [None]:
find_term(corpus_df_topic_dur44, "co")

24425354 ['(', '60', ')', 'co', 'gamma', 'source', '.']
24317906 ['.', ')', 'clone', 'co', '527.', 'vigorous', 'fast']


# __Testing__

## ___Check topic over time correctness___

8/25/22: Spot check topic 61 where the term "qtl" is among the top term in the 1st time bin.
- But going back to the corpus for topic 61, the term is not present in any records.an
- Re-run script4_4 with `global_tuning` set to `False`.

### Load c-Tf-Idf for 1st time bin and the word file

In [58]:
with open(ctfidf_dir / "ctfidf_0.pickle", 'rb') as f:
  ctfidf_0 = pickle.load(f)
ctfidf_0.shape

(78, 18850331)

In [60]:
word_file = ctfidf_dir / f"word_list_18850331.pickle"
with open(word_file, 'rb') as f:
  words = pickle.load(f)

### Check c-Tf-Idf value for qtl

In [76]:
index_qtl = words.index("qtl")
index_qtl

14023797

In [62]:
# Topic 61 has index=50 in the ctfidf file
# see script_4_4, "Test run the 1st time bin with manual topic_over_time"
ctfidf_0[50, index_qtl]

0.0009095433190337633

### Check top terms

In [74]:
ctfidf_0_toc61 = ctfidf_0[50,].toarray()[0]
ctfidf_0_toc61.shape

(18850331,)

In [79]:
top_x   = 10
top_idx = np.argpartition(ctfidf_0_toc61, -top_x)[-top_x:]

# So qtl is among the top 10 terms for time bin 0, even though its starting
# c-tf-idf is 0.
index_qtl in top_idx

True