# __Step 3.1: pubmed vocab to get tf and tf-idf__

The goals for step 3 are to:
- Define the most prominent terms in plant biology overall
  - Based on Tf and Tf-idf
  - Based on science vocabs (oxford dictionary)
- Define the most prominent terms chrnologically
  - Monthly
  - Year
  - Decade with step size of 1??

To do:
- Remove stop words, build vocab from corpus:
  - Get tf and tf-idf overall
    - Remove stop words
  - Get tf and tf-idf monthly
    - Remove stop words
- Remove stop words, build vocab use oxford dictionaries (3.2)
  - Get tf and tf-idf overall 
  - Get tf and tf-idf monthly

Thoughts:
- After looking into Tf-based features:
  - There are simply too many non-scientific words. Not particularly helpful.
    - See if looking at only the top X helps.
  - The total features returned with max_df=0.8 and min_df=0.01 may be a bit constranining, result in 2088 features only. 
    - Set min_df=0.001
    - Set max_features=2000
  - There are a number of features that are just numbers, some apparently are years.
    - Filter those out.
  - Some terms are synonyms.
    - May use semantic similarity to group them.
  - Given the above issues, Tf-idf is a better choice, particularly for dealing with the 1st problem. So focus on Tf-idf.

## ___Set up___

### Module import

In [1]:
import argparse
import json
import pickle
import pandas as pd
import numpy as np
import joblib
import sys
from os import chdir
from pathlib import Path

from nltk.corpus import stopwords
from sklearn import feature_extraction, feature_selection, metrics, \
                    model_selection

### Key variables

In [None]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "3_key_term_temporal/3_1_pubmed_vocab"
work_dir.mkdir(parents=True, exist_ok=True)

os.chdir(work_dir)

# specify plant science corpus
corpus_dir  = proj_dir / "2_text_classify/2_5_predict_pubmed"
corpus_file = corpus_dir / "corpus_plant_421658.tsv.gz"

# vecotrizer parameters
max_features = 2000
ngram_range  = (1,3)
max_df       = 0.8
min_df       = 0.001
stop_words   = list(stopwords.words('english'))
lowercase    = True



## ___Set up vectorizers and transform corpus___

### Read corpus into dataframe

In [3]:
corpus_df = pd.read_csv(corpus_file, sep='\t', compression='gzip')
corpus_df.shape

(421658, 11)

In [4]:
corpus_df.head(2)

Unnamed: 0.1,Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,reg_article,y_prob,y_pred
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,1
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,1


### Define X

In [None]:
# corpus
X = corpus_df['txt']

### Initialize vectorizers and fit/transform corpus

In [5]:
# term-frequency vectorizer
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vec_TF = feature_extraction.text.CountVectorizer(max_features = max_features, 
                                                 ngram_range  = ngram_range,
                                                 stop_words   = stop_words,
                                                 max_df       = max_df,
                                                 min_df       = min_df,
                                                 lowercase    = lowercase)

In [6]:
# term frequencies: 15 min
vec_TF.fit(X)

# Transform corpus: ~3 min
X_TF = vec_TF.transform(X)

# Get feature names
vocab_TF = vec_TF.vocabulary_
feat_TF  = vec_TF.get_feature_names_out()
type(vocab_TF), type(feat_TF)

# Convert to dataframe
X_TF_df  = pd.DataFrame(X_TF.toarray(), columns=feat_TF)

In [21]:
# Get only numeric columns
feat_non_numeric = [feat for feat in feat_TF if not feat.isnumeric()]
X_TF_df = X_TF_df[feat_non_numeric]

In [23]:
X_TF_df['PMID'] = corpus_df['PMID']

In [24]:
X_TF_df.head(1)

Unnamed: 0,16s,aba,ability,abiotic,abiotic stress,abiotic stresses,able,abscisic,abscisic acid,abscisic acid aba,...,yielded,yields,young,zea,zea mays,zinc,zn,zone,μm,PMID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,61


In [31]:
# These are saved with setting: max_feature=10000, min_df=0.01
with open(work_dir / 'corpus_txt_tf_sparse_matrix', 'wb') as f:
  pickle.dump(X_TF, f)
with open(work_dir / 'corpus_txt_tf_features', 'wb') as f:
  pickle.dump(feat_TF, f)

### Get Tf-idf vectorizer and fit/transform X

In [32]:
# tf-idf vectorizer    
vec_TI = feature_extraction.text.TfidfVectorizer(max_features = max_features, 
                                                 ngram_range  = ngram_range,
                                                 stop_words   = stop_words,
                                                 max_df       = max_df,
                                                 min_df       = min_df,
                                                 lowercase    = lowercase)

In [33]:
# term frequencies: 15 min
vec_TI.fit(X)

In [None]:
# Transform corpus: ~3 min
X_TI = vec_TI.transform(X)

In [None]:
# Get feature names
feat_TF  = vec_TF.get_feature_names_out()

In [None]:
# Output sparse matrix and features
with open(work_dir / 'corpus_txt_tfidf_sparse_matrix', 'wb') as f:
  pickle.dump(X_TI, f)
with open(work_dir / 'corpus_txt_tfidf_features', 'wb') as f:
  pickle.dump(feat_TI, f)

In [None]:
# Convert to dataframe
X_TF_df  = pd.DataFrame(X_TF.toarray(), columns=feat_TF)

In [None]:
# Get only numeric columns
feat_non_numeric = [feat for feat in feat_TI if not feat.isnumeric()]
X_TI_df = X_TI_df[feat_non_numeric]

## ___Overall term freuqency___

### Tf total

Number of occurrences, regardless if it occurs multiple times in a doc. 