# __Step 2a: Bag of words and TF-idf models__

Contruct bag of word and Tf-idf models. Hyperparameters include:
- Feature extraction
  - Cleaned or not in step 1
  - `CountVectorizer`
    - `max_features`: try 1e4, 2e4, 5e4, 1e5
    - `ngram_range`: default [1,1], try also [1,2], [1,3]
    - `max_df`: default 1.0, try also 0.9, 0.7, 0.5
    - `min_df`: default 1, try also 2, 4, 8

### Setup

In [133]:
## For reproducibility
rand_state = 20220609

## for data
import json
import pandas as pd
import numpy as np
import joblib
from os import chdir
from pathlib import Path

## for bag-of-words
from sklearn import feature_extraction, feature_selection, metrics
from sklearn import model_selection
from xgboost import XGBClassifier

## for deep learning
from tensorflow import keras
from tensorflow.keras import models, layers, preprocessing
from tensorflow.keras import backend as K

## for bert language model
import transformers


In [16]:
# Set up working directory and corpus file location
proj_dir          = Path('/home/shius/projects/plant_sci_hist')
work_dir          = proj_dir / "2_text_classify"
corpus_combo_file = work_dir / "corpus_combo"

chdir(work_dir)

## ___Train/test split___

### Read data

The data are serialized as a json file from the last step in `script_text_preprocess.ipynb`.


In [17]:
# Load json file
with corpus_combo_file.open("r+") as f:
    corpus_combo_json = json.load(f)

# Convert json back to dataframe
corpus_combo = pd.read_json(corpus_combo_json)
corpus_combo.sample(5)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,label,txt_clean
1418117,32535930,2020-06-15,The New phytologist,Allelic differences of clustered terpene synth...,Plant volatile emissions can recruit predators...,plant,Allelic differences of clustered terpene synth...,1,allelic difference clustered terpene synthases...
294489,10575286,1999-11-27,Forschende Komplementarmedizin,[Results of a standardized survey on the medic...,The plant Cannabis sativa has a long history o...,plant,[Results of a standardized survey on the medic...,0,result standardized survey medical use cannabi...
79180,2575122,1989-01-01,The Journal of experimental zoology. Supplemen...,Interspecific variation in sugar and amino aci...,Previous studies of cecal sugar and amino acid...,sage,Interspecific variation in sugar and amino aci...,0,interspecific variation sugar amino acid trans...
1219916,29643861,2018-04-13,Frontiers in plant science,"Targeted Sequencing by Gene Synteny, a New Str...",Sugarcane exhibits a complex genome mainly due...,sugarcane,"Targeted Sequencing by Gene Synteny, a New Str...",1,targeted sequencing gene synteny new strategy ...
504520,16631830,2006-04-25,Phytochemistry,"Immunosuppressive diacetylenes, ceramides and ...","Three C-17 diacetylenic compounds (1-3), one m...",hydrocotyle,"Immunosuppressive diacetylenes, ceramides and ...",1,immunosuppressive diacetylenes ceramides cereb...


### Split data

- Stratify based on labels
- Set random state: The setting will ensure two splits have the same instances.

In [18]:
# Original corpus
corpus_ori = corpus_combo[['label','txt']]
train_ori, test_ori = model_selection.train_test_split(corpus_ori, 
    test_size=0.2, stratify=corpus_ori['label'], random_state=rand_state)

# Cleaned corpus
corpus_cln = corpus_combo[['label','txt_clean']]
corpus_cln.rename(columns={'txt_clean': 'txt'}) # make column names consistent
train_cln, test_cln = model_selection.train_test_split(corpus_cln, 
    test_size=0.2, stratify=corpus_cln['label'], random_state=rand_state)

In [19]:
train_ori.head(5)

Unnamed: 0,label,txt
853778,0,"Update: Exertional hyponatremia, active compon..."
1165206,1,Stable megadalton TOC-TIC supercomplexes as ma...
1107511,0,Effects of population-related variation in pla...
513902,1,Water Relations in Pulvini from Samanea saman:...
177616,0,Inhibitory activities of substances present in...


In [20]:
train_cln.head(5)

Unnamed: 0,label,txt_clean
853778,0,update exertional hyponatremia active componen...
1165206,1,stable megadalton toctic supercomplexes major ...
1107511,0,effect populationrelated variation plant prima...
513902,1,water relation pulvini samanea saman ii effect...
177616,0,inhibitory activity substance present plant se...


In [24]:
train_ori['label'].value_counts(), test_ori['label'].value_counts()

(0    34658
 1    34658
 Name: label, dtype: int64,
 0    8665
 1    8665
 Name: label, dtype: int64)

In [25]:
train_cln['label'].value_counts(), test_cln['label'].value_counts()

(0    34658
 1    34658
 Name: label, dtype: int64,
 0    8665
 1    8665
 Name: label, dtype: int64)

## ___Bag of words model___

### Relevant functions

#### _Hyperparameters_

Run a testing XgBoost model in the section `Models for original text` and found that the test F1 is already 0.94. Not sure how much tuning on this is needed. So reduce the scope to focus on: 
- `ngram_range`, `stop_words` (want to see how stop word impact classification), and feature selection.

Thoughts:
- Cleaned or not in step 1.
- `CountVectorizer`
  - `max_features`: 1e4, 5e4, 1e5
    - Originally thought that it did not make sense to do this because at p<1e-2, there are only 8k features. But here the features are selected based on freuqencies. Some less frequency terms can still be only present in one class.
  - `ngram_range`: [1,1], [1,2], [1,3]
  - NOT TESTED: `max_df`: default 1.0, try also 0.9, 0.7, 0.5
  - NOT TESTED: `min_df`: default 1, try also 2, 4, 8
  - For original text
    - `stop_words`: None, try also `english`
- Feature selection
  - `p_threshold`: 1e-2, 1e-3, 1e-4, 1e-5


In [57]:
def get_hyperparameters(stopw=0):
    ''' Return a dictionary with hyperparameters
    Args
      stopw (int): whether to rid of english stopwords (1) or not (0)
    Return:
      hyperp (dict): a dictionary with hyperparameters
    '''
   
    param_grid = {"max_features": [1e4, 5e4, 1e5],
                  "ngram_range": [(1,1), (1,2), (1,3)],
                  "stop_words:": [None],
                  "p_threshold": [1e-2, 1e-3, 1e-4, 1e-5]}
    if stopw:
        param_grid["Stop_words"].append("english")
    
    return param_grid


#### _Feature extraction_

- Build a vocab with the number of words = `max_features`. 
- Consider: unigrams, bigrams, and trigrams.

In [60]:
def extract_feat(X_train, param=[], vocab=""):
    '''Extracting features as term frequencies
    Args:
      param (list): contains max_features, ngram_range, max_df, min_df,
        and stop_words
      X_train (pandas series): the txt column in the training data frame
    Returns:
      vectorizer (sklearn.feature_extraction.text.CountVectorizer) 
      X_train (pandas series): the transformed X_train
    '''
    # vectorizerd term frequencies
    if vocab == "":
      [max_features, ngram_range, max_df, min_df, stop_words] = param
      vectorizer = feature_extraction.text.CountVectorizer(
                              max_features = max_features, 
                              ngram_range  = ngram_range,
                              max_df       = max_df,
                              min_df       = min_df,
                              stop_words   = stop_words)
    else:
      vectorizer = feature_extraction.text.CountVectorizer(vocabulary=vocab)

    # fit the vectorizer with training corpus
    vectorizer.fit(X_train)

    # transform the training corpus
    X_train = vectorizer.transform(X_train)

    return vectorizer, X_train


#### _Feature selection_

With chi-square test

In [58]:
def select_feat(X_train, y_train, vectorizer, p_threshold):
    '''Select features based on chi-square test results
    Args:
      X_train (pandas series): the txt column in the training data frame
      y_train (pandas series): the label column in the training data frame
      vecorizer: fitted with original X_train and returned from get_vectorizer()
      p_threshold (float): p is derived from chi-square test. Features with p <= 
        p_threshold_s are selected.
    Return:
      X_names (list): names of selected features
    '''
    y            = y_train
    X_names      = vectorizer.get_feature_names_out()
    dtf_features = pd.DataFrame()
    for cat in np.unique(y):
        _, p = feature_selection.chi2(X_train, y==cat)
        dtf_features = pd.concat([dtf_features, 
                    pd.DataFrame({"feature":X_names, "p":p, "y":cat})])
        dtf_features = dtf_features.sort_values(
                    ["y","p"], ascending=[True,False])
        dtf_features = dtf_features[dtf_features["p"] <= p_threshold]
    
    X_names = dtf_features["feature"].unique().tolist()

    for cat in np.unique(y):
        print("{}:".format(cat), " selected features:",
                len(dtf_features[dtf_features["y"]==cat]))
        print(" top features:", ",".join(
                dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))

    print('Total selected:', len(X_names))

    return X_names

#### _Model training, hyperparameter tuning, and cross-validation_

Do [hyperparameter grid search with XGBoost](https://www.kaggle.com/code/tilii7/hyperparameter-grid-search-with-xgboost/notebook)

In [None]:
def run_xgboost(X_train, y_train, rand_state):
    '''Do hyperparameter tuning and cross-validation of XgBoost models
    Args:
      X_train (pandas dataframe): features
      y_train (pandas series): labels
      rand_state (int): rand
    Return:
      rand_search (RandomizedSearchCV): fitted obj
    '''

    params = {'min_child_weight': [1, 5, 10],
              'gamma': [0.5, 1, 1.5, 2, 5],
              'subsample': [0.6, 0.8, 1.0],
              'colsample_bytree': [0.6, 0.8, 1.0],
              'max_depth': [3, 4, 5]}
    folds       = 5
    param_comb  = 5
    n_jobs      = 14

    # Initialize classifier
    # 06/11/2022: the silent parameter is deprecated, use verbosity=0
    xgb = XGBClassifier(learning_rate=0.02, 
                        n_estimators=600, 
                        objective='binary:logistic',
                        verbosity=1, 
                        nthread=1)

    # Initilize stratified k fold obj
    skf = model_selection.StratifiedKFold(
                        n_splits=folds, shuffle = True, random_state = rand_state)
    
    # initiate randomized search CV obj
    rand_search = model_selection.RandomizedSearchCV(
                        xgb                , param_distributions = params, 
                        n_iter = param_comb, scoring      = 'roc_auc', 
                        n_jobs = n_jobs    , cv = skf.split(X_train,y_train), 
                        verbose = 3        , random_state =rand_state)

    # Train
    rand_search.fit(X_train, y_train)

    return rand_search


### Models for original text

#### _Testing run_

In [72]:
# Get the training/testing corpus and labels
X_train_ori = train_ori['txt']
y_train_ori = train_ori['label']
X_test_ori  = test_ori['txt']
y_test_ori  = test_ori['label']

In [73]:
# Set parameters
param       = [10000, [1,1], 1.0, 1, None]

# Get vectorizer and fitted X_train
vectorizer, X_train_ori_vec = extract_feat(X_train_ori, param=param)
print("Train dim:", X_train_ori_vec.shape)

Train dim: (69316, 10000)


In [74]:
# Get selected feature names
p_threshold = 1e-4
X_names_ori = select_feat(X_train_ori_vec, y_train_ori, vectorizer, p_threshold)

0:  selected features: 8130
 top features: 210,appearing,exotic,application,su,stratification,antifungal,projected,depending,holds
1:  selected features: 8130
 top features: 210,appearing,exotic,application,su,stratification,antifungal,projected,depending,holds
Total selected: 8130


In [75]:
# Refit vectorizer with selected features and re-transform X_train_ori
vectorizer_sel, X_train_ori_vec_sel = extract_feat(X_train_ori, vocab=X_names_ori)
print("Train dim:", X_train_ori_vec_sel.shape)

Train dim: (69316, 8130)


In [76]:
# Also apply the refitted vecorizer to testing data
X_test_ori_vec_sel = vectorizer_sel.transform(X_test_ori)
print("Test dim:", X_test_ori_vec_sel.shape)

Test dim: (17330, 8130)


In [117]:
# Get xgboost model and cv results
rand_search = run_xgboost(X_train_ori_vec_sel, y_train_ori, rand_state)


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 3/5] END colsample_bytree=0.8, gamma=5, max_depth=3, min_child_weight=1, subsample=1.0;, score=0.979 total time= 2.6min
[CV 4/5] END colsample_bytree=0.8, gamma=5, max_depth=3, min_child_weight=1, subsample=1.0;, score=0.977 total time= 2.6min
[CV 1/5] END colsample_bytree=0.8, gamma=5, max_depth=3, min_child_weight=1, subsample=1.0;, score=0.980 total time= 2.6min
[CV 2/5] END colsample_bytree=0.8, gamma=5, max_depth=3, min_child_weight=1, subsample=1.0;, score=0.978 total time= 2.7min
[CV 1/5] END colsample_bytree=1.0, gamma=5, max_depth=4, min_child_weight=5, subsample=1.0;, score=0.983 total time= 4.0min
[CV 2/5] END colsample_bytree=1.0, gamma=5, max_depth=4, min_child_weight=5, subsample=1.0;, score=0.982 total time= 4.1min
[CV 3/5] END colsample_bytree=1.0, gamma=5, max_depth=4, min_child_weight=5, subsample=1.0;, score=0.982 total time= 4.1min
[CV 4/5] END colsample_bytree=1.0, gamma=5, max_depth=4, min_child_weigh

In [124]:
corpus_ori_best_est   = rand_search.best_estimator_
corpus_ori_best_param = rand_search.best_params_
corpus_ori_best_score = rand_search.best_score_
print("best auROC:", corpus_ori_best_score)
print("best parameters:", corpus_ori_best_param)

best auROC: 0.9844870097228213
best parameters: {'subsample': 1.0, 'min_child_weight': 10, 'max_depth': 5, 'gamma': 5, 'colsample_bytree': 1.0}


In [125]:
# Save the best model
model_ori_name = work_dir / f'model_txt_original_randomizedseachcv_.sav'
joblib.dump(corpus_ori_best_est, model_ori_name)

['/home/shius/projects/plant_sci_hist/2_text_classify/model_txt_original_randomizedseachcv_.sav']

In [126]:
# Load the saved model
corpus_ori_best_est_loaded = joblib.load(model_ori_name)

In [132]:
#help(corpus_ori_best_est_loaded)

In [134]:
y_pred_ori = corpus_ori_best_est_loaded.predict(X_test_ori_vec_sel)
report = metrics.classification_report(y_test_ori, y_pred_ori)

In [140]:
metrics.f1_score(y_test_ori, y_pred_ori)

0.9413039758142718

In [144]:
param = [10000.0, (1, 1), None, 0.01]
param_str  = \
    f"{int(param[0])}-{'|'.join(map(str,param[1]))}-{param[2]}-{param[3]}"
param_str

'10000-1|1-None-0.01'

__For term freuqency models, move to `script_text_classify_tf.py`.__