# __Step 1: data pre-processing__

The downloaded abstracts contain many articles that are not considered plant science but biomedical. This notebook contains two components:

1. Identify a subset of abstracts to use as training data
2. Train a text classification model to distinguish plant and non-plant science texts.
   - For the second components, the code example is based on 
     - [News dataset (Kaggle)](https://www.kaggle.com/rmisra/news-category-dataset)
     - And a series of three TowardDataScience articles by Mauro Di Pietro: [1](https://towardsdatascience.com/text-analysis-feature-engineering-with-nlp-502d6ea9225d), [2](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794), [3](https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180).

### Setup

In [1]:
# My conda environment
# !conda activate tf

## for data
import json
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
from numpy.random import randint
from os import chdir

## for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## for processing
import re
import nltk

## for bag-of-words
from sklearn import feature_extraction, feature_selection, metrics, pipeline
from sklearn import model_selection, naive_bayes, manifold, preprocessing

## for explainer
from lime import lime_text

## for word embedding
import gensim
import gensim.downloader as gensim_api

## for deep learning
from tensorflow import keras
from tensorflow.keras import models, layers, preprocessing
from tensorflow.keras import backend as K

## for bert language model
import transformers

#from nlp_utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set up working directory and corpus file location
proj_dir         = Path('/home/shius/projects/plant_sci_hist')
corpus_dir       = proj_dir / "1_obtaining_corpus"
corpus_file_name = 'pubmed_qualified.tsv'
corpus_file      = corpus_dir / corpus_file_name

work_dir         = proj_dir / "2_text_classify"
chdir(work_dir)

## ___Identifying training data___

### Load data and pre-processing

Starting 1497511
- After removing duplicates: 1475989
- Create a new column 'txt' which is concatenated between 'Title' and 'Abstract'
- After removing records without either title or abstract: 1385417

In [3]:
corpus_df_raw = pd.read_csv(corpus_file, delimiter='\t')
corpus_df_raw.shape

(1497511, 6)

In [4]:
# Count # of duplicates
corpus_df_raw.duplicated().value_counts()

False    1475989
True       21522
dtype: int64

In [5]:
# Drop duplicated rows
corpus_df = corpus_df_raw[corpus_df_raw.duplicated() == False]
corpus_df.shape

(1475989, 6)

In [6]:
# Create a new column 'txt' which is concatenated between 'Title' and 'Abstract'
corpus_df['txt'] = corpus_df['Title'] + ". " + corpus_df['Abstract']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corpus_df['txt'] = corpus_df['Title'] + ". " + corpus_df['Abstract']


In [7]:
# Deal with NA
print("Title NAN:", corpus_df['Title'].isnull().sum())
print("Abstract NAN:", corpus_df['Abstract'].isnull().sum())
print("Txt NAN:", corpus_df['txt'].isnull().sum())

Title NAN: 142
Abstract NAN: 90428
Txt NAN: 90570


In [8]:
# Rid of all records with NAs
corpus_df = corpus_df.dropna(axis=0)
corpus_df.shape

(1385417, 7)

In [9]:
# Write the processed corpus into a file: decide to write later
#corpus_df.to_csv(work_dir / (corpus_file_name + ".noredun_nona"), sep='\t')

### Get a list of journals

In [10]:
corpus_df.head(3)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt
0,36,1975-11-01,The British journal of nutrition,The effects of processing of barley-based supp...,1. In one experiment the effect on rumen pH of...,barley,The effects of processing of barley-based supp...
1,52,1975-12-02,Biochemistry,Evidence of the involvement of a 50S ribosomal...,The functional role of the Bacillus stearother...,rose,Evidence of the involvement of a 50S ribosomal...
2,60,1975-12-11,Biochimica et biophysica acta,The reaction between the superoxide anion radi...,1. The superoxide anion radical (O2-) reacts w...,tuna,The reaction between the superoxide anion radi...


In [11]:
# This is a pandas series
journal_counts = corpus_df['Journal'].value_counts()
journal_counts.to_csv(work_dir / 'out_raw_journal_counts')

### Define positive and negative set journals

- Number negative journals : 7360
- Number negative instances: 43323
- Number positive journals : 17
- Number positive instances: 98937

In [12]:
positives = ['Plant physiology', 'Frontiers in plant science', 'Planta',
             'The Plant journal : for cell and molecular biology', 
             'Journal of experimental botany', 'Plant molecular biology',
             'The New phytologist', 'The Plant cell', 'Phytochemistry',
             'Plant &amp; cell physiology', 'American journal of botany',
             'Annals of botany', 'BMC plant biology', 'Tree physiology',
             'Molecular plant-microbe interactions : MPMI',
             '"Plant biology (Stuttgart, Germany)"', 
             'Plant biotechnology journal']

# Any journal with <20 papers will be regarded as negative 
negative_threshold = 20 

# Journal names to exclude: decide not to use these, concerned that connection
# between plant science and medical science may be eliminated.
negative_keywords  = ['medicine', 'pharmaceutical', 'pharmacological', 
                      'psychiatry']

In [13]:
# Count total for negative set
total_negative = 0
total_positive = 0
num_negative_journals = 0
for journal, count in journal_counts.items():
    if count <= negative_threshold:
        total_negative += count
        num_negative_journals += 1
    elif journal in positives:
        total_positive += count

print("Number negative journals :", num_negative_journals)
print("Number negative instances:", total_negative)
print("Number positive journals :", len(positives))
print("Number positive instances:", total_positive)

Number negative journals : 7360
Number negative instances: 43323
Number positive journals : 17
Number positive instances: 98937


In [14]:
# Proportion of articles to subsample from each positive journal
prop_to_sample = total_negative/total_positive
print(prop_to_sample)

0.4378847145152976


### Construct a dataframe with positive and negative examples

In [15]:
# This step is not useful downstream, just playing
journal_counts = journal_counts.sort_values()
journal_counts.head(3), journal_counts.tail(3)

(Neurotrauma reports                                                     1
 Brain and neuroscience advances                                         1
 Wiley interdisciplinary reviews. Data mining and knowledge discovery    1
 Name: Journal, dtype: int64,
 Journal of agricultural and food chemistry    15942
 Plant physiology                              21236
 PloS one                                      24095
 Name: Journal, dtype: int64)

In [16]:
# Collect negative or negative examples for each journal as a dataframe,
# then put the dataframes into a list for concatenation later.
df_positve  = []
df_negative = []

positives_sampled = 0                   # Keeping track how many are sampled
journal_items = journal_counts.items()  # so the content is iterable

# The total is for tqdm to show a progress bar
for journal, count in tqdm(journal_items, total=len(journal_counts)):
    # Specify a subset dataframe for a journal
    subset = corpus_df.loc[corpus_df["Journal"] == journal]

    # Negative example
    if count <= negative_threshold:
        df_negative.append(subset)
    
    # Positive examples
    elif journal in positives:
        # Plus 1 to round things up.
        num_to_sample = int(count*prop_to_sample) + 1
        positives_sampled += num_to_sample
        subset = subset.sample(n=num_to_sample)
        df_positve.append(subset)

print("Positives sampled:", positives_sampled)

100%|██████████| 12457/12457 [13:50<00:00, 15.00it/s] 

Positives sampled: 43329





In [17]:
corpus_pos = pd.concat(df_positve)
corpus_pos['label'] = [1]*corpus_pos.shape[0]

# Subsample because there are more positive examples
if positives_sampled > total_negative:
    corpus_pos = corpus_pos.sample(n=total_negative)
corpus_pos.shape

(43323, 8)

In [18]:
corpus_neg = pd.concat(df_negative)
corpus_neg['label'] = [0]*corpus_neg.shape[0]
corpus_pos.shape

(43323, 8)

In [19]:
# Concatenate positives and negatives
corpus_combo = pd.concat([corpus_pos, corpus_neg])
corpus_combo.shape

(86646, 8)

## ___Pre-processing___


### Apply pre-processing function

- Lowercase
- Stopword removal
- Remove characterst that are non-alphanumeric and non-white space characters
- lemmitisation

In [20]:
# Function based on Mauro Di Pietro (2020):
#  https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180
def utils_preprocess_text(text, lst_stopwords, flg_stemm=False, flg_lemm=True):
    '''
    Preprocess a string.
    :parameter
        :param text: string - name of column containing text
        :param lst_stopwords: list - list of stopwords to remove
        :param flg_stemm: bool - whether stemming is to be applied
        :param flg_lemm: bool - whether lemmitisation is to be applied
    :return
        cleaned text
    '''
    ## clean: lowercasing, stripping, then removing punctuations
    text = str(text).lower().strip()
    
    # RE: replace any character that is not alphanumeric, underscore, whitespace
    #  with ''. Originally this is it, but realized that biological terms have
    #  special characters including roman numerals, dash, and ",". So they are
    #  not removed.
    text = re.sub(r'[^\w\s(α-ωΑ-Ω)-,]', '', text)

    ## Tokenize (convert from string to list)
    lst_text = text.split()    
    
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in 
                    lst_stopwords]
                
    ## Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

In [21]:
tqdm.pandas(desc="Clean text")
lst_stopwords = nltk.corpus.stopwords.words("english")

corpus_combo["txt_clean"] = corpus_combo["txt"].progress_apply(lambda x: 
    utils_preprocess_text(x, lst_stopwords))
corpus_combo.sample(5)

Clean text: 100%|██████████| 86646/86646 [01:20<00:00, 1073.93it/s]


Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,label,txt_clean
655796,19574514,2009-07-04,Academic psychiatry : the journal of the Ameri...,The perceptions and habits of alcohol consumpt...,"The authors aim to quantify the extent, and to...",tobacco,The perceptions and habits of alcohol consumpt...,0,perception habit alcohol consumption smoking a...
514872,16666646,1989-03-01,Plant physiology,Identification of Several Pathogenesis-Related...,Inoculation of tomato (Lycopersicon esculentum...,plant,Identification of Several Pathogenesis-Related...,1,identification several pathogenesisrelated pro...
1446818,32954081,2020-09-22,Kidney international reports,Creatinine Fluctuations Forecast Cross-Harvest...,Chronic kidney disease of unknown origin (CKDu...,sugarcane,Creatinine Fluctuations Forecast Cross-Harvest...,0,creatinine fluctuation forecast crossharvest k...
858093,23613273,2013-04-25,Plant physiology,Gene discovery of modular diterpene metabolism...,"Plants produce over 10,000 different diterpene...",plants,Gene discovery of modular diterpene metabolism...,1,gene discovery modular diterpene metabolism no...
29574,1234701,1975-01-01,Acta physiologica latino americana,Effect of ouabain on the renal response to fur...,The aim of this work was to investigate whethe...,ouabain,Effect of ouabain on the renal response to fur...,0,effect ouabain renal response furosemide aim w...


### Ouput the pre-processed data

In [22]:
# 6/10/22 Shiu
# Tried the following but does not work. Got FileNotFoundError. Turned out that
# I cannot use relative path (i.e., ~/blah). Change work_dir to absolute path
# and it works.
corpus_combo_file = work_dir / 'corpus_combo'
corpus_combo_json = corpus_combo.to_json()
with corpus_combo_file.open("w+") as f:
    json.dump(corpus_combo_json, f)

Continue with text classification with `script_text_classify.ipynb`.

# __DEPRECATED__

### Creating a dataframe with only the target columns

In [23]:
target_column   = ["label","txt"]
label_txt = corpus_combo[target_column]
label_txt.shape

(86646, 2)

In [24]:
label_txt.sample(10)

Unnamed: 0,label,txt
344282,0,From tobacco to health care and beyond--a crit...
170289,0,[Optimization against the occurrence of termin...
824387,1,Biphenyl-type neolignans from Magnolia officin...
1127939,1,Seed-specific transcription factor HSFA9 links...
1190421,1,Ethylene Receptors Signal via a Noncanonical P...
268714,1,Resistance gene candidates identified by PCR w...
88586,0,MSH/ACTH 4-10 and aging effects on delayed res...
1205665,0,Environmental metabolomics with data science f...
916935,0,High HIV prevalence and associated factors in ...
202941,1,Expression of ferredoxin-dependent glutamate s...
