# Data Pre-processing
Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)

Last Updated: June 13th, 2019

This notebook was developed with the intent of satisfying modeling needs of work with 
Professor Daisuke Kihara [dkihara@purdue.edu](dkihara@purdue.edu).
### Description
This notebook displays how to preprocess data for eventual use with Natural Language Processing models like Doc2Vec.

**Libraries Needed:** 
[pandas](https://pandas.pydata.org/pandas-docs/stable/install.html), 
[numpy](https://www.numpy.org), 
[gensim](https://radimrehurek.com/gensim/install.html), 
[scikit-learn](https://scikit-learn.org/stable/install.html),
[pattern](https://github.com/clips/pattern),
[nltk](https://www.nltk.org)

In [1]:
# Standard imports
import random
random.seed(42)
import collections
import pandas as pd
import numpy as np
np.random.seed(42)
# Document similarity
import gensim
import gensim.parsing.preprocessing as gpp

#### Specification 
We want to preprocess the text data in `nar-complete-data.{csv,json}` so that we may generate useful embeddings in a later module. 

In [2]:
db = pd.read_csv('complete-article-data.csv')
db = db.reset_index(drop=True)
db.head()

Unnamed: 0,year,article-link,local-path,title,abstract,authors,introduction
0,2019,https://doi.org/10.1093/nar/gky993,articles/2019/1-NAR.html,Database Resources of the BIG Data Center in 2019,The BIG Data Center at Beijing Institute of Ge...,['BIG Data Center Members'],The BIG Data Center (http://bigd.big.ac.cn) at...
1,2019,https://doi.org/10.1093/nar/gky1124,articles/2019/2-NAR.html,The European Bioinformatics Institute in 2018:...,The European Bioinformatics Institute (https:/...,"['Charles E Cook', 'Rodrigo Lopez', 'Oana Stro...","A primary mission of EMBL-EBI is to collect, o..."
2,2019,https://doi.org/10.1093/nar/gky1069,articles/2019/3-NAR.html,Database resources of the National Center for ...,The National Center for Biotechnology Informat...,"['Eric W Sayers', 'Richa Agarwala', 'Evan E Bo...",The National Center for Biotechnology Informat...
3,2019,https://doi.org/10.1093/nar/gky843,articles/2019/4-NAR.html,AmtDB: a database of ancient human mitochondri...,Ancient mitochondrial DNA is used for tracing ...,"['Edvard Ehler', 'Jiří Novotný', 'Anna Juras',...",Ancient DNA (aDNA) is a genetic material obtai...
4,2019,https://doi.org/10.1093/nar/gky822,articles/2019/5-NAR.html,AnimalTFDB 3.0: a comprehensive resource for a...,The Animal Transcription Factor DataBase (Anim...,"['Hui Hu', 'Ya-Ru Miao', 'Long-Hao Jia', 'Qing...",Transcription factors (TFs) are special protei...


#### Grab our text
We need to define a function `read_corpus_intro`, which yields simply pre-processed results over all introductions we've read in.

We want to remove stopwords, numbers, puncutation etc. Here we use the preprocessing filters described on the gensim documents: [https://radimrehurek.com/gensim/parsing/preprocessing.html](https://radimrehurek.com/gensim/parsing/preprocessing.html)

We will also use the Natural Language Tool Kit (NLTK) and pattern's 
powerful [lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) tools.

In [3]:
import nltk
nltk.download('wordnet')
from pattern.en import lemma

[nltk_data] Downloading package wordnet to /Users/sean/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Basic Preprocessing Function: `basic_preprocess`
- Takes as input a string representing a document
- [`remove_stopwords`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.remove_stopwords): Remove [stopwords](https://en.wikipedia.org/wiki/Stop_words) from given String
- [`strip_non_alphanum`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_non_alphanum): Remove non-alphabetic characters
- [`strip_punctuation*`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_punctuation): Unicode string without punctuation characters
- [`strip_tags`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_tags): Remove any remaining html tags
- [`strip_numeric`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_numeric): Remove digits
- [`strip_multiple_whitespaces`](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_multiple_whitespaces): Remove repeating whitespace characters (spaces, tabs, line breaks) and turns tabs & line breaks into spaces

In [4]:
def basic_preprocess(string):
    string_list = gpp.preprocess_string(string.lower(), 
        filters=[
            gpp.remove_stopwords, 
            gpp.strip_non_alphanum, 
            gpp.strip_punctuation, gpp.strip_punctuation2,
            gpp.strip_tags, 
            gpp.strip_numeric,
            gpp.strip_multiple_whitespaces])
    return [ent.lower().strip()
            for ent in string_list]

In [5]:
print(basic_preprocess('''
                This is an example of what will be returned
                if we use just basic_preprocessing.
                '''))

['example', 'returned', 'use', 'basic', 'preprocessing']


[Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) Function: `lemmatize_strings`:
- This function takes as input a list of strings
- Reduces inflectional forms of words to a base form
- Returns list of lemmatized strings

In [6]:
def lemmatize_strings(string_list):
    return [lemma(ent.lower().strip()) 
                for ent in string_list ]

This handles a small issue with generators during runtime

In [7]:
try:
    lemmatize_strings('test me'.split())
except StopIteration and RuntimeError:
    pass

In [8]:
print(lemmatize_strings('''
                This is an example of what will be returned
                if we use just lemmatization.
                '''.split()))

['thi', 'be', 'an', 'example', 'of', 'what', 'will', 'be', 'return', 'if', 'we', 'use', 'just', 'lemmatization.']


In [9]:
print(lemmatize_strings(
        basic_preprocess(
                '''
                This is an example of what will be returned
                if we use lemmatization after we 
                use our basic_preprocessing function.
                ''')))

['example', 'return', 'use', 'lemmatization', 'use', 'basic', 'preprocess', 'function']


Read Corpus from Pandas DB: `read_corpus`:
- This function takes as input a list of strings
- `features_desired`: optional input of what DataFrame columns to take
- `min_size`: optional input for the minimum-sized string to accept AFTER preprocessing and lemmatization

In [10]:
def read_corpus(pandas_db, features_desired=['introduction'], min_size=0):
    for index, row in pandas_db.iterrows():
        # Create the desired feature string
        rowStr = ''.join([str(row[feature]) + ' '
                          for feature in features_desired])
        rowStrList = lemmatize_strings(basic_preprocess(rowStr))
        res = [ent for ent in rowStrList if len(ent) >= min_size]
        yield res

Now, we will generate a list of strings that we will eventually train our models on.

In [11]:
pp_corpus = list(read_corpus(
                    db, 
                    features_desired=['abstract', 'introduction'],
                    min_size=3))

## TF-IDF Computation



You will defeat the whole purpose of IDF weighting if its not based on a large corpora as (a) your vocabulary becomes too small and (b) you have limited ability to observe the behavior of words that you do know about.

Below code sourced from [here](http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XQBj3y3MxQI)

In [12]:
docs_corpus = []
for ent in pp_corpus:
    tmp = " "
    for word in ent:
        tmp += word + ' '
    docs_corpus.append(tmp)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re

In [14]:
tfidf = TfidfVectorizer(
    # Maximum % of documents where the word may exist
    max_df=0.50, # remove all words that are in >=50% of documents
    # Extra stopword filter provided by sklearn
    stop_words='english',
    smooth_idf=True,
    use_idf=True,
    strip_accents='unicode')
word_count_vector = tfidf.fit_transform(docs_corpus)

I recommend reviewing the below set of words to see if it is acceptable for them to be excluded for you. You can adjust the `max_df` value above to your needs.

In [15]:
list(tfidf.stop_words_)

['study',
 'contain',
 'genome',
 'include',
 'database',
 'http',
 'data',
 'user',
 'provide',
 'protein',
 'tool',
 'search',
 'base',
 'new',
 'information',
 'analysi',
 'sequence',
 'develop',
 'available',
 'web',
 'number',
 'www',
 'resource',
 'gene']

Remove the new stopwords from our text.

In [16]:
new_stopwords = set(tfidf.stop_words_)

In [17]:
tmp_res = []
for pp_list in pp_corpus:
    tmp = []
    for word in pp_list:
        if word not in new_stopwords:
            tmp.append(word)
    tmp_res.append(tmp)
pp_corpus = tmp_res

### Considering High-Count Words
You also have the option of removing high-count words that were not captured in the original TF-IDF filtering. 

In [18]:
dictionary = gensim.corpora.Dictionary(pp_corpus)
print(dictionary)
#print(dictionary.token2id)

Dictionary(30329 unique tokens: ['academia', 'academy', 'access', 'accessible', 'activity']...)


We will convert this to a gensim [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model)-style corpus for later processing.

In [19]:
bow_corpus = [dictionary.doc2bow(text) for text in pp_corpus]

We also might be interested in creating a massive list of all the strings in our corpus, and then quantify this using a python Counter to view the top most common entries.

In [20]:
ctr = collections.Counter([inner
    for outer in pp_corpus
        for inner in outer])

Look at the the top 10 most common words now, and if you see fit, you may remove them. We chose to retain them for this study.

In [21]:
n = 10
counter_result = ctr.most_common(n)
print(counter_result)
topn = [word[0] for word in counter_result]

[('structure', 6836), ('human', 5582), ('annotation', 5500), ('site', 4352), ('interaction', 4085), ('domain', 4035), ('rna', 3982), ('function', 3962), ('model', 3772), ('genomic', 3706)]


### Save Results

In [22]:
docs_corpus = []
for ent in pp_corpus:
    tmp = " "
    for word in ent:
        tmp += word + ' '
    docs_corpus.append(tmp)

In [23]:
db['preprocessed_data'] = docs_corpus

In [25]:
db.to_csv('preprocessed-data.csv', index=False)