4.4.5 Challenge: Build your own NLP model
For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

Data cleaning / processing / language parsing
Create features using two different NLP methods: For example, BoW vs tf-idf.
Use the features to fit supervised learning models for each feature set to predict the category outcomes.
Assess your models using cross-validation and determine whether one model performed better.
Pick one of the models and try to increase accuracy by at least 5 percentage points.

In [2]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

# NLP 
import spacy
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
abstract1 = requests.get(r'http://api.plos.org/search?q=title:%22transmission%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')
abstract2 = requests.get(r'http://api.plos.org/search?q=title:%22disease%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')
abstract3 = requests.get(r'http://api.plos.org/search?q=title:%22health%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')
abstract4 = requests.get(r'http://api.plos.org/search?q=title:%22communicable%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')
abstract5 = requests.get(r'http://api.plos.org/search?q=title:%22prevalence%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')
abstract6 = requests.get(r'http://api.plos.org/search?q=title:%22healthcare%22&fl=abstract&wt=json&api_key=7ujScsFm2osdMw6ozx4g')

In [4]:
transmission_1 = abstract1.json()
disease_1= abstract2.json()
health_1 = abstract3.json()
communicable_1 = abstract4.json()
prevalence_1 = abstract5.json()
healthcare_1 = abstract6.json()


Extracting only the abstracts of each article into one string.

In [5]:
transmission = ''
for article in transmission_1['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    transmission = transmission + art

transmission[0:200]

"'\\nIn Japan, the fraction of norovirus outbreaks attributable to human-to-human transmission has increased with time, and the timing of the increased fraction has coincided with the increase in the ob"

In [6]:
disease = ''
for article in disease_1['response']['docs']:
    art = re.sub(r'[\[\]]', '', str(article['abstract']))
    disease = disease+art
    
disease[0:200]
                                      

"'Background: Exposure to beryllium may lead to granuloma formation and fibrosis in those who develop chronic beryllium disease (CBD). Although disease presentation varies from mild to severe, little i"

In [7]:
health = ''
for article in health_1['response']['docs']:
    art = re.sub(r'[\[\]]', '', str(article['abstract']))
    health = health+art
    
health[0:200]
             

"'Background: For most rural households in sub-Saharan Africa, healthy livestock play a key role in averting the burden associated with zoonotic diseases, and in meeting household nutritional and socio"

In [8]:
communicable = ''
for article in communicable_1['response']['docs']:
    art = re.sub(r'[\[\]]', '', str(article['abstract']))
    communicable= communicable+art
    
communicable[0:200]

"'\\nIn sub-Saharan Africa (SSA), epidemiological data for chronic kidney disease (CKD) are scarce. We conducted a prospective cross-sectional study including 952 patients in an outpatient clinic in Tan"

In [9]:
prevalence = ''
for article in prevalence_1['response']['docs']:
    art = re.sub(r'[\[\]]', '', str(article['abstract']))
    prevalence= prevalence+art
    
prevalence[0:200]

"'Introduction: Frailty is an important concept in modern healthcare due to its association with adverse outcomes. Its prevalence varies in the literature and there is a paucity of literature looking a"

In [10]:
healthcare = ''
for article in healthcare_1['response']['docs']:
    art = re.sub(r'[\[\]]', '', str(article['abstract']))
    healthcare= healthcare+art
    
healthcare[0:200]

"'Background: Culturally and linguistically diverse patients access healthcare services less than the host populations and are confronted with different barriers such as language barriers, legal restri"

Text cleaning

In [11]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'[\']', '', text)
    text = re.sub(r'[\\]', '', text)
    text = re.sub(r'\d', '', text)
    return text

In [12]:
transmission_clean = text_cleaner(transmission)
disease_clean = text_cleaner(disease)
health_clean = text_cleaner(health)
communicable_clean = text_cleaner(communicable)
prevalence_clean = text_cleaner(prevalence)
healthcare_clean = text_cleaner(healthcare)

In [13]:
len(transmission_clean)

15268

In [14]:
len(disease_clean)

11803

Language Parsing using spaCy

In [15]:
nlp = spacy.load('en_core_web_sm')

In [16]:
transmission_doc = nlp(transmission_clean)
disease_doc = nlp(disease_clean)
health_doc = nlp(health_clean)
communicable_doc = nlp(communicable_clean)
prevalence_doc = nlp(prevalence_clean)
healthcare_doc = nlp(healthcare_clean)

Splitting each topic into individual sentences.

In [17]:
transmission_sents = [[sent, 'transmission'] for sent in transmission_doc.sents]
disease_sents = [[sent, 'disease'] for sent in disease_doc.sents]
health_sents = [[sent, 'health'] for sent in health_doc.sents]
communicable_sents = [[sent, 'communicable'] for sent in communicable_doc.sents]
prevalence_sents = [[sent, 'prevalence'] for sent in prevalence_doc.sents]
healthcare_sents = [[sent, 'healthcare'] for sent in healthcare_doc.sents]

sentences = pd.DataFrame(transmission_sents+disease_sents+health_sents+communicable_sents+prevalence_sents+healthcare_sents)
sentences.head()
print(len(sentences))

719


In [18]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 100 == 0:
            print("Processing row {}".format(i))
            
    return df

Finding common words

In [19]:
transmission_words = bag_of_words(transmission_doc)
disease_words = bag_of_words(disease_doc)
health_words = bag_of_words(health_doc)
communicable_words = bag_of_words(communicable_doc)
prevalence_words = bag_of_words(prevalence_doc)
healthcare_words = bag_of_words(healthcare_doc)


common_words = set(transmission_words + disease_words + health_words + communicable_words + prevalence_words + healthcare_words)

Creating features

In [26]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()


Processing row 0
Processing row 100
Processing row 200
Processing row 300
Processing row 400
Processing row 500
Processing row 600
Processing row 700


Unnamed: 0,host,maintain,expansion,unanticipated,restriction,correlate,equivalent,count,leisure,belief,...,BPS,choice,prespecifie,≥,iii,prevention,early,Jakob,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(nIn, Japan, ,, the, fraction, of, norovirus, ...",transmission
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, present, study, aimed, to, estimate, the...",transmission
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, effective, reproduction, number, (, Ry, ...",transmission
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Ry, was, estimated, by, using, the, fraction,...",transmission
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, Ry, estimates, ranged, from, .)",transmission


In [21]:
word_counts.shape

(719, 2086)

Creating tf-idf features
Converting sentences into numeric vectors.
Creating list

In [22]:
#Converting the sentences into numeric vectors
abstract_list = []
for topic in [transmission_1, disease_1, health_1, communicable_1, prevalence_1, healthcare_1]:
    for article in topic['response']['docs']:
        abstract_list = abstract_list + article['abstract']

Vectorizing the texts

In [24]:
from sklearn.model_selection import train_test_split
X = abstract_list
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

vectorizer = TfidfVectorizer(max_df=0.7, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=False, #convert everything to lower case 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
abstract_tfidf=vectorizer.fit_transform(abstract_list)
print("Number of features: %d" % abstract_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(abstract_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[1])

Number of features: 918
Original sentence: 
In sub-Saharan Africa (SSA), epidemiological data for chronic kidney disease (CKD) are scarce. We conducted a prospective cross-sectional study including 952 patients in an outpatient clinic in Tanzania to explore CKD prevalence estimates and the association with cardiovascular and infectious disorders. According to KDIGO, we measured albumin-to-creatinine ratio and calculated eGFR using CKD-EPI formula. Factors associated with CKD were calculated by logistic regression. Venn diagrams were modelled to visualize interaction between associated factors and CKD. Overall, the estimated CKD prevalence was 13.6% (95% CI 11–16%). Ninety-eight patients (11.2%) (95% CI 9–14%) were categorized as moderate, 12 (1.4%) (95% CI 0–4%) as high, and 9 (1%) (95% CI 0–3%) as very high risk according to KDIGO. History of tuberculosis (OR 3.75, 95% CI 1.66–8.18; p = 0.001) and schistosomiasis (OR 2.49, 95% CI 1.13–5.18; p = 0.02) were associated with CKD. A trend 

In [25]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 260 to 130.
svd= TruncatedSVD(130)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('Component {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])

Percent variance captured by all components: 100.0
Component 0:
: Jonathan Quick and colleagues discuss how women's health world-wide can be improved through universal health coverage.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

Name: 0, dtype: float64
Component 1:
\nPseudoperonospora cubensis, an obligate biotrophic oomycete causing devastating foliar disease in species of the Cucurbitaceae family, was never reported in seeds or transmitted by seeds. We now show that P. cubensis occurs in fruits and seeds of downy mildew-infected plants but not in fruits or seeds of healthy plants. About 6.7% of the fruits collected during 2012–2014 have developed downy mildew when homogenized and inoculated onto detached leaves and 0.9% of the seeds collected developed downy mildew when grown to the seedling stage. This is the first report showing that P. cubensis has become seed-transmitted in cucurbits. Species-specific PCR assays showed that P. cubensis occurs in ovaries, fruit seed cavity and seed embryos of cucurbits. We propose that international trade of fruits or seeds of cucurbits might be associated with the recent global change in the population structure of P. cubensis.\n                                          

Name: 1, dtype: float64
Component 2:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

Name: 2, dtype: float64
Component 3:
\n        There has been growing recognition in the international community that health should be considered a human right. Much less attention has been paid, however, to the ensuing legal obligation to provide international assistance.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Name: 3, dtype: float64
Component 4:
Background: Race and ethnicity, typically defined as how individuals self-identify, are complex social constructs. Self-identified racial/ethnic minorities are less likely to receive preventive care and more likely to report healthcare discrimination than self-identified non-Hispanic whites. However, beyond self-identification, these outcomes may vary depending on whether racial/ethnic minorities are perceived by others as being minority or white; this perception is referred to as socially-assigned race. Purpose: To examine the associations between socially-assigned race and healthcare discrimination and receipt of selected preventive services. Methods: Cross-sectional analysis of the 2004 Behavioral Risk Factor Surveillance System “Reactions to Race” module. Respondents from seven states and the District of Columbia were categorized into 3 groups, defined by a composite of self-identified race/socially-assigned race: Minority/Minority (M/M, n = 6,8

Name: 4, dtype: float64
