## Natural language processing with SpaCy

In [3]:
# note: conda environment data_review is set up for this notebook
import os

import IPython

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# SpaCy model to use
spacy_mod = 'en_core_web_lg' # lg is the "large" option, other options: en_core_web_md, en_core_web_sm (med and small)

# note: need to downlad the spacy_mod, in command line use "python -m spacy download en_core_web_lg"
import spacy

In [5]:
# load the data

test = pd.read_csv('./data/drugsComTest_raw.csv')
train = pd.read_csv('./data/drugsComTrain_raw.csv')
merge = [train,test]
merged_data = pd.concat(merge,ignore_index=True)
bc_merged = merged_data[merged_data['condition'] == 'Birth Control']

In [7]:
#This removes the HTML escaped charaters 
# Melissa note: for some reason the first time I run this it gives error but when I run through a second time is fine? 
from html import unescape

def clean_review(text):
    return unescape(text.strip(' "\'')).replace('\ufeff1', '')

bc_merged.review = bc_merged.review.apply(clean_review)

## SpaCy lemmatization

Note: first I am not doing the stop words just to see what I am working with. It seems that some people seem to think they are actually useful to keep so not having the stopwords removed may be beneficial.

In [17]:
# natural language processor
nlp = spacy.load(spacy_mod, disable = ['parser', 'ner'])

In [18]:
# pick one example review to work with in example

num = 100 # choose random number to get different review
review_example = bc_merged.review.iloc[num]
print(review_example)

I am 22, no prior children, I have endometriosis & was told Mirena was the best option. The implementation process was hell. The worst pain I have ever been in. I fainted & had to have a friend pick me up. I left it in for 10 months. Within those 10 months I'd gained 18 pounds (on a 5 foot tall person that's horrible), I was moody, had the worst cramps, hair loss, acne, horrible headaches, the appetite of a sumo wrestler, & didn't want to do anything. I felt lazy and borderline depressed. They don't tell you those side effects for obvious reasons. Yesterday I finally had it removed. Being traumatized by having it put in I about made myself sick with nerves. The removal was quick and painless and I already feel like myself again.


In [19]:
# we can see what the lemmatization of this review is
doc = nlp(review_example)
review_token = " ".join([token.lemma_ for token in doc])
print(review_token)
type(doc)

I be 22 , no prior child , I have endometriosis & be tell Mirena be the good option . the implementation process be hell . the bad pain I have ever be in . I faint & have to have a friend pick I up . I leave it in for 10 month . within those 10 month I have gain 18 pound ( on a 5 foot tall person that be horrible ) , I be moody , have the bad cramp , hair loss , acne , horrible headache , the appetite of a sumo wrestler , & do n't want to do anything . I feel lazy and borderline depressed . they do n't tell you those side effect for obvious reason . yesterday I finally have it remove . be traumatize by have it put in I about make myself sick with nerve . the removal be quick and painless and I already feel like myself again .


spacy.tokens.doc.Doc

In [21]:
tokens = nlp(review_example.replace('/', ' / '))

In [22]:
type(tokens)

spacy.tokens.doc.Doc

In [23]:
tokens

I am 22, no prior children, I have endometriosis & was told Mirena was the best option. The implementation process was hell. The worst pain I have ever been in. I fainted & had to have a friend pick me up. I left it in for 10 months. Within those 10 months I'd gained 18 pounds (on a 5 foot tall person that's horrible), I was moody, had the worst cramps, hair loss, acne, horrible headaches, the appetite of a sumo wrestler, & didn't want to do anything. I felt lazy and borderline depressed. They don't tell you those side effects for obvious reasons. Yesterday I finally had it removed. Being traumatized by having it put in I about made myself sick with nerves. The removal was quick and painless and I already feel like myself again.

In [24]:
type(tokens[10].text)

str

In [15]:
def review2token(text):
    # tokenize
    doc = nlp(text)
    # now to turn into a list of strings that are the tokens
    words = []
    for token in doc:
        #skip spaces
        if token.text.isspace():
            continue 
        
        words.append(token.text)
    return words

In [26]:
# type is a string
a = review2token(review_example)
type(a)

list

In [27]:
type(a[1])

str

In [145]:
# lemmatize all the reviews in the data set

# this is a big data set so I am just going to run for less for now!
#reviews_lemma = [nlp(review) for review in bc_merged.review]
n = 1000
bc_merged_sub = bc_merged.iloc[:n]

reviews_lemma = [review2token(review) for review in bc_merged_sub.review]

In [146]:
len(reviews_lemma)

1000

In [148]:
#reviews_lemma[1]

In [73]:
len(reviews_lemma)

1000

In [149]:
# count the number of times each token appears
token_count = {}

for rev in reviews_lemma:
    for token in rev:
        token_count[token] = token_count.get(token, 0) + 1

In [150]:
token_count['I']

7005

1

In [151]:
# sort the counts
token_count = dict(sorted(token_count.items(), key = lambda pair: pair[1], reverse=True))

In [153]:
# display a table of the most frequent words

# keep it within our vocabulary
show_words = 500
columns = 5

markdown_rows = []
markdown_rows.append(f'#### Top {show_words} most frequent tokens\n')
markdown_rows.append('| Token | Count '*columns + '|')
markdown_rows.append('| ---: | :---- '*columns + '|')

row = ''
for index, (word, count) in enumerate(token_count.items()):
    if index >= show_words: break
    if index%columns == 0 and row:
        markdown_rows.append(row + '|')
        row = ''
    row += f'| {word} | {count} '
if row:
    markdown_rows.append(row + '|')
    
IPython.display.Markdown('\n'.join(markdown_rows))

#### Top 500 most frequent tokens

| Token | Count | Token | Count | Token | Count | Token | Count | Token | Count |
| ---: | :---- | ---: | :---- | ---: | :---- | ---: | :---- | ---: | :---- |
| I | 7005 | . | 5860 | and | 3129 | , | 3079 | the | 2539 |
| it | 2281 | to | 2081 | my | 1967 | a | 1957 | have | 1558 |
| for | 1467 | was | 1399 | had | 1169 | of | 1058 | on | 1022 |
| this | 923 | n't | 874 | but | 870 | that | 815 | is | 789 |
| in | 740 | ! | 682 | me | 673 | been | 671 | period | 634 |
| not | 617 | with | 574 | so | 559 | about | 549 | months | 548 |
| pill | 543 | 've | 541 | 'm | 526 | no | 495 | birth | 472 |
| control | 467 | now | 441 | 's | 436 | first | 414 | get | 384 |
| month | 379 | did | 376 | weight | 366 | am | 364 | all | 362 |
| at | 357 | after | 355 | do | 354 | My | 350 | ) | 348 |
| periods | 347 | ( | 344 | out | 342 | It | 339 | 3 | 339 |
| The | 334 | just | 331 | got | 327 | or | 320 | would | 320 |
| very | 318 | days | 316 | like | 315 | years | 311 | - | 310 |
| as | 296 | side | 294 | has | 292 | only | 288 | started | 284 |
| be | 284 | time | 282 | taking | 279 | because | 276 | are | 265 |
| you | 265 | 2 | 255 | cramps | 251 | week | 250 | effects | 246 |
| mood | 245 | bleeding | 242 | acne | 237 | before | 234 | bad | 228 |
| sex | 226 | never | 225 | since | 223 | any | 219 | gain | 219 |
| when | 218 | from | 215 | which | 214 | feel | 212 | year | 212 |
| up | 211 | more | 207 | also | 202 | one | 200 | if | 200 |
| day | 197 | back | 193 | weeks | 192 | swings | 190 | spotting | 185 |
| then | 181 | two | 178 | some | 176 | take | 175 | than | 173 |
| really | 173 | getting | 172 | having | 168 | pain | 168 | were | 164 |
| every | 164 | over | 161 | will | 160 | pregnant | 157 | an | 157 |
| they | 150 | light | 149 | even | 148 | 5 | 147 | doctor | 147 |
| could | 145 | little | 145 | almost | 142 | This | 141 | went | 141 |
| drive | 141 | 6 | 139 | off | 139 | pills | 138 | still | 138 |
| few | 138 | other | 135 | gained | 134 | can | 133 | much | 133 |
| 4 | 131 | ever | 131 | great | 130 | made | 129 | heavy | 129 |
| recommend | 129 | cramping | 127 | experience | 126 | far | 124 | going | 123 |
| good | 120 | No | 118 | painful | 118 | go | 117 | felt | 117 |
| But | 116 | body | 116 | love | 116 | headaches | 115 | inserted | 115 |
| took | 114 | put | 113 | last | 112 | pounds | 110 | noticed | 110 |
| always | 108 | insertion | 108 | normal | 105 | its | 103 | say | 102 |
| well | 100 | does | 99 | again | 99 | horrible | 98 | ago | 97 |
| skin | 97 | used | 95 | So | 95 | / | 95 | thing | 94 |
| lot | 94 | pregnancy | 94 | different | 94 | being | 93 | After | 93 |
| implant | 93 | ca | 93 | stopped | 92 | experienced | 92 | during | 89 |
| 10 | 88 | better | 88 | shot | 88 | ... | 86 | removed | 86 |
| nothing | 86 | while | 85 | depression | 84 | problems | 84 | away | 84 |
| IUD | 84 | by | 82 | think | 81 | know | 81 | switched | 80 |
| work | 80 | until | 80 | nausea | 80 | too | 79 | there | 79 |
| try | 79 | : | 79 | Implanon | 79 | three | 78 | long | 78 |
| same | 77 | want | 77 | your | 77 | stop | 76 | second | 75 |
| due | 75 | reviews | 75 | " | 75 | worst | 74 | what | 74 |
| anything | 74 | though | 73 | symptoms | 73 | depressed | 73 | tried | 73 |
| use | 73 | using | 72 | Nexplanon | 72 | sure | 72 | & | 71 |
| most | 70 | something | 70 | completely | 69 | life | 69 | another | 68 |
| moody | 68 | around | 68 | these | 68 | old | 68 | extremely | 68 |
| pretty | 68 | anxiety | 68 | thought | 67 | gotten | 67 | definitely | 66 |
| taken | 65 | next | 65 | how | 65 | help | 64 | them | 64 |
| emotional | 63 | decided | 63 | told | 63 | fine | 62 | down | 62 |
| 1 | 62 | Mirena | 62 | worse | 60 | half | 60 | crazy | 60 |
| times | 59 | way | 59 | lighter | 59 | bit | 58 | make | 58 |
| terrible | 58 | change | 58 | blood | 58 | said | 57 | lost | 57 |
| hair | 57 | i | 57 | best | 57 | lasted | 56 | where | 56 |
| less | 56 | everything | 56 | 7 | 56 | severe | 56 | And | 55 |
| things | 55 | 8 | 55 | happy | 55 | issues | 55 | into | 55 |
| However | 55 | breasts | 54 | face | 54 | wanted | 54 | new | 54 |
| absolutely | 54 | through | 54 | When | 53 | pack | 53 | right | 53 |
| ? | 53 | once | 53 | couple | 53 | everyone | 53 | effective | 52 |
| worth | 52 | patch | 51 | longer | 51 | boyfriend | 51 | feeling | 51 |
| arm | 51 | Also | 50 | see | 50 | hormones | 50 | form | 49 |
| 'll | 49 | gone | 49 | anymore | 49 | reason | 49 | regular | 49 |
| sometimes | 48 | 15 | 48 | anyone | 48 | .. | 48 | 20 | 48 |
| everyday | 48 | If | 48 | Loestrin | 48 | breast | 48 | give | 47 |
| later | 46 | break | 46 | medication | 46 | BC | 46 | past | 46 |
| trying | 45 | insurance | 45 | starting | 45 | switching | 45 | bleed | 45 |
| Lo | 45 | Not | 44 | people | 44 | Nuvaring | 44 | finally | 44 |
| notice | 44 | effect | 44 | hurt | 44 | loss | 44 | myself | 44 |
| cycle | 43 | negative | 43 | 'd | 43 | constantly | 43 | super | 43 |
| yet | 42 | problem | 42 | works | 42 | today | 41 | soon | 41 |
| tired | 41 | read | 41 | worked | 41 | Now | 41 | point | 41 |
| night | 41 | Skyla | 40 | who | 40 | already | 40 | straight | 40 |
| without | 40 | she | 40 | worry | 39 | nauseous | 39 | eat | 39 |
| actually | 39 | appetite | 39 | gave | 38 | awful | 38 | free | 37 |
| many | 37 | changed | 37 | clear | 37 | caused | 37 | stomach | 37 |
| Ortho | 37 | maybe | 36 | became | 36 | switch | 36 | migraines | 36 |
| cry | 36 | Sprintec | 36 | may | 35 | low | 35 | person | 35 |
| helped | 35 | hoping | 35 | loved | 35 | done | 34 | those | 34 |
| however | 34 | kids | 34 | increased | 34 | makes | 34 | start | 34 |
| enough | 34 | At | 33 | came | 33 | 2015 | 33 | baby | 33 |
| Then | 33 | eating | 33 | constant | 33 | within | 33 | husband | 33 |
| Since | 33 | ring | 33 | Depo | 33 | Fe | 33 | least | 33 |
| lbs | 32 | They | 32 | found | 32 | For | 32 | bled | 32 |
| hours | 32 | keep | 32 | sore | 32 | libido | 31 | why | 31 |
| job | 31 | between | 31 | pains | 31 | ; | 31 | nexplanon | 31 |
| except | 31 | able | 31 | sick | 31 | recently | 31 | 9 | 31 |
| Tri | 31 | high | 31 | he | 31 | end | 30 | method | 30 |
| remember | 30 | slight | 30 | women | 30 | uncomfortable | 30 | find | 30 |
| cause | 30 | place | 30 | 18 | 30 | Overall | 30 | else | 30 |
| extreme | 30 | hormonal | 30 | minutes | 30 | procedure | 30 | ended | 29 |
| immediately | 29 | bc | 29 | hope | 29 | honestly | 29 | come | 29 |
| A | 29 | skyla | 29 | nt | 29 | usually | 29 | 're | 29 |
| recommended | 29 | changes | 29 | daily | 28 | lower | 28 | should | 28 |
| we | 28 | deal | 28 | part | 28 | major | 28 | whole | 28 |
| pressure | 28 | each | 28 | mild | 28 | entire | 28 | hour | 28 |
| experiencing | 27 | Insertion | 27 | bloating | 27 | easy | 27 | brand | 27 |
| prior | 27 | depo | 26 | 30 | 26 | lose | 26 | began | 26 |
| working | 26 | angry | 26 | non | 26 | 3rd | 26 | size | 26 |

In [109]:
def text2tokens(text):
    """text -> tokenize -> lemmatize/normalize"""
    
    tokens = nlp(
        # also split on "/"
        text.replace('/', ' / '),
        
        # we only need tokenizer and lemmas, so disable the rest
        disable=['tagger', 'parser', 'ner']
    )
    
    lexemes = []
    for token in tokens:
        
        # sometimes whitespace gets recognized as a token...
        if token.text.isspace():
            continue
            
        # prefer more general representations
        # but only if they have an embedding
        if nlp.vocab[token.lemma_.lower()].has_vector:
            lexeme = token.lemma_.lower()
        elif nlp.vocab[token.norm_.lower()].has_vector:
            lexeme = token.norm_.lower()
        else:
            lexeme = token.lower_
        
        lexemes.append(lexeme)
        
    return lexemes

In [110]:
num = 100 # choose random number to get different review
text = bc_merged.review.iloc[num]

In [111]:
text

"I am 22, no prior children, I have endometriosis & was told Mirena was the best option. The implementation process was hell. The worst pain I have ever been in. I fainted & had to have a friend pick me up. I left it in for 10 months. Within those 10 months I'd gained 18 pounds (on a 5 foot tall person that's horrible), I was moody, had the worst cramps, hair loss, acne, horrible headaches, the appetite of a sumo wrestler, & didn't want to do anything. I felt lazy and borderline depressed. They don't tell you those side effects for obvious reasons. Yesterday I finally had it removed. Being traumatized by having it put in I about made myself sick with nerves. The removal was quick and painless and I already feel like myself again."

In [112]:
t2 = text2tokens(text)











In [113]:
type(t2)

list

In [114]:
type(t2[0])

str