# Text Preprocessing 

##### Author: Saurabh Kumar

**What is NLP?**

What can we do with NLP?

* `information extraction`
* `Text classification`
   - Bag of words (tf-idf)
   - deep learning (RNN/LSTM/Transformer)
* `Language modeling/ Natural Text Generation`
* `Text Similairty`
> * `Topic modeling`
* Translation
* Chat bot
* Question answering
* Text-to-speech and Speech-to-text

The general workflow for any Natural Language Processing Project

![nlp_pract.PNG](attachment:nlp_pract.PNG)

* NLP
* Intro to Kaggle Kernels / Jupyter Notebook
* SpaCy- Text Tokenization, POS Tagging, Parsing, NER
* Python Fundamentals: Collections, list comprehensions, sorted, apply

## SpaCy

"SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

SpaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

SpaCy is not research software. It's built on the latest research, but it's designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that SpaCy is integrated and opinionated. SpaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets SpaCy deliver generally better performance and developer experience."

### SpaCy Features 

NAME |	DESCRIPTION |
:----- |:------|
Tokenization|Segmenting text into words, punctuations marks etc.|
Part-of-speech (POS) Tagging|Assigning word types to tokens, like verb or noun.|
Dependency Parsing|	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
Lemmatization|	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".|
Sentence Boundary Detection (SBD)|	Finding and segmenting individual sentences.|
Named Entity Recognition (NER)|	Labelling named "real-world" objects, like persons, companies or locations.|
Similarity|	Comparing words, text spans and documents and how similar they are to each other.|
Text Classification|	Assigning categories or labels to a whole document, or parts of a document.|
Rule-based Matching|	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
Training|	Updating and improving a statistical model's predictions.|
Serialization|	Saving objects to files or byte strings.|

SOURCE: https://spacy.io/usage/spacy-101****

'He killed the man with **fire**'

In [2]:
print('He killed the man with fire') 

He killed the man with fire


In [24]:
# importing spacy 
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

In [22]:
# review text

text="This is one of the greatest films ever made. Brilliant acting by George C. Scott and Diane Riggs. This movie is both disturbing and extremely deep. Don't be fooled into believing this is just a comedy. It is a brilliant satire about the medical profession. It is not a pretty picture. Healthy patients are killed by incompetent surgeons, who spend all their time making money outside the hospital. And yet, you really believe that this is a hospital. The producers were very careful to include real medical terminology and real medical cases. This movie really reveals how difficult in is to run a hospital, and how badly things already were in 1971. I loved this movie."
print(text)

This is one of the greatest films ever made. Brilliant acting by George C. Scott and Diane Riggs. This movie is both disturbing and extremely deep. Don't be fooled into believing this is just a comedy. It is a brilliant satire about the medical profession. It is not a pretty picture. Healthy patients are killed by incompetent surgeons, who spend all their time making money outside the hospital. And yet, you really believe that this is a hospital. The producers were very careful to include real medical terminology and real medical cases. This movie really reveals how difficult in is to run a hospital, and how badly things already were in 1971. I loved this movie.


In [25]:
# instantiate the document text
doc = nlp(text)  #disable=['parser','tagger','ner'])
# which the SpaCy document methods and attributes
print(dir(doc))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_span_hooks'

### NLP Pipeline

When you read the text into spaCy, e.g. doc = nlp(text), you are applying a pipeline of nlp processes to the text.
by default spaCy applies a tagger, parser, and ner, but you can choose to add, replace, or remove these steps.
Note: Removing unnecessary steps for a given nlp can lead to substantial descreses in processing time.

In [6]:
from IPython.core.display import display, HTML
# SpaCy pipeline
spacy_url = 'https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg'
iframe = '<iframe src={} width=1000 height=200></iframe>'.format(spacy_url)
HTML(iframe)



### Tokenization

SpaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. 

In [7]:
# SpaCy pipeline
spacy_url = 'https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg'
iframe = '<iframe src={} width=1500 height=200></iframe>'.format(spacy_url)
HTML(iframe)

In [8]:
# 
[ token.text for token in nlp(" let's go to N.Y.!")]," let's go to N.Y.!".split()

([' ', 'let', "'s", 'go', 'to', 'N.Y.', '!'], ["let's", 'go', 'to', 'N.Y.!'])

In [9]:
tok_doc=nlp("Some\nspaces  and\ttab characters") # Let's go to N.Y.!'
# tok_doc=nlp("Let's go to N.Y.!")
tokens_text = [t.text for t in tok_doc]
tokens_text

['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']

Spacy

In [11]:
article2 = 'ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456'

In [12]:
import spacy
print('spaCy Version: %s' % spacy.__version__)

spaCy Version: 2.1.3


In [13]:
spacy_nlp = spacy.load('en_core_web_sm')

print('Original Article: %s' % (article2))
print()
doc = spacy_nlp(article2)
tokens = [token.text for token in doc]
print(tokens)

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_!@', '#', '!', '@#$%^&*()_+', '0123456']


First step of spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc.

NLTK

In [14]:
import nltk
print('NTLK Version: %s' % nltk.__version__)

NTLK Version: 3.4


In [15]:
print('Original Article: %s' % (article2))
print()
print(nltk.word_tokenize(article2))

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']


The behavior is a little difference from spaCy. NLTK treats most of special character as a "word" except "_". Of course, number will be tokenized as well.

#### Stopwords

Spacy

In [16]:
# import a list of stop words from SpaCy
from spacy.lang.en.stop_words import STOP_WORDS

print('Number of Spacy stop words: %d' % len(list(STOP_WORDS)))
print('Example stop words: {}'.format(list(STOP_WORDS)[0:10]))

Number of Spacy stop words: 312
Example stop words: ['make', 'therein', 'we', 'quite', 'serious', 'hundred', 'again', 'most', 'enough', 'more']


In [17]:
# removing stopwords
article = 'In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.'

doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

['computing', ',', 'stop', 'words', 'words', 'filtered', 'processing', 'natural', 'language', 'data', '(', 'text).[1', ']', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'language', 'processing', 'tools', ',', 'tools', 'use', 'list', '.', 'tools', 'specifically', 'avoid', 'removing', 'stop', 'words', 'support', 'phrase', 'search', '.']


In [18]:
# Add customize stop words

customize_stop_words = ['computing', 'filtered']

for w in customize_stop_words:
    spacy_nlp.vocab[w].is_stop = True
    
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

[',', 'stop', 'words', 'words', 'processing', 'natural', 'language', 'data', '(', 'text).[1', ']', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'language', 'processing', 'tools', ',', 'tools', 'use', 'list', '.', 'tools', 'specifically', 'avoid', 'removing', 'stop', 'words', 'support', 'phrase', 'search', '.']


Nltk

In [20]:
import nltk 
print('NLTK Version: %s' % (nltk.__version__))

nltk.download('stopwords')


nltk_stopwords = nltk.corpus.stopwords.words('english')
print('Number of stop words: %d' % len(nltk_stopwords))
print('First ten stop words: %s' % list(nltk_stopwords)[:10])

NLTK Version: 3.4


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of stop words: 179
First ten stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


General words such as "are", "the" are removed as well. For example, "indeed" is removed in NLTK but not spaCy. On the other hand, "used" are removed in spaCy but not NLTK

In [21]:
tokens = nltk.tokenize.word_tokenize(article)
tokens = [token for token in tokens if not token in nltk_stopwords]

print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

['In', 'computing', ',', 'stop', 'words', 'words', 'filtered', 'processing', 'natural', 'language', 'data', '(', 'text', ')', '.', '[', '1', ']', 'Though', '``', 'stop', 'words', "''", 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'used', 'natural', 'language', 'processing', 'tools', ',', 'indeed', 'tools', 'even', 'use', 'list', '.', 'Some', 'tools', 'specifically', 'avoid', 'removing', 'stop', 'words', 'support', 'phrase', 'search', '.']


#### Stemming and lemmatization

from [Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) textbook:

Are the below words the same?

`organize, organizes, and organizing`

`democracy, democratic, and democratization`

Stemming and Lemmatization both generate the root form of the words.

Lemmatization uses the rules about a language. The resulting tokens are all actual words

`"Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. 
The resulting tokens may not be actual words. Stemming is faster.`

In [11]:
# word_list = ['feet', 'foot', 'foots', 'footing']
word_list=['organize', 'organizes', 'organizing']

In [12]:
from nltk import stem
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [13]:
[porter.stem(word) for word in word_list]

['organ', 'organ', 'organ']

In [14]:
# Lemmatization using nltk
[wnl.lemmatize(word) for word in word_list]

['organize', 'organizes', 'organizing']

**Spacy Lemmatization example**
* Adjectives: happier, happiest → happy
* Adverbs: worse, worst → badly
* Nouns: dogs, children → dog, child
* Verbs: writes, writing, wrote, written → write

In [15]:
# Lemmatization using spacy
for token in nlp(" ".join(word_list)):
    print(token.text, token.lemma_)

organize organize
organizes organize
organizing organize


In [22]:
# Sapcy vs NLTK

article = "Lemmatisation (or lemmatization) in linguistics is the process of grouping together \
the inflected forms of a word so they can be analysed as a single item, identified by the word's \
lemma, or dictionary form."

#.............................................................

import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')

doc = spacy_nlp(article)
tokens = [token.text for token in doc]

print('Original Article: %s' % (article))
print()

for token in doc:
    if token.text != token.lemma_:
        print('Original : %s, New: %s' % (token.text, token.lemma_))
        
#.............................................................

import nltk 
print('NLTK Version: %s' % (nltk.__version__))

nltk.download('wordnet')

wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

tokens = nltk.word_tokenize(article)

print('Original Article: %s' % (article))
print()

for token in tokens:
    lemmatized_token = wordnet_lemmatizer.lemmatize(token)
    
    if token != lemmatized_token:
        print('Original : %s, New: %s' % (token, lemmatized_token))

spaCy Version: 2.1.3
Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Original : Lemmatisation, New: lemmatisation
Original : linguistics, New: linguistic
Original : is, New: be
Original : grouping, New: group
Original : inflected, New: inflect
Original : forms, New: form
Original : they, New: -PRON-
Original : analysed, New: analyse
Original : identified, New: identify
NLTK Version: 3.4


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Original : forms, New: form
Original : as, New: a


spaCy will convert word to lower case and changing past tense, gerund form (other tenses as well) to present tense. Also, "they" normalize to "-PRON-" which is pronoun.

The result is totally difference from spaCy. Only two words are lemmaizated and one of them "as" is strange. It seems that "s" will removed if it is the last character. Therefore, "as" is converted to "a"

### Spelling Correction

#### Correcting your spelling error via 2 distance

When dealing with text, we may need to deal with incorrect text. Although we can still use character embeddings and word embeddings to compute a similar vectors. It is good for unseen data and out-of-vocabulary (OOV). However, it will be better if we can correct typo.

Typo can be generated in several scenarios. If you work on optical character recognition (OCR), post-processing step for OCR output is very critical part as some error will be introduced from OCR engine and it can be caused by bad quality of image and OCR engine error. Another typo source comes from human. When you work on chatbot project, the input comes from human and it must include typos.

To achieve a better result, it will be a better to correct typo as earlier as we can. 

Norvig implemented a very simple but amazing library to correct the spelling error in 2007. Possible candidate corrections are computed by different ways and finding the most likelihood word from there. There are 2 phases to find the possible candidate words.

First of all, it uses 4 different ways to generate new word while the edit distance between original word and candidate word is 1. Difference from Levenshtein Distance, it considers:

1. Deletion: Remove one letter
2. Transposition: Swap two adjacent letters
3. Replacement: Change one letter to another
4. Insertion: Add one letter

Taking “edward” as an example, “edwar”, “edwadr”, “edwadd”, “edwward” are examples of “Deletion”, “Transposition”, “Replacement” and “Insertion” respectively. Obviously, tons of invalid words will be generated. Therefore, it will filter out by a given vocabulary (it call “known word” in library). To expand potential candidates, algorithm repeat this step again but the edit distance is 2.

Second part is selecting candidate from possible candidates based on probability. For example, the occurrences of “Edward” in the given dictionary is 2%, the probability is 0.02. The highest probability word will be chosen from potential candidates.

Implementation:

To facility the spell check, corpus is needed. For sake of easier for demonstration, I simply use dataset from sklearn library without pre-processing. You should use your domain specific dataset to build a better corpus for your data.

In [1]:
# Building Corpus:

from collections import Counter
from sklearn.datasets import fetch_20newsgroups
import re
corpus = []
for line in fetch_20newsgroups().data:
    line = line.replace('\n', ' ').replace('\t', ' ').lower()
    line = re.sub('[^a-z ]', ' ', line)
    tokens = line.split(' ')
    tokens = [token for token in tokens if len(token) > 0]
    corpus.extend(tokens)
corpus = Counter(corpus)

In [3]:
%reload_ext autoreload
%autoreload 2

import sys, os
def add_aion(curr_path=None):
    if curr_path is None:
        dir_path = os.getcwd()
        target_path = os.path.dirname(os.path.dirname(dir_path))
        if target_path not in sys.path:
#             print('Added %s into sys.path.' % (target_path))
            sys.path.insert(0, target_path)
            
add_aion()

In [4]:
# Correction

from aion.util.spell_corrector import SpellCorrector

spell_corrector = SpellCorrector(dictionary=corpus, verbose=1)
spell_corrector.correction('edwardd')

ModuleNotFoundError: No module named 'aion'

### String Matching

In [7]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



Formula: 2*(Matched Characters) / (len(String A) + len(String B))

In [20]:
# Default scorer is Weighed Ratio

countries = ['Afghanistan','Åland Islands','Albania', 'Algeria','American Samoa', 'HongKung']

for location in ['Hong Kong', 'jepen', 'United tates']:
    result = process.extract(location, countries, limit=2)
    print(result)

[('HongKung', 82), ('Afghanistan', 30)]
[('American Samoa', 21), ('Afghanistan', 18)]
[('Åland Islands', 42), ('Afghanistan', 35)]


In [19]:
# ratio
process.extract('Edward', ['Edwards', 'Edwards2', 'drawdE'], scorer=fuzz.ratio, limit =1)
# limit helps in providing best match count

[('Edwards', 92)]

In [17]:
# partial ratio
process.extract('Edward', ['Edwards', 'Edwards2', 'drawdE', 'Edy'], scorer=fuzz.QRatio, limit = 4)

[('Edwards', 92), ('Edwards2', 86), ('Edy', 44), ('drawdE', 17)]

In [None]:
# returning file path
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
    ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

### Part-of-speech (POS) Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following "the" in English is most likely a noun.

Annotation | Description
:----- |:------|
Text |The original word text|
Lemma |The base form of the word.|
POS |The simple part-of-speech tag.|
Tag |The detailed part-of-speech tag.|
Dep |Syntactic dependency, i.e. the relation between tokens.|
Shape |The word shape – capitalisation, punctuation, digits.|
Is Alpha |Is the token an alpha character?|
Is Stop |Is the token part of a stop list, i.e. the most common words of the language?|

In [26]:
# review document
doc

This is one of the greatest films ever made. Brilliant acting by George C. Scott and Diane Riggs. This movie is both disturbing and extremely deep. Don't be fooled into believing this is just a comedy. It is a brilliant satire about the medical profession. It is not a pretty picture. Healthy patients are killed by incompetent surgeons, who spend all their time making money outside the hospital. And yet, you really believe that this is a hospital. The producers were very careful to include real medical terminology and real medical cases. This movie really reveals how difficult in is to run a hospital, and how badly things already were in 1971. I loved this movie.

In [27]:
# check if POS tags were added to the doc in the NLP pipeline
doc.is_tagged

True

In [28]:
# print column headers
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'TEXT','LEMMA_','POS_','TAG_','DEP_','SHAPE_','IS_ALPHA','IS_STOP'))

# print various SpaCy POS attributes
for token in doc:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

TEXT            | LEMMA_          | POS_     | TAG_     | DEP_        | SHAPE_   | IS_ALPHA | IS_STOP  | 
This            | This            | DET      | DT       | nsubj       | Xxxx     |        1 |        1 |
is              | be              | VERB     | VBZ      | ROOT        | xx       |        1 |        1 |
one             | one             | NUM      | CD       | attr        | xxx      |        1 |        1 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        1 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        1 |
greatest        | great           | ADJ      | JJS      | amod        | xxxx     |        1 |        0 |
films           | film            | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
ever            | ever            | ADV      | RB       | advmod      | xxxx     |        1 |        1 |
made            | make            | VERB     | VBN    

Text: The original word text.
Lemma: The base form of the word.
POS: The simple part-of-speech tag.
Tag: The detailed part-of-speech tag.
Dep: Syntactic dependency, i.e. the relation between tokens.
Shape: The word shape — capitalization, punctuation,digits.
is alpha: Is the token an alpha character?
is stop: Is the token part of a stop list, i.e. the mostcommon words of the language?

In [19]:
spacy.explain('JJ')

'adjective'

In [20]:
previous_token = doc[0]  # set first token

for token in doc[1:]:    
    # identify adjective noun pairs
    if previous_token.pos_ == 'ADJ' and token.pos_ == 'NOUN':
        print(f'{previous_token.text}_{token.text}')
    
    previous_token = token

greatest_films
brilliant_satire
medical_profession
pretty_picture
Healthy_patients
incompetent_surgeons
medical_terminology
medical_cases


##### word sense disambiguation via part of speech tags

In [21]:
for token in doc[0:20]:
    print(f'{token.text}_{token.pos_}')

This_DET
is_AUX
one_NUM
of_ADP
the_DET
greatest_ADJ
films_NOUN
ever_ADV
made_VERB
._PUNCT
Brilliant_ADJ
acting_VERB
by_ADP
George_PROPN
C._PROPN
Scott_PROPN
and_CCONJ
Diane_PROPN
Riggs_PROPN
._PUNCT


In [22]:
spacy.explain('CARDINAL')

'Numerals that do not fall under another type'

### Named Entity Recognition (NER)

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product, or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Other libraries are Stanford NER and NLTK NE_Chunk

spaCy

In [29]:
ner_text = "When I told John that I wanted to move to Alaska, he warned me that I'd have trouble finding a Starbucks there."
ner_doc = nlp(ner_text)

In [30]:
print('{:10} | {:15}'.format('LABEL','ENTITY'))

for ent in ner_doc.ents[0:20]:
    print('{:10} | {:50}'.format(ent.label_, ent.text))

LABEL      | ENTITY         
PERSON     | John                                              
GPE        | Alaska                                            
ORG        | Starbucks                                         


In [31]:
# ent methods and attributes
print(dir(ent))

['_', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_recalculate_indices', '_vector', '_vector_norm', 'as_doc', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'label', 'label_', 'lefts', 'lemma_', 'lower_', 'merge', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'set_extension', 'similarity', 'start', 'start_char', 'string', 'subtree', 'text', 'text_with_ws', 'to_array', 'upper_', 'vector', 'vector_norm', 'vocab']


In [32]:
displacy.render(docs=ner_doc, style='ent', jupyter=True)

In [33]:
spacy.explain('GPE')

'Countries, cities, states'

In [34]:
doc1 = nlp("Larry Page founded Google") #"Apple is looking at buying U.K. startup for $1 billion"
displacy.render(doc1, style="ent")

In [35]:
# dependency visualization

# show visualization in Jupyter Notebook
displacy.render(docs=doc, style='ent', jupyter=True)

In [40]:
article = 'Original Sentence: The university was founded in 1885 by Leland and Jane Stanford in memory of their only child, Leland Stanford Jr., who had died of typhoid fever at age 15 the previous year. Stanford was a former Governor of California and U.S. Senator; he made his fortune as a railroad tycoon. The school admitted its first students on October 1, 1891,[2][3] as a coeducational and non-denominational institution.'

document = nlp(article)

print('Original Sentence: %s' % (article))
for element in document.ents:
    print('Type: %s, Value: %s' % (element.label_, element))

Original Sentence: Original Sentence: The university was founded in 1885 by Leland and Jane Stanford in memory of their only child, Leland Stanford Jr., who had died of typhoid fever at age 15 the previous year. Stanford was a former Governor of California and U.S. Senator; he made his fortune as a railroad tycoon. The school admitted its first students on October 1, 1891,[2][3] as a coeducational and non-denominational institution.
Type: DATE, Value: 1885
Type: GPE, Value: Leland
Type: PERSON, Value: Jane Stanford
Type: PERSON, Value: Leland Stanford Jr.
Type: DATE, Value: age 15 the previous year
Type: PERSON, Value: Stanford
Type: GPE, Value: California
Type: GPE, Value: U.S.
Type: ORDINAL, Value: first
Type: DATE, Value: October 1


In [42]:
article2 = 'New York, New York , NY N.Y. new york'

document = nlp(article2)

print('Original Sentence: %s' % (article2))
for element in document.ents:
    print('Type: %s, Value: %s' % (element.label_, element))

Original Sentence: New York, New York , NY N.Y. new york
Type: GPE, Value: New York
Type: GPE, Value: New York
Type: GPE, Value: N.Y.


In [44]:
displacy.render(docs=document, style='ent', jupyter=True)

1. Able to recognize both “New York”
2. Tagged “NY N.Y.” as single location
3. Unable to tag “new york”

NLTK NE_Chunk

In [46]:
import nltk
print('NTLK version: %s' % (nltk.__version__))

from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')

NTLK version: 3.4


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\saurabhkumar9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.


True

NLTK separates the word tokenization, Part of Speech (POS) tagging and Named Entity Recognition. So we need to execute three functions for named entity recognition.

In [47]:
results = ne_chunk(pos_tag(word_tokenize(article)))

print('Original Sentence: %s' % (article))
print()
for x in str(results).split('\n'):
    if '/NNP' in x:
        print(x)

Original Sentence: Original Sentence: The university was founded in 1885 by Leland and Jane Stanford in memory of their only child, Leland Stanford Jr., who had died of typhoid fever at age 15 the previous year. Stanford was a former Governor of California and U.S. Senator; he made his fortune as a railroad tycoon. The school admitted its first students on October 1, 1891,[2][3] as a coeducational and non-denominational institution.

  (GPE Leland/NNP)
  (PERSON Jane/NNP Stanford/NNP)
  (GPE Leland/NNP)
  Stanford/NNP
  Jr./NNP
  (PERSON Stanford/NNP)
  Governor/NNP
  (GPE California/NNP)
  (GPE U.S/NNP)
  Senator/NNP
  October/NNP
  ]/NNP


In [49]:
results = ne_chunk(pos_tag(word_tokenize(article)))
print('Original Sentence: %s' % (article))
print()
for x in str(results).split('\n'):
    if '/NNP' in x:
        print(x)

Original Sentence: Original Sentence: The university was founded in 1885 by Leland and Jane Stanford in memory of their only child, Leland Stanford Jr., who had died of typhoid fever at age 15 the previous year. Stanford was a former Governor of California and U.S. Senator; he made his fortune as a railroad tycoon. The school admitted its first students on October 1, 1891,[2][3] as a coeducational and non-denominational institution.

  (GPE Leland/NNP)
  (PERSON Jane/NNP Stanford/NNP)
  (GPE Leland/NNP)
  Stanford/NNP
  Jr./NNP
  (PERSON Stanford/NNP)
  Governor/NNP
  (GPE California/NNP)
  (GPE U.S/NNP)
  Senator/NNP
  October/NNP
  ]/NNP


In [50]:
results = ne_chunk(pos_tag(word_tokenize(article2)))
print('Original Sentence: %s' % (article2))
print()
for x in str(results).split('\n'):
    if '/NNP' in x:
        print(x)

Original Sentence: New York, New York , NY N.Y. new york

  (GPE New/NNP York/NNP)
  (GPE New/NNP York/NNP)
  (ORGANIZATION NY/NNP)
  N.Y./NNP


Stanford

https://towardsdatascience.com/named-entity-recognition-3fad3f53c91e

Stanford NER need extra implementation to get entity if it include more than 1 word. Also, the performance of tagging is the slowest by comparing to other two libraries.

spaCy seems like the easier one library to get the entity and no extra setup for it. Besides NER, it also support GPU and deep learning approach.

NLTK NE_Chunk needs more setups (downloading pre-trained file) but it is just one-off. The result seems like not good by comparing other two libraries.

### Fast Sentence Boundary Detection (SBD)

In [30]:
"This is a sentence. This is another sentence. let's go to N.Y.!".split('.')

['This is a sentence', ' This is another sentence', " let's go to N", 'Y', '!']

In [31]:
doc = nlp("This is a sentence. This is another sentence. let's go to N.Y.!") 

for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.
let's go to N.Y.!


In [32]:
tokens = ['i', 'want', 'to', 'go', 'to', 'school'] # "i_want", want_to

In [33]:
def ngrams(tokens, n):
    length = len(tokens)
    grams = []
    for i in range(length - n + 1):
        grams.append("_".join(tokens[i:i+n]))
    return grams

In [34]:
spacy.load('en')

<spacy.lang.en.English at 0x7f04c5d07f28>

In [35]:
print(ngrams(tokens, 2))

['i_want', 'want_to', 'to_go', 'go_to', 'to_school']


### Python Fundamentals
A brief overview of some advanced Python which will be used in future lessons

In [52]:
from collections import defaultdict,Counter

In [53]:
review = """This is one of the greatest films ever made. Brilliant acting by George C. Scott and Diane Riggs.
This movie is both disturbing and extremely deep. Don't be fooled into believing this is just a comedy. 
It is a brilliant satire about the medical profession. It is not a pretty picture. Healthy patients are killed by incompetent surgeons, who spend all their time making money outside the hospital. And yet, you really believe that this is a hospital. The producers were very careful to include real medical terminology and real medical cases.
This movie really reveals how difficult in is to run a hospital, and how badly things already were in 1971. I loved this movie.""".strip()

example_doc = nlp(review)

In [54]:
# WRONG APPROACH - KeyError!

# try to create a word count dict with new keys
vocab = {}
for word in example_doc:
    try:
        vocab[word.text] += 1
    except:
        vocab[word.text] = 1
    
print(vocab)

{'This': 3, 'is': 7, 'one': 1, 'of': 1, 'the': 3, 'greatest': 1, 'films': 1, 'ever': 1, 'made': 1, '.': 11, 'Brilliant': 1, 'acting': 1, 'by': 2, 'George': 1, 'C.': 1, 'Scott': 1, 'and': 4, 'Diane': 1, 'Riggs': 1, '\n': 3, 'movie': 3, 'both': 1, 'disturbing': 1, 'extremely': 1, 'deep': 1, 'Do': 1, "n't": 1, 'be': 1, 'fooled': 1, 'into': 1, 'believing': 1, 'this': 3, 'just': 1, 'a': 5, 'comedy': 1, 'It': 2, 'brilliant': 1, 'satire': 1, 'about': 1, 'medical': 3, 'profession': 1, 'not': 1, 'pretty': 1, 'picture': 1, 'Healthy': 1, 'patients': 1, 'are': 1, 'killed': 1, 'incompetent': 1, 'surgeons': 1, ',': 3, 'who': 1, 'spend': 1, 'all': 1, 'their': 1, 'time': 1, 'making': 1, 'money': 1, 'outside': 1, 'hospital': 3, 'And': 1, 'yet': 1, 'you': 1, 'really': 2, 'believe': 1, 'that': 1, 'The': 1, 'producers': 1, 'were': 2, 'very': 1, 'careful': 1, 'to': 2, 'include': 1, 'real': 2, 'terminology': 1, 'cases': 1, 'reveals': 1, 'how': 2, 'difficult': 1, 'in': 2, 'run': 1, 'badly': 1, 'things': 1, '

In [55]:
??defaultdict

In [56]:
d = defaultdict(int)  # define the type of data the dict stores

for word in example_doc:
    d[word.text] += 1  # can add to unassigned keys

print(d)

defaultdict(<class 'int'>, {'This': 3, 'is': 7, 'one': 1, 'of': 1, 'the': 3, 'greatest': 1, 'films': 1, 'ever': 1, 'made': 1, '.': 11, 'Brilliant': 1, 'acting': 1, 'by': 2, 'George': 1, 'C.': 1, 'Scott': 1, 'and': 4, 'Diane': 1, 'Riggs': 1, '\n': 3, 'movie': 3, 'both': 1, 'disturbing': 1, 'extremely': 1, 'deep': 1, 'Do': 1, "n't": 1, 'be': 1, 'fooled': 1, 'into': 1, 'believing': 1, 'this': 3, 'just': 1, 'a': 5, 'comedy': 1, 'It': 2, 'brilliant': 1, 'satire': 1, 'about': 1, 'medical': 3, 'profession': 1, 'not': 1, 'pretty': 1, 'picture': 1, 'Healthy': 1, 'patients': 1, 'are': 1, 'killed': 1, 'incompetent': 1, 'surgeons': 1, ',': 3, 'who': 1, 'spend': 1, 'all': 1, 'their': 1, 'time': 1, 'making': 1, 'money': 1, 'outside': 1, 'hospital': 3, 'And': 1, 'yet': 1, 'you': 1, 'really': 2, 'believe': 1, 'that': 1, 'The': 1, 'producers': 1, 'were': 2, 'very': 1, 'careful': 1, 'to': 2, 'include': 1, 'real': 2, 'terminology': 1, 'cases': 1, 'reveals': 1, 'how': 2, 'difficult': 1, 'in': 2, 'run': 1,

In [57]:
somedict = {'a':2}
print(somedict[3]) # KeyError



KeyError: 3

In [58]:
someddict = defaultdict(int)
print(someddict[3]) # print int(), thus 0

0


In [59]:
# count the number of times each CARDINAL appears
print(Counter(d))

Counter({'.': 11, 'is': 7, 'a': 5, 'and': 4, 'This': 3, 'the': 3, '\n': 3, 'movie': 3, 'this': 3, 'medical': 3, ',': 3, 'hospital': 3, 'by': 2, 'It': 2, 'really': 2, 'were': 2, 'to': 2, 'real': 2, 'how': 2, 'in': 2, 'one': 1, 'of': 1, 'greatest': 1, 'films': 1, 'ever': 1, 'made': 1, 'Brilliant': 1, 'acting': 1, 'George': 1, 'C.': 1, 'Scott': 1, 'Diane': 1, 'Riggs': 1, 'both': 1, 'disturbing': 1, 'extremely': 1, 'deep': 1, 'Do': 1, "n't": 1, 'be': 1, 'fooled': 1, 'into': 1, 'believing': 1, 'just': 1, 'comedy': 1, 'brilliant': 1, 'satire': 1, 'about': 1, 'profession': 1, 'not': 1, 'pretty': 1, 'picture': 1, 'Healthy': 1, 'patients': 1, 'are': 1, 'killed': 1, 'incompetent': 1, 'surgeons': 1, 'who': 1, 'spend': 1, 'all': 1, 'their': 1, 'time': 1, 'making': 1, 'money': 1, 'outside': 1, 'And': 1, 'yet': 1, 'you': 1, 'believe': 1, 'that': 1, 'The': 1, 'producers': 1, 'very': 1, 'careful': 1, 'include': 1, 'terminology': 1, 'cases': 1, 'reveals': 1, 'difficult': 1, 'run': 1, 'badly': 1, 'thing

In [60]:
most_common=Counter(d).most_common(10)
most_common

[('.', 11),
 ('is', 7),
 ('a', 5),
 ('and', 4),
 ('This', 3),
 ('the', 3),
 ('\n', 3),
 ('movie', 3),
 ('this', 3),
 ('medical', 3)]

In [61]:
most_common=Counter(review.split()).most_common(4)
most_common

[('is', 7), ('a', 5), ('and', 4), ('This', 3)]

#### **LIST ** ****
unpacking, slicing, 

In [46]:
elems = [1, 2, 3, 4]
a, b, c, d = elems
print(a, b, c, d)

1 2 3 4


In [47]:
a, *new_elems, d = elems
print(a)
print(new_elems)
print(d)

1
[2, 3]
4


In [48]:
elems = list(range(10))
print(elems)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [49]:
print(elems[::-1]) # 9 8 76

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


In [50]:
elems[::2],elems[-2::-2]

([0, 2, 4, 6, 8], [8, 6, 4, 2, 0])

#### List Comprehension

In [62]:
nums = [1,2,3,4,5]
nums_squared = [num * num for num in nums if num %2==0]
print(nums_squared)

[4, 16]


In [52]:
nums_squared=[]
for num in nums:
    nums_squared.append(num)

nums_squared

[1, 2, 3, 4, 5]

 #### Lambda, map, filter, reduce

In [63]:
def square_fn(x):
    return x * x

square_ld = lambda x: x * x

In [66]:
print(square_fn(5))
print(square_ld(2))

25
4


In [67]:
nums

[1, 2, 3, 4, 5]

In [68]:
nums_squared_1 = map(square_fn, nums)
nums_squared_2 = map(lambda x: x * x, nums)
print(list(nums_squared_1))
print(list(nums_squared_2))


[1, 4, 9, 16, 25]
[1, 4, 9, 16, 25]


In [70]:
filtered_Values = filter(lambda x: x*x > 10, nums)
print(list(filtered_Values))

[4, 5]


#### IMDB REVIEW DATA EXPLORATION 

In [58]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/usinlppracticum/sample_submission.csv
/kaggle/input/usinlppracticum/imdb_test.csv
/kaggle/input/usinlppracticum/imdb_train.csv


In [59]:
data= pd.read_csv('/kaggle/input/usinlppracticum/imdb_train.csv')
data.head()

Unnamed: 0,review,sentiment
0,We had STARZ free weekend and I switched on th...,negative
1,I'll admit that this isn't a great film. It pr...,negative
2,I finally found a version of Persuasion that I...,positive
3,The BBC surpassed themselves with the boundari...,positive
4,"Much praise has been lavished upon Farscape, b...",negative


In [60]:
data.iloc[1,0]

'I\'ll admit that this isn\'t a great film. It practically screams "low-budget" yet oddly I still found myself liking the film because although it lacked quality it abounded with energy. It was like the Little Engine That Could and a movie merged into one! <br /><br />The film takes place at a radio network and concerns some of their low-level employees--two page boys (one very pushy and brash and the other one a wuss) as well as a new receptionist. All three have visions of radio stardom but must for now content themselves with their lowly jobs.<br /><br />Into this story appears a murder that seems somewhat out of the blue. I didn\'t know that this was a murder mystery film and was taken a bit by surprise. However, like most B-mysteries, the cops are lamebrains and it\'s up to our pushy hero (Moran) to try to save the day. Throughout all this, I had a hard time deciding if Moran was obnoxious or endearing. I\'m still not sure!! <br /><br />There is a moment in the film that is high o

##### Pandas Apply

apply is an efficient and fast approach to 'apply' a function to every element in a row. applymap does the same to every element in the entire dataframe (e.g. convert all ints to floats)

Example: https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/

In [61]:
# create a small dataframe with example data
example_data = {'col1':range(0,3),'col2':range(3,6)}
test_df = pd.DataFrame(example_data)
test_df

Unnamed: 0,col1,col2
0,0,3
1,1,4
2,2,5


In [62]:
# apply a built-in function to each element in a column
test_df['col1'].apply(float)

0    0.0
1    1.0
2    2.0
Name: col1, dtype: float64

In [63]:
# apply a custom function to every element in a column
def add_five(row):
    return row + 5

test_df['col1'].apply(add_five)

0    5
1    6
2    7
Name: col1, dtype: int64

In [64]:
test_df['col1'].apply(lambda x: x+5)

0    5
1    6
2    7
Name: col1, dtype: int64

In [65]:
# apply an annonomous function to every element in a column
test_df['col1'].apply(lambda x: x+5)

0    5
1    6
2    7
Name: col1, dtype: int64

##### Sorted

sorted(iterable, key=None, reverse=False)

- Return a new sorted list from the items in iterable.
- Has two optional arguments which must be specified as keyword arguments.
- key specifies a function of one argument that is used to extract a comparison key from each list element: key=str.lower. The default value is None (compare the elements directly).
- reverse is a boolean value. If set to True, then the list elements are sorted as if each comparison were reversed.

SOURCE: https://docs.python.org/3/library/functions.html#sorted

In [66]:
articles =[('article2', 3, 'za'),('article3', 2, 'yb'),('article1', 1, 'xc')]

In [67]:
sorted(articles)

[('article1', 1, 'xc'), ('article2', 3, 'za'), ('article3', 2, 'yb')]

In [68]:
sorted(articles, key=lambda x: x[1])

[('article1', 1, 'xc'), ('article3', 2, 'yb'), ('article2', 3, 'za')]

In [69]:
# sort based on the last term
sorted(articles, key=lambda x: x[2][1])

[('article2', 3, 'za'), ('article3', 2, 'yb'), ('article1', 1, 'xc')]

In [70]:
imdb_sample=data.sample(1000).reset_index(drop=True)

In [71]:
from tqdm import tqdm
tqdm.pandas()

imdb - apply (len, tokenize )
value_counts,


In [72]:
def clean_review(x):
    doc = nlp(x) #test_df['col1'].apply(add_five)
    narrative = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-'] 
    narrative = [i for i in narrative if i not in STOP_WORDS]#  removing # list comprehension
    return " ".join(narrative)

In [73]:
%%time
imdb_sample['clean']=imdb_sample['review'].apply(lambda x:clean_review(x)) #progress_

CPU times: user 58.2 s, sys: 704 ms, total: 58.9 s
Wall time: 56.2 s


**Next Lesson**
* Text Vectorization
* [Regex cheatsheet](https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/)


### References:
* https://spacy.io/api/language
* https://www.fast.ai/2019/07/08/fastai-nlp/
* https://github.com/makcedward/nlp

