# Fundamental NLP Concepts with spaCy

## spaCy is a NLP package for Python (implemented in Cython)
#### Common Tasks include:
* Tokenization
* Lemmatization - reducing inflectional forms of words into base or root word
* Part-of-speech Tagging - marking a word based on POS (definition and context
* Entity Recognition - identify and segment Named Entities, and categorize them
* Dependency Parsing - analyze structure, relationships between Head words, and modifiers
* Sentence Recognition
* Word-to-vector Transformations
* Methods for Cleaning and Normalizing Text

#### Load spaCy's Pipeline (stored as nlp) - Invoke NLP on sample text to create a Doc object

In [4]:
import spacy

nlp = spacy.load("en")

doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!")

## Tokenization
#### Splitting text into words, symbols, punctuation, spaces, etc. - creating Tokens
#### Can simply split the string on Whitespace

In [5]:
doc.text.split()

['The',
 'big',
 'grey',
 'dog',
 'ate',
 'all',
 'of',
 'the',
 'chocolate,',
 'but',
 'fortunately',
 'he',
 "wasn't",
 'sick!']

#### This Naive approach fails to separate *wasn't* - spaCy can handle this
#### Returns a string representations of the token, rather than a token object

In [6]:
[token.orth_ for token in doc]

['The',
 'big',
 'grey',
 'dog',
 'ate',
 'all',
 'of',
 'the',
 'chocolate',
 ',',
 'but',
 'fortunately',
 'he',
 'was',
 "n't",
 'sick',
 '!']

#### Token methods offer string and integer representations of processed text
#### Return Strings - methods with underscore_
#### Return Integers - methods without underscore suffix

In [7]:
[(token, token.orth_, token.orth) for token in doc]

[(The, 'The', 5059648917813135842),
 (big, 'big', 15511632813958231649),
 (grey, 'grey', 10475807793332549289),
 (dog, 'dog', 7562983679033046312),
 (ate, 'ate', 10806788082624814911),
 (all, 'all', 13409319323822384369),
 (of, 'of', 886050111519832510),
 (the, 'the', 7425985699627899538),
 (chocolate, 'chocolate', 10946593968795032542),
 (,, ',', 2593208677638477497),
 (but, 'but', 14560795576765492085),
 (fortunately, 'fortunately', 13851269277375979931),
 (he, 'he', 1655312771067108281),
 (was, 'was', 9921686513378912864),
 (n't, "n't", 2043519015752540944),
 (sick, 'sick', 14841597609857081305),
 (!, '!', 17494803046312582752)]

## Lemmatization
#### Reducing a word to its Base form - desirable approach to Standardizing
#### Use .lemma_

In [8]:
practice = "practice practiced practicing"

nlp_practice = nlp(practice)
[word.lemma_ for word in nlp_practice]

['practice', 'practice', 'practicing']

#### Doing this prior to certain tasks (such as Bag-of-words) can avoid duplication

## POS Tagging
#### Assign grammatical properties to words (such as noun, verb, adjective, etc.) - useful in rule-based processes
#### For example, to determine who owns what in a given description - Exploit possessives 
#### Coarse-grained POS - Use .pos_
#### Fine-grained POS - Use .tag_

In [9]:
doc2 = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house")

pos_tags = [(i, i.tag_) for i in doc2]
pos_tags

[(Conor, 'NNP'),
 ('s, 'POS'),
 (dog, 'NN'),
 ('s, 'POS'),
 (toy, 'NN'),
 (was, 'VBD'),
 (hidden, 'VBN'),
 (under, 'IN'),
 (the, 'DT'),
 (man, 'NN'),
 ('s, 'POS'),
 (sofa, 'NN'),
 (in, 'IN'),
 (the, 'DT'),
 (woman, 'NN'),
 ('s, 'POS'),
 (house, 'NN')]

#### This can be used to Extract the owner, and what they own

In [11]:
owners_possessions = []
for i in pos_tags:
    if i[1] == "POS":
        owner = i[0].nbor(-1)
        possession = i[0].nbor(1)
        owners_possessions.append((owner, possession))

owners_possessions

[(Conor, dog), (dog, toy), (man, sofa), (woman, house)]

#### This returns a List of Tuples - It can also be done as a List Comprehension

In [12]:
[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"]

[(Conor, dog), (dog, toy), (man, sofa), (woman, house)]

## Entity Recognition
#### Classify Named Entities found in text into Predefined Categories

In [13]:
wiki_obama = """Barack Obama is an American politician who served as the 44th 
President of the United States from 2009 to 2017. He is the first African American 
to have served as president, as well as the first born outside the contiguous United
States"""

nlp_obama = nlp(wiki_obama)
[(i, i.label_, i.label) for i in nlp_obama.ents]

[(Barack Obama, 'PERSON', 380),
 (American, 'NORP', 381),
 (44th, 'ORDINAL', 396),
 (the United States, 'GPE', 384),
 (2009, 'DATE', 391),
 (2017, 'DATE', 391),
 (first, 'ORDINAL', 396),
 (African, 'NORP', 381),
 (first, 'ORDINAL', 396),
 (United
  States, 'ORG', 383)]

#### Entities have been identified - PERSON, NORP, ORDINAL, DATE
#### NLP tasks frequently look to split a document into sentences

In [14]:
for ix, sent in enumerate(nlp_obama.sents, 1):
    print("Sentence number {}: {}".format(ix, sent))

Sentence number 1: Barack Obama is an American politician who served as the 44th 
President of the United States from 2009 to 2017.
Sentence number 2: He is the first African American 
to have served as president, as well as the first born outside the contiguous United
States
