# Final Project
### Alex Ledgerwood

Github Repo: https://github.com/ALedgerwood/Module-7-Final-Project
Link to Tutorial:https://realpython.com/natural-language-processing-spacy-python/
Link to Text to be analyzed:https://readingwise.com/assets/uploads/pdf/The_Hill_We_Climb_Transcript.pdf

### import spacy and define nlp

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp


<spacy.lang.en.English at 0x150938efa60>

### call the .text attribute on each token to get the text contained within that token.

In [3]:
introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(introduction_doc)

[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

### do the same thing but reading from a file, not typing it in

I don't know where the introduction.txt file is supposed to be - I am using the exact code from the tutorial

In [4]:
import pathlib
file_name = "introduction.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print([token.text for token in introduction_doc])

FileNotFoundError: [Errno 2] No such file or directory: 'introduction.txt'

### the .sents property is used to extract sentences from the Doc object

In [5]:
>>> about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> sentences = list(about_doc.sents)
>>> len(sentences)
2
>>> for sentence in sentences:
...     print(f"{sentence[:5]}...")

Gus Proto is a Python...
He is interested in learning...


### Creating custom delimeters in sentence detection

this uses elipses as a delimeter

In [6]:
>>> ellipsis_text = (
...     "Gus, can you, ... never mind, I forgot"
...     " what I was saying. So, do you think"
...     " we should ..."
... )

>>> from spacy.language import Language
>>> @Language.component("set_custom_boundaries")
... def set_custom_boundaries(doc):
...     """Add support to use `...` as a delimiter for sentence detection"""
...     for token in doc[:-1]:
...         if token.text == "...":
...             doc[token.i + 1].is_sent_start = True
...     return doc
...

>>> custom_nlp = spacy.load("en_core_web_sm")
>>> custom_nlp.add_pipe("set_custom_boundaries", before="parser")
>>> custom_ellipsis_doc = custom_nlp(ellipsis_text)
>>> custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
>>> for sentence in custom_ellipsis_sentences:
...     print(sentence)

Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...


### Tokenization - showing that the token’s original index position in the string is still available as an attribute on Token

could be useful for in-place word replacement down the line

In [7]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)

>>> for token in about_doc:
...     print (token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


### Other attributes for the token class
use f-string formatting to output a table accessing some common attributes from each Token in Doc:

.text_with_ws prints the token text along with any trailing space, if present.

.is_alpha indicates whether the token consists of alphabetic characters or not.

.is_punct indicates whether the token is a punctuation symbol or not.

.is_stop indicates whether the token is a stop word or not. 

In [8]:
>>> print(
...     f"{'Text with Whitespace':22}"
...     f"{'Is Alphanumeric?':15}"
...     f"{'Is Punctuation?':18}"
...     f"{'Is Stop Word?'}"
... )
>>> for token in about_doc:
...     print(
...         f"{str(token.text_with_ws):22}"
...         f"{str(token.is_alpha):15}"
...         f"{str(token.is_punct):18}"
...         f"{str(token.is_stop)}"
...     )
...

Text with Whitespace  Is Alphanumeric?Is Punctuation?   Is Stop Word?
Gus                   True           False             False
Proto                 True           False             False
is                    True           False             True
a                     True           False             True
Python                True           False             False
developer             True           False             False
currently             True           False             False
working               True           False             False
for                   True           False             True
a                     True           False             True
London                True           False             False
-                     False          True              False
based                 True           False             False
Fintech               True           False             False
company               True           False             False
.                  

### custom tokenization

In [9]:
>>> custom_about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London@based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )

>>> print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London@based', 'Fintech', 'company', '.', 'He']


### To include the @ symbol as a custom infix, you need to build your own Tokenizer object

In [10]:
>>> import re
>>> from spacy.tokenizer import Tokenizer

>>> custom_nlp = spacy.load("en_core_web_sm")
>>> prefix_re = spacy.util.compile_prefix_regex(
...     custom_nlp.Defaults.prefixes
... )
>>> suffix_re = spacy.util.compile_suffix_regex(
...     custom_nlp.Defaults.suffixes
... )

>>> custom_infixes = [r"@"]

>>> infix_re = spacy.util.compile_infix_regex(
...     list(custom_nlp.Defaults.infixes) + custom_infixes
... )

>>> custom_nlp.tokenizer = Tokenizer(
...     nlp.vocab,
...     prefix_search=prefix_re.search,
...     suffix_search=suffix_re.search,
...     infix_finditer=infix_re.finditer,
...     token_match=None,
... )

>>> custom_tokenizer_about_doc = custom_nlp(custom_about_text)

>>> print([token.text for token in custom_tokenizer_about_doc[8:15]])

['for', 'a', 'London', '@', 'based', 'Fintech', 'company']


### Stop Words
Stop words are typically defined as the most common words in a language.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. 

In [11]:
>>> import spacy
>>> spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
>>> len(spacy_stopwords)
326
>>> for stop_word in list(spacy_stopwords)[:10]:
...     print(stop_word)

did
they
thru
yet
’m
would
whole
hereafter
say
under


### .is_stop attribute
You don’t need to access this list directly, though. You can REMOVE STOP WORDS from the input text by making use of the .is_stop attribute of each token:

In [12]:
>>> custom_about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
>>> nlp = spacy.load("en_core_web_sm")
>>> about_doc = nlp(custom_about_text)
>>> print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


### Lemmatization
a root word, is called a lemma.

reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

In [13]:
import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> conference_help_text = (
...     "Gus is helping organize a developer"
...     " conference on Applications of Natural Language"
...     " Processing. He keeps organizing local Python meetups"
...     " and several internal talks at his workplace."
... )
>>> conference_help_doc = nlp(conference_help_text)
>>> for token in conference_help_doc:
...     if str(token) != str(token.lemma_):
...         print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
                  He : he
               keeps : keep
          organizing : organize
             meetups : meetup
               talks : talk


### Word Frequency
once you've lemmatized (AND removed stop words), you can perform statitstical analysis on text

the following code tells the most common words, so you can assume they are what the text is about.

In [14]:
>>> import spacy
>>> from collections import Counter
>>> nlp = spacy.load("en_core_web_sm")
>>> complete_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech company. He is"
...     " interested in learning Natural Language Processing."
...     " There is a developer conference happening on 21 July"
...     ' 2019 in London. It is titled "Applications of Natural'
...     ' Language Processing". There is a helpline number'
...     " available at +44-1234567891. Gus is helping organize it."
...     " He keeps organizing local Python meetups and several"
...     " internal talks at his workplace. Gus is also presenting"
...     ' a talk. The talk will introduce the reader about "Use'
...     ' cases of Natural Language Processing in Fintech".'
...     " Apart from his work, he is very passionate about music."
...     " Gus is learning to play the Piano. He has enrolled"
...     " himself in the weekend batch of Great Piano Academy."
...     " Great Piano Academy is situated in Mayfair or the City"
...     " of London and has world-class piano instructors."
... )
>>> complete_doc = nlp(complete_text)

>>> words = [
...     token.text
...     for token in complete_doc
...     if not token.is_stop and not token.is_punct
... ]

>>> print(Counter(words).most_common(5))

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]


### Part of Speech Tagging

In [15]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> for token in about_doc:
...     print(
...         f"""
... TOKEN: {str(token)}
... =====
... TAG: {str(token.tag_):10} POS: {token.pos_}
... EXPLANATION: {spacy.explain(token.tag_)}"""
...     )


TOKEN: Gus
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: Proto
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: is
=====
TAG: VBZ        POS: AUX
EXPLANATION: verb, 3rd person singular present

TOKEN: a
=====
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: Python
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: developer
=====
TAG: NN         POS: NOUN
EXPLANATION: noun, singular or mass

TOKEN: currently
=====
TAG: RB         POS: ADV
EXPLANATION: adverb

TOKEN: working
=====
TAG: VBG        POS: VERB
EXPLANATION: verb, gerund or present participle

TOKEN: for
=====
TAG: IN         POS: ADP
EXPLANATION: conjunction, subordinating or preposition

TOKEN: a
=====
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: London
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: -
=====
TAG: HYPH       POS: PUNCT
EXPLANATION: punctuation mark, hyphen

TOKEN: based
=====
TAG

### Pullout words by part of speech/category

In [16]:
>>> nouns = []
>>> adjectives = []
>>> for token in about_doc:
...     if token.pos_ == "NOUN":
...         nouns.append(token)
...     if token.pos_ == "ADJ":
...         adjectives.append(token)
...

>>> nouns


[developer, company]

### Visualization using spaCy's builtin called displaCy

each token is assigned a POS tag written just below the token

In [2]:
>>> import spacy
>>> from spacy import displacy
>>> nlp = spacy.load("en_core_web_sm")

>>> about_interest_text = (
...     "He is interested in learning Natural Language Processing."
... )
>>> about_interest_doc = nlp(about_interest_text)
>>> displacy.serve(about_interest_doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


### Creating Preprocessing Functions
To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process.

Note that complete_filtered_tokens doesn’t contain any stop words or punctuation symbols, and it consists purely of lemmatized lowercase tokens.

In [2]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> complete_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech company. He is"
...     " interested in learning Natural Language Processing."
...     " There is a developer conference happening on 21 July"
...     ' 2019 in London. It is titled "Applications of Natural'
...     ' Language Processing". There is a helpline number'
...     " available at +44-1234567891. Gus is helping organize it."
...     " He keeps organizing local Python meetups and several"
...     " internal talks at his workplace. Gus is also presenting"
...     ' a talk. The talk will introduce the reader about "Use'
...     ' cases of Natural Language Processing in Fintech".'
...     " Apart from his work, he is very passionate about music."
...     " Gus is learning to play the Piano. He has enrolled"
...     " himself in the weekend batch of Great Piano Academy."
...     " Great Piano Academy is situated in Mayfair or the City"
...     " of London and has world-class piano instructors."
... )
>>> complete_doc = nlp(complete_text)
>>> def is_token_allowed(token):
...     return bool(
...         token
...         and str(token).strip()
...         and not token.is_stop
...         and not token.is_punct
...     )
...
>>> def preprocess_token(token):
...     return token.lemma_.strip().lower()
...
>>> complete_filtered_tokens = [
...     preprocess_token(token)
...     for token in complete_doc
...     if is_token_allowed(token)
... ]

>>> complete_filtered_tokens

['gus',
 'proto',
 'python',
 'developer',
 'currently',
 'work',
 'london',
 'base',
 'fintech',
 'company',
 'interested',
 'learn',
 'natural',
 'language',
 'processing',
 'developer',
 'conference',
 'happen',
 '21',
 'july',
 '2019',
 'london',
 'title',
 'application',
 'natural',
 'language',
 'processing',
 'helpline',
 'number',
 'available',
 '+44',
 '1234567891',
 'gus',
 'helping',
 'organize',
 'keep',
 'organize',
 'local',
 'python',
 'meetup',
 'internal',
 'talk',
 'workplace',
 'gus',
 'present',
 'talk',
 'talk',
 'introduce',
 'reader',
 'use',
 'case',
 'natural',
 'language',
 'processing',
 'fintech',
 'apart',
 'work',
 'passionate',
 'music',
 'gus',
 'learn',
 'play',
 'piano',
 'enrol',
 'weekend',
 'batch',
 'great',
 'piano',
 'academy',
 'great',
 'piano',
 'academy',
 'situate',
 'mayfair',
 'city',
 'london',
 'world',
 'class',
 'piano',
 'instructor']

### Rule-cased matching using spaCy
For example, with rule-based matching, you can extract a first name and a last name, which are always proper nouns

In [3]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)

>>> from spacy.matcher import Matcher
>>> matcher = Matcher(nlp.vocab)

>>> def extract_full_name(nlp_doc):
...     pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
...     matcher.add("FULL_NAME", [pattern])
...     matches = matcher(nlp_doc)
...     for _, start, end in matches:
...         span = nlp_doc[start:end]
...         yield span.text
...

>>> next(extract_full_name(about_doc))

'Gus Proto'

### Dependency Parsing Using spaCy

Extracting the dependency graph of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. 

The dependencies can be mapped in a directed graph representation where:

Words are the nodes.

Grammatical relationships are the edges.

In this example, the sentence contains three relationships:

nsubj is the subject of the word, and its headword is a verb.

aux is an auxiliary word, and its headword is a verb.

dobj is the direct object of the verb, and its headword is also a verb.

In [3]:
>>> import spacy
from spacy import displacy
>>> nlp = spacy.load("en_core_web_sm")
>>> piano_text = "Gus is learning piano"
>>> piano_doc = nlp(piano_text)
>>> for token in piano_doc:
...     print(
...         f"""
... TOKEN: {token.text}
... =====
... {token.tag_ = }
... {token.head.text = }
... {token.dep_ = }"""
...     )


TOKEN: Gus
=====
token.tag_ = 'NNP'
token.head.text = 'learning'
token.dep_ = 'nsubj'

TOKEN: is
=====
token.tag_ = 'VBZ'
token.head.text = 'learning'
token.dep_ = 'aux'

TOKEN: learning
=====
token.tag_ = 'VBG'
token.head.text = 'learning'
token.dep_ = 'ROOT'

TOKEN: piano
=====
token.tag_ = 'NN'
token.head.text = 'learning'
token.dep_ = 'dobj'


In [4]:
>>> displacy.serve(piano_doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


### Tree and subtree navigation
spaCy provides attributes like .children, .lefts, .rights, and .subtree to make navigating the parse tree easier.

In [1]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> one_line_about_text = (
...     "Gus Proto is a Python developer"
...     " currently working for a London-based Fintech company"
... )
>>> one_line_about_doc = nlp(one_line_about_text)

>>> # Extract children of `developer`
>>> print([token.text for token in one_line_about_doc[5].children])


['a', 'Python', 'working']


### Shallow Parsing/Chunking

Noun Phrase Detection

In [2]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")

>>> conference_text = (
...     "There is a developer conference happening on 21 July 2019 in London."
... )
>>> conference_doc = nlp(conference_text)

>>> # Extract Noun Phrases
>>> for chunk in conference_doc.noun_chunks:
...     print (chunk)

a developer conference
21 July
London


### Name-Entity Recognition
locating named entities in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, etc

You can use NER to learn more about the meaning of your text.

spaCy has the property .ents on Doc objects. You can use it to extract named entities

In [10]:
>>> import spacy
from spacy import displacy
>>> nlp = spacy.load("en_core_web_sm")

>>> piano_class_text = (
...     "Great Piano Academy is situated"
...     " in Mayfair or the City of London and has"
...     " world-class piano instructors."
... )
>>> piano_class_doc = nlp(piano_class_text)

>>> for ent in piano_class_doc.ents:
...     print(
...         f"""
... {ent.text = }
... {ent.start_char = }
... {ent.end_char = }
... {ent.label_ = }
... spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}"""
... )


ent.text = 'Great Piano Academy'
ent.start_char = 0
ent.end_char = 19
ent.label_ = 'ORG'
spacy.explain('ORG') = Companies, agencies, institutions, etc.

ent.text = 'Mayfair'
ent.start_char = 35
ent.end_char = 42
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states

ent.text = 'the City of London'
ent.start_char = 46
ent.end_char = 64
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states


In [11]:
displacy.serve(piano_class_doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


### Using NER to Redact names

In [4]:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> survey_text = (
...     "Out of 5 people surveyed, James Robert,"
...     " Julie Fuller and Benjamin Brooks like"
...     " apples. Kelly Cox and Matthew Evans"
...     " like oranges."
... )


>>> def replace_person_names(token):
...     if token.ent_iob != 0 and token.ent_type_ == "PERSON":
...         return "[REDACTED] "
...     return token.text_with_ws
...

>>> def redact_names(nlp_doc):
...     with nlp_doc.retokenize() as retokenizer:
...         for ent in nlp_doc.ents:
...             retokenizer.merge(ent)
...     tokens = map(replace_person_names, nlp_doc)
...     return "".join(tokens)
...

>>> survey_doc = nlp(survey_text)
>>> print(redact_names(survey_doc))

Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples. [REDACTED] and [REDACTED] like oranges.
