<a href="https://colab.research.google.com/github/Ben-Ogega/Machine-Learning-Projects/blob/Part_of_speech/NLP_using_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis in Python
### In this notebook we will be doing some Natural Language Processing (NLP) using Python, NLTK, and Spacy



#### Implement **NLP in spaCy**

1.   Customize and extend built-in functionalities in spaCy
2.   Perform basic statistical analysis on a text
3.   Create a **pipeline** to process **unstructured text**
4.   Parse a sentence and extract meaningful insights from it
5.   I refer from this [site](https://realpython.com/natural-language-processing-spacy-python/)

### NLP is a subfield of artificial intelligence, and it’s all about allowing computers to comprehend human language.
> NLP involves **Analyzing, Quantifying, Understanding, and Deriving** meaning from natural languages. Read more [here](https://realpython.com/natural-language-processing-spacy-python/)

Examples of NLP applications include:

1.   BERT from Google

2.  GPT family from OpenAI





### NLP helps you extract insights from unstructured text and has many use cases, such as:

Automatic summarization

> **Named-entity recognition**

> **Question answering systems**

> **Sentiment analysis**

## Installation of spaCy

In [1]:
!pip install spacy --quiet

## Step 0. Read in Data and NLTK Basics

### The default model for the English language is designated as **en_core_web_sm**. Since the models are quite large, it’s best to **install them separately**— *including all languages in one package would make the download too massive.*

Import Spacy

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp #is a callable spacy object

<spacy.lang.en.English at 0x79339163e410>

In [3]:
 # To start processing my input, I construct a Doc object.
 # A Doc object is a sequence of Token objects represneting a lexical token.
 # A token is an individual object ie word, punctuation, symbol, whitespace

introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(introduction_doc)


spacy.tokens.doc.Doc

In [4]:
# Generate tokens from the Doc
tokens = [token.text for token in introduction_doc]
tokens[0]

'This'

### We can also read from a file

In [5]:
# import pathlib
# file_name = "introduction.txt"
# introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
# print ([token.text for token in introduction_doc])

## Sentence Detection
Sentence detection is the process of locating where **sentences start** and **end in a given text**.

This allows us **to divide a text into linguistically meaningful units.**

In [6]:
about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
about_doc = nlp(about_text)
sentences = list(about_doc.sents) # .sents property is used to extract sentences from the Doc object
len(sentences)


2

In [7]:
# Printing the first 5 token span
for sentence in sentences:
   print(f"{sentence[:5]}...")
  #  print(type(sentence))

Gus Proto is a Python...
He is interested in learning...


## **Tokens** in **spaCy**



Building the Doc container involves **tokenizing the text**. The process of **tokenization breaks a text down into its basic units**—or **tokens**—which are represented in spaCy as **Token objects**.

In [8]:
nlp = spacy.load("en_core_web_sm")
about_text = ('''Gus Proto is a Python developer currently
              working for a London-based Fintech
               company. He is interested in learning
                Natural Language Processing.''')
about_doc = nlp(about_text)

for token in about_doc:
  print (token, token.idx)
  # """the .idx attribute, which represents
  # the starting position of the token in the original text"""

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32

               42
working 57
for 65
a 69
London 71
- 77
based 78
Fintech 84

                91
company 107
. 114
He 116
is 119
interested 122
in 133
learning 136

                 144
Natural 161
Language 169
Processing 178
. 188


## Stop Words

Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are **the**, **are**, **but, and they**. Most sentences need to contain stop words in order to be **full sentences that make grammatical sense.**

With NLP, **stop words are generally removed because they aren’t significant,** and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language

In [9]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

for stop_word in list(spacy_stopwords)[:10]:
  print(stop_word)

unless
yourself
whereby
get
really
another
myself
above
he
where


### Removing **stop words** from the input text by making use of the .**is_stop** attribute of each token

In [10]:
custom_about_text = (
 """Ben Ogega is a teacher and Mechanical engineering trainer currently
working for a Nairobi-based  Road Safety
company. He is interested in learning
Natural Language Processing.""")
nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

[Ben, Ogega, teacher, Mechanical, engineering, trainer, currently, 
, working, Nairobi, -, based,  , Road, Safety, 
, company, ., interested, learning, 
, Natural, Language, Processing, .]


## **Lemmatization**
**Lemmatization** is the process of **reducing inflected forms** of a word while still ensuring that the reduced form belongs to the language. This **reduced form, or root word**, is called a **lemma**.

For example, **organizes, organized and organizing** are all forms of **organize.** Here, **organize is the lemma**.

In [11]:
conference_help_text = ("""Ben Ogega is a teacher and Mechanical engineering
trainer currently
working for a Nairobi-based  Road Safety
company. He is interested in learning
Natural Language Processing.""")

conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
  if str(token) != str(token.lemma_):
    print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
          Mechanical : mechanical
             working : work
               based : base
                  He : he
                  is : be
            learning : learn


## Word Frequency
We can now convert a given text into **tokens** and **perform statistical** analysis on it.

In [12]:
from collections import Counter
# nlp = spacy.load("en_core_web_sm")
complete_text = ("""Ben Ogega is a mechanical engineer witha penchant for
 programming in Python currently
working for a Nairobi-based Road Safety company. He is
interested in learning Natural Language Processing.
He hopes to one day attend a developer conference in his city.
If he gets anooportunity he will give his keynote speech a title
 "Applications of Natural
Language Processing in Road Safety and Defensive Driving".
He purposes to start organizing local Python meetups and several
internal talks at his workplace. Ben is also an avid reader.
His new found passion is in Machine Learning especially the  "Use
cases of Natural Language Processing in in Road Safety and Defensive Driving.
Apart from his work, he is very passionate about music and history.
He hopes to learn how to play a guiter.""")


In [13]:

print(complete_text[:25])

Ben Ogega is a mechanical


In [14]:
complete_doc = nlp(complete_text)

words = [token.text for token in complete_doc
if not token.is_stop and not token.is_punct
 ]

print(Counter(words).most_common(5))

[('\n', 11), ('Road', 3), ('Safety', 3), ('Natural', 3), ('Language', 3)]


In [15]:
# Removing Stop words
Counter([token.text for token in complete_doc if not token.is_punct]).most_common(5)

[('\n', 11), ('in', 7), ('is', 5), ('a', 5), ('He', 4)]

## Part-of-Speech Tagging
**Part of speech** or POS is a grammatical role that explains how a **particular word is used in a sentence**. There are typically **eight parts of speech:**

  1. **Noun**
  2. **Pronoun**
  3. **Adjective**
  4. **Verb**
  5. **Adverb**
  6. **Preposition**
  7. **Conjunction**
  8. **Interjection**

  POS tags are useful for assigning a **syntactic** category like **noun or verb** to each **word**

In [16]:
nlp = spacy.load("en_core_web_sm")

about_text = ("""Gus Proto is a Python developer currently working
              for a London-based Fintech company.
              He is interested in learning Natural Language Processing.""")

about_doc = nlp(about_text)

for token in about_doc:
    print(f"""TOKEN: {str(token)} ===== TAG: {str(token.tag_):10}
POS: {token.pos_}
EXPLANATION: {spacy.explain(token.tag_)}""")

TOKEN: Gus ===== TAG: NNP        
POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: Proto ===== TAG: NNP        
POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: is ===== TAG: VBZ        
POS: AUX
EXPLANATION: verb, 3rd person singular present
TOKEN: a ===== TAG: DT         
POS: DET
EXPLANATION: determiner
TOKEN: Python ===== TAG: NNP        
POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: developer ===== TAG: NN         
POS: NOUN
EXPLANATION: noun, singular or mass
TOKEN: currently ===== TAG: RB         
POS: ADV
EXPLANATION: adverb
TOKEN: working ===== TAG: VBG        
POS: VERB
EXPLANATION: verb, gerund or present participle
TOKEN: 
               ===== TAG: _SP        
POS: SPACE
EXPLANATION: whitespace
TOKEN: for ===== TAG: IN         
POS: ADP
EXPLANATION: conjunction, subordinating or preposition
TOKEN: a ===== TAG: DT         
POS: DET
EXPLANATION: determiner
TOKEN: London ===== TAG: NNP        
POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: - ===== TAG: HYP

Here, two attributes of the Token class are accessed and printed using f-strings:

**.tag_** displays a fine-grained tag.
**.pos_** displays a coarse-grained tag, which is a reduced version of the fine-grained tags.
You also use **spacy.explain()** to give descriptive details about a particular POS tag, which can be a valuable reference tool.

In [17]:
# Using POS tags, to extract a particular category of words:

nouns = []
adjectives = []
for token in about_doc:
    if token.pos_ == "NOUN":
        nouns.append(token)
    if token.pos_ == "ADJ":
        adjectives.append(token)
print(nouns)
print(adjectives)

[developer, company]
[interested]


## Visualization: Using displaCy

**spaCy** comes with a built-in visualizer called **displaCy**. We can use it to visualize a **dependency parse** or **named entities** in a browser or a Jupyter notebook.

We can use displaCy to find POS tags for tokens:

In [18]:
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
about_interest_text = (
    "He is interested in learning Natural Language Processing."
)
about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style="dep", jupyter=True)

In [19]:
# displacy.render(about_interest_doc, style="dep", jupyter=True)

## Preprocessing Functions

To bring your text into a format ideal for analysis, you can write **preprocessing functions** to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:

            Lowercases the text
            Lemmatizes each token
            Removes punctuation symbols
            Removes stop words
A **preprocessing function** converts text to an analyzable format. It’s typical for most NLP tasks.

In [20]:
import spacy
nlp = spacy.load("en_core_web_sm")
complete_text = ('''Gus Proto is a Python developer currently.
There is a developer conference happening on
21 July interested in learning Natural Language Processing.
2019 in London. It is titled Applications of Natural
Language Processing. There is a helpline number
available at +44-1234567891. Gus is helping organize
He keeps organizing local Python meetups and several
internal talks at his workplace. Gus is also presentin
a talk. The talk will introduce the reader about
cases of Natural Language Processing in Fintech.
Apart from his work, he is very passionate about music
Gus is learning to play the Piano. He has enrolled
himself in the weekend batch of Great Piano Academy.
Great Piano Academy is situated in Mayfair or the City of London and has world-class piano instructors.''')

complete_doc = nlp(complete_text)
def is_token_allowed(token):
    return bool(
        token
        and str(token).strip()
        and not token.is_stop
        and not token.is_punct
    )
def preprocess_token(token):
    return token.lemma_.strip().lower()
complete_filtered_tokens = [
    preprocess_token(token)
    for token in complete_doc
    if is_token_allowed(token)
]

complete_filtered_tokens[:10]

['gus',
 'proto',
 'python',
 'developer',
 'currently',
 'developer',
 'conference',
 'happen',
 '21',
 'july']

## Rule-Based Matching Using spaCy

#### Rule-based matching is one of the **steps in extracting information  from unstructured text**. It’s used to identify and extract tokens and phrases **according to patterns (such as lowercase)** and **grammatical features (such as part of speech).**

In [21]:
#For example, with rule-based matching, you can extract a first name and a last name, which are always proper nouns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
about_text2 = (
   """Ben Ogega is a Python developer currently working for a London-based Fintech company.
   He is interested in learning Natural Language Processing."""
)
about_doc = nlp(about_text2)

matcher = Matcher(nlp.vocab)
def extract_full_name(nlp_doc):
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
    matcher.add("FULL_NAME", [pattern])
    matches = matcher(nlp_doc)
    for _, start, end in matches:
        span = nlp_doc[start:end]
        yield span.text
next(extract_full_name(about_doc))

'Ben Ogega'

In [22]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
conference_org_text = ('''There is a developer conference
    happening on 21 July 2019 in London. It is titled
    Applications of Natural Language Processing
    There is a helpline number available
    at (254) 456-7891''')
def extract_phone_number(nlp_doc):
    pattern = [
        {"ORTH": "("},
        {"SHAPE": "ddd"},
        {"ORTH": ")"},
        {"SHAPE": "ddd"},
        {"ORTH": "-", "OP": "?"},
        {"SHAPE": "dddd"},
    ]
    matcher = Matcher(nlp.vocab)
    matcher.add("PHONE_NUMBER", [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
      span = nlp_doc[start:end]
      return span.text
conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)

'(254) 456-7891'

## Dependency Parsing Using spaCy

**Dependency parsing** is the **process of extracting the dependency graph of a sentence** to **represent its grammatical structure**. It defines the **dependency relationship between headwords and their dependents**. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the root of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation where:



1.   *Words are the nodes.*
2.   *Grammatical relationships are the edges.*
3.   *Dependency parsing helps you know what role a word plays in the text  and*
4.   *How different words relate to each other.*





## **Tree and Subtree Navigation**

The **dependency graph has all the properties of a tree**. This tree contains information about **sentence structure and grammar** and can be traversed in different ways to extract relationships.

spaCy provides **attributes like .children, .lefts, .rights, and .subtree** to make navigating the parse tree easier. Here are a few examples of using those attributes:

In [23]:

nlp = spacy.load("en_core_web_sm")
one_line_about_text = (
    '''Gus Proto is a Python developer"
...     " currently working for a London-based Fintech company"'''
)
one_line_about_doc = nlp(one_line_about_text)
# Extract children of `developer`
print([token.text for token in one_line_about_doc[5].children])
['a', 'Python', 'working']

# Extract previous neighboring node of `developer`
print (one_line_about_doc[5].nbor(-1))


# Extract next neighboring node of `developer`
print (one_line_about_doc[5].nbor())


# Extract all tokens on the left of `developer`
print([token.text for token in one_line_about_doc[5].lefts])


# Extract tokens on the right of `developer`
print([token.text for token in one_line_about_doc[5].rights])


# Print subtree of `developer`
print (list(one_line_about_doc[5].subtree))
# [a, Python, developer, currently, working, for, a, London, -, based, Fintech
# company]

['a', 'Python']
Python
"
['a', 'Python']
[]
[a, Python, developer]


## Shallow Parsing
Shallow parsing, or chunking, is the process of extracting phrases from unstructured text.


This involves chunking groups of adjacent tokens into phrases on the basis of their POS tags. There are some standard **well-known chunks** such as **noun phrases, verb phrases, and prepositional phrases**.

## Noun Phrase Detection
A noun phrase is a phrase that has **a noun as its head**. It could also include other **kinds of words, such as adjectives, ordinals, and determiners.**

In [24]:
# spaCy has the property .noun_chunks on the Doc object.

nlp = spacy.load("en_core_web_sm")

conference_text = (
    "There is a developer conference happening on 21 July 2019 in London."
)
conference_doc = nlp(conference_text)
# Extract Noun Phrases
for chunk in conference_doc.noun_chunks:
    print (chunk)

a developer conference
21 July
London


## Verb Phrase Detection
A verb phrase is a **syntactic unit** composed of at least one verb. This verb can be joined by other chunks, such as noun phrases. **Verb phrases** are useful for understanding the actions that nouns are involved in.

In [25]:
!pip install textacy --quiet

We can use **textacy** to extract verb phrases based on grammatical rules:

In [26]:
import textacy
about_talk_text = (
    ''' I do not have enough information about that person to help with your request.
    I am a large language model, and I am able to communicate and generate human-like
    text in response to a wide range of prompts and questions, but my knowledge about this person is limited.
    Is there anything else I can do to help you with this request?''')

patterns = [{"POS": "AUX"}, {"POS": "VERB"}]
about_talk_doc = textacy.make_spacy_doc(
     about_talk_text, lang="en_core_web_sm"
 )
verb_phrases = textacy.extract.token_matches(
     about_talk_doc, patterns=patterns
 )

 # Print all verb phrases
for chunk in verb_phrases:
     print(chunk.text)


 # Extract noun phrase to explain what nouns are involved
for chunk in about_talk_doc.noun_chunks:
     print (chunk)

# this talk
# the speaker
# the audience
# the use cases
# Natural Language Processing
# Fintech
# use
# interesting examples
# the way

can do
I
enough information
that person
your request
I
a large language model
I
human-like 
    text
response
a wide range
prompts
questions
my knowledge
this person
anything
I
you
this request


## Named-Entity Recognition NER

**Named-entity recognition (NER)** is **the process of locating named entities in unstructured text** and **then classifying them into predefined categories,** such as **person names**, **organizations**, **locations**, **monetary values**, **percentages**, and **time expressions**.

### spaCy has the property **.ents on Doc objects**. You can use it to extract named entities:


In [27]:
# import spacy
nlp = spacy.load("en_core_web_sm")

piano_class_text = (
    "Great Piano Academy is situated"
    " in Mayfair or the City of London and has"
    " world-class piano instructors."
)
piano_class_doc = nlp(piano_class_text)

for ent in piano_class_doc.ents:
    print(
        f"""
{ent.text = }
{ent.start_char = }
{ent.end_char = }
{ent.label_ = }
spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}"""
)


ent.text = 'Great Piano Academy'
ent.start_char = 0
ent.end_char = 19
ent.label_ = 'ORG'
spacy.explain('ORG') = Companies, agencies, institutions, etc.

ent.text = 'Mayfair'
ent.start_char = 35
ent.end_char = 42
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states

ent.text = 'the City of London'
ent.start_char = 46
ent.end_char = 64
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states


**.text** gives the Unicode text representation of the entity.
**.start_char** denotes the character offset for the start of the entity.
**.end_char** denotes the character offset for the end of the entity.
**.label_** gives the label of the entity.

**spacy.explain** gives descriptive details about each entity label. You can also use displaCy to visualize these entities:

In [28]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin)."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", jupyter=True)

In [29]:
import spacy
from spacy import displacy
displacy.render(piano_class_doc, style="ent", jupyter=True)

### Reading Data

\begin{equation}
E = mc^2
\end{equation}