<a href="https://colab.research.google.com/github/Ben-Ogega/Machine-Learning-Projects/blob/master/NLP_using_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis in Python
### In this notebook we will be doing some Natural Language Processing (NLP) using Python, NLTK, and Spacy



#### Implement **NLP in spaCy**

1.   Customize and extend built-in functionalities in spaCy
2.   Perform basic statistical analysis on a text
3.   Create a **pipeline** to process **unstructured text**
4.   Parse a sentence and extract meaningful insights from it
5.   I refer from this [site](https://realpython.com/natural-language-processing-spacy-python/)

### NLP is a subfield of artificial intelligence, and it’s all about allowing computers to comprehend human language.
> NLP involves **Analyzing, Quantifying, Understanding, and Deriving** meaning from natural languages. Read more [here](https://realpython.com/natural-language-processing-spacy-python/)

Examples of NLP applications include:

1.   BERT from Google

2.  GPT family from OpenAI





### NLP helps you extract insights from unstructured text and has many use cases, such as:

Automatic summarization

> **Named-entity recognition**

> **Question answering systems**

> **Sentiment analysis**

## Installation of spaCy

In [13]:
!pip install spacy



## Step 0. Read in Data and NLTK Basics

### The default model for the English language is designated as **en_core_web_sm**. Since the models are quite large, it’s best to **install them separately**— *including all languages in one package would make the download too massive.*

Import Spacy

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp #is a callable spacy object

<spacy.lang.en.English at 0x79d01b1c7f10>

In [15]:
 # To start processing my input, I construct a Doc object.
 # A Doc object is a sequence of Token objects represneting a lexical token.
 # A token is an individual object ie word, punctuation, symbol, whitespace

introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(introduction_doc)


spacy.tokens.doc.Doc

In [16]:
# Generate tokens from the Doc
tokens = [token.text for token in introduction_doc]
tokens[0]

'This'

### We can also read from a file

In [17]:
# import pathlib
# file_name = "introduction.txt"
# introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
# print ([token.text for token in introduction_doc])

## Sentence Detection
Sentence detection is the process of locating where **sentences start** and **end in a given text**.

This allows us **to divide a text into linguistically meaningful units.**

In [18]:
about_text = (
...     "Gus Proto is a Python developer currently"
...     " working for a London-based Fintech"
...     " company. He is interested in learning"
...     " Natural Language Processing."
... )
about_doc = nlp(about_text)
sentences = list(about_doc.sents) # .sents property is used to extract sentences from the Doc object
len(sentences)


2

In [19]:
# Printing the first 5 token span
for sentence in sentences:
   print(f"{sentence[:5]}...")
  #  print(type(sentence))

Gus Proto is a Python...
He is interested in learning...


## **Tokens** in **spaCy**



Building the Doc container involves **tokenizing the text**. The process of **tokenization breaks a text down into its basic units**—or **tokens**—which are represented in spaCy as **Token objects**.

In [27]:
nlp = spacy.load("en_core_web_sm")
about_text = ('''Gus Proto is a Python developer currently
              working for a London-based Fintech
               company. He is interested in learning
                Natural Language Processing.''')
about_doc = nlp(about_text)

for token in about_doc:
  print (token, token.idx)
  # """the .idx attribute, which represents
  # the starting position of the token in the original text"""

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32

               42
working 57
for 65
a 69
London 71
- 77
based 78
Fintech 84

                91
company 107
. 114
He 116
is 119
interested 122
in 133
learning 136

                 144
Natural 161
Language 169
Processing 178
. 188


## Stop Words

Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are **the**, **are**, **but, and they**. Most sentences need to contain stop words in order to be **full sentences that make grammatical sense.**

With NLP, **stop words are generally removed because they aren’t significant,** and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language

In [33]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

for stop_word in list(spacy_stopwords)[:10]:
  print(stop_word)

back
really
this
nevertheless
'll
amount
an
would
however
why


### Removing **stop words** from the input text by making use of the .**is_stop** attribute of each token

In [35]:
custom_about_text = (
 """Ben Ogega is a teacher and Mechanical engineering trainer currently
working for a Nairobi-based  Road Safety
company. He is interested in learning
Natural Language Processing.""")
nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

[Ben, Ogega, teacher, Mechanical, engineering, trainer, currently, 
, working, Nairobi, -, based,  , Road, Safety, 
, company, ., interested, learning, 
, Natural, Language, Processing, .]


## **Lemmatization**
**Lemmatization** is the process of **reducing inflected forms** of a word while still ensuring that the reduced form belongs to the language. This **reduced form, or root word**, is called a **lemma**.

For example, **organizes, organized and organizing** are all forms of **organize.** Here, **organize is the lemma**.

In [36]:
conference_help_text = ("""Ben Ogega is a teacher and Mechanical engineering
trainer currently
working for a Nairobi-based  Road Safety
company. He is interested in learning
Natural Language Processing.""")

conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
  if str(token) != str(token.lemma_):
    print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
          Mechanical : mechanical
             working : work
               based : base
                  He : he
                  is : be
            learning : learn


## Word Frequency
We can now convert a given text into **tokens** and **perform statistical** analysis on it.

In [48]:
from collections import Counter
# nlp = spacy.load("en_core_web_sm")
complete_text = ("""Ben Ogega is a mechanical engineer witha penchant for
 programming in Python currently
working for a Nairobi-based Road Safety company. He is
interested in learning Natural Language Processing.
He hopes to one day attend a developer conference in his city.
If he gets anooportunity he will give his keynote speech a title
 "Applications of Natural
Language Processing in Road Safety and Defensive Driving".
He purposes to start organizing local Python meetups and several
internal talks at his workplace. Ben is also an avid reader.
His new found passion is in Machine Learning especially the  "Use
cases of Natural Language Processing in in Road Safety and Defensive Driving.
Apart from his work, he is very passionate about music and history.
He hopes to learn how to play a guiter.""")


In [50]:

print(complete_text[:25])

Ben Ogega is a mechanical


In [51]:
complete_doc = nlp(complete_text)

words = [token.text for token in complete_doc
if not token.is_stop and not token.is_punct
 ]

print(Counter(words).most_common(5))

[('\n', 11), ('Road', 3), ('Safety', 3), ('Natural', 3), ('Language', 3)]


In [52]:
# Removing Stop words
Counter([token.text for token in complete_doc if not token.is_punct]).most_common(5)

[('\n', 11), ('in', 7), ('is', 5), ('a', 5), ('He', 4)]

## Part-of-Speech Tagging
**Part of speech** or POS is a grammatical role that explains how a **particular word is used in a sentence**. There are typically **eight parts of speech:**

  1. **Noun**
  2. **Pronoun**
  3. **Adjective**
  4. **Verb**
  5. **Adverb**
  6. **Preposition**
  7. **Conjunction**
  8. **Interjection**

  POS tags are useful for assigning a **syntactic** category like **noun or verb** to each **word**

### Reading Data

\begin{equation}
E = mc^2
\end{equation}