## Natural Language Processing (NLP)

For NLP tasks, we will be using Spacy library.

For in-text Notebook installation:
> `!pip install spacy`

Using CMD  or Anaconda Prompt Command Prompt for installation:
> `open cmd > Install with prompts "conda install -c conda-forge spacy" (RECOMMENDED!)`

Download the language library (en_core_web_sm) for Spacy via CMD/Anaconda:
> `python -m spacy download en_core_web_sm`

### What is NLP?
NLP is and area of CS and AI that is concerned with the interactions between machine and human (natural) languages. Basically, programming the machines to process and analyze human natural language in bigger scale.

In general, NLP processing looks like this:
> ***Natural Human Language (speech and text in English, French, Bengali) <br> ‚Üí Raw Data (text, speech data) <br> ‚Üí NLP processing (tokenization, normalization, segmentation) <br> ‚Üí Feature Extraction <br> ‚Üí Machine recognizible vector representation (BoW, TF-IDF, Word Embeddings) <br> ‚Üí Machine analysis (ML-DL tasks) <br> ‚Üí NLP Applications (spam detection, chatbots, sentiment analysis etc.)***

# NLP Processing Steps

There are quite a few steps in the whole processing part of natural language data. But in a traditional best-case scene, we can actually divide it into 9 separate simple steps:

#### Step-01: Text Cleaning
Removing irrelevant or noisy elements from the text. For example:
```
"Hey! üôÇ Check out this link ###https://abc.com!!!"
>> ["Hey! Check out this link"]
```
<br>

#### Step-02: Lowercasing
Converting all text to lowercase letters for ensuring consisteny over all texts. For example:
```
"Hello, Adam Gross! I'm Senat Brown."
>> ["hello, Adam Gross! i'm Senat  Brown."] (‚úî Preferable way)
>> ["hello, adam gross! i'm senat  brown."] (‚ùå Not Preferable as it looses Named Entity Recognition)
```
<br>

#### Step-03: Sentence Segmentation
Splitting different lines and sentences into separate entity. For example:
```
"Sally is mumbling. She might be nervous of speaking."
>> ["Sally is mumbling.", "She might be nervous of speaking."]
```
‚òÖ Useful for parsing and document-level analysis
<br><br>

#### Step-04: Tokenization
Breaking sentences into preferable peices (Tokens) of words and symbols. For example:
```
"I'm Gary Hunson. A DYI shop owner at Brisbey."
>> ["I'm", "Gary", "Hunson", ".", "A", "DYI", "shop", "owner", "at", "Brisbey", "."]
```
‚òÖ Tokenization depends on the task in-hand and the context we're working with. It will be clear in upcoming notebooks.
<br><br>

#### Step-05: Normalization
Standardizing text formats to ensure consistency over all type of texts. Oftenly results in better accuracy in classification related tasks. This includes:
- Expanding Contraction (don't ‚Üí do not)
- Removing extra Punctuation (Hello!! ‚Üí Hello!)
- Converting Numbers to text [optional] (3 ‚Üí three)

For example:
```
"I can't do 9 to 5 anymore!!!"
>> ["I cannot do nine to five anymore!"]
```
<br>

#### Step-06: Stopword Removal
It is the process of eliminating very common words (am, was, is, to, a, an) that carry less standalone meaning in a sentence and often do not help a model distinguish between texts. For example:
```
"I am learning representation learning techniques."
>> ["learning", "representation", "learning", "techniques"]
```
‚òÖ Removing stopwords:
- Reduces noise in text
- Reduces vocabulary size
- Improves model efficieny
- Focuses on content-bearing words
<br>

#### Step-07: Stemming or Lemmatization
**Stemming** is not always a good choice, but faster processing technique.
> `studying ‚Üí studi` | Faster processing but not a proper stem<br>
> `running ‚Üí run` | A proper stem

On the other hand, **Lemmatization** is preferable and accurate in practice, but slower in processing.
> `studying ‚Üí study` | A proper lemma <br>
> `running ‚Üí run` | A proper lemma
<br>

#### Step-08: Handling Rare Words and Noise
Removing very unknown words and replaceing it with `<UNK>` token is often used for better processing. For example:

```
"qwerty keyboard"
>> [<UNK>, "keyboard"]
```
<br>

#### Step-09: Vectorization
This is a crucial and final process of converting the tokenized texts into numerical vector representation that machine can understand. This is done by various techniques, like:
1. **Bag of Words (BoW)** = Represents text by counting the number of times each word appears and ignores the word order.
```
Text = "learning representation learning techniques"
Tokens = ["learning", "representation", "learning", "techniques"]
Vector = [2, 1, 1] (learning 2x, representation 1x, techniques 1x)
```
2. **TF-IDF (Term Frequency‚ÄìInverse Document Frequency)** = It improves BoW by reducing the importance of very common words (is, was, am) and highlighting distinctive words.<br>
TF: how often a word appears in a document<br>
IDF: how rare the word is across documents
```
learning ‚Üí 0.32,        representation ‚Üí 0.78,        techniques ‚Üí 0.64
Vector = [0.32, 0.78, 0.64]
```
3. **Word Embeddings** = Word embeddings represent each word as a dense numerical vector that captures semantic meaning. Words with similar meanings have similar vectors.<br>
```
learning        ‚Üí [0.21, -0.34, 0.87, ...]
representation  ‚Üí [0.19, -0.30, 0.82, ...]
techniques     ‚Üí [0.25, -0.40, 0.90, ...]
```
- Key Learning: **representation learning ‚âà feature learning** (the word "*feature*" would have quite similar vector like "*representation*"

4. **Token IDs** = This technique mostly used by the modern DL (LSTM, BERT, GPT) Transformer models, uses integer token IDs produced by tokenizer.<br>
```
learning       ‚Üí 1045
representation ‚Üí 2381
techniques     ‚Üí 7129
Token          = ["learning", "representation", "learning", "techniques"]
Token IDs      = [1045, 2381, 1045, 7129]
```
- **Important Note**
    - Stopword removal is often skipped for transformers
    - Tokenizers handle casing, subwords, and punctuation

## Spacy Basics
- Loading library
- Building a pipeline object
- Using tokens
- Parts-of-Speech Tagging
- Understanding token attiributes

In [1]:
#load spacy
import spacy

In [2]:
# loads language processing model
nlp = spacy.load('en_core_web_sm')

In [3]:
# A simple text is passed through the `nlp()` object for NLP processing
doc = nlp(u'Hartell is looking for a company with $500m asset to start business in the U.A.E')

In [4]:
# To show each generated tokens from the text
for token in doc:
    print(token.text)

Hartell
is
looking
for
a
company
with
$
500
m
asset
to
start
business
in
the
U.A.E


In [5]:
# To show the POS of each generated tokens from the text (as token IDs)
for token in doc:
    print(token.text, token.pos)

Hartell 96
is 87
looking 100
for 85
a 90
company 92
with 85
$ 99
500 93
m 92
asset 92
to 94
start 100
business 92
in 85
the 90
U.A.E 96


In [6]:
# for detailed corresponding POS view
for token in doc:
    print(token.text, '=', token.pos_)

Hartell = PROPN
is = AUX
looking = VERB
for = ADP
a = DET
company = NOUN
with = ADP
$ = SYM
500 = NUM
m = NOUN
asset = NOUN
to = PART
start = VERB
business = NOUN
in = ADP
the = DET
U.A.E = PROPN


In [7]:
# for showing syntactic dependency
for token in doc:
    print(token.text, token.pos_, token.dep_)

Hartell PROPN nsubj
is AUX aux
looking VERB ROOT
for ADP prep
a DET det
company NOUN pobj
with ADP prep
$ SYM nmod
500 NUM nummod
m NOUN quantmod
asset NOUN pobj
to PART aux
start VERB advcl
business NOUN dobj
in ADP prep
the DET det
U.A.E PROPN pobj


In [8]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2d4c4abd100>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2d4c4abd880>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2d4c4ab6120>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2d4c4b63d80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2d4c4b75640>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2d4c4ab62e0>)]

The `nlp.pipeline` is an *ordered list* of components that a text passes through for processing. When you call an nlp object on a text, the text is first tokenized, and then each component in the pipeline is applied to the `Doc` object in sequence, modifying it and adding annotations. 

`Text ‚Üí nlp.pipeline() ‚Üí [Tokenizer ‚Üí POS Tagger ‚Üí Dependency Parser ‚Üí NER]`

In [94]:
doc2 = nlp(u"Hartell's aren't    looking for a company anymore. Are they? Almost...around 1.5K applied and 200 got in 2025!")

In [99]:
# to show the pos tag, dependency of each token
print(f"Token{'':<7}POS{'':<6}Dependency{'':<5}Description")
print("-"*60)

for token in doc2:
    print(f"{token.text:<10} {token.pos_:<10} {token.dep_:<12} {spacy.explain(token.dep_)}")

Token       POS      Dependency     Description
------------------------------------------------------------
Hartell    PROPN      nsubj        nominal subject
's         PART       case         case marking
are        AUX        aux          auxiliary
n't        PART       neg          negation modifier
           SPACE      dep          unclassified dependent
looking    VERB       ROOT         root
for        ADP        prep         prepositional modifier
a          DET        det          determiner
company    NOUN       pobj         object of preposition
anymore    ADV        advmod       adverbial modifier
.          PUNCT      punct        punctuation
Are        AUX        ROOT         root
they       PRON       nsubj        nominal subject
?          PUNCT      punct        punctuation
Almost     ADV        advmod       adverbial modifier
...        PUNCT      punct        punctuation
around     ADP        advmod       adverbial modifier
1.5        NUM        nummod       numeri

In [100]:
doc2[0] #show the tokenized word at index 0

Hartell

In [102]:
doc2[5].text # similar but as text

'looking'

In [104]:
doc2[5].pos_ # show the POS of the token

'VERB'

In [105]:
doc2[5].dep_ # show the dependencies of the token

'ROOT'

In [106]:
doc2[5].lemma_ # show the Base form of the word

'look'

In [107]:
doc2[5].tag_ # show the detailed POS tag

'VBG'

In [17]:
doc2[0].shape_ # the word's shape - caps, puctuation, digits

'Xxxxx'

In [111]:
doc2[0].is_alpha # if the token contains only alphabetic characters or not

True

In [112]:
doc2[17]

1.5

In [113]:
doc2[17].is_alpha # 1.5 is numeric, not alphabetic; so it's False

False

In [117]:
doc2[7]

a

In [118]:
doc2[7].is_stop # if the token is a part of a stopword or not

True

In [119]:
doc3 = nlp(
    u"On this I took comfort in spite of all my sorrow, and said, ‚ÄòI know, then, about these two; tell me, therefore, about the third man of whom you spoke; is he still alive, but at sea, and unable to get home? or is he dead? Tell me, no matter how much it may grieve me."
)

In [124]:
quote = doc3[15:25]

In [125]:
print(quote)

‚ÄòI know, then, about these two;


In [126]:
type(quote)

spacy.tokens.span.Span

In [24]:
type(doc3)

spacy.tokens.doc.Doc

In [42]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [43]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [44]:
doc4[6] #7th token

This

In [45]:
doc4[6].is_sent_start # to check if the token is the first token in a sentence

True

In [47]:
doc4[7]

is

In [46]:
doc4[7].is_sent_start

False

In [29]:
doc4[6].is_sent_end # to check if the token is the last token in a sentence

False

In [48]:
doc4[5]

.

In [49]:
doc4[5].is_sent_end

True