## Natural Language Processing (NLP)

For NLP tasks, we will be using Spacy library.

For in-text Notebook installation:
> `!pip install spacy`

Using CMD  or Anaconda Prompt Command Prompt for installation:
> `open cmd > Install with prompts "conda install -c conda-forge spacy" (RECOMMENDED!)`

Download the language library (en_core_web_sm) for Spacy via CMD/Anaconda:
> `python -m spacy download en_core_web_sm`

### What is NLP?
NLP is and area of CS and AI that is concerned with the interactions between machine and human (natural) languages. Basically, programming the machines to process and analyze human natural language in bigger scale.

In general, NLP processing looks like this:
> ***Natural Human Language (speech and text in English, French, Bengali) <br> ‚Üí Raw Data (text, speech data) <br> ‚Üí NLP processing (tokenization, normalization, segmentation) <br> ‚Üí Feature Extraction <br> ‚Üí Machine recognizible vector representation (BoW, TF-IDF, Word Embeddings) <br> ‚Üí Machine analysis (ML-DL tasks) <br> ‚Üí NLP Applications (spam detection, chatbots, sentiment analysis etc.)***

# NLP Processing Steps

There are quite a few steps in the whole processing part of natural language data. But in a traditional best-case scene, we can actually divide it into 9 separate simple steps:

#### Step-01: Text Cleaning
Removing irrelevant or noisy elements from the text. For example:
```
"Hey! üôÇ Check out this link ###https://abc.com!!!"
>> ["Hey! Check out this link"]
```
<br>

#### Step-02: Lowercasing
Converting all text to lowercase letters for ensuring consisteny over all texts. For example:
```
"Hello, Adam Gross! I'm Senat Brown."
>> ["hello, Adam Gross! i'm Senat  Brown."] (‚úî Preferable way)
>> ["hello, adam gross! i'm senat  brown."] (‚ùå Not Preferable as it looses Named Entity Recognition)
```
<br>

#### Step-03: Sentence Segmentation
Splitting different lines and sentences into separate entity. For example:
```
"Sally is mumbling. She might be nervous of speaking."
>> ["Sally is mumbling.", "She might be nervous of speaking."]
```
‚òÖ Useful for parsing and document-level analysis
<br><br>

#### Step-04: Tokenization
Breaking sentences into preferable peices (Tokens) of words and symbols. For example:
```
"I'm Gary Hunson. A DYI shop owner at Brisbey."
>> ["I'm", "Gary", "Hunson", ".", "A", "DYI", "shop", "owner", "at", "Brisbey", "."]
```
‚òÖ Tokenization depends on the task in-hand and the context we're working with. It will be clear in upcoming notebooks.
<br><br>

#### Step-05: Normalization
Standardizing text formats to ensure consistency over all type of texts. Oftenly results in better accuracy in classification related tasks. This includes:
- Expanding Contraction (don't ‚Üí do not)
- Removing extra Punctuation (Hello!! ‚Üí Hello!)
- Converting Numbers to text [optional] (3 ‚Üí three)

For example:
```
"I can't do 9 to 5 anymore!!!"
>> ["I cannot do nine to five anymore!"]
```
<br>

#### Step-06: Stopword Removal
It is the process of eliminating very common words (am, was, is, to, a, an) that carry less standalone meaning in a sentence and often do not help a model distinguish between texts. For example:
```
"I am learning representation learning techniques."
>> ["learning", "representation", "learning", "techniques"]
```
‚òÖ Removing stopwords:
- Reduces noise in text
- Reduces vocabulary size
- Improves model efficieny
- Focuses on content-bearing words
<br>

#### Step-07: Stemming or Lemmatization
**Stemming** is not always a good choice, but faster processing technique.
> `studying ‚Üí studi` | Faster processing but not a proper stem<br>
> `running ‚Üí run` | A proper stem

On the other hand, **Lemmatization** is preferable and accurate in practice, but slower in processing.
> `studying ‚Üí study` | A proper lemma <br>
> `running ‚Üí run` | A proper lemma
<br>

#### Step-08: Handling Rare Words and Noise
Removing very unknown words and replaceing it with `<UNK>` token is often used for better processing. For example:

```
"qwerty keyboard"
>> [<UNK>, "keyboard"]
```
<br>

#### Step-09: Vectorization
This is a crucial and final process of converting the tokenized texts into numerical vector representation that machine can understand. This is done by various techniques, like:
1. **Bag of Words (BoW)** = Represents text by counting the number of times each word appears and ignores the word order.
```
Text = "learning representation learning techniques"
Tokens = ["learning", "representation", "techniques"]
Vector = [2, 1, 1] (learning 2x, representation 1x, techniques 1x)
```
2. **TF-IDF (Term Frequency‚ÄìInverse Document Frequency)** = It improves BoW by reducing the importance of very common words (is, was, am) and highlighting distinctive words.<br>
TF: how often a word appears in a document<br>
IDF: how rare the word is across documents
```
learning ‚Üí 0.32,        representation ‚Üí 0.78,        techniques ‚Üí 0.64
Vector = [0.32, 0.78, 0.64]
```
3. **Word Embeddings** = Word embeddings represent each word as a dense numerical vector that captures semantic meaning. Words with similar meanings have similar vectors.<br>
```
learning        ‚Üí [0.21, -0.34, 0.87, ...]
representation  ‚Üí [0.19, -0.30, 0.82, ...]
techniques     ‚Üí [0.25, -0.40, 0.90, ...]
Key Learning:
representation learning ‚âà feature learning (the word "feature" would have quite similar vector like "representation"
```