<h1 style="text-align: center"> Natural Language Processing (NLP) </h1>
<h5 style="text-align: center"> An introduction to Machine Learning </h5>

NLP is a fascinating field at the intersection of computer science and linguistics, and it's a key component of many of the technologies we use every day, from search engines to virtual assistants.

In this Article, we'll dive into the core concepts of NLP, explore various techniques, and see how we can apply them to real-world problems. Whether you're a seasoned data scientist or just starting out, there's something here for you.

If you're interested in Machine Learning and Data Science, I invite you to follow me on my various platforms:

- [LinkedIn](https://www.linkedin.com/in/md-rishat-talukder-a22157239/)
- [GitHub](https://github.com/RishatTalukder/learning_machine_learning/tree/main)
- [YouTube](https://www.youtube.com/channel/UCEEkKXpAkSwcMaIHnGi3Niw)

Let's get started!

# Introduction

Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. It's a fascinating field that has seen rapid growth in recent years, thanks to advances in machine learning and deep learning.

Some Real World Applications of NLP are:

- `Chatbots`
- `CHAT GPT`
- `Copilot`

> Even 30% of this article was written by Copilot 😁

## What & How of NLP

### What is NLP?

`Natural Language Processing (NLP)` is a very broad field that encompasses a wide range of tasks, from simple text processing to complex language understanding. As it is a `subfield of artificial intelligence`, it is concerned with the interaction between computers and humans using natural language.

Suppose You work for a `customer service department` and you receive hundreds of `emails` every day. It would be impossible to read and respond to each one manually. 

Or you are a `doctor` and you have to go through hundreds of `medical records` to `diagnose` a patient. It would be very time-consuming and error-prone for you to do this manually.

There are Hundreds of such examples where if it's done manually it would be very time-consuming and error-prone. This is where `NLP` comes in. It can help you `automate` these tasks and make your life easier.

`NLP` can help you `extract information` from `text`, `classify` text into different categories, `summarize` text, `translate` text from one language to another, and much more.

So, think of the first scenario where you receive hundreds of emails every day. You can use `NLP` to `automatically read` and `classify` these emails into different categories. This way, you can `prioritize` which emails to respond to first and which ones to respond to later. How this happens:

- `Compile` all the emails into a single document.
- `Featurize` the text data, meaning you would want to convert the text data into a format that can be used by a machine learning model.
- `Compare` the Features of the text data to a set of predefined categories.

These are the basic steps involved in `NLP` but there are many more advanced techniques that can be used to `extract information` from text data.

### How does NLP work?

Here's a simple example to illustrate how `NLP` works:

Suppose you have two `documents`:

- Document 1: "Bob Likes Apples"
- Document 2: "Sam Likes Oranges"

You want to `compare` these two documents to see if they are `similar` or `different`. We can:

- `Tokenize` the documents, meaning we would split the documents into individual words. SO, the tokenized version of the documents would be:
    - Document 1: ["Bob", "Likes", "Apples"]
    - Document 2: ["Sam", "Likes", "Oranges"]

- `Vectorize` the documents, meaning we would convert the words into numbers. We can use a technique called `Bag of Words` to do this. The vectorized version of the documents would be:

`Bag of Words` is a simple technique that converts text data into a matrix of word counts. Each row in the matrix represents a document, and each column represents a word. The value in each cell represents the count of the word in the document.

So, we compile all the words in the documents into a single list:

`["Bob", "Sam", "Likes", "Apples", "Oranges"]`

Now, we can convert the documents into vectors:

- Document 1: `"Bob Likes Apples"` -> `["Bob": 1, "Sam": 0, "Likes": 1, "Apples": 1, "Oranges": 0]` -> `[1, 0, 1, 1, 0]`
- Document 2: `"Sam Likes Oranges"` -> `["Bob": 0, "Sam": 1, "Likes": 1, "Apples": 0, "Oranges": 1]` -> `[0, 1, 1, 0, 1]`

Now, We have a fully `vectorized` version of each document. We can now `compare` these vectors to see if they are `similar` or `different`. This is very useful for `document classification` because we are treating the documents as `vectors` of `features`. SO, we can perform `mathematical operations` like `dot products` and `cosine similarity` to compare the documents.

Now, I'm not going to go deep into the `mathematical details` of how these operations work, GO DO YOUR OWN RESEARCH 😁

> `COSINE SIMILARITY` is a the `dot product` of two vectors `divided` by the `product` of the `magnitude` or `length` of the two vectors from the `origin`.

![cosine](attachment:image.png)

The `equation` for `cosine similarity` is:

$$ \text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||} $$

Where `A` and `B` are the two vectors and `||A||` and `||B||` are the magnitudes of the two vectors.

We cna use `cosine similarity` to `compare` the `similarity` of two `documents`. If the `cosine similarity` is `close to 1`, then the documents are `similar`. If the `cosine similarity` is `close to 0`, then the documents are `different`.

We can also imporve the `Bag of Words` model by adjusting the `word counts` based on the `frequency` of the words in the `corpus`.(A `corpus` is a collection of `documents`)
- `TF-IDF` (Term Frequency-Inverse Document Frequency) is a technique that does this. It `weights` the `word counts` based on the `frequency` of the words in the `corpus`.

`TF(Term Frequency)` is the `importance` of the `term` or `word` in the `document`. It is calculated as the `number of times` the `term` appears in the `document`. We represent it as:

$$ \text{TF(d, t)} = \text{Number of times term t appears in document d} $$

`IDF(Inverse Document Frequency)` is the `importance` of the `term` or `word` in the `corpus` meaning `all the documents`. It also means how `rare` the `term` is in the `corpus`. It is calculated as the `logarithm` of the `total number of documents` divided by the `number of documents` that contain the `term`. We represent it as:

$$ \text{IDF(t)} = \log(\frac{\text{D}}{\text{t}}) $$

Where `D` is the `total number of documents` and `t` is the `number of documents` that contain the `term`.

`TF-IDF` is calculated as the `product` of `TF` and `IDF`. It `weights` the `word counts` based on the `frequency` of the words in the `corpus`.

$$ \text{W(x, y)} = \text{TF(x, y)} \times \text{IDF(x)} $$
$$ \text{W(x, y)} = \text{Number of times term x appears in document y} \times \log(\frac{\text{N}}{\text{df(x)}}) $$

Here, 

- `N` is the `total number of documents`
- `df(x)` is the `number of documents` that contain the `term x`
- `TF(x, y)` is the `number of times` the `term x` appears in the `document y`

`TF-IDF` is a very powerful technique that can help you `extract important information` from `text data`. We do this to get, not just the `word counts`, but the `importance` of the `words` in the `document`.

This is just a `brief overview` of how `NLP` works. There are many more `advanced techniques` that can be used to `extract information` from `text data`.

# Natural Language Processing using Python

Now that we have a basic understanding of `NLP`, let's see how we can use `Python` to `perform NLP` tasks. We'll use the `Natural Language Toolkit (NLTK)` library, which is a `popular library` for `NLP` in `Python`.

We have to `install` the `NLTK` library first. We can do this using the following command:

```bash

conda install nltk # If you are using Anaconda

pip install nltk # If you are using pip

```

In this Article, I'll show you the workings of `NLP` using the `NLTK` library and build a `spam filter` using. In this process, we'll learn about `tokenization`, `stemming`, `lemmatization`, and `TF-IDF`.

Let's get started!

## NLTK Basics

The `Natural Language Toolkit (NLTK)` is a `popular library` for `NLP` in `Python`. It provides a wide range of tools and resources for `text processing` and `analysis`. I hope you have already installed the `NLTK` library.

Lets import the `NLTK` library and `download` some `resources`:


In [1]:
import nltk

Before going to the code, let me give you and overview. I'll use `nltk.download_shell()` to show you how to `download` the `resources`. You can `download` the `resources` and `corpora` that you need for your `NLP` tasks.

This method will open a `shell` where you can `download` the `resources` and `corpora` that you need. You can `download` the `resources` by `selecting` the `number` of the `resource` you want to `download`.

The shell will give you choices like:

- `d` to `download` the `resource`
- `q` to `quit` the `shell`
- `l` to `list` the `resources`
- `u` to `update` the `resources`

SO, let's download the `resources` and `corpora` named `stopwords`.

In [2]:
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  d



Download which package (l=list; x=cancel)?


  Identifier>  stopwords


    Downloading package stopwords to /home/rishat/nltk_data...
      Package stopwords is already up-to-date!



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  l



Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  

Hit Enter to continue:  


  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
  [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
                           and Basque Subset)
  [ ] crubadan............ Crubadan Corpus
  [ ] dependency_treebank. Dependency Parsed Treebank
  [ ] dolch............... Dolch Word List
  [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
                           Corpus
  [ ] extended_omw........ Extended Open Multilingual WordNet
  [ ] floresta............ Portuguese Treebank
  [ ] framenet_v15........ FrameNet 1.5
  [ ] framenet_v17........ FrameNet 1.7
  [ ] gazetteers.......... Gazeteer Lists
  [ ] genesis............. Genesis Corpus
  [ ] gutenberg........... Project Gutenberg Selections
  [ ] ieer................ NIST IE-ER DATA SAMPLE
  [ ] inaugural........... C-Span Inaugural Address Corpus
  [ ] indian.............. Indian Language POS-Tagged Corpus
  [ ] jeita............... JEITA Public Morphologically Tagged 

Hit Enter to continue:  


  [ ] knbc................ KNB Corpus (Annotated blog corpus)
  [ ] large_grammars...... Large context-free and feature-based grammars
                           for parser comparison
  [ ] lin_thesaurus....... Lin's Dependency Thesaurus
  [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
                           part-of-speech tags
  [ ] machado............. Machado de Assis -- Obra Completa
  [ ] masc_tagged......... MASC Tagged Corpus
  [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [ ] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
  [ ] moses_sample........ Moses Sample Models
  [ ] movie_reviews....... Sentiment Polarity Dataset Version 2.0
  [ ] mte_teip5........... MULTEXT-East 1984 annotated corpus 4.0
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
                           2015) subset of the Paraphrase Database.
  [ ] names............... Names Corpus, Version 1.3 (1994-03-2

Hit Enter to continue:  


  [ ] opinion_lexicon..... Opinion Lexicon
  [ ] panlex_swadesh...... PanLex Swadesh Corpora
  [ ] paradigms........... Paradigm Corpus
  [ ] pe08................ Cross-Framework and Cross-Domain Parser
                           Evaluation Shared Task
  [ ] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
                           character properties in Perl
  [ ] pil................. The Patient Information Leaflet (PIL) Corpus
  [ ] pl196x.............. Polish language of the XX century sixties
  [ ] porter_test......... Porter Stemmer Test Files
  [ ] ppattach............ Prepositional Phrase Attachment Corpus
  [ ] problem_reports..... Problem Report Corpus
  [ ] product_reviews_1... Product Reviews (5 Products)
  [ ] product_reviews_2... Product Reviews (9 Products)
  [ ] propbank............ Proposition Bank Corpus 1.0
  [ ] pros_cons........... Pros and Cons
  [ ] ptb................. Penn Treebank
  [ ] punkt............... Punkt Tokenizer Models
  [ ] qc...

Hit Enter to continue:  


  [ ] rslp................ RSLP Stemmer (Removedor de Sufixos da Lingua
                           Portuguesa)
  [ ] rte................. PASCAL RTE Challenges 1, 2, and 3
  [ ] sample_grammars..... Sample Grammars
  [ ] semcor.............. SemCor 3.0
  [ ] senseval............ SENSEVAL 2 Corpus: Sense Tagged Text
  [ ] sentence_polarity... Sentence Polarity Dataset v1.0
  [ ] sentiwordnet........ SentiWordNet
  [ ] shakespeare......... Shakespeare XML Corpus Sample
  [ ] sinica_treebank..... Sinica Treebank Corpus Sample
  [ ] smultron............ SMULTRON Corpus Sample
  [ ] snowball_data....... Snowball Data
  [ ] spanish_grammars.... Grammars for Spanish
  [ ] state_union......... C-Span State of the Union Address Corpus
  [*] stopwords........... Stopwords Corpus
  [ ] subjectivity........ Subjectivity Dataset v1.0
  [ ] swadesh............. Swadesh Wordlists
  [ ] switchboard......... Switchboard Corpus Sample
  [ ] tagsets............. Help on Tagsets
  [ ] timit...............

Hit Enter to continue:  


  [ ] treebank............ Penn Treebank Sample
  [ ] twitter_samples..... Twitter Samples
  [ ] udhr2............... Universal Declaration of Human Rights Corpus
                           (Unicode Version)
  [ ] udhr................ Universal Declaration of Human Rights Corpus
  [ ] unicode_samples..... Unicode Samples
  [ ] universal_tagset.... Mappings to the Universal Part-of-Speech Tagset
  [ ] universal_treebanks_v20 Universal Treebanks Version 2.0
  [ ] vader_lexicon....... VADER Sentiment Lexicon
  [ ] verbnet3............ VerbNet Lexicon, Version 3.3
  [ ] verbnet............. VerbNet Lexicon, Version 2.1
  [ ] webtext............. Web Text Corpus
  [ ] wmt15_eval.......... Evaluation data from WMT15
  [ ] word2vec_sample..... Word2Vec Sample
  [ ] wordnet2021......... Open English Wordnet 2021
  [ ] wordnet2022......... Open English Wordnet 2022
  [ ] wordnet31........... Wordnet 3.1
  [ ] wordnet............. WordNet
  [ ] wordnet_ic.......... WordNet-InfoContent
  [ ] word

Hit Enter to continue:  



Collections:
  [P] all-corpora......... All the corpora
  [P] all-nltk............ All packages available on nltk_data gh-pages
                           branch
  [P] all................. All packages
  [P] book................ Everything used in the NLTK Book
  [P] popular............. Popular packages
  [ ] tests............... Packages for running tests
  [ ] third-party......... Third-party data packages

([*] marks installed packages; [P] marks partially installed collections)

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  d



Download which package (l=list; x=cancel)?


  Identifier>  brown


    Downloading package brown to /home/rishat/nltk_data...
      Unzipping corpora/brown.zip.



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  q


Here I have downloaded the `stopwords` `corpora` using the `nltk.download_shell()` method. You can `download` the `resources` and `corpora` that you need for your `NLP` tasks too.

Now, for info I'll use a dataset from `UCI Machine Learning Repository` named `SMS Spam Collection`. This dataset contains `SMS` messages that are `labeled` as `spam` or `ham` (not spam). We'll use this dataset to build a `spam filter` using `NLP`.

You can `download` the `dataset` from this link (https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

And I also have the dataset in my `GitHub` repository. You can `download` it from there too.