![logo](https://drive.google.com/uc?export=view&id=1QJ9PAT9q-Ksv_Vs_pLXtLHxjjV-9FMTz)

Created by: [Noureldin Mohamed](mailto:s-noureldin.hamedo@zewailcity.edu.eg)


# Table of Contents

- [Introduction](#scrollTo=qWnhhhtY0psV)
- [Building Blocks](#scrollTo=rufmCSTm0-Tb)


# **Introduction**

<p>
<b>NLP</b>, <b><font color='gren'> Natural Language Processing</font></b>
, refers to the field of artificial intelligence focused on enabling machines to understand, interpret, and generate human language. It involves developing algorithms and models to analyze and process textual and spoken data. NLP plays a crucial role in applications like:

<ul>
<li> Text Understanding and Extraction
<li> Chatbots and Virtual Assistants
<li> Sentiment Analysis
<li> Named Entity Recognition
<li> Text Generation
</ul>
</p>

<h2><b>How does NLP Work?</b></h2>

<img src="https://www.shaip.com/wp-content/uploads/2022/10/How-NLP-Works-760px.jpg" width="600">



# **Building Blocks**

Before diving deeper into the vast realm of NLP models, we need to understand the essential steps required to get things working correctly.
<br><br>
<h3><b>Step 1: Preprocessing</b></h3>
Text preprocessing serves as a crucial step in preparing text data for model input by eliminating noise, such as emotions, punctuation, and variations in case. In the realm of Human Language, expressing the same idea in diverse ways poses a significant challenge. This challenge stems from the fact that machines, unlike humans, require numerical input, compelling us to adeptly convert text into meaningful numerical representations.

We will look into some common essential ways to preprocess your text.
<h4><font color='gold'><b>Normalization</b></font></h4>

<b>What is it?</b>

Normalization is like tidying up your text. It involves making all text uniform, such as converting everything to lowercase. This helps the model treat words like <font color='gren'><b>"apple"</b></font> and <font color='gren'><b>"Apple"</b></font> the same way.


```
text=text.lower() # Convert to lower case
```



<b>Why do it?</b>

It ensures consistency in your text data, so the model doesn't get confused by variations in capitalization.

<h4><font color='gold'><b>Tokenization</b></font></h4>
<b>What is it?</b>

Tokenization is like breaking down a sentence into individual words. It helps the model understand and process each word separately.


```
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  # Download the Punkt tokenizer models

# Tokenize into words
words = word_tokenize(text)

# Tokenize into sentences
sentences = sent_tokenize(text)

```


<b>Why do it?</b>

It's like giving the model a set of building blocks <font color='gren'><b>(words)</b></font> to work with, making it easier to analyze and understand the text.


<h4><font color='gold'><b>Lemmatization</b></font></h4>

<b>What is it?</b>

Lemmatization is finding the base or root form of a word. For example, <font color='gren'><b>"running"</b></font> becomes <font color='gren'><b>"run."</b></font> It helps the model focus on the core meaning.


```
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the Punkt tokenizer models
nltk.download('wordnet')  # Download the WordNet lemmatizer data

# Tokenize into words
words = word_tokenize(text)

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

```


<b>Why do it?</b>

It reduces words to their essential forms, preventing the model from treating similar words like <font color='gren'><b>(running, run, runs)</b></font> differently.


<h4><font color='gold'><b>Stop Words Removal</b></font></h4>

<b>What is it?</b>

Stop words are common words like <font color='gren'><b>"the""</b></font> and <font color='gren'><b>"and"</b></font> that don't carry much meaning. Removing them helps the model focus on the important words.


```
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the Punkt tokenizer models
nltk.download('stopwords')  # Download the stopwords data

# Tokenize into words
words = word_tokenize(text)

# Get English stop words
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_words = [word for word in words if word.lower() not in stop_words]
```


<b>Why do it?</b>

It cleans up the text, leaving only the meaningful words that contribute to the overall meaning.

<h4><font color='gold'><b>Removing Punctuations</b></font></h4>

<b>What is it?</b>

Removing punctuations involves getting rid of symbols like <font color='gren'><b>"."</b></font>, <font color='gren'><b>","</b></font>, and <font color='gren'><b>"!"</b></font> from the text.


```
import re

text=re.sub(r"[^0-9a-zA-Z]"," ", text)
```


<b>Why do it?</b>

Punctuations don't usually add much meaning in text analysis, so removing them makes the text cleaner and easier for the model to understand.
<br><br>



<h3><b>Step 2: Feature Extraction</b></h3>

Now after preparing our text, it is time to convert it into a numerical representation for the computer to undertand. This process is called <font color='gren'> <b>Word Embedding</b></font>
<br><br>

The word embedding could be implemented in 2 different ways.
<br><br>

<h4><b>Frequency Based Embedding</b>

Frequency-Based Embedding involves representing words based on their frequency of occurrence in the text corpus. Words that occur more frequently are assigned higher weights.


A <font color='gren'><b>text corpus</b></font> refers to a large and structured collection of texts in a particular language or domain. It serves as a representative sample of a language, providing a basis for linguistic analysis, natural language processing (NLP), and machine learning tasks.

<br><br>
<h4><b>Prediction Based Embedding</b>

Bag of Words,
TFIDF,
N-Gram: Unigram, Bigram, ...
Word Embedding, Word2Vec

<br><br>
<h3><b>Step 3: Classification </b></h3>

RNN, GRU, LSTM
Attention Models, Transformers,
BERT