In [1]:
import pandas as pd

df: pd.DataFrame = pd.read_csv('data/sentiment_analysis.csv')
corpus = df['text']

## Introduction to NLP

NLP (Natural Language Processing) is all about enabling computers to understand, interpret, and work with human language in a meaningful way. Think about things like sentiment analysis, machine translation, chatbots, or information retrieval—NLP powers all these applications and more.

## Basic Text Processing

### Tokenization

Tokenization is breaking text into smaller pieces, like sentences or words. It’s one of the first steps in processing language data.

- **Sentence Tokenization**: Splits text into sentences.  
- **Word Tokenization**: Splits sentences into words. 


In [3]:
from nltk.tokenize import NLTKWordTokenizer

tokenizer = NLTKWordTokenizer()
tokens = tokenizer.tokenize_sents(corpus[0:2])

print(corpus[0:2])
print(tokens)

0             What a great day!!! Looks like dream.
1    I feel sorry, I miss you here in the sea beach
Name: text, dtype: object
[['What', 'a', 'great', 'day', '!', '!', '!', 'Looks', 'like', 'dream', '.'], ['I', 'feel', 'sorry', ',', 'I', 'miss', 'you', 'here', 'in', 'the', 'sea', 'beach']]


## Text Vectorization

Text Vectorization The techniques which convert text into features are generally known as "Text Vectorization techniques", because they all aim to convert text into vectors (array) that can then be fed to a machine learning model.

| Sentence                  | Tokens                               | One-Hot Encoding Vector |
|---------------------------|--------------------------------------|--------------------------|
| "The cat sat in the hat"  | ["The", "cat", "sat", "in", "hat"] | [1, 1, 1, 1, 1, 0]    |
| "The cat with the hat"    | ["The", "cat", "with", "the", "hat"] | [1, 1, 0, 0, 1, 1]  |

In [4]:
vocabulary = sorted(set(word for sentence in tokens for word in sentence))
print(f'Vocabulary: {vocabulary} \n{"-"*20}')

print('Sentence 0:',tokens[0])
print('Vector 0:',[1 if word in tokens[0] else 0 for word in vocabulary])

print('Sentence 1:',tokens[1])
print('Vector 1:',[1 if word in tokens[1] else 0 for word in vocabulary])

Vocabulary: ['!', ',', '.', 'I', 'Looks', 'What', 'a', 'beach', 'day', 'dream', 'feel', 'great', 'here', 'in', 'like', 'miss', 'sea', 'sorry', 'the', 'you'] 
--------------------
Sentence 0: ['What', 'a', 'great', 'day', '!', '!', '!', 'Looks', 'like', 'dream', '.']
Vector 0: [1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]
Sentence 1: ['I', 'feel', 'sorry', ',', 'I', 'miss', 'you', 'here', 'in', 'the', 'sea', 'beach']
Vector 1: [0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1]


### Part-of-Speech (POS) Tagging

Assigns parts of speech (e.g., noun, verb, adjective) to each word.

![alt text](<images/Pasted image 20241109035814.png>)

In [48]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('universal_tagset')

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

print("John's big idea isn't all that bad :",pos_tag(word_tokenize("John's big idea isn't all that bad.")) )

print("John's big idea isn't all that bad :", pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal'))

John's big idea isn't all that bad : [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
John's big idea isn't all that bad : [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'), ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Riyadh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Riyadh\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


### Text Normalization

Text normalization is a key step in preparing raw text data for analysis. It involves transforming text into a consistent format, making it easier for models to interpret and analyze. This process reduces noise and variability, ensuring that different forms of the same word or expression are treated consistently in NLP tasks.

![alt text](<images/Pasted image 20241021005733.png>)

A **text preprocessing pipeline** cleans and standardizes text by **lowercasing**, **removing repetitions and punctuation**, and **normalizing words** with stemming or lemmatization. This improves consistency and prepares the text for NLP tasks.

#### To lower case

In [6]:
corpus[0].lower()

'what a great day!!! looks like dream.'

#### Remove repetitive sequence of words

In [7]:
corpus[167]

'writing report cards  soooo tired but what an amazing day. check it out on fb soon!'

In [9]:
import re

re.sub(r"(.)\1{2,}", r"\1", corpus[167])

'writing report cards  so tired but what an amazing day. check it out on fb soon!'

#### Special Characters removal

In [10]:
# using regex to remove unwanted chars bt negating the selected chars
re.sub(r'[^a-zA-Z ]', '', corpus[0])

'What a great day Looks like dream'

#### Lemmatization & Stemming

![alt text](<images/Pasted image 20241021005855.png>)

These techniques simplify vocabulary and reduce feature size, but they impact the **accuracy of word meaning** and can affect the **contextual interpretation**. For example, stemming may group words too aggressively, leading to errors in sentiment analysis, while lemmatization maintains more semantic clarity.

| Technique         | Pros                                                          | Cons                                                                 |
| ----------------- | ------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Stemming**      | - Faster, computationally light                               | - Less accurate, may distort meaning (e.g., "universal" → "univers") |
| **Lemmatization** | - More accurate, retains meaning (e.g., "am, are, is" → "be") | - Slower, requires more computation and linguistic resources         |

##### Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [12]:
corpus[167].split(' ')[2]

'cards'

In [13]:
lemmatizer.lemmatize(corpus[167].split(' ')[2])

'card'

##### Stemming

In [14]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [15]:
corpus[167]

'writing report cards  soooo tired but what an amazing day. check it out on fb soon!'

In [16]:
stemmer.stem(corpus[167].split(' ')[9]) # amazing

'amaz'

### Stopwords

**Stopwords** are common words in a language, such as "the," "is," "in," and "and," which typically carry minimal semantic value and do not contribute significantly to the meaning of the text. In natural language processing (NLP), removing stopwords is a common preprocessing step to reduce the noise in text data, allowing models to focus on more meaningful terms.

In [17]:
pass


#### Why Remove Stopwords?
- **Enhances Model Efficiency**: Removing stopwords reduces the vocabulary size, making the model more efficient and reducing computational costs.
- **Improves Relevance**: Helps to focus on words with higher semantic importance, which can lead to better model accuracy in tasks like text classification, search relevance, and topic modeling.

Stopwords removal is generally useful but should be carefully applied based on the context, as some stopwords may carry meaning in specific tasks, like sentiment analysis.


## Word Embeddings

Word embeddings are a way of representing words as vectors in a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.

### Two approaches to word embeddings

#### Frequency-based embeddings

Frequency-based embeddings refer to word representations that are derived from the frequency of words in a corpus. These embeddings are based on the idea that the importance or significance of a word can be inferred from how frequently it occurs in the text.

##### Bag-of-words

Bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. Machine learning algorithms cannot work with raw text directly; the text must be converted into well defined fixed-length(vector) numbers.

Example:
1. “The cat sat” 
2. “The cat sat in the hat” 
3. “The cat with the hat”

![alt text](<images/Pasted image 20241021005545.png>)


##### Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is used to emphasize words that are common within a specific document but relatively rare across the entire corpus, helping to highlight key terms for each document.

The TF-IDF score for a term is calculated as:

$$
\textit{TF-IDF} = TF \times IDF
$$

$$
\begin{align*}
    \text{Let}\ t&=\text{Term} \\
    \text{Let}\ IDF&=\text{Inverse Document Frequency} \\
    \text{Let}\ TF&=\text{Term Frequency} \\[2em]
    TF \:&=\: \frac{\text{term frequency in document}}{\text{total words in document}} \\[1em]
    IDF(t) \:&=\: \log_2\left(\frac{\text{total documents in corpus}}{\text{documents with term}}\right)
\end{align*}
$$

**Note**: To avoid divide-by-zero errors, add 1 to all counters if a term is absent in the corpus.


###### Step 1: Calculate Term Frequencies (TF)

To find Term Frequency (TF), divide the count of each term by the total number of terms in the document.

Sample documents:

1. **Document 1 (D1)**: "The cat sat" – Total terms: 3
2. **Document 2 (D2)**: "The cat sat in the hat" – Total terms: 6
3. **Document 3 (D3)**: "The cat with the hat" – Total terms: 5

| Term |  TF in Document 1 (D1)   |     TF in Document 2 (D2)      |  TF in Document 3 (D3)  |
| :--: | :----------------------: | :----------------------------: | :---------------------: |
| the  | $$ \frac{1}{3} = 0.33 $$ | $$ \frac{2}{6} \approx 0.33 $$ | $$ \frac{1}{5} = 0.2 $$ |
| cat  | $$ \frac{1}{3} = 0.33 $$ | $$ \frac{1}{6} \approx 0.17 $$ | $$ \frac{1}{5} = 0.2 $$ |
| sat  | $$ \frac{1}{3} = 0.33 $$ | $$ \frac{1}{6} \approx 0.17 $$ |            0            |
|  in  |            0             | $$ \frac{1}{6} \approx 0.17 $$ |            0            |
| hat  |            0             | $$ \frac{1}{6} \approx 0.17 $$ | $$ \frac{1}{5} = 0.2 $$ |
| with |            0             |               0                | $$ \frac{1}{5} = 0.2 $$ |

###### Step 2: Calculate Inverse Document Frequency (IDF)

Calculate the IDF for each term:

- **the**: $$\log_2\left(\frac{3}{3}\right) = 0$$
- **cat**: $$\log_2\left(\frac{3}{3}\right) = 0$$
- **sat**: $$\log_2\left(\frac{3}{2}\right) \approx 0.585$$
- **in**: $$\log_2\left(\frac{3}{1}\right) \approx 1.585$$
- **hat**: $$\log_2\left(\frac{3}{2}\right) \approx 0.585$$
- **with**: $$\log_2\left(\frac{3}{1}\right) \approx 1.585$$

###### Step 3: Calculate TF-IDF Scores

Multiply the TF and IDF values for each term in each document.

| Term | TF-IDF (D1)                | TF-IDF (D2)                | TF-IDF (D3)                |
| ---- | ---------------------------| ---------------------------| ---------------------------|
| the  | $$0 \times 0.33 = 0$$      | $$0 \times 0.33 = 0$$      | $$0 \times 0.2 = 0$$       |
| cat  | $$0 \times 0.33 = 0$$      | $$0 \times 0.17 = 0$$      | $$0 \times 0.2 = 0$$       |
| sat  | $$0.585 \times 0.33 \approx 0.193$$ | $$0.585 \times 0.17 \approx 0.1$$ | 0           |
| in   | $$1.585 \times 0 = 0$$     | $$1.585 \times 0.17 \approx 0.27$$ | 0           |
| hat  | $$0.585 \times 0 = 0$$     | $$0.585 \times 0.17 \approx 0.1$$ | $$0.585 \times 0.2 \approx 0.117$$ |
| with | $$1.585 \times 0 = 0$$     | 0                            | $$1.585 \times 0.2 \approx 0.317$$ |


In [18]:
pass

##### N-grams

An N-gram represents a sequence of N words (word level) or N characters (character level) in a text. By capturing these sequences, N-grams help preserve word connections and contextual relationships, allowing for a more generalized understanding of text.


![alt text](<images/Pasted image 20241021005447.png>)

In [19]:
from nltk import ngrams

sentence = 'this is a foo bar sentences and I want to ngramize it'

n = 2
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print(grams)

('this', 'is')
('is', 'a')
('a', 'foo')
('foo', 'bar')
('bar', 'sentences')
('sentences', 'and')
('and', 'I')
('I', 'want')
('want', 'to')
('to', 'ngramize')
('ngramize', 'it')



#### Prediction-based embeddings

Prediction-based embeddings are word representations derived from models that are trained to predict certain aspects of a word's context or neighboring words. Unlike frequency-based embeddings that focus on word occurrence statistics, prediction-based embeddings capture semantic relationships and contextual information, providing richer representations of word meanings.

##### Word2Vec

Word2Vec embeddings place similar words near each other in vector space, capturing relationships between them. For example, the model can understand that 'man' is to 'woman' as 'king' is to 'queen,' showing how words relate through meaning. This ability to recognize patterns and analogies is a big advantage of Word2Vec.

![alt text](<images/Pasted image 20241109040300.png>)

Word2Vec consists of two main models for generating vector representations: Continuous Bag of Words (CBOW) and Continuous Skip-gram. 

![alt text](<images/Pasted image 20241021001505.png>)

In the context of Word2Vec, the **Continuous Bag of Words (CBOW) model** aims to predict a target word based on its surrounding context words within a given window. It uses the context words to predict the target word, and the learned embeddings capture semantic relationships between words.

The **Continuous Skip-gram model**, on the other hand, takes a target word as input and aims to predict the surrounding context words.

###### Limitations of Word2Vec

Do you know the Ozone Layer?

Using **Word2Vec**, let's examine the associations for "Ozone":

| Word  | Human | Food | Liquid | Chemical |
| ----- | ----- | ---- | ------ | -------- |
| Cake  | 0.0   | 0.9  | 0.1    | 0.0      |
| Juice | 0.0   | 0.5  | 0.9    | 0.0      |
| Acid  | 0.0   | 0.0  | 0.1    | 0.9      |
| Ozone | 0.0   | 0.7  | 0.0    | 0.1      |

![alt text](<images/Pasted image 20241009234647.png>)

"Ozone" is classified as Food

This example shows that **Word2Vec can make associations** based on statistical relationships in text, but it lacks real-world understanding of concepts. Here, "Ozone" is incorrectly associated with "Food," despite being a gas.



Another limitation of Word2Vec is its inability to distinguish specific subtypes within a category. For instance, **different types of coffee** like espresso and cappuccino may be interpreted as nearly identical due to the small cosine distance between them, even though they are distinct.

![alt text](<images/Pasted image 20241005065755.png>)

This highlights that while Word2Vec captures statistical patterns, it doesn’t fully understand nuanced distinctions between items within a category.

In [27]:
import gensim


Here we train a word embedding using the Brown Corpus:

In [49]:
import nltk
nltk.download('brown')

from nltk.corpus import brown

train_set = brown.sents()[:10000]
model = gensim.models.Word2Vec(train_set)


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Riyadh\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


It might take some time to train the model. So, after it is trained, it can be saved as follows:

In [50]:
model.save('brown.embedding')
new_model = gensim.models.Word2Vec.load('brown.embedding')

The model will be the list of words with their embedding. We can easily get the vector representation of a word.


In [32]:
len(new_model.wv['university'])

100

There are some supporting functions already implemented in Gensim to manipulate with word embeddings. For example, to compute the cosine similarity between 2 words:

In [33]:
new_model.wv.similarity('university','school') > 0.3

True

**Using the pre-trained model**

NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

In [35]:
nltk.download('word2vec_sample')

from nltk.data import find

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)


[nltk_data] Downloading package word2vec_sample to
[nltk_data]     C:\Users\Riyadh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping models\word2vec_sample.zip.


We pruned the model to only include the most common words (~44k words).

In [None]:
len(model)

43981


Each word is represented in the space of 300 dimensions:


In [None]:
len(model['university'])

300


Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.


In [39]:
model.most_similar(positive=['university'], topn = 3)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906319618225),
 ('undergraduate', 0.6587095856666565)]

Finding a word that is not in a list is also supported, although, implementing this by yourself is simple

In [41]:
model.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'

"King - Man + Woman" is close to "Queen" and "Germany - Berlin + Paris" is close to "France"


In [42]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

[('France', 0.7884091138839722)]

##### GloVe

Unlike the Word2Vec models (CBOW and Skip-gram), which focus on predicting context words given a target word or vice versa, GloVe uses a different approach that involves optimizing word vectors based on their co-occurrence probabilities. The training process is designed to learn embeddings that effectively capture the semantic relationships between words.

[Continue reading](https://github.com/stanfordnlp/glove)

In [None]:
# No code

##### BERT

**BERT (Bidirectional Encoder Representations from Transformers)** is a model that creates context-aware embeddings by reading text in both directions. Unlike simpler embeddings, BERT can understand words based on the surrounding words, handling polysemy (multiple meanings) and capturing deep semantic relationships.

- **Bidirectional**: Analyzes each word with both left and right context, providing a fuller understanding of meaning.
- **Handles Polysemy and Synonymy**: Differentiates meanings based on context, useful for tasks like question answering and sentiment analysis.
- **Pretrained**: Trained on massive text data, making it adaptable to various language tasks.

BERT's embeddings are widely used in NLP tasks where deep contextual understanding is required.

[Continue reading](https://huggingface.co/docs/transformers/en/model_doc/bert)

In [None]:
# No code


## Comparison of Word Embedding Techniques

| Feature                       | Count Vectorization | TF-IDF Vectorization | Word2Vec (CBOW, Skip-gram) | GloVe |
| ----------------------------- | ------------------- | -------------------- | -------------------------- | ----- |
| **Interpretable**             | ✅                   | ✅                    | ❌                          | ❌     |
| **Captures Semantic Meaning** | ❌                   | ❌                    | ✅                          | ✅     |
| **Sparse Representation**     | ✅                   | ✅                    | ❌                          | ❌     |
| **Handles Large Vocabulary**  | ❌                   | ❌                    | ✅                          | ✅     |
| **Context-Aware**             | ❌                   | ❌                    | ✅                          | ✅     |
| **Easy to Compute**           | ✅                   | ✅                    | ❌                          | ❌     |
| **Suitable for Short Text**   | ✅                   | ✅                    | ✅                          | ✅     |
| **Requires Large Dataset**    | ❌                   | ❌                    | ✅                          | ✅     |
| **Handles Synonymy**          | ❌                   | ❌                    | ✅                          | ✅     |
| **Handles Polysemy**          | ❌                   | ❌                    | ~                          | ~     |

In this table:
- **Polysemy Handling** (`~`): Word2Vec and GloVe can partially handle polysemy by capturing context to an extent, but they don’t fully differentiate meanings like contextual embeddings (e.g., BERT).

---
# Conclusion and Next Steps

In this guide, we've covered the essential components of Natural Language Processing, from text preprocessing to advanced word embeddings. Understanding these foundational steps is crucial for building powerful NLP models and tackling real-world language tasks.

This repository will include Jupyter notebooks to provide hands-on practice with these concepts. These notebooks will guide you through practical projects, such as building a sentiment classifier or performing named entity recognition. You'll also find tools and techniques for preprocessing text, implementing embeddings, and evaluating NLP models.

### What’s Next?
1. **Deepen Your Knowledge**: Continue exploring advanced topics in NLP, such as dependency parsing, topic modeling, and more sophisticated embeddings like contextualized embeddings (e.g., BERT).
2. **Practice with Real Data**: Applying these techniques on real datasets is the best way to solidify your understanding.
3. **Explore Other NLP Libraries**: After mastering NLTK, consider learning SpaCy or the Hugging Face Transformers library for modern NLP workflows.

With these resources and hands-on practice, you’ll be well-prepared to tackle more complex NLP projects and keep advancing your skills.
