## *Basic Terminologies:*
1. *`Corpus(Blog) ` Large and Organized Collection of Text, This Text can be anything such as articles, books, social media posts or recorder conversations. It is Raw Data used to Train Language Models*

2. *`Documnent ` Single Piece of Text within a Corpus. It can be a single sentence, paragraph or an article.*
    - *Corpus Consist of Mltiple Documents.*

3. *`Vocabulary(Dictionary) ` List of all the unique words that appear in the entire corpus, It doesnot contain duplicate words & it records each word only once.*

4. *`Word ` simplest unit of text. This single word can occur multiple times in a document or blog. *
    - *Each Document conists of a collection of Words.*

### `Anthor ways to convert a word into vector:`
- `One-Hot Encoding `: 
The simplest way to represent words as vectors. The idea is to create a vector the size of the entire dictionary. For each word, this vector is all zeros except for one field that takes the value 1, which determines the word's position in the dictionary.

Advantages:
 - Very simple and easy to understand and implement.

Disadvantages:
 - `The Curse of Dimensionality`: If the dictionary is large (tens of thousands of words), the vector becomes very large and full of zeros, consuming significant memory and increasing processing time.

 - `Understandable`: Each vector is completely independent of the other. There is no relationship between the vector for "king" and the vector for "queen," despite the similarity in meaning.

In [1]:
import numpy as np

vocab = {'Cat':2, 'sat':3, 'on':4, 'the':5, 'mat':6}
vocab_size = len(vocab)

ohe_vector = np.zeros(vocab_size)

word = 'Cat'
ohe_vector[vocab[word]-1] = 1   

print("Word ->", word)
print("Vocab ->", vocab)
print("One-Hot Encoding ->", ohe_vector)

Word -> Cat
Vocab -> {'Cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat': 6}
One-Hot Encoding -> [0. 1. 0. 0. 0.]


### `N-Grams`:
It's not a method for converting words to vectors like One-Hot, but rather a method for creating units of adjacent words. The idea is to take consecutive groups of n words.

If n = 1, we call them Unigrams (single words).
If n = 2, we call them Bigrams (pairs of words).
If n = 3, we call them Trigrams (triads of words).

This helps capture some context and meaning because it takes into account the order of words in a sentence.

Advantages:
 - It takes into account word order: something neither Bag of Words nor One-Hot do.
 - Relatively simple: easy to understand and implement.

Disadvantages:
 - It still suffers from the curse of dimensionality: each new N-gram added to the dictionary increases its size dramatically, especially if n is large.
 - The rare word problem: If a particular N-gram does not appear in the training data, its value will be zero.

In [2]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "I love natural language processing"
tokens = word_tokenize(text)

bigrams = list(ngrams(tokens, 2))
print(bigrams)

trigrams = list(ngrams(tokens, 3))
print(trigrams)

[('I', 'love'), ('love', 'natural'), ('natural', 'language'), ('language', 'processing')]
[('I', 'love', 'natural'), ('love', 'natural', 'language'), ('natural', 'language', 'processing')]


### *Summary :*
1. `One Hot Encoding `: is a method for representing words as numerical vectors, 
2. `N-grams ` : are a method for representing words in context before transforming them into vectors (using One-Hot or other methods). 

*These methods are important fundamentals, but they are `not widely used in modern models based on Word Embeddings and Transformers` because they `cannot capture the meaning and complex relationships between words`.*

### Practical Implementation of all we talk about.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

In [24]:
corpus = """
Mohamed Salah Hamed Mahrous Ghaly Egyptian Arabic pronunciation:
Salah began his senior career in 2010 at Al-Mokawloon, departing in 2012 to join Basel, where he won two Swiss Super League titles. 
In 2014, he joined Chelsea for a reported fee of £11 million, but limited gametime led to successive loans to Fiorentina and Roma, who later signed him permanently for €15 million.
In the 2016–17 season, Salah was a key figure in Roma's unsuccessful title bid, reaching double figures in both goals and assists.
"""

corpus

"\nMohamed Salah Hamed Mahrous Ghaly Egyptian Arabic pronunciation:\nSalah began his senior career in 2010 at Al-Mokawloon, departing in 2012 to join Basel, where he won two Swiss Super League titles. \nIn 2014, he joined Chelsea for a reported fee of £11 million, but limited gametime led to successive loans to Fiorentina and Roma, who later signed him permanently for €15 million.\nIn the 2016–17 season, Salah was a key figure in Roma's unsuccessful title bid, reaching double figures in both goals and assists.\n"

In [25]:
# Tokenization (Word and Sentences)

# nltk.download('punkt')
# nltk.download('stopwords')

sentences = sent_tokenize(corpus)
tokens = word_tokenize(corpus)

print("Sentences ->", sentences)
print("Number of sentences ->", len(sentences))
print(type(sentences))

print("Tokens ->", tokens)
print("Number of tokens ->", len(tokens))
print(type(tokens))

Sentences -> ['\nMohamed Salah Hamed Mahrous Ghaly Egyptian Arabic pronunciation:\nSalah began his senior career in 2010 at Al-Mokawloon, departing in 2012 to join Basel, where he won two Swiss Super League titles.', 'In 2014, he joined Chelsea for a reported fee of £11 million, but limited gametime led to successive loans to Fiorentina and Roma, who later signed him permanently for €15 million.', "In the 2016–17 season, Salah was a key figure in Roma's unsuccessful title bid, reaching double figures in both goals and assists."]
Number of sentences -> 3
<class 'list'>
Tokens -> ['Mohamed', 'Salah', 'Hamed', 'Mahrous', 'Ghaly', 'Egyptian', 'Arabic', 'pronunciation', ':', 'Salah', 'began', 'his', 'senior', 'career', 'in', '2010', 'at', 'Al-Mokawloon', ',', 'departing', 'in', '2012', 'to', 'join', 'Basel', ',', 'where', 'he', 'won', 'two', 'Swiss', 'Super', 'League', 'titles', '.', 'In', '2014', ',', 'he', 'joined', 'Chelsea', 'for', 'a', 'reported', 'fee', 'of', '£11', 'million', ',', 'b

In [26]:
# Stopword Removal

stop_words_english = set(stopwords.words('english'))
print("Stopwords ->", stop_words_english)
print("Number of Stopwords ->", len(stop_words_english))
print(type(stop_words_english))

cleaned_tokens = [t for t in tokens if t.lower() not in stop_words_english] 
print("Cleaned Tokens ->", cleaned_tokens)
print("Number of Cleaned Tokens ->", len(cleaned_tokens))
print(type(cleaned_tokens))

Stopwords -> {"mightn't", "she's", 'i', 'if', 'being', 'hadn', "shan't", "i'd", 'mustn', 'for', 'on', 'weren', 'out', "you've", 'into', 'of', 'once', 'where', 'over', 'this', 'his', 'through', 'when', 'he', 'most', 's', 'with', "it'd", "hasn't", 'were', 'she', 'whom', 'had', 'their', 'to', 'off', 'my', 'is', 'ours', 'yours', 'during', 'aren', 'hers', 'up', 'after', "wasn't", 'nor', 'here', 'and', 'me', "i'll", "she'll", "wouldn't", "couldn't", "they'd", 'no', 'more', 'all', 're', 'ourselves', "it's", 'wouldn', 'was', 'have', 'how', 'been', 'now', 'down', 'o', 'your', 'them', 'any', 'who', 'why', "won't", "he'll", 'yourself', 'doing', 'above', 'will', 'such', 'myself', 'the', 'both', "i've", 'same', 'wasn', 'do', "mustn't", "doesn't", 'which', 'our', 'her', "hadn't", 'itself', 'an', 'between', 'a', 'am', 'should', 'not', "that'll", 'those', 'these', "they've", "weren't", "i'm", 'but', 'against', 'other', "we'll", 'because', "aren't", 'before', "haven't", 'than', "they're", "didn't", 'ab

In [27]:
# Stemming
stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in cleaned_tokens]
print("Stemmed Words ->", stemmed_words)
print("Number of Stemmed Words ->", len(stemmed_words))
print(type(stemmed_words))

Stemmed Words -> ['moham', 'salah', 'hame', 'mahrou', 'ghali', 'egyptian', 'arab', 'pronunci', ':', 'salah', 'began', 'senior', 'career', '2010', 'al-mokawloon', ',', 'depart', '2012', 'join', 'basel', ',', 'two', 'swiss', 'super', 'leagu', 'titl', '.', '2014', ',', 'join', 'chelsea', 'report', 'fee', '£11', 'million', ',', 'limit', 'gametim', 'led', 'success', 'loan', 'fiorentina', 'roma', ',', 'later', 'sign', 'perman', '€15', 'million', '.', '2016–17', 'season', ',', 'salah', 'key', 'figur', 'roma', "'s", 'unsuccess', 'titl', 'bid', ',', 'reach', 'doubl', 'figur', 'goal', 'assist', '.']
Number of Stemmed Words -> 68
<class 'list'>


In [28]:
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in cleaned_tokens]
print("Lemmatized Words ->", lemmatized_words)
print("Number of Lemmatized Words ->", len(lemmatized_words))
print(type(lemmatized_words))

Lemmatized Words -> ['Mohamed', 'Salah', 'Hamed', 'Mahrous', 'Ghaly', 'Egyptian', 'Arabic', 'pronunciation', ':', 'Salah', 'began', 'senior', 'career', '2010', 'Al-Mokawloon', ',', 'departing', '2012', 'join', 'Basel', ',', 'two', 'Swiss', 'Super', 'League', 'title', '.', '2014', ',', 'joined', 'Chelsea', 'reported', 'fee', '£11', 'million', ',', 'limited', 'gametime', 'led', 'successive', 'loan', 'Fiorentina', 'Roma', ',', 'later', 'signed', 'permanently', '€15', 'million', '.', '2016–17', 'season', ',', 'Salah', 'key', 'figure', 'Roma', "'s", 'unsuccessful', 'title', 'bid', ',', 'reaching', 'double', 'figure', 'goal', 'assist', '.']
Number of Lemmatized Words -> 68
<class 'list'>


In [31]:
import re
corpus = []
for i in range(len(sentences)):
    review = re.sub(r'[^a-zA-Z0-9]', ' ', sentences[i])
    review = review.lower()
    corpus.append(review)

corpus

[' mohamed salah hamed mahrous ghaly egyptian arabic pronunciation  salah began his senior career in 2010 at al mokawloon  departing in 2012 to join basel  where he won two swiss super league titles ',
 'in 2014  he joined chelsea for a reported fee of  11 million  but limited gametime led to successive loans to fiorentina and roma  who later signed him permanently for  15 million ',
 'in the 2016 17 season  salah was a key figure in roma s unsuccessful title bid  reaching double figures in both goals and assists ']

In [None]:
# Bag of Words
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus).toarray()
print("Bag of Words ->", X)
print("Shape of Bag of Words ->", X.shape)

Bag of Words -> [[0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 2 1
  0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 2 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1]
 [1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 2 1 0 0 0 1 1 0 1 0
  1 0 1 0 1 1 1 0 2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 1 0]
 [0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 3 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0]]
Shape of Bag of Words -> (3, 70)


In [38]:
print(vectorizer.get_feature_names_out())
print("Length of Vocabulary ->", len(vectorizer.get_feature_names_out()))

['11' '15' '17' '2010' '2012' '2014' '2016' 'al' 'and' 'arabic' 'assists'
 'at' 'basel' 'began' 'bid' 'both' 'but' 'career' 'chelsea' 'departing'
 'double' 'egyptian' 'fee' 'figure' 'figures' 'fiorentina' 'for'
 'gametime' 'ghaly' 'goals' 'hamed' 'he' 'him' 'his' 'in' 'join' 'joined'
 'key' 'later' 'league' 'led' 'limited' 'loans' 'mahrous' 'million'
 'mohamed' 'mokawloon' 'of' 'permanently' 'pronunciation' 'reaching'
 'reported' 'roma' 'salah' 'season' 'senior' 'signed' 'successive' 'super'
 'swiss' 'the' 'title' 'titles' 'to' 'two' 'unsuccessful' 'was' 'where'
 'who' 'won']
Length of Vocabulary -> 70


In [39]:
print(vectorizer.vocabulary_)
print("Length of Vocabulary Dictionary ->", len(vectorizer.vocabulary_))

{'mohamed': 45, 'salah': 53, 'hamed': 30, 'mahrous': 43, 'ghaly': 28, 'egyptian': 21, 'arabic': 9, 'pronunciation': 49, 'began': 13, 'his': 33, 'senior': 55, 'career': 17, 'in': 34, '2010': 3, 'at': 11, 'al': 7, 'mokawloon': 46, 'departing': 19, '2012': 4, 'to': 63, 'join': 35, 'basel': 12, 'where': 67, 'he': 31, 'won': 69, 'two': 64, 'swiss': 59, 'super': 58, 'league': 39, 'titles': 62, '2014': 5, 'joined': 36, 'chelsea': 18, 'for': 26, 'reported': 51, 'fee': 22, 'of': 47, '11': 0, 'million': 44, 'but': 16, 'limited': 41, 'gametime': 27, 'led': 40, 'successive': 57, 'loans': 42, 'fiorentina': 25, 'and': 8, 'roma': 52, 'who': 68, 'later': 38, 'signed': 56, 'him': 32, 'permanently': 48, '15': 1, 'the': 60, '2016': 6, '17': 2, 'season': 54, 'was': 66, 'key': 37, 'figure': 23, 'unsuccessful': 65, 'title': 61, 'bid': 14, 'reaching': 50, 'double': 20, 'figures': 24, 'both': 15, 'goals': 29, 'assists': 10}
Length of Vocabulary Dictionary -> 70
