#***One-Hot Encoding***
*One-hot encoding is a way to represent words as numbers so that computers can understand and work with them*

*One "Hot" Element: In each vector, only one element is set to "1" (hot) to indicate the presence of the word, and all other elements are set to "0"*.


**Example:**

"apple" is represented as [1, 0, 0]
The first position is "hot" (1) because "apple" is the first word in our vocabulary.

"orange" is represented as [0, 1, 0]
The second position is "hot" (1) because "orange" is the second word in our vocabulary.

"banana" is represented as [0, 0, 1]
The third position is "hot" (1) because "banana" is the third word in our vocabulary.

In [3]:
import numpy as np

words = np.array(['apple', 'orange', 'banana']).reshape(-1, 1)
print(words)

[['apple']
 ['orange']
 ['banana']]


In [4]:
import numpy as np

words = np.array(['apple', 'orange', 'banana'])
print(words)

['apple' 'orange' 'banana']


***Sparse Matrix*** : *This is a matrix that is primarily filled with zeros and only a few non-zero elements. Storing such a matrix efficiently in memory can save space. In the context of one-hot encoding, where there are many zeros, a sparse matrix can be more memory efficient*.

**When sparse_output=True,** *the OneHotEncoder returns a sparse matrix. This can be **memory efficient**, especially when dealing with large datasets*.

**When sparse_output=False,** *the OneHotEncoder returns a dense matrix, which is easier to handle in many scenarios but can use more memory*

In [7]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoder

In [11]:
words = np.array(['apple','banana', 'orange']).reshape(-1, 1)
print(words)

[['apple']
 ['banana']
 ['orange']]


In [12]:
one_hot_encoded = encoder.fit_transform(words)
print(one_hot_encoded)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


***Bag of Words (BoW)***

***Example***

For the sentences **"I love apples" and "I love oranges"**:

**Vocabulary**: **{I, love, apples, oranges}**

"I love apples" → [1, 1, 1, 0]

"I love oranges" → [1, 1, 0, 1]

In [21]:
sentences = ["I love apples", "I love oranges"]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

vectorizer

*The **CountVectorizer** splits each sentence into individual words (tokens)*.

**Tokenization :** "I love apples" → ['I', 'love', 'apples']

**Word Counting:** *For each sentence, it counts the occurrences of each word in the vocabulary*.

**Vocabulary Creation:** *It creates a vocabulary of all unique words from the entire corpus (all sentences). The vocabulary is sorted alphabetically by default*.

**Vocabulary  Creation**: ['apples', 'love', 'oranges']

**Word Counting:**

"I love apples" → {'apples': 1, 'love': 1, 'oranges': 0}

"I love oranges" → {'apples': 0, 'love': 1, 'oranges': 1}

In [25]:
bow_encoded = vectorizer.fit_transform(sentences).toarray()
print(bow_encoded)


[[1 1 0]
 [0 1 1]]


In [28]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'text':["people of odishaare good","odisha love naveen pattnaik","odisha is not a good state"]})
df

Unnamed: 0,text
0,people of odishaare good
1,odisha love naveen pattnaik
2,odisha is not a good state


In [30]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv

In [31]:
bow = cv.fit_transform(df['text'])
bow

<3x11 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [32]:
print(cv.vocabulary_)
print("Index positions ")

{'people': 9, 'of': 7, 'odishaare': 6, 'good': 0, 'odisha': 5, 'love': 2, 'naveen': 3, 'pattnaik': 8, 'is': 1, 'not': 4, 'state': 10}
Index positions 


Here's the tabulated representation of the vectors with the index and word mapping for each sentence:

### Sentence 1: "people of odishaare good"

| Index | Word      | Count |
|-------|-----------|-------|
| 0     | good      | 1     |
| 1     | is        | 0     |
| 2     | love      | 0     |
| 3     | naveen    | 0     |
| 4     | not       | 0     |
| 5     | odisha    | 0     |
| 6     | odishaare | 1     |
| 7     | of        | 1     |
| 8     | pattnaik  | 0     |
| 9     | people    | 1     |
| 10    | state     | 0     |

### Sentence 2: "odisha love naveen pattnaik"

| Index | Word      | Count |
|-------|-----------|-------|
| 0     | good      | 0     |
| 1     | is        | 0     |
| 2     | love      | 1     |
| 3     | naveen    | 1     |
| 4     | not       | 0     |
| 5     | odisha    | 1     |
| 6     | odishaare | 0     |
| 7     | of        | 0     |
| 8     | pattnaik  | 1     |
| 9     | people    | 0     |
| 10    | state     | 0     |

### Sentence 3: "odisha is not a good state"

| Index | Word      | Count |
|-------|-----------|-------|
| 0     | good      | 1     |
| 1     | is        | 1     |
| 2     | love      | 0     |
| 3     | naveen    | 0     |
| 4     | not       | 1     |
| 5     | odisha    | 1     |
| 6     | odishaare | 0     |
| 7     | of        | 0     |
| 8     | pattnaik  | 0     |
| 9     | people    | 0     |
| 10    | state     | 1     |

This tabulated representation clearly shows the count of each word from the vocabulary in each sentence.

In [33]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[1 0 0 0 0 0 1 1 0 1 0]]
[[0 0 1 1 0 1 0 0 1 0 0]]
[[1 1 0 0 1 1 0 0 0 0 1]]


In [35]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [36]:
from nltk.tokenize import word_tokenize
sentence = "Lithi is looking for a new role in NLP"


tokens = word_tokenize(sentence)

uni_grams = tokens

print("Uni-grams:", uni_grams)

Uni-grams: ['Lithi', 'is', 'looking', 'for', 'a', 'new', 'role', 'in', 'NLP']


In [38]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize


sentence = "Lithi is looking for a new role in NLP"
tokens = word_tokenize(sentence)
# Trigrams (3-grams)
tri_grams = list(ngrams(tokens, 3))
print("Trigrams:", tri_grams)

Trigrams: [('Lithi', 'is', 'looking'), ('is', 'looking', 'for'), ('looking', 'for', 'a'), ('for', 'a', 'new'), ('a', 'new', 'role'), ('new', 'role', 'in'), ('role', 'in', 'NLP')]


**TF-IDF (Term Frequency-Inverse Document Frequency)**

Term Frequency (TF): Measures the frequency of a term t within a document d. It indicates how frequently a term appears in a document relative to the total number of terms in that document.


TF(t,d) = Number of times term t appears in document d / Total number of terms in document d

**Example: If the word "cat" appears 3 times in a document with a total of 100 words, the TF for "cat" in that document is 3/100 = 0.03**

Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. It diminishes the weight of terms that appear frequently in many documents and emphasizes terms that are rare in the corpus.


IDF(t,D)=log(Total number of documents in the corpus D / Number of documents containing term t+1)

**Example: If there are 1,000,000 documents in the corpus and the word "cat" appears in 100,000 documents, the IDF for "cat" is log( 1000000 / 100000 + 1 )**
TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Get the TF-IDF matrix
tfidf_matrix = X.toarray()

tfidf_matrix

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [42]:
feature_names = vectorizer.get_feature_names_out()
print("\nFeature Names:", feature_names)



Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


#***What is Vector Representation?***
*In NLP, a vector representation is a way to turn words or sentences into numbers so that computers can understand and work with them*.



### Vocabulary
| Index | Word     |
|-------|----------|
| 0     | cats     |
| 1     | are      |
| 2     | cute     |
| 3     | dogs     |
| 4     | friendly |
| 5     | and      |
| 6     | pets     |

### Vector Representation of Sentences
| Sentence                    | Vector             |
|-----------------------------|--------------------|
| "cats are cute"             | [1, 1, 1, 0, 0, 0, 0] |
| "dogs are friendly"         | [0, 1, 0, 1, 1, 0, 0] |
| "cats and dogs are pets"    | [1, 1, 0, 1, 0, 1, 1] |

### Breaking Down the Vectors

#### For "cats are cute":
| Word      | Count |
|-----------|-------|
| cats      | 1     |
| are       | 1     |
| cute      | 1     |
| dogs      | 0     |
| friendly  | 0     |
| and       | 0     |
| pets      | 0     |

#### For "dogs are friendly":
| Word      | Count |
|-----------|-------|
| cats      | 0     |
| are       | 1     |
| cute      | 0     |
| dogs      | 1     |
| friendly  | 1     |
| and       | 0     |
| pets      | 0     |

#### For "cats and dogs are pets":
| Word      | Count |
|-----------|-------|
| cats      | 1     |
| are       | 1     |
| cute      | 0     |
| dogs      | 1     |
| friendly  | 0     |
| and       | 1     |
| pets      | 1     |


**the vector is [1, 1, 0, 1, 0, 1, 1]**

#***Word2Vec Models***


In [43]:
from gensim.models import Word2Vec
#common_texts: A small sample dataset provided by gensim that contains simple sentences.
from gensim.test.utils import common_texts


In [46]:
# vector_size=100: Each word will be represented by a vector of 100 numbers.

# window=5: This is the number of words around the target word that will be considered
#  (context window size). For example, if the target word is "computer",
# the words within 5 positions before and after "computer" are considered.

# min_count=1: Words that appear less than 1 time are ignored.

# workers=4: Number of CPU cores to use for training.

model_cbow = Word2Vec(sentences=common_texts, vector_size=100,
                      window=5, min_count=1, workers=4, sg=0)

In [45]:
model_cbow.train(common_texts, total_examples=len(common_texts), epochs=10)




(36, 290)

In [47]:
vector = model_cbow.wv['computer']


***Let’s imagine we have these sentences***:

*"I love machine learning"*

*"Machine learning is fascinating"*

*"I enjoy studying computer science" *

***We use Word2Vec to create vectors for each word***.

**For example**:

"I" might be represented as [0.2, -0.3, ...] -->these numbers are occurance in the dataset(refer TF-IDF)

"love" might be [0.5, 0.1, ...]

"machine" might be [0.1, -0.2, ...]

"learning" might be [0.4, 0.2, ...]

"computer" might be [0.3, 0.4, ...]

"science" might be [0.6, -0.1, ...]

In [48]:
vector = model_cbow.wv['computer']
print(vector)


[-0.00515774 -0.00667028 -0.0077791   0.00831315 -0.00198292 -0.00685696
 -0.0041556   0.00514562 -0.00286997 -0.00375075  0.0016219  -0.0027771
 -0.00158482  0.0010748  -0.00297881  0.00852176  0.00391207 -0.00996176
  0.00626142 -0.00675622  0.00076966  0.00440552 -0.00510486 -0.00211128
  0.00809783 -0.00424503 -0.00763848  0.00926061 -0.00215612 -0.00472081
  0.00857329  0.00428459  0.0043261   0.00928722 -0.00845554  0.00525685
  0.00203994  0.0041895   0.00169839  0.00446543  0.0044876   0.0061063
 -0.00320303 -0.00457706 -0.00042664  0.00253447 -0.00326412  0.00605948
  0.00415534  0.00776685  0.00257002  0.00811905 -0.00138761  0.00808028
  0.0037181  -0.00804967 -0.00393476 -0.0024726   0.00489447 -0.00087241
 -0.00283173  0.00783599  0.00932561 -0.0016154  -0.00516075 -0.00470313
 -0.00484746 -0.00960562  0.00137242 -0.00422615  0.00252744  0.00561612
 -0.00406709 -0.00959937  0.00154715 -0.00670207  0.0024959  -0.00378173
  0.00708048  0.00064041  0.00356198 -0.00273993 -0.0

#***Word2Vec Implementation***

In [49]:
!pip install gensim



In [50]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



In [51]:
vector_king = wv['king']
vector_king


array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [52]:
vector_king.shape

(300,)

In [53]:
wv['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [54]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

In [55]:
wv.similarity('women','queen')

0.2019103