## NLP_Assignment_1
1. Explain One-Hot Encoding
2. Explain Bag of Words
3. Explain Bag of N-Grams
4. Explain TF-IDF
5. What is OOV problem?
6. What are word embeddings?
7. Explain Continuous bag of words (CBOW)
8. Explain SkipGram
9. Explain Glove Embeddings.

In [1]:
'''Ans 1:- One-Hot Encoding is a technique used to represent
categorical data as binary vectors. Each category is assigned a unique
binary value, with only one bit set to 1 and the rest as 0s. For
example, in Python This code converts the 'Color' column into
one-hot encoded vectors, with a binary representation for each
color category.'''

import pandas as pd

data = {'Color': ['Red', 'Green', 'Blue']}
df = pd.DataFrame(data)

one_hot_encoded = pd.get_dummies(df, columns=['Color'])
print(one_hot_encoded)

   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           0            1          0
2           1            0          0


In [2]:
'''Ans 2:- Bag of Words (BoW) is a text representation method where
the frequency of each word in a document is counted,
disregarding word order. It creates a sparse vector of word frequencies
for each document. In Python This code converts a list of text
documents into a BoW representation using scikit-learn's
CountVectorizer.

Each row corresponds to a document, and each column
corresponds to a unique word in the corpus.This BoW matrix allows us
to represent text data in a numerical format suitable for
machine learning algorithms.  

like:-
The word "this" appears 1 time.
The word "is" appears 1 time.
The word "the" appears 1 time.
The word "first" appears 1 time.
The word "document" appears 1 time.
The word "second" appears 0 times.
The word "and" appears 0 times.
The word "third" appears 1 time.
The word "one" appears 0 times.'''

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.",
          "This document is the second document.",
          "And this is the third one."]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]]


In [4]:
'''Ans 3:- Bag of N-Grams is an extension of the Bag of Words (BoW)
model that considers sequences of N consecutive words (N-grams)
in text data. It captures not only individual words but also
word combinations, providing better context. For example, in
the sentence "I love machine learning," the 2-grams would be
"I love" and "love machine," giving more insight than BoW.In
this code, ngram_range=(1, 2) specifies 1-grams (individual
words) and 2-grams (word pairs). The resulting BoW matrix
includes both single words and word pairs as features.

This Bag of N-Grams representation captures both
individual words and meaningful word combinations, providing richer
information than a simple Bag of Words model.'''

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love machine learning", "Machine learning is fun"]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['fun' 'is' 'is fun' 'learning' 'learning is' 'love' 'love machine'
 'machine' 'machine learning']
[[0 0 0 1 0 1 1 1 1]
 [1 1 1 1 1 0 0 1 1]]


In [6]:
'''Ans 4:- TF-IDF (Term Frequency-Inverse Document Frequency) is a
numerical statistic used to evaluate the importance of a term in a
document within a corpus. It considers both the frequency of a term
in a document (TF) and how unique it is across the entire
corpus (IDF). Higher TF-IDF scores indicate more important
terms.This code calculates TF-IDF values for words in the given text
documents, representing the importance of each term.'''

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.",
          "This document is the second document.",
          "And this is the third one."]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

print(tfidf_vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0.         0.46941728 0.61722732 0.3645444  0.         0.
  0.3645444  0.         0.3645444 ]
 [0.         0.7284449  0.         0.28285122 0.         0.47890875
  0.28285122 0.         0.28285122]
 [0.49711994 0.         0.         0.29360705 0.49711994 0.
  0.29360705 0.49711994 0.29360705]]


In [6]:
'''Ans 5:- The out-of-vocabulary (OOV) problem is a problem in
natural language processing (NLP) where a word in the input text
is not present in the vocabulary of the model. This can
happen when the model is trained on a limited dataset or when the
input text contains new or rare words.In this example, the
handle_oov_words() function replaces any word in the text that is not in the
vocabulary with the placeholder token <OOV>. This allows the model to
continue processing the text even if it encounters words that it
does not know.  The OOV problem can be a challenge for NLP
models, but there are a number of techniques that can be used to
handle it.

1. Providing a larger vocabulary: This can be done by training the model
   on a larger dataset or by adding new words to the vocabulary manually.
2. Using a statistical language model: This can help the model to predict 
   the meaning of OOV words based on the context in which they appear.
3. Using a neural machine translation model: This can learn to translate 
   OOV words from one language to another.'''

def handle_oov_words(text):
    for word in text:
        if word not in vocabulary:
            text = text.replace(word, "<OOV>")
        return text

text = "This is a sentence with an OOV word."

# Replace OOV words with <OOV>
text = handle_oov_words(text)

print(text)

<OOV>his is a sentence with an OOV word.


In [7]:
'''Ans 6:- Word embeddings are numerical representations of words in
a way that captures semantic relationships. They map words
to dense vectors in a continuous vector space. Word2Vec is a
popular technique to create word embeddings.This code uses
Word2Vec to create word embeddings and retrieve the vector
representation for the word "NLP."

The output we get is a 100-dimensional word embedding
vector for the word "NLP" generated by Word2Vec. Each dimension
in the vector represents a different aspect of the word's
meaning, learned from the context in which it appears in the
training data. These embeddings are used to capture semantic
similarities between words and are widely used in natural language
processing tasks.'''

from gensim.models import Word2Vec

sentences = [['I', 'love', 'NLP'], ['Word', 'embeddings', 'are', 'useful']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

vector = model.wv['NLP']
print(vector)

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.

In [9]:
'''Ans 7:- Continuous Bag of Words (CBOW) is a word embedding
technique in natural language processing. Unlike Word2Vec's
Skip-gram, CBOW predicts a target word from its context words. It
aims to learn word representations by minimizing the prediction
error. CBOW is efficient and performs well on smaller datasets.
To train a CBOW model using Gensim's Word2Vec, we do need to
first build a vocabulary from our training data.'''

from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ['i', 'love', 'machine', 'learning'],
    ['word', 'embeddings', 'are', 'useful'],
    ['machine', 'learning', 'is', 'fun'],
]

# Create and train the CBOW model
model = Word2Vec(sentences, vector_size=100, window=2, sg=0, min_count=1, workers=4)

# Access the word vectors
vector = model.wv['machine']
print(vector)

[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7104042e-03
 -8.5343346e-03  3.2071066e-03 -4.6379971e-03 -5.0889552e-03
  3.5896183e-03  5.37033

In [11]:
'''Ans 8:- This code uses Gensim's Word2Vec to train a Skip-gram word
embedding model on a small dataset. It learns to represent words as
dense vectors of size 100 while considering a window of 2 words
around each target word. The 'sg' parameter is set to 1 for
Skip-gram. Finally, it retrieves the word vector for the word
"learning" learned from its context in the sentences.'''

from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ['i', 'love', 'machine', 'learning'],
    ['word', 'embeddings', 'are', 'useful'],
    ['machine', 'learning', 'is', 'fun'],
]

# Create and train the Skip-gram model
model = Word2Vec(sentences, vector_size=100, window=2, sg=1, min_count=1, workers=4)

# Access the word vectors
vector = model.wv['learning']
print(vector)

[-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03
 -1.5776526e-03  3.2137157e-04 -4.1406299e-03 -7.6826881e-03
 -1.5080082e-03  2.46979

In [1]:
'''Ans 9:- GloVe (Global Vectors for Word Representation) is an
unsupervised word embedding technique designed to capture the semantic
meaning of words in a large corpus of text. It focuses on the
co-occurrence statistics of words within a context window to create word
vectors.  GloVe constructs a global word-to-word co-occurrence
matrix that quantifies how often pairs of words appear together
in the same context. It then uses matrix factorization
techniques to learn word embeddings. The key insight behind GloVe is
that word vectors should reflect the relationships between
words accurately, which means words with similar meanings will
have similar vectors.  GloVe embeddings have several
advantages, including their ability to capture semantic relationships,
handle rare words, and generalize well to various NLP tasks. They
are pre-trained on extensive text corpora, making them useful
for downstream applications like sentiment analysis, machine
translation, and named entity recognition. GloVe has become a
fundamental tool in natural language processing, facilitating the
representation of words in a continuous vector space, enabling richer and
more effective text analysis.'''


