# Word Representations

In Natural Language Processing (NLP), word representations are ways to numerically represent words so computers can understand and process them. Think of it like translating words into a language computers speak - numbers!

common methods:

1. **Traditional Methods:**
*One-Hot Encoding:* Each word becomes a unique vector with one "1" (representing that word) and the rest filled with zeros. Simple, but doesn't capture relationships between words.
TF-IDF: Considers how frequently a word appears in a document (Term Frequency) and how rare it is across all documents (Inverse Document Frequency). Helps identify important words in a context.

2. **Word Embeddings:**
*Word2Vec*: Learns to predict words that appear near each other in a large text corpus. Represents words as dense vectors, capturing semantic relationships (e.g., "king" - "man" + "woman" ≈ "queen").
GloVe: Similar to Word2Vec, but also considers global word co-occurrence statistics.

**Why Word Embeddings Matter:**

Dimensionality Reduction: Representing words in a lower-dimensional space makes computation more efficient.

Semantic Capture: Words with similar meanings have similar vector representations, allowing models to understand relationships and analogies.

Transfer Learning: Pre-trained word embeddings can be used in different NLP tasks, saving time and resources.

In a nutshell, word representations are essential for NLP tasks like machine translation, sentiment analysis, and text classification, enabling computers to understand the meaning and context of human language.

## **Bag of Words**

The Bag-of-Words (BoW) model is a way to represent text data for machine learning algorithms. It's like putting all the words from a document into a bag and counting how many times each word appears, ignoring the order or grammar.

Imagine you have two sentences:

"The cat sat on the mat."
"The dog chased the cat."
Steps to create a BoW representation:

Vocabulary: Combine all unique words from both sentences: {"the", "cat", "sat", "on", "mat", "dog", "chased"}.

Counting: For each sentence, count how many times each vocabulary word appears:
Sentence 1: {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1, "dog": 0, "chased": 0}

Sentence 2: {"the": 2, "cat": 1, "sat": 0, "on": 0, "mat": 0, "dog": 1, "chased": 1}

Result: Each sentence is now represented as a vector of word counts. This numerical representation can be used as input for machine learning algorithms.

Key points:
**BoW** ignores word order and grammar.
It focuses on word frequency.
It's simple and efficient for basic text analysis tasks.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat."
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print()
print(bow_matrix.toarray())

['cat' 'chased' 'dog' 'mat' 'on' 'sat' 'the']

[[1 0 0 1 1 1 2]
 [1 1 1 0 0 0 2]]


This code will output the BoW representation of the two sentences as a numerical matrix.

**Limitations**:

BoW loses information about word order and context.
It doesn't capture semantic relationships between words.
Despite its limitations, BoW is a useful tool for basic text analysis tasks and can be a starting point for more advanced NLP techniques.

In [None]:
print(vectorizer.get_feature_names_out())

['cat' 'chased' 'dog' 'mat' 'on' 'sat' 'the']


## **TF - IDF(term Frequency - Inverse Document Frequency)**

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in Natural Language Processing (NLP) to reflect how important a word is to a document within a collection of documents (corpus).**bold text**

Here's how it works with an example:

Imagine you have a corpus of 3 documents:

Document 1: "The cat sat on the mat."
Document 2: "The dog chased the cat."
Document 3: "The dog barked loudly."

Let's calculate the TF-IDF for the word "cat":

1. Term Frequency (TF):

Count how many times "cat" appears in each document:
Document 1: 1
Document 2: 1
Document 3: 0
Calculate the term frequency (TF) for each document by dividing the count by the total number of words in that document.

2. Inverse Document Frequency (IDF):

Count how many documents contain the word "cat": 2 (Document 1 and Document 2)
Calculate IDF: log(Total number of documents / Number of documents containing the word) = log(3/2)

3. TF-IDF:

Multiply TF and IDF for each document:
Document 1: TF(cat) * IDF(cat)
Document 2: TF(cat) * IDF(cat)
Document 3: 0 (since "cat" doesn't appear)
Interpretation:

A high TF-IDF score means the word is frequent in a specific document but rare across the corpus, indicating its importance for that document.
In our example, "cat" would have a higher TF-IDF score in Document 1 and 2 compared to other words like "the" which appear frequently across all documents.


Key Points:

TF-IDF helps identify words that are unique and relevant to a specific document within a larger corpus.
It's used in various NLP tasks like information retrieval, text classification, and keyword extraction.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The dog barked loudly."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print()
print(tfidf_matrix.toarray())

['barked' 'cat' 'chased' 'dog' 'loudly' 'mat' 'on' 'sat' 'the']

[[0.         0.34101521 0.         0.         0.         0.44839402
  0.44839402 0.44839402 0.52965746]
 [0.         0.40352536 0.53058735 0.40352536 0.         0.
  0.         0.         0.62674687]
 [0.5844829  0.         0.         0.44451431 0.5844829  0.
  0.         0.         0.34520502]]


**TF-IDF Calculation
TF (Term Frequency):**

The number of times a term appears in a document divided by the total number of terms in that document.



---


**IDF (Inverse Document Frequency):** Measures how important a term is across the entire corpus. It is calculated as:


IDF ( 𝑡 ) = log ⁡ ( 𝑁/ DF ( 𝑡 ) )

Where:

N is the total number of documents.

DF(t) is the number of documents that contain the term
t.

TF-IDF: The TF-IDF value for a term in a document is calculated as:

TF-IDF(t,d)=TF(t,d)×IDF(t)

### **Limitations of The Bag of Words and TF-IDF**





Bag of Words (BoW):

**Loss of Word Order and Context:** BoW completely ignores the order of words in a sentence. This means it can't distinguish between "The cat chased the dog" and "The dog chased the cat," even though the meaning is completely different.


**No Semantic Understanding:** BoW only considers word frequencies and doesn't capture the semantic relationships between words. It treats all words as equally important, regardless of their meaning or context.


---



TF-IDF:

Limited Contextual Awareness: While TF-IDF considers the rarity of words across documents, it still doesn't fully capture the context in which words are used. For example, it might give high scores to words that are frequent in a specific topic but not necessarily important for understanding the overall meaning of a document.

Sparsity: The TF-IDF matrix can become very sparse (mostly filled with zeros) when dealing with large vocabularies, which can pose challenges for some machine learning algorithms.


---

General Limitations of Both:

Inability to Handle Out-of-Vocabulary Words: Both methods struggle with words that were not present in the training corpus. They can't assign meaningful representations to these new words.

Limited Representation of Complex Linguistic Phenomena: BoW and TF-IDF don't capture more complex linguistic features like syntax, negation, or sarcasm, which can be crucial for understanding natural language.

To overcome these limitations, more advanced techniques like word embeddings (Word2Vec, GloVe) and deep learning models (RNNs, Transformers) are often used in modern NLP tasks. These methods can capture semantic relationships, context, and even handle out-of-vocabulary words to some extent.


## **Word2Vec**

Word2Vec is a technique for learning word embeddings, which are dense vector representations of words that capture semantic relationships between them. Unlike BoW and TF-IDF, Word2Vec represents words in a continuous vector space, where words with similar meanings are located closer to each other.



There are two main architectures for Word2Vec:

Continuous Bag-of-Words (CBOW): Predicts a target word based on its surrounding context words. It essentially tries to guess a word given its neighbors.

Skip-gram: Predicts context words within a certain window given a target word. It tries to guess the neighbors of a given word.

In [None]:
# The below code is not working

In [None]:
!pip install gensim==4.2.0



In [None]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')
vocab_len = len(wv.key_to_index.keys())
print(vocab_len)