## Bag of Words

Bag of Words (BoW) involves converting text documents into numerical vectors by counting the occurrences of each word in the document, disregarding grammar and word order. The resulting vector represents the frequency distribution of words in the document, forming a "bag" of words without considering their sequence or structure. BoW is commonly used for text classification.

BoW disregards the order and structure of words in the text and focuses solely on the frequency of individual words. The idea is to treat each document as an unordered "bag" of words, which allows us to represent text in a format suitable for machine learning tasks.By representing text as a frequency distribution of words, BoW captures the presence and importance of words in a document. It enables us to measure the similarity between documents, classify texts into different categories, perform sentiment analysis, and more.

In [1]:
corpus = ["This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",]

we have a corpus of four text documents. We use the CountVectorizer from scikit-learn to convert the text into a bag of words representation. The fit_transform method creates a sparse matrix where each row corresponds to a document, and each column represents a unique word in the corpus. The cell values indicate the frequency of each word in the respective document.

The get_feature_names_out() method returns the list of unique words in the corpus, which forms the vocabulary. The BoW matrix and the vocabulary are then printed for inspection.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the BoW representation
bow_matrix = vectorizer.fit_transform(corpus)

# Get the vocabulary (list of unique words) from the vectorizer
vocabulary = vectorizer.get_feature_names_out()

# Print the BoW representation and the vocabulary
print(vocabulary)
print(bow_matrix.toarray())


['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## N grams 


N-grams are contiguous sequences of N items (words, characters, or tokens) extracted from a text or sentence. This is used to represent the frequency and relationships between words in a given piece of text.The core intuition behind using N-grams is to capture the local context and relationships between words within a text. By considering sequences of N items (words, characters, or tokens) together, N-grams provide a more comprehensive representation of the text compared to individual words (unigrams). This can be particularly useful in capturing phrases, collocations, and syntactic patterns in the language.

For example, consider the sentence: "I love to code." Here are some examples of different N-grams for this sentence:

1-grams (unigrams): ["I", "love", "to", "code"] = Bag of Words!!!

2-grams (bigrams): ["I love", "love to", "to code"]

3-grams (trigrams): ["I love to", "love to code"]

4-grams (fourgrams): ["I love to code"]

In [10]:
import nltk
from nltk.util import ngrams
tokenized_corpus = [nltk.word_tokenize(doc) for doc in corpus]

# Function to generate N-grams for the entire corpus
def generate_ngrams_for_corpus(corpus, n):
    n_grams_corpus = []
    for doc in corpus:
        n_grams = ngrams(doc, n)
        n_grams_doc = [' '.join(grams) for grams in n_grams]
        n_grams_corpus.append(n_grams_doc)
    return n_grams_corpus

# Generate and store N-grams for the corpus
N = 2
n_grams_corpus = generate_ngrams_for_corpus(tokenized_corpus, N)

# Print the N-grams for each document in the corpus
for i, doc_ngrams in enumerate(n_grams_corpus):
    print(f"N-grams for document {i + 1}: {doc_ngrams}")

N-grams for document 1: ['This is', 'is the', 'the first', 'first document', 'document .']
N-grams for document 2: ['This document', 'document is', 'is the', 'the second', 'second document', 'document .']
N-grams for document 3: ['And this', 'this is', 'is the', 'the third', 'third one', 'one .']
N-grams for document 4: ['Is this', 'this the', 'the first', 'first document', 'document ?']


# TFIDF (Term Frequency-Inverse Document Frequency)
TFIDF  is a numerical representation used to measure a word's importance in a document relative to a collection of documents. The core intuition is to prioritize words that are frequent in the document (TF) but rare across the corpus (IDF), making them more discriminative and informative for text analysis tasks.

### Steps for the TFIDF vectoriser 
Tokenization: Break down input documents into individual words (tokens).

Counting: Count the frequency of each word in each document, creating a document-term matrix.

Term Frequency (TF) Calculation: (frequency of each term in a document ) / total number of words in that document. It gives the probability of a word. 

Inverse Document Frequency (IDF) Calculation: It measures the rarity of each term across the entire corpus. loge(total number of documents in a corpus/the number of documents containing the term) + 1. The effect of adding 1 is to avoid any circumstance where the log value becomes zero since we have to multiply in the next term. 

TF-IDF Score Calculation: by multiplying its TF and IDF scores, resulting in a TF-IDF matrix representing the importance of words in the context of the entire corpus.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

Tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus to get the TF-IDF matrix
tfidf_matrix = Tfidf_vectorizer.fit_transform(corpus)

# Convert the matrix to an array for better visualization (optional)
tfidf_array = tfidf_matrix.toarray()

# Print the TF-IDF matrix
print(tfidf_array)


[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
