Some common terms to remember
1. Corpus
2. Vocabulary
3. Document
4. Word

# Bag of words

In [2]:
import numpy as np
import pandas as pd


In [3]:
df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})

In [4]:
df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()

CountVectorizer from sklearn.feature_extraction.text is a tool used in natural language processing (NLP) to convert a collection of text documents into a numerical format that machine learning models can understand.

Here’s what happens when you create and use it:

It tokenizes the text: breaks text into words or tokens.

It counts the frequency of each token (word) in the documents.

It creates a document-term matrix, where each row is a document/text, and each column corresponds to a unique word from the entire set of documents.

The value in the matrix cells is the count of how many times the word appears in that document.

In [6]:
bow=cv.fit_transform(df['text'])

In [7]:
#vocanulary
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'dswithbappy': 1, 'write': 4, 'comment': 0}


In [8]:
bow.toarray()

array([[0, 1, 1, 1, 0],
       [0, 2, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 0, 0, 1]], dtype=int64)

In [9]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray()) 

[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]


How bag-of-words vectors are formed:

For each text, the vector is of length 5 (number of unique words). Each element in the vector shows how many times that word occurs in the sentence.

Let's map the vocabulary:

Index	Word
0	comment
1	dswithbappy
2	people
3	watch
4	write
Vector for each example:

For text = "people watch dswithbappy"

Counts:

comment: 0

dswithbappy: 1

people: 1

watch: 1

write: 0

Vector: `[0,]

For text = "dswithbappy watch dswithbappy"

Counts:

comment: 0

dswithbappy: 2 (appears twice)

people: 0

watch: 1

write: 0

Vector: `[0,]

For text = "people write comment"

Counts:

comment: 1

dswithbappy: 0

people: 1

watch: 0

write: 1

Vector: `[1,]

Summary:
The sparse matrix vectors you see:

text
[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]
represent the count of each unique word (in the fixed vocabulary order) in each corresponding text from your DataFrame.

Each position corresponds to the count of a specific word.

The shape of the vector is the number of unique words (5 here).

The values are word counts.

So this is exactly how the BoW sparse matrix is formed from your text data using CountVectorizer.

In [10]:
cv.transform(['Bappy watch dswithbappy']).toarray()

array([[0, 1, 0, 1, 0]], dtype=int64)

In [11]:
X=bow.toarray()
y=df['output']


# N Gram

In [12]:

df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [14]:
# BI Grams
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(2,2))

In [16]:
bow =cv.fit_transform(df['text'])

In [17]:
print(cv.vocabulary_)

{'people watch': 2, 'watch dswithbappy': 4, 'dswithbappy watch': 0, 'people write': 3, 'write comment': 5, 'dswithbappy write': 1}


In [18]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]


In [19]:
#TI GRAM
#
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(3,3))


In [20]:
bow =cv.fit_transform(df['text'])


In [21]:
print(cv.vocabulary_)

{'people watch dswithbappy': 2, 'dswithbappy watch dswithbappy': 0, 'people write comment': 3, 'dswithbappy write comment': 1}


In [22]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]


# TF-IDF (Term frequency- Inverse document frequency

In [23]:

df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [25]:
# from sklearn.feature_extraction.text import TfidVectorizer
# tfid=TfidVectorizer()

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid = TfidfVectorizer()


In [27]:
arr=tfid.fit_transform(df['text']).toarray()

In [28]:
arr

array([[0.        , 0.49681612, 0.61366674, 0.61366674, 0.        ],
       [0.        , 0.8508161 , 0.        , 0.52546357, 0.        ],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027],
       [0.61366674, 0.49681612, 0.        , 0.        , 0.61366674]])

### The TfidfVectorizer 
in scikit-learn  is similar in usage to CountVectorizer but instead of just counting word frequencies, it computes the TF-IDF score for each word in each document.

How TF-IDF Vectorizer works (using your example texts):
Your text data:

"people watch dswithbappy"

"dswithbappy watch dswithbappy"

"people write comment"

"dswithbappy write comment"

TfidfVectorizer transforms this corpus into a matrix where each element represents the TF-IDF score of a word in a particular document.

What is TF-IDF?
TF (Term Frequency): Measures how frequently a term occurs in a document, similar to the count in Bag of Words.

IDF (Inverse Document Frequency): Measures how important a term is across all documents. Words common to many documents get lower weight; rare words get higher weight.

TF-IDF score combines these to represent important words that are frequent in a document but rare across all documents.

Workflow:
Build vocabulary: Extract unique words from all documents (like in CountVectorizer).

Compute term frequency (TF): Count how often each word appears in each document.

Compute inverse document frequency (IDF): Calculate how rare each word is across all documents.

Calculate TF-IDF: Multiply TF with IDF to generate weighted scores.

Normalize: By default, vectors are normalized (L2 norm) to scale patterns evenly.

### Key outcomes:
Words like "dswithbappy" that appear multiple times in some documents get a higher TF, but since it appears in many documents, its IDF score will reduce it somewhat.

Words that appear rarely or only in a specific document (like "people", "comment", "write") will have relatively higher TF-IDF scores to highlight their importance in that document.

Resulting matrix is sparse and normalized but represents weighted features rather than raw counts.

Why use TF-IDF over CountVectorizer?
Bag of Words treats all words equally regardless of their frequency across documents.

TF-IDF weights words to reduce the effect of common words that are less informative (like "watch" if it appears everywhere).

TF-IDF leads to better feature representation for tasks like text classification or clustering.

In [29]:

print(tfid.idf_)

[1.51082562 1.22314355 1.51082562 1.51082562 1.51082562]
