In [None]:
import pandas as pd 
import numpy as np 

In [11]:
df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'],
                   'label': [1, 1, 0, 0]})

In [3]:
df.head()

Unnamed: 0,text,label
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [12]:
bow = cv.fit_transform(df['text'])

In [13]:
# vocab
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [15]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]


<h3> TF-IDF (Term Frequency-Inverse Document Frequency) </h3> works by evaluating a word's importance in a document by combining its frequency within the document (Term Frequency) with how rare it is across a collection of documents (Inverse Document Frequency), ultimately assigning a score that reflects both. 

1. Term Frequency (TF):
* Concept: TF measures how often a specific word appears in a document. 
* Calculation: It's calculated by dividing the number of times a word appears in a document by the total number of words in that document. 
* Example: If the word "cat" appears 5 times in a document with 50 words, the TF of "cat" is 5/50 = 0.1. 

2. Inverse Document Frequency (IDF):
* Concept: IDF measures how important a word is across the entire collection of documents. 
* Calculation: It's calculated as the log of the total number of documents (N) divided by the number of documents containing the word (DF). 
* Example: If "cat" appears in 2 out of 10 documents, the IDF is log(10/2). 

3. TF-IDF Score:
* Calculation: The TF-IDF score is calculated by multiplying the TF and IDF values. 
* Interpretation: A higher TF-IDF score indicates that the word is both frequent in the current document and relatively rare across the entire collection, suggesting it's a more important or relevant term. 
* Example: If the TF of "cat" is 0.1 and the IDF is 1.6, the TF-IDF score would be 0.1 * 1.6 = 0.16. 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [17]:
print(tfidf.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [19]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']
