# Numerical Representaion of text 

We convert text into numbers because machine learning models can only work with numerical data.
This allows them to perform mathematical operations, find patterns, and learn relationships in text (using methods like BoW, TF-IDF, or embeddings).

## Bag of Words

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
data = ['Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [17]:
countvec = CountVectorizer()

countvec_fit = countvec.fit_transform(data)

print(countvec.get_feature_names_out())

['10' 'about' 'admirable' 'ahead' 'are' 'as' 'attacks' 'back' 'bait'
 'beach' 'blood' 'can' 'carol' 'decided' 'directions' 'drank' 'drawer'
 'efficiency' 'feet' 'from' 'giving' 'gruff' 'grumbling' 'handful' 'he'
 'himself' 'if' 'in' 'is' 'man' 'most' 'mountains' 'occur' 'of' 'old'
 'only' 'out' 'paired' 'people' 'quite' 'road' 'said' 'sat' 'scooped'
 'see' 'shark' 'she' 'shop' 'sign' 'since' 'so' 'socks' 'speed' 'that'
 'the' 'them' 'there' 'to' 'up' 'vampire' 'was' 'were' 'west' 'when'
 'where' 'which' 'with' 'work' 'works' 'worms' 'you']


In [18]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())

print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 

## Term Frequency – Inverse Document Frequency ( TF–IDF )


#### Step 1: Term Frequency (TF)

$$
TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total terms in document } d}
$$

Example: In `"I love NLP I love AI"`

* TF("love") = 2 / 6 = **0.33**

---

#### Step 2: Inverse Document Frequency (IDF)

$$
IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
$$

So common words across all docs (like "the") get **low IDF**, while rare words get **high IDF**.

---

#### Step 3: TF–IDF

$$
TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)
$$

---

In [19]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
data = ['Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [21]:
tfidfvec = TfidfVectorizer()

tfidfvec_fit = tfidfvec.fit_transform(data)

In [22]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns = tfidfvec.get_feature_names_out())

print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 