# NLP - Natural language processing

### term frequency

term frequency $tf(t,d)$ - relative frequency of term t in document d, i.e how frequent a term occurs in a document

$$
tf(t,d) = \frac{f_{t,d}}{\sum t'\epsilon d f t',d}
$$

where $f_{t,d}$ the raw count, is the amount of time the term t is in document d, the deonminator is the total number of terms in the document, also tf(t,d) could be defined in several ways, and a simple way is to just equate it to the raw frequency count.

In [None]:
import numpy as np

review1 = "I LOVE this book about love"
review2 = "no this book was okay"

all_words = [text.lower().split() for text in [review1,review2]]

print(all_words)

all_words = [words for text in all_words for words in text]

unique_words = set(all_words)


print(unique_words)





[['i', 'love', 'this', 'book', 'about', 'love'], ['no', 'this', 'book', 'was', 'okay']]


{'about', 'book', 'i', 'love', 'no', 'okay', 'this', 'was'}

In [13]:
vocabulary = {word: index for index, word in enumerate(unique_words)}

print(vocabulary)

def term_frequency_vectorizer(document, vocabulary):
    term_frequency = np.zeros(len(vocabulary))

    for word in document.lower().split():
        index = vocabulary[word]
        term_frequency[index] += 1

    return term_frequency

review1_term_freq = term_frequency_vectorizer(review1, vocabulary)
review2_term_freq = term_frequency_vectorizer(review2, vocabulary)


print(review1_term_freq, review2_term_freq)

{'about': 0, 'love': 1, 'okay': 2, 'no': 3, 'this': 4, 'book': 5, 'was': 6, 'i': 7}
[1. 2. 0. 0. 1. 1. 0. 1.] [0. 0. 1. 1. 1. 1. 1. 0.]


In [14]:
import pandas as pd
bag_of_words = pd.DataFrame([review1_term_freq,review2_term_freq], columns=vocabulary.keys(), dtype="int16")

bag_of_words

Unnamed: 0,about,love,okay,no,this,book,was,i
0,1,2,0,0,1,1,0,1
1,0,0,1,1,1,1,1,0


# TF-idf

term frequency - inverse document frequency
tf idf is a way to represent how important a word is across a corpus of documents, basically its a vectore with numeric weights on each word

sklearn imports

CountVectorizer - creates a bag of words model
TfidfTransformer - transforms it using TF-IDF
TfidfVectorizer -  dous countvectorizer and tfidstransformer

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

countvectorizer = CountVectorizer()
bag_of_words_sparse = countvectorizer.fit_transform([review1,review2])
bag_of_words_sparse.todense(), countvectorizer.get_feature_names_out()

pd.DataFrame(bag_of_words_sparse.todense(),columns=countvectorizer.get_feature_names_out())

Unnamed: 0,about,book,love,no,okay,this,was
0,1,1,2,0,0,1,0
1,0,1,0,1,1,1,1


In [17]:
# TF-IDF

tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit_transform(bag_of_words_sparse).todense()

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])

In [18]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform([review1,review2]).todense()

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])