## TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is a feature exrtraction technique, which is commonly used in text mining and information retrieval. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.
- Term Frequency (TF): This measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization.
$$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}$$
- Inverse Document Frequency (IDF): This measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
$$IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}}\right)$$
- N is the total number of documents in the corpus


In [7]:
import re
import string
import pandas as pd

df = pd.read_csv("../data/AERA02_AptitudeAssessment_Dataset_NLP_cleaned_vi.csv")
df.fillna("", inplace=True)

with open("../data/external/stopwords.txt") as f:
    stopwords = f.readlines()
    stopwords = [word.strip() for word in stopwords]

def process_text(text):
    text = re.sub("(&#\d+;)", "", text)
    text = re.sub("([\/-])", " ", text)
    text = re.sub("(<.*?>)", "" ,text)
    text = re.sub("(^https?:\/\/\S+)", "", text)
    text = "".join([i for i in text if i not in string.punctuation + "…"])
    text = text.lower()
    text = " ".join([word for word in text.split() if word not in stopwords])
    return text

def process_corpus(corpus):
    _WORD_SPLIT = re.compile("([.,!?\"/':;)(])")
    def basic_tokenizer(sentence):
        words = []
        for space_separated_fragment in sentence.strip().split():
            words.extend(_WORD_SPLIT.split(space_separated_fragment))
        return [w.lower() for w in words if w != '' and w != ' ' and w not in string.punctuation]
    
    corpus = corpus.replace("\n", " ").split(" ")
    
    return " ".join(basic_tokenizer(" ".join(corpus)))

In [9]:
sentences = " ".join(df["title2review"].apply(process_text).tolist())

In [3]:
sentences[:50]

'trải nghiệm tốt đầy đủ dịch vụ tiện nghi ăn sáng b'

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(max_features=3000,sublinear_tf=True)
tf.fit([sentences])
X = tf.transform([sentences])



In [10]:
from sklearn.preprocessing import Binarizer
binaray = Binarizer(threshold=6)
y = binaray.fit_transform(y_score)
y = np.array(y).flatten()

NameError: name 'y_score' is not defined