**Tiền xử lí dữ liệu**

**1. TF-IDF**

- Biểu diễn TF-IDF đối với 1 văn bản d trong một tập văn bản D (corpus):

$r_d = [tf-idf(w_1, d, D), tf-idf(w_2, d, D), ..., tf-idf(w_{|V|}, d, D)]$

với, $r_d \in R^{|V|}$ là một vector $|V|$ chiều và $V = {w_i}$ là từ điển (tập các từ xuất hiện trong $D$) đối với $D$

- Trong đó:

$tf-idf(w_i, d, D) = tf(w_i, d) * idf(w_i, D)$

với,

$tf(w_i, d) = \dfrac{f(w_i, d)}{max(f(w_j, d): w_j \in V)}$

$idf(w_i, D) = log_{10}^{\dfrac{|D|}{|d' \in D: w_i \in d'|}}$

- Xác định từ điển V:

  - Với mỗi văn bản $d$ trong $D$:
    - Tách d thành các từ theo punctuations ta thu được $W_d$
    - Loại bỏ từ dừng - stop words khỏi $W_d$
    - Đưa các từ về dạng gốc(stemming)
    - Ta thu được $W_d$
  - Cuối cùng:
    $V = $ giao của $W_d$ với $d \in D$

**Example Demo**

In [4]:
import pandas as pd
import sklearn as skl
import math

In [1]:
# Create data

first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"

first_sentence = first_sentence.lower()
second_sentence = second_sentence.lower()

first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")
total = set(first_sentence).union(set(second_sentence))

print(first_sentence)
print(second_sentence)
print(total)

['data', 'science', 'is', 'the', 'sexiest', 'job', 'of', 'the', '21st', 'century']
['machine', 'learning', 'is', 'the', 'key', 'for', 'data', 'science']
{'the', 'sexiest', '21st', 'century', 'machine', 'learning', 'for', 'job', 'data', 'is', 'key', 'of', 'science'}


In [2]:
# Filter sentence by stopwords

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

ft_first_sentence = [w for w in first_sentence if w not in stop_words]
ft_second_sentence = [w for w in second_sentence if w not in stop_words]
ft_total = set(ft_first_sentence).union(set(ft_second_sentence))

print(ft_first_sentence)
print(ft_second_sentence)
print(ft_total)

['data', 'science', 'sexiest', 'job', '21st', 'century']
['machine', 'learning', 'key', 'data', 'science']
{'sexiest', '21st', 'century', 'machine', 'learning', 'job', 'data', 'key', 'science'}


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
# Create DataFrame

wordDictA = dict.fromkeys(ft_total, 0) 
wordDictB = dict.fromkeys(ft_total, 0)

for word in ft_first_sentence:
    wordDictA[word] += 1
for word in ft_second_sentence:
    wordDictB[word] += 1

data = pd.DataFrame([wordDictA, wordDictB])
data

Unnamed: 0,sexiest,21st,century,machine,learning,job,data,key,science
0,1,1,1,0,0,1,1,0,1
1,0,0,0,1,1,0,1,1,1


In [6]:
# tf section

def computeTF(wordDict):
    tfDict = {}
    for word, count in wordDict.items():
        tfDict[word] = count / max(wordDict.values())
    return tfDict

tf_A = computeTF(wordDictA)
tf_B = computeTF(wordDictB)

tf = pd.DataFrame([tf_A, tf_B])
tf                                                            

Unnamed: 0,sexiest,21st,century,machine,learning,job,data,key,science
0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0
1,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0


In [7]:
# idf section

def computeIDF(data):
  N = data.shape[0] 
  idfDict = dict.fromkeys(data.columns, 0)
  for word, val in idfDict.items():
    val = (data[word] != 0).sum()
    idfDict[word] = math.log10(N / float(val)) # not +1 because < 0
  return idfDict

idf = computeIDF(data)
idf

{'21st': 0.3010299956639812,
 'century': 0.3010299956639812,
 'data': 0.0,
 'job': 0.3010299956639812,
 'key': 0.3010299956639812,
 'learning': 0.3010299956639812,
 'machine': 0.3010299956639812,
 'science': 0.0,
 'sexiest': 0.3010299956639812}

In [8]:
# tf-idf section

def computeTFIDF(tfWordDict, idfs):
  tfidf = {}
  for word, val in tfWordDict.items():
    tfidf[word] = val * idfs[word]
  return tfidf

idfs = computeIDF(data)
tfidf_A = computeTFIDF(tf_A, idfs)
tfidf_B = computeTFIDF(tf_B, idfs)

tfidf = pd.DataFrame([tfidf_A, tfidf_B])
tfidf

Unnamed: 0,sexiest,21st,century,machine,learning,job,data,key,science
0,0.30103,0.30103,0.30103,0.0,0.0,0.30103,0.0,0.0,0.0
1,0.0,0.0,0.0,0.30103,0.30103,0.0,0.0,0.30103,0.0


In [9]:
# Short step using sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

vectorize = TfidfVectorizer(stop_words='english')

first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"

response = vectorize.fit_transform([first_sentence, second_sentence])

print(response)

# (x, y), x is data[x], y is in vectorize.get_feature_names()

  (0, 1)	0.4466561618018052
  (0, 0)	0.4466561618018052
  (0, 3)	0.4466561618018052
  (0, 8)	0.4466561618018052
  (0, 7)	0.31779953783628945
  (0, 2)	0.31779953783628945
  (1, 4)	0.4992213265230509
  (1, 5)	0.4992213265230509
  (1, 6)	0.4992213265230509
  (1, 7)	0.35520008546852583
  (1, 2)	0.35520008546852583
