## ថ្ងៃនេះខ្ញុំនឹងបង្ហាញពីការបង្រៀនកុំព្យូទ័រអោយចេះនូវរបៀបអានអត្ថបទ

## Today, We're teaching computers how to read


### What 's a Corpus?
Let's start with a brief corpus of a document. A Corpus is a collection of documents.

In [2]:
docA = "the cat sat on my face"
docB = "the dog sat on my bed"

### Tokenizing
Most of the time when we work on text, we can use the 'Bag Of Words' model to represent a document. In the BOW model, each document can be thought of as a bag of words.

In [3]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [4]:
bowB

['the', 'dog', 'sat', 'on', 'my', 'bed']

Splitting a document up into the component words like this is called 'tokenizing'.

Ok, so the documents are tokentized, but how do we convert a tokenized BOW into numbers?

There are a few strategies. One simple strategy could be to create a vector of all possible words, and for each document count how many times each word appears.

In [5]:
wordSet = set(bowA).union(set(bowB))

In [6]:
wordSet

{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'}

In [7]:
#Now I wll create dictionaries to keep my word counts.
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [8]:
wordDictB

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

In [9]:
#Now I will count the words in my bags
for word in bowA:
    wordDictA[word]+=1

for word in bowB:
    wordDictB[word]+=1

In [10]:
wordDictB

{'bed': 1, 'cat': 0, 'dog': 1, 'face': 0, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [11]:
wordDictA

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [12]:
#Lastly I will stick those into a matrix.
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


## TF-IDF - A better Strategy

Rather than just counting, we can use the TF-IDF score of a word to rank it's importance.

The tfidf score of a word, w, is:  tf(w) * idf(w)

Where tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

And where idf(w) = log(Number of documents / Number of documents that contain word w).


In [16]:
#TODO: TO COMPUTE THE TF
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict

In [17]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [20]:
tfBowB

{'bed': 0.16666666666666666,
 'cat': 0.0,
 'dog': 0.16666666666666666,
 'face': 0.0,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'the': 0.16666666666666666}

In [21]:
tfBowA

{'bed': 0.0,
 'cat': 0.16666666666666666,
 'dog': 0.0,
 'face': 0.16666666666666666,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'the': 0.16666666666666666}

In [24]:
#TODO: TO COMPUTE THE IDF
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    #Counts the number of documents that contain a word w
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word]+=1
                
    #Devide N by denominator above, take the log of that
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
        
    return idfDict

In [25]:
idfs = computeIDF([wordDictA, wordDictB])

In [26]:
idfs

{'bed': 0.6931471805599453,
 'cat': 0.6931471805599453,
 'dog': 0.6931471805599453,
 'face': 0.6931471805599453,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

In [27]:
#TODO: TO COMPUTE THE TFIDF
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [29]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [30]:
import pandas as pd
pd. DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0
