# **Data Representation - From words to numbers**

**Term Document Matrix:**

Document 1: "The dog is a nice dog"

Document 2: "The ant is no dog"


Algorithm:

```
1. Assign each unique word(=type) from the corpus an index:
      the -> 0
      dog -> 1
      is -> 2
      a -> 3
      nice -> 4
      ant -> 5
      no -> 6

The set of unique words is referred to as the Vocabulary
```



```
2. Create a matrix filled with zeros with the following dimensions:
      rows = number of documents
      columns = size of the vocabulary

Matrix:
            the [0]  dog [1]  is [2]  a [3]  nice [4]  ant [5]  no [6]
Doc 1  [0]       0        0       0      0         0        0        0
Doc 2  [1]       0        0       0      0         0        0        0
```

```
3. Iterate over the words in the documents and increment the values
in the matrix at position (docIndex, wordindex) by one:
"the dog is a nice dog"
=>

            the [0]  dog [1]  is [2]  a [3]  nice [4]  ant [5]  no [6]
Doc 1  [0]       1        2       1      1         1        0        0
Doc 2  [1]       0        0       0      0         0        0        0


"the ant is no dog"
=>

            the [0]  dog [1]  is [2]  a [3]  nice [4]  ant [5]  no [6]
Doc 1  [0]       1        2       1      1         1        0        0
Doc 2  [1]       1        1       1      0         0        1        1
```

The resulting matrix is a so-called "term-document" matrix, where one row represents a document.



In [None]:
import torch
def getDocMatrix():
    ###the documents are represented as a nested list (steps like tokenization etc. are omitted for simplicity)

    documents = [["the","dog","is","a","nice","dog"],["the","ant","is","no","dog"]]

    #determine the amount of documents in documents
    nrDocuments = len(documents)

    #vocabulary = set(sum(documents,[])) #simple, but implicit
    ###explicity determining the vocabulary:

    #init empty vocab
    vocabulary = []

    #iterate over documents
    for document in documents:
      #iterate over words in document
      for token in document:
        #check if a token is in vocabulary - if not, add to vocab
        if not token in vocabulary:
          vocabulary.append(token)
    #print the vocabulary as list
    print (vocabulary)

    #determine the size of the vocabulary
    vocSize = len(vocabulary)
    print (vocSize)

    #create a range of indices needed (i.e. the numbers 0-6)
    indices = range(len(vocabulary))
    print (indices)

    #create a dictionary out of the vocabulary and the range of indices
    vocabularyIndex = dict(zip(vocabulary, indices))
    print (vocabularyIndex)

    #initialize the term document matrix with zeros
    docMatrix = torch.zeros((nrDocuments,vocSize))

    for i, document in enumerate(documents):
        print (f"\nProcessing document {i}: and the content is {document}")

        #replace the words in the current document by their respective indices and transform the list to a tensor
        tmp = torch.tensor(list(map(lambda x: vocabularyIndex[x], document)))
        print (f"\nDocument {i} as type indices from the vocabularyIndex: ")
        print (tmp) #document with words replaced by indices
        print(f"The shape of the tmp tensor is: {tmp.shape} and as dimensions: {tmp.dim()}")

        # Basically we create a zero matrix with rows size of tokens of document and columns size of vocabulary
        # Then we set the positions of the tokens to 1
        # Very sparse representation of tokens in document

        # define the rows and columns
        rows , cols = tmp.shape[0], len(vocabulary)
        oneHot = torch.zeros((rows,cols)) #intermediate matrix where each row represents one token

        rows = torch.arange(tmp.shape[0])
        print (f"\nDocument {i} row indices (tokens):")
        print (rows)
        oneHot[rows,tmp] = 1 # Basically for all tokens (rows) we set the respective column (tmp) to 1

        print (f"\nDocument {i} will have {rows.shape[0]} rows and {cols} columns in the one-hot encoded matrix.")
        print (oneHot)
        print (f"The shape of the oneHot tensor is: {oneHot.shape} and as dimensions: {oneHot.dim()}")


        # Sum up the one-hot encoded vectors to get the document vector basically add up all rows as vectors
        docVector = torch.sum(oneHot,0) # zero means add rows wise as axis=0 in numpy
        print (f"\nThe document vector for document {i} is:")
        print (docVector)

        #assign the document vector to the respective row in the document matrix
        docMatrix[i] = docVector
    print ("\n Finally , This is the term-document matrix:")
    return (docMatrix)
docMatrix = getDocMatrix()
print (docMatrix)

TF-IDF: Taking into account rare words and text length

Side effect: Normalization and "better" value range

In [None]:
print ("Term-Document Matrix")
print (docMatrix)

# Word frequency for whole corpus:
wf = torch.sum(docMatrix,0) # sum over rows (documents)
print ("\nHow often does a token occur in the corpus:")
print (wf)

# Nr. of words per document
wc = torch.sum(docMatrix,1) # sum over columns (tokens) per document
print ("\nHow many words are in the respective documents:")
print (wc)

# term frequency
tf = torch.div(docMatrix.transpose(0,1),wc)
tf = tf.transpose(0,1)
print ("\nNormalise token frequency by document length:")
print (tf)

# document frequency
# This counts non zero entries per column (token) in the docMatrix
# so it will have dim of (1,voacSize)
df = torch.count_nonzero(docMatrix,0)
print ("\nIn how many documents of the corpus does a token occur:")
print (df)

# inverse document frequency
# Get dimensions or size of docMatrix
print (docMatrix.shape, docMatrix.shape[0], docMatrix.dim())
idf = docMatrix.shape[0]/df+1 #+1 to avoid division by zero errors
print ("\nInverse of the document frequency:")
print (idf)
#log idf
logIdf = torch.log(idf) #taking the logarithm
print ("\nLog inverse document frequency: (specificity of a term)")
print (logIdf)

#tf-idf
tfidf = tf*logIdf
print ("\nTF-IDF - Term Frequency * Term Specificity")
print (tfidf)