<a href="https://colab.research.google.com/github/Salvoaf/labComputerVision/blob/main/2_Document_term_matrix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document-term matrix generation

In this exercise, you'll have to generate a document-term matrix from an input list of preprocessed documents.

In [2]:
!pip install ipytest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ipytest
  Downloading ipytest-0.12.0-py3-none-any.whl (15 kB)
Collecting pytest>=5.4
  Downloading pytest-7.1.3-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 2.1 MB/s 
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 40.4 MB/s 
Installing collected packages: pluggy, jedi, iniconfig, pytest, ipytest
  Attempting uninstall: pluggy
    Found existing installation: pluggy 0.7.1
    Uninstalling pluggy-0.7.1:
      Successfully uninstalled pluggy-0.7.1
  Attempting uninstall: pytest
    Found existing installation: pytest 3.6.4
    Uninstalling pytest-3.6.4:
      Successfully uninstalled pytest-3.6.4
Successfull

In [3]:
from typing import List, Tuple
import ipytest
import pytest

ipytest.autoconfig()

Input documents are given as lists of tokenized terms.

In [8]:
DOCUMENTS = [
    ["aaa", "bbb", "ccc"],
    ["eee", "fff"],
    ["aaa", "eee", "aaa", "ccc", "fff", "fff", "ggg", "aaa"],
    ["bbb", "bbb", "bbb"],
    ["ggg", "fff", "ccc", "aaa", "ccc"],
]

Your task is to complete this method:

# Soluzione 1

In [30]:
def get_doc_term_matrix(docs: List[List[str]]) -> Tuple[List[List[int]], List[str]]:
    """Generates a document-term matrix and the corresponding vocabulary.
    
    Args:
        docs: List of documents, each given by a list of tokenized terms.
        
    Returns:
        Tuple consisting of the document-term matrix and the corresponding vocabulary.
        In the document-term matrix row `i` corresponds to `docs[i]` and column `j`
        corresponds to the jth element of the vocabulary. Values represent the number
        of times the term appears in the document.
        Terms may be in any order in the vocabulary.
    """
    vocabulary = []
    doc_term_matrix = []
    vector = []
    for doc in docs:
      for word in doc:
        if word not in vocabulary:
          vocabulary.append(word)
    for doc in docs:
      vector = [0 for i in range(0, len(vocabulary))]
      for word in doc:
        for i,voc_word in enumerate(vocabulary):
          if word == voc_word :
            vector[i] = vector[i]+1
            continue;
      doc_term_matrix.append(vector)
    return doc_term_matrix, vocabulary

# Soluzione 2

In [37]:
def get_doc_term_matrix(docs: List[List[str]]) -> Tuple[List[List[int]], List[str]]:
    """Generates a document-term matrix and the corresponding vocabulary.
    
    Args:
        docs: List of documents, each given by a list of tokenized terms.
        
    Returns:
        Tuple consisting of the document-term matrix and the corresponding vocabulary.
        In the document-term matrix row `i` corresponds to `docs[i]` and column `j`
        corresponds to the jth element of the vocabulary. Values represent the number
        of times the term appears in the document.
        Terms may be in any order in the vocabulary.
    """
    vocabulary = [] #totale parole presenti in tutti i documenti
    doc_term_matrix = []
    vector = [] #una lista di dizionari che memorizzano per ogni parola del documento(key) la proprio occorrenza(value)
    dictionary = {} #per ogni parola del documento(key) la proprio occorrenza(value)

    for doc in docs: #estraggo ogni documento
      dictionary = {}
      for word in doc:#estraggo ogni parola del documento in questione
        if word in dictionary.keys(): #controllo se nel dizionario è presente una certa parola del doc
          dictionary[word] = dictionary[word] +1 #se ho già la parola aumento l'occorrenza
        else:
          dictionary[word] =  1 #se è la prima volta che la parola si presenta nel dizionario
        if word not in vocabulary: 
          vocabulary.append(word)
      vector.append(dictionary)


    for doc in vector:
      list_vec = [] #utilizzo questa lista per popolare il nostro indice
      for word in vocabulary:
        if word in doc.keys():
          list_vec.append(doc[word])
        else:
          list_vec.append(0)
      doc_term_matrix.append(list_vec)
    return doc_term_matrix, vocabulary
      


Tests.

In [38]:
%%ipytest

def test_num_docs():
    doc_term_matrix, _ = get_doc_term_matrix(DOCUMENTS)
    assert len(doc_term_matrix) == len(DOCUMENTS)
    
def test_vocabulary():
    _, vocabulary = get_doc_term_matrix(DOCUMENTS)
    assert set(vocabulary) == {"aaa", "bbb", "ccc", "eee", "fff", "ggg"}
    
def test_term_counts():
    doc_term_matrix, vocabulary = get_doc_term_matrix(DOCUMENTS)
    idx_aaa = vocabulary.index("aaa")
    idx_ccc = vocabulary.index("ccc")
    idx_fff = vocabulary.index("fff")
    assert doc_term_matrix[0][idx_aaa] == 1
    assert doc_term_matrix[0][idx_ccc] == 1
    assert doc_term_matrix[0][idx_fff] == 0
    assert doc_term_matrix[2][idx_aaa] == 3
    assert doc_term_matrix[2][idx_ccc] == 1
    assert doc_term_matrix[2][idx_fff] == 2

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.03s[0m[0m
