# Term-Document Matrix

## Definition

Matrix with **N** rows and **V** columns:
* **N** is the number of documents
* **V** is the size of the vocabulary

Each row is the Bag of Word vector of one document in the corpus.

Each column represents a single term of the vocabulary.

In [1]:
import numpy as np

## Build a Term-Document Matrix

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import requests

r = requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')

assert r.status_code == 200

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

In [4]:
from nltk.tokenize import sent_tokenize

book = ' '.join([x.strip() for x in lines[7:]])
sentences = sent_tokenize(book)

In [5]:
corpus = sentences[:10]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
)

In [7]:
term_doc = count.fit_transform(corpus)

In [12]:
print(vocab)

['abhorrent', 'abhorrent cold', 'actions', 'adjusted', 'adjusted temperament', 'adler', 'admirable', 'admirable things', 'admirably', 'admirably balanced', 'admit', 'admit intrusions', 'akin', 'akin love', 'balanced', 'balanced mind', 'cold', 'cold precise', 'crack', 'crack high', 'delicate', 'delicate finely', 'distracting', 'distracting factor', 'disturbing', 'disturbing strong', 'doubt', 'doubt mental', 'drawing', 'drawing veil', 'eclipses', 'eclipses predominates', 'emotion', 'emotion akin', 'emotion nature', 'emotions', 'emotions particularly', 'excellent', 'excellent drawing', 'eyes', 'eyes eclipses', 'factor', 'factor throw', 'false', 'false position', 'felt', 'felt emotion', 'finely', 'finely adjusted', 'gibe', 'gibe sneer', 'grit', 'grit sensitive', 'heard', 'heard mention', 'high', 'high power', 'holmes', 'holmes woman', 'instrument', 'instrument crack', 'introduce', 'introduce distracting', 'intrusions', 'intrusions delicate', 'irene', 'irene adler', 'lenses', 'lenses distur

In [13]:
print(corpus)

['To Sherlock Holmes she is always the woman.', 'I have seldom heard him mention her under any other name.', 'In his eyes she eclipses and predominates the whole of her sex.', 'It was not that he felt any emotion akin to love for Irene Adler.', 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.', 'He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.', 'He never spoke of the softer passions, save with a gibe and a sneer.', "They were admirable things for the observer--excellent for drawing the veil from men's motives and actions.", 'But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.', 'Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more

In [8]:
vocab = count.get_feature_names()
print(f'Size of vocabulary: {len(vocab)}')
print(f'Size of corpus    : {len(corpus)}')

print(f'Shape of Term-Document Matrix: {term_doc.shape} - {term_doc.shape[0]} Rows x {term_doc.shape[1]} Columns')

Size of vocabulary: 139
Size of corpus    : 10
Shape of Term-Document Matrix: (10, 139) - 10 Rows x 139 Columns


## Understand Sparse Matrices

The type of `term_doc` is `CSR Matrix`. This is a type of **Sparse** matrix, it is a special data structure where we do not store every cell of the matrix in memory, as in fact there are not a lot of cells that have a value different than 0.

We use `to_array()` to transform it into a **dense** matrix, which will use a lot more memory, as it allocates memory for each cell in the matrix.

In [9]:
print(f'Number of non-zero cells: {term_doc.nnz}')
print(f'Total number of cells   : {np.prod(term_doc.shape)}')
print(f'Density                 : {term_doc.nnz / np.prod(term_doc.shape):0.2f}')

Number of non-zero cells: 140
Total number of cells   : 1390
Density                 : 0.10


## View Term-Document Matrix

In [10]:
import pandas as pd

term_doc_df = pd.DataFrame(term_doc.toarray(), columns=vocab, index=corpus)

In [11]:
term_doc_df

Unnamed: 0,abhorrent,abhorrent cold,actions,adjusted,adjusted temperament,adler,admirable,admirable things,admirably,admirably balanced,...,things observer,throw,throw doubt,trained,trained reasoner,veil,veil men,woman,world,world seen
To Sherlock Holmes she is always the woman.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
I have seldom heard him mention her under any other name.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
In his eyes she eclipses and predominates the whole of her sex.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
It was not that he felt any emotion akin to love for Irene Adler.,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.",1,1,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
"He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
"He never spoke of the softer passions, save with a gibe and a sneer.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
They were admirable things for the observer--excellent for drawing the veil from men's motives and actions.,0,0,1,0,0,0,1,1,0,0,...,1,0,0,0,0,1,1,0,0,0
But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.,0,0,0,1,1,0,0,0,0,0,...,0,1,1,1,1,0,0,0,0,0
"Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


When using a bigger corpus, the matrix density goes down quite fast.

In [None]:
term_doc_book = count.fit_transform(sentences)

vocab_book = count.get_feature_names()
print(f'Size of vocabulary: {len(vocab_book)}')
print(f'Size of corpus    : {len(sentences)}')
print(f'Shape of Term-Document Matrix: {term_doc_book.shape} - {term_doc_book.shape[0]} Rows x {term_doc_book.shape[1]} Columns')

print(f'Number of non-zero cells: {term_doc_book.nnz}')
print(f'Total number of cells   : {np.prod(term_doc_book.shape)}')
print(f'Density                 : {term_doc_book.nnz / np.prod(term_doc_book.shape):0.4f}')

Size of vocabulary: 4282
Size of corpus    : 673
Shape of Term-Document Matrix: (673, 4282) - 673 Rows x 4282 Columns
Number of non-zero cells: 5960
Total number of cells   : 2881786
Density                 : 0.0021


## Explore Term-Document Matrix

In which documents does one term appear?
* Find the column for this term
* Read at which indices are the non-zero elements

In [None]:
term = 'sherlock holmes'
index_in_vocab = vocab_book.index(term)

column = term_doc_book[:, index_in_vocab]  # Still a sparse matrix, with only 1 column
print(f'Number of documents with term "{term}": {column.nnz}')

rows, _ = column.nonzero()
for r in rows:
    print(f'{column[r, 0]} times in document with index {r}: "{sentences[r]}"')

Number of documents with term "sherlock holmes": 11
1 times in document with index 0: "To Sherlock Holmes she is always the woman."
1 times in document with index 437: "The house was just such as I had pictured it from Sherlock Holmes' succinct description, but the locality appeared to be less private than I expected."
1 times in document with index 564: "He was searching his pockets for the key when someone passing said: "Good-night, Mister Sherlock Holmes.""
1 times in document with index 570: "he cried, grasping Sherlock Holmes by either shoulder and looking eagerly into his face."
1 times in document with index 599: ""Mr. Sherlock Holmes, I believe?""
1 times in document with index 606: "Sherlock Holmes staggered back, white with chagrin and surprise."
1 times in document with index 616: "The photograph was of Irene Adler herself in evening dress, the letter was superscribed to "Sherlock Holmes, Esq."
1 times in document with index 619: "It was dated at midnight of the preceding ni

Which terms appear in one document?
* Find the row for this document
* Read at which indices there are non-zero elements

In [None]:
index = 665
document = sentences[index]

print(f'Document: "{document}"')

row = term_doc_book[index, :]    # Still a sparse matrix, with only 1 row
print(f'Number of terms in the document: {row.nnz}')

_, cols = row.nonzero()
for c in cols:
    print(f'{vocab_book[c]:<25}: {row[0, c]:>2}')


Document: "And that was how a great scandal threatened to affect the kingdom of Bohemia, and how the best plans of Mr. Sherlock Holmes were beaten by a woman's wit."
Number of terms in the document: 27
sherlock                 :  1
holmes                   :  1
woman                    :  1
sherlock holmes          :  1
bohemia                  :  1
great                    :  1
best                     :  1
scandal                  :  1
kingdom                  :  1
mr                       :  1
plans                    :  1
mr sherlock              :  1
threatened               :  1
affect                   :  1
beaten                   :  1
wit                      :  1
great scandal            :  1
scandal threatened       :  1
threatened affect        :  1
affect kingdom           :  1
kingdom bohemia          :  1
bohemia best             :  1
best plans               :  1
plans mr                 :  1
holmes beaten            :  1
beaten woman             :  1
woman wit         

# Documents Similarity Matrix

This similarity matrix is a square matrix **N x N**, where N is the number of documents in the corpus.

* At cell in row `i` and column `j` the value is the cosine similarity of documents in index `i` and `j`.
* `sim[i, j] = cosine(doc[i], doc[j])`
* This matrix is symetric. `sim[i, j] = sim[j, i]`
* The diagonal is made of `1`.

It derives from the Term-Document Matrix.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(term_doc_book)

The similarity matrix is a dense matrix, it is a regular Numpy Array.

In [None]:
print(f'Type / Shape of Documents similarity matrix: {type(doc_sim)} {doc_sim.shape}')

Type / Shape of Documents similarity matrix: <class 'numpy.ndarray'> (673, 673)


Top 10 of the most similar to a document.

In [None]:
import pandas as pd

index = 665
query = sentences[index]

print(f'Query: "{query}""')
print()

sims = pd.Series(doc_sim[index, :], index=sentences).sort_values(ascending=False)[1:11]
print(sims)

Query: "And that was how a great scandal threatened to affect the kingdom of Bohemia, and how the best plans of Mr. Sherlock Holmes were beaten by a woman's wit.""

"Mr. Sherlock Holmes, I believe?"                                                                                                             0.363696
To Sherlock Holmes she is always the woman.                                                                                                   0.344265
"Well, I followed you to your door, and so made sure that I was really an object of interest to the celebrated Mr. Sherlock Holmes.           0.233380
It was dated at midnight of the preceding night and ran in this way: "My dear Mr. Sherlock Holmes: "You really did it very well.              0.200643
I leave a photograph which he might care to possess; and I remain, dear Mr. Sherlock Holmes, "Very truly yours, "Irene Norton, née Adler."    0.185185
Sherlock Holmes staggered back, white with chagrin and surprise.                

# Terms Similarity Matrix

When we transpose the Term-Document Matrix, we have terms as rows and documents as columns.

If we consider these rows as term vectors, we have a vector representation for terms.

After all, similar words (words with close-by meaning) will appear in similar documents..

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

term_sim = cosine_similarity(term_doc_book.T)

In [None]:
print(f'Type / Shape of Documents similarity matrix: {type(term_sim)} {term_sim.shape}')

Type / Shape of Documents similarity matrix: <class 'numpy.ndarray'> (4282, 4282)


In [None]:
import pandas as pd

query = 'sherlock'
index = vocab_book.index(query)

column = term_doc_book[:, index]  # Still a sparse matrix, with only 1 column
print(f'Query: "{query}" appears in {column.nnz} documents')
print()

sims = pd.Series(term_sim[index, :], index=vocab_book).sort_values(ascending=False)[1:11]
print(sims)

Query: "sherlock" appears in 12 documents

sherlock holmes    0.957427
mr sherlock        0.645497
mr                 0.481125
holmes             0.458333
dear mr            0.408248
dress letter       0.288675
best plans         0.288675
holmes esq         0.288675
said good          0.288675
remain dear        0.288675
dtype: float64
