# Vector Space Model of Text

Notebook to build a simple term-document matrix and uses it to show word similarity

In [1]:
from csv import QUOTE_NONE
import pandas as pd
import numpy as np

# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

pd.set_option("display.precision", 4)  # for display only: show floats to 4 decimal places

In [2]:
#NLTK setup - uncomment and run first time you import NLTK
#import nltk
#nltk.download('punkt')
#nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))
print(" ".join(stopwords))

our down haven after until ll they few there she didn't shouldn too t have mustn had needn't won't into you'll to them these aren't mustn't so ourselves which having is or wouldn't nor did then hers against when shan other o how will i with her an being by do own now before just why hasn't on more again than most if should won my yourselves but no under ve don re it some can below their its s shan't should've me we through very the where up ours be are because mightn of isn't she's as while here couldn doesn ain you don't yours theirs same both a you've was at your does m aren off in itself whom ma for themselves between herself his during once hadn needn wouldn what that not couldn't didn only weren haven't you're about d out each doesn't further those yourself above hasn mightn't weren't been all any shouldn't and this myself you'd such isn over doing him am y that'll who it's has were wasn hadn't wasn't himself he from


## Load data

Using QNLI as unannotated corpus (ignoring column "label")

In [3]:
# using larger train.tsv dataset

df_qnli = pd.read_csv("QNLI/dev.tsv",delimiter="\t",quoting=QUOTE_NONE)
df_qnli.head(3)

Unnamed: 0,index,question,sentence,label
0,0,What came into force after the new constitutio...,"As of that day, the new constitution heralding...",entailment
1,1,What is the first major city in the stream of ...,The most important tributaries in this area ar...,not_entailment
2,2,What is the minimum required if you want to te...,In most provinces a second Bachelor's Degree s...,not_entailment


## Vector Space model of Words and Documents

Word: 1 token (after NLTK tokenize)
Document: 1 row in column "sentence"

In [4]:
# convert dataset to array of (short) lowercase texts/documents
texts = [sent.lower() for sent in df_qnli.sentence]

N_DOCUMENTS = len(texts)
print("N=# texts/documents:",N_DOCUMENTS)

print("\nOne sample text:")
print(texts[100])

N=# texts/documents: 5463

One sample text:
newton completed 4 of 4 passes for 51 yards and rushed twice for 25 yards, while jonathan stewart finished the drive with a 1-yard touchdown run, cutting the score to 10–7 with 11:28 left in the second quarter.


In [5]:
MIN_DOC_FREQ_TO_KEEP = 2  # ingore words that appear only once

vectorizer = CountVectorizer(stop_words="english",  # note this is using different set than NLTK stopwords in cell 2
                             binary=False,  # detect counts of words in text
                             tokenizer=word_tokenize, 
                             min_df=MIN_DOC_FREQ_TO_KEEP) 

# Apply to data -> Build N x K matrix, presence/absense of word type k in document n
simple_term_document_matrix = vectorizer.fit_transform(raw_documents=texts).todense()

K_VOCAB_SIZE = simple_term_document_matrix.shape[1]
print("K=Size of vocabulary:", K_VOCAB_SIZE)

print("Shape of term-document matrix (NxK): ", simple_term_document_matrix.shape)


# peek: vocabulary:
print("\nSample of vocabulary:")
vocab = sorted(vectorizer.vocabulary_.keys(), key=vectorizer.vocabulary_.get)
print(vocab[3000:3015])  # looking at a slice of 20 word types in the middle, starting at index 3000



K=Size of vocabulary: 9469
Shape of term-document matrix (NxK):  (5463, 9469)

Sample of vocabulary:
['djingis', 'dna', 'dnas', 'dock', 'docking—', 'doctor', 'doctors', 'doctor—an', 'doctrinal', 'doctrine', 'document', 'documentary', 'documents', 'does', 'doing']


In [6]:
# Sparse matrix: Most values are zero -> Single text only uses small subset of vocabulary
simple_term_document_matrix[:10,:10]

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [7]:
pct_nonzero = (simple_term_document_matrix>0).mean()
print(f"% of nonzero cells: {pct_nonzero:.2%}") 

% of nonzero cells: 0.17%


## Word Similarity (and Difference)


Examine 4 words (two pairs): Which terms are similar? Which are different?

In [8]:
word_arr = ["architecture", "building","economic", "financial", ]
word_index_arr = [vocab.index(w) for w in word_arr]

print(word_index_arr)

for w, w_idx in zip(word_arr, word_index_arr):
    document_count = simple_term_document_matrix[:,w_idx].sum()
    print(f"{w:12}  index:{w_idx}  count:{document_count}")


[1085, 1641, 3158, 3709]
architecture  index:1085  count:17
building      index:1641  count:67
economic      index:3158  count:73
financial     index:3709  count:31


In [9]:
# Peek: vector for word 'architecture' 
word = "architecture"
word_idx = vocab.index(word)

word_vector = simple_term_document_matrix[:, word_idx]
print(f"Vector of word type '{word}' with dimensions {word_vector.shape} :")
print(word_vector)


print(f"\n\nTexts containing word type '{word}':\n")
for i in range(simple_term_document_matrix.shape[0]):
    if word_vector[i]>0:
        sentence = df_qnli.sentence.iloc[i]
        show_sentence = sentence.replace(word, "___"+word.upper()+"___")
        print(i, show_sentence)
        print()
        

Vector of word type 'architecture' with dimensions (5463, 1) :
[[0]
 [0]
 [0]
 ...
 [0]
 [0]
 [0]]


Texts containing word type 'architecture':

141 Likewise the tower above the main entrance has an open work crown surmounted by a statue of fame, a feature of late Gothic ___ARCHITECTURE___ and a feature common in Scotland, but the detail is Classical.

422 Norman ___ARCHITECTURE___ typically stands out as a new stage in the architectural history of the regions they subdued.

899 Presently, a firm that is nominally an "___ARCHITECTURE___" or "construction management" firm may have experts from all related fields as employees, or to have an associated company that provides each necessary skill.

1114 The Neoclassical revival affected all aspects of ___ARCHITECTURE___, the most notable are the Great Theater (1825–1833) and buildings located at Bank Square (1825–1828).

1291 The main ___ARCHITECTURE___ gallery has a series of pillars from various buildings and different periods, for exampl

In [10]:
# get N x 4 matrix for 4 words of interest:
matrix_slice = simple_term_document_matrix[:,word_index_arr]

#normalize each word vector
matrix_slice_normed = matrix_slice/np.linalg.norm(matrix_slice, axis=0)

similarity_matrix = np.dot(matrix_slice_normed.T, matrix_slice_normed)

pd.DataFrame(similarity_matrix, columns=word_arr, index=word_arr)


Unnamed: 0,architecture,building,economic,financial
architecture,1.0,0.0965,0.0,0.0
building,0.0965,1.0,0.0,0.0152
economic,0.0,0.0,1.0,0.0469
financial,0.0,0.0152,0.0469,1.0


In [11]:
# what if we don't normalize word vectors? 
similarity_matrix_unnormed = np.dot(matrix_slice.T, matrix_slice)

# convert to pandas dataframe (easier to read!)
pd.DataFrame(similarity_matrix_unnormed, columns=word_arr, index=word_arr)

Unnamed: 0,architecture,building,economic,financial
architecture,17,4,0,0
building,4,101,0,1
economic,0,0,95,3
financial,0,1,3,43


## Beyond counts: Tf-Idf Vectorization

TfIdf vectorizer has similar interface

In [12]:
tfidf_vectorizer = TfidfVectorizer(
    stop_words="english", 
    binary=True, # ignore count when applied to data: 1='word in text' 0:'word not in text'
    tokenizer=word_tokenize, 
    min_df=MIN_DOC_FREQ_TO_KEEP
) 

tfidf_matrix = tfidf_vectorizer.fit_transform(raw_documents=texts).todense()

K_VOCAB_SIZE = tfidf_matrix.shape[1]
print("K=Size of vocabulary:", K_VOCAB_SIZE)

print("Shape of term-document matrix (NxK): ", tfidf_matrix.shape)

tfidf_matrix[:10,:10]



K=Size of vocabulary: 9469
Shape of term-document matrix (NxK):  (5463, 9469)


matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.20990859, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.11565219, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.    

In [13]:
pct_nonzero = (tfidf_matrix>0).mean()  # same as count-vectorizer
print(f"% of nonzero cells: {pct_nonzero:.2%}") 

% of nonzero cells: 0.17%


In [14]:
# get N x 4 matrix for 4 words of interest:
matrix_slice = tfidf_matrix[:,word_index_arr]

#normalize each word vector
matrix_slice_normed = matrix_slice/np.linalg.norm(matrix_slice, axis=0)

similarity_matrix = np.dot(matrix_slice_normed.T, matrix_slice_normed)

pd.DataFrame(similarity_matrix, columns=word_arr, index=word_arr)

Unnamed: 0,architecture,building,economic,financial
architecture,1.0,0.1014,0.0,0.0
building,0.1014,1.0,0.0,0.0243
economic,0.0,0.0,1.0,0.0171
financial,0.0,0.0243,0.0171,1.0


## Vector representation of texts

In [15]:
SAMPLE_DOC_IDX = 100
print(texts[SAMPLE_DOC_IDX])

newton completed 4 of 4 passes for 51 yards and rushed twice for 25 yards, while jonathan stewart finished the drive with a 1-yard touchdown run, cutting the score to 10–7 with 11:28 left in the second quarter.


In [16]:
doc_vector_1 = simple_term_document_matrix[SAMPLE_DOC_IDX, :]
doc_vector_2 = tfidf_matrix[SAMPLE_DOC_IDX, :]

# add all non-stop words to dataframe for review:
wordtype_info = [(word, doc_vector_1[0,j], doc_vector_2[0,j]) 
                 for j, word in enumerate(vocab)  
                 if doc_vector_1[0,j]>0]

df_wordtype_info = pd.DataFrame(wordtype_info, columns = ["wordtype","count","tfidf"]) 
df_wordtype_info.sort_values('tfidf', ascending=False)

Unnamed: 0,wordtype,count,tfidf
3,10–7,1,0.2467
4,11:28,1,0.2467
7,51,1,0.2467
2,1-yard,1,0.24
12,jonathan,1,0.2299
18,rushed,1,0.2224
9,cutting,1,0.2224
21,stewart,1,0.2192
16,quarter,1,0.2163
22,touchdown,1,0.2091


## Multi-word representations: Texts and Queries

In [17]:
# query: sentence that uses 'building' and 'financial' - why similar? 

query_vector = tfidf_vectorizer.transform(["building financial"]).todense()

text_scores = np.dot(tfidf_matrix, query_vector.T)
query_match_index = np.argmax(text_scores)
texts[query_match_index]

'mortgage bankers, accountants, and cost engineers are likely participants in creating an overall plan for the financial management of the building construction project.'