# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Vector-Space-Models" data-toc-modified-id="Vector-Space-Models-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Vector Space Models</a></div><div class="lev2 toc-item"><a href="#Term-document-incidence" data-toc-modified-id="Term-document-incidence-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Term-document incidence</a></div><div class="lev2 toc-item"><a href="#Term-Frequency-(TF)" data-toc-modified-id="Term-Frequency-(TF)-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Term Frequency (TF)</a></div><div class="lev2 toc-item"><a href="#Inverse-Document-Frequency-(IDF)" data-toc-modified-id="Inverse-Document-Frequency-(IDF)-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Inverse Document Frequency (IDF)</a></div><div class="lev2 toc-item"><a href="#TF-IDF" data-toc-modified-id="TF-IDF-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>TF-IDF</a></div>

# Vector Space Models

We have already explored a glimpse of vector space models in Module 1 where we converted each document as a vector of term (feature)  counts. In this module we will explore this notion a little deeper. According to this [wikipage](https://en.wikipedia.org/wiki/Vector_space_model), **Vector space model** or **term vector model** represents text documents as algebraic vectors of terms. 


## Term-document incidence 

The is the most simplest numerical representation of documents. Why do we need numerical representation for documents? Numerical representation allows us to apply many data mining and machine learning methods without any changes. A **term** is an indexed unit within a corpus (in a broad sense we can say terms are words). For each document we record whether a term is present or not; so we get a boolean vector for document and a boolean matrix for a corpus, which is known as **term-document incidence matrix**. 


## Term Frequency (TF)

If a term occurs multiple times within a document, this term should receive more importance in the text analysis which cannot be handled by a term-document incidence vector. We can assign **weight** to each term based on the number of occurrences the term in the document. The simplest approach is to assign the weight to be equal to the number of occurrences of the term in document. This weighting scheme is called **term frequency**. Given the term frequencies (TFs) of document, it represents the document quantitatively. This representation is also known as **Bag of Words** model. Why? Because the exact ordering of the terms in a document is ignored but the number of occurrences of each term bears importance. So in this representation, the document ``A is greater than B`` is identical to the document ``B is greater than A``.


## Inverse Document Frequency (IDF)

Raw term frequency poses a problem: all terms of considered equally important. Consider an example where the word "sport" occurs in each of the document of the corpus on sports. Does this word have any discriminating power for assessing relevancy or classification? No. To reduce the effect of terms that occur frequently across documents, the notion of **inverse document frequency** (IDF) is introduced. Formally, idf is defined as follows: 

$$
\text{idf}_t = \log \frac{N}{\text{df}_t}
$$

Here,
$N$: Number of documents in the corpus
$df_t$: Document frequency of term t; number of documents that contain term t
$idf_t$: Inverse document frequency of t


## TF-IDF

Both term frequency and inverse document frequency can be combined to give a composite score (importance) for term in each document. Formally, the tf-idf weighting scheme assigns to term $t$ a weight in document $d$ as follows: 

$$
\text{tf-idf}_{t,d} = \text{tf}_{t,d} \times \text{idf}_t.
$$

**Observations: (see Ch 6 of Introduction to Information Retrieval)**

TF-IDF score has following properties.

1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
2. lowerwhenthetermoccursfewertimesinadocument,oroccursinmany documents (thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.

## Additonal Reading
 - [Video on TF and IDF](https://web.dsa.missouri.edu/static/videos/DMIR/DATA_SCI_8630_IDF.mp4)
 - [What does tf-idf mean?](http://www.tfidf.com/)
 - [Term and Frequency Weighting](https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html)
 - [Inverse Document Frequency](https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html)
 - [TFIDF Overview](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html)
 - Extra Reading
     - [Even More TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
