# INFO 3350/6350

## Lecture 02: Vectorization

* **Go to section, or you will fail the course.**
* **Do the readings for the day that they are assigned, and at latest by Friday.**
* **Watch/star the course repo so you don't miss any updates or schedule changes**
* **First homework is released next Friday**

## Review: What did we go over last Wednesday?

* Turning documents into vectors
* Comparing vectors with distance/similarity metrics
* Tokenization (split up strings into smaller units)

## TF-IDF weighting for word counts

* Why do we sometimes remove stopwords from our features?
    * High-frequency words shared by many documents don't tell us (in many, but not all, cases) much about the similarities or differences between documents
* But stopword lists are binary: a word is either a stopword (hence, removed) or it isn't
* Can we define a continuous adjustment for "stoppiness" that we apply to *every* word, depending on how widely used it is?
* One approach is "term frequency-inverse document frequency" (TFIDF) weighting. 
    * You can think of this as multiplying the count of each term in a document by the inverse of the fraction of all documents in which that word occurs (hence "term frequency [multiplied by] inverse document frequency"). It's a bit more complicated than that (see below), but that's the idea. This upweights words that occur in relatively few documents.
    * The count of a word that occurs in every document would be multiplied by one, hence get no boost in each document. A word that occurs in just one document in a corpus of 100 documents would be multiplied by 100 in the one document that contains it.
* There are several tweaks to TFIDF to smooth it out and to modulate the boost it provides. `scikit-learn`'s `TfidfVectorizer` applies the reweighting:

$$\text{idf}(t) = \ln\frac{1+n}{1 + \text{df}(t)} + 1$$

Where:

* $t$ is the term in question
* $\text{idf}(t)$ is the inverse document weight to be applied to the count of term $t$
* $n$ is the number of documents in the corpus
* $\text{df}(t)$ is the number of documents in the corpus that contain term $t$

A toy example: Consider two documents:

* Document 1: `"cat dog"`
* Document 2: `"dog dog"`

`cat` occurs in just one document; `dog` occurs in both documents. So we want (and expect) to upweight the count of `cat` in document 1.

Calculate the `idf` weight for `cat` in document 1:

* $n = 2$
* $\text{df}(\text{`cat'}) = 1$

$$\text{idf}(\text{`cat'}) = \ln\frac{1 + 2}{1 + 1} + 1 = \ln\frac{3}{2} + 1 = 1.405$$

And for `dog` in document 1:

* $n = 2$
* $\text{df}(\text{`dog'}) = 2$

$$\text{idf}(\text{`dog'}) = \ln\frac{1 + 2}{1 + 2} + 1 = \ln\frac{3}{3} + 1 = 1.0$$

So, `cat` will be upweighted relative to `dog`, because it is the less widely used word across documents in the corpus.

Our non-normalized but IDF-weighted feature matrix will look like this:

```
cat  dog
1.4  1.0
0    2.0
```

In code:

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "cat dog",
    "dog dog",
]
print("### Corpus ###\n", corpus, "\n")

# without IDF weighting (note l2 norm)
vectorizer_no_idf = TfidfVectorizer(
    use_idf=False
)
features_no_idf = vectorizer_no_idf.fit_transform(corpus)
print("### Feature matrix *without* IDF weighting ###")
print("Feature names:", vectorizer_no_idf.get_feature_names_out())
print(features_no_idf.toarray())

# with IDF weighting
vectorizer_with_idf = TfidfVectorizer(
    use_idf=True,
)
features_with_idf = vectorizer_with_idf.fit_transform(corpus)
print("\n### Feature matrix *with* IDF weighting ###")
print("Feature names:", vectorizer_with_idf.get_feature_names_out())
print(features_with_idf.toarray())

### Corpus ###
 ['cat dog', 'dog dog'] 

### Feature matrix *without* IDF weighting ###
Feature names: ['cat' 'dog']
[[0.70710678 0.70710678]
 [0.         1.        ]]

### Feature matrix *with* IDF weighting ###
Feature names: ['cat' 'dog']
[[0.81480247 0.57973867]
 [0.         1.        ]]


Notice that, in document 1, `cat` has been up-weighted while `dog` has been downweighted. There's no change in document 2 because that document has only a single word type and `TfidfVectorizer`'s `l2` norm enforces total feature weights whose squares sum to 1.

In [2]:
# Check our hand calculation against code version
import numpy as np
vec = np.array([1.405, 1.0])     # hand calculation
l2_vec = vec/np.linalg.norm(vec) # calculate l2 normalized version
print("l2-normed, hand calculated, IDF weighted features for document 1")
print(l2_vec)

l2-normed, hand calculated, IDF weighted features for document 1
[0.81471182 0.57986606]


In [3]:
# do our two versions match?
assert np.allclose(l2_vec, features_with_idf[0,].toarray(), atol=0.01)

## Using custom functions

`Sklearn` vectorizers have [settings for common options](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). But sometimes, you need to plug in your own code for a special case. For example, what about Chinese-language input? 

In [4]:
# Strings
zh = '因受新型冠状病毒危机对足球和其他体育赛事的持续影响，早已面临越来越多亏损的英格兰超级足球联赛周四宣布，因为无法解决与中国合作伙伴的纠纷，已终止了其最赚钱的海外转播合同。'
en = 'The English Premier League, already facing mounting losses because of the continued impact of the coronavirus crisis on soccer and other sporting events, announced on Thursday that it had canceled its most lucrative overseas broadcast contract after it was unable to resolve a dispute with its Chinese partner.'

vectorizer_default = TfidfVectorizer(input='content')
chinese_features = vectorizer_default.fit_transform([zh])
print("Feature names:", vectorizer_default.get_feature_names_out())
print("Values:", chinese_features.toarray())

Feature names: ['因为无法解决与中国合作伙伴的纠纷' '因受新型冠状病毒危机对足球和其他体育赛事的持续影响' '已终止了其最赚钱的海外转播合同'
 '早已面临越来越多亏损的英格兰超级足球联赛周四宣布']
Values: [[0.5 0.5 0.5 0.5]]


In [5]:
# Custom tokenizer
import jieba

def chinese_tokenizer(x):
    '''Tokenize a Chinese-language string'''
    return jieba.lcut(x)
    
vectorizer_chinese = TfidfVectorizer(
    input='content',
    tokenizer=chinese_tokenizer,
    token_pattern=None
)
chinese_features = vectorizer_chinese.fit_transform([zh])
print("Feature names:", vectorizer_chinese.get_feature_names_out())
print("Values:", chinese_features.toarray())

  import pkg_resources
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.953 seconds.
Prefix dict has been built successfully.


Feature names: ['。' '与' '中国' '了' '亏损' '体育赛事' '其' '其他' '冠状病毒' '危机' '合作伙伴' '合同' '周四' '和'
 '因为' '因受' '多' '宣布' '对' '已' '影响' '持续' '新型' '无法' '早已' '最' '海外' '的' '纠纷'
 '终止' '英格兰' '解决' '赚钱' '超级' '越来越' '足球' '足球联赛' '转播' '面临' '，']
Values: [[0.12598816 0.12598816 0.12598816 0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.50395263 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.37796447]]


In [6]:
# check norm = l2
# cf. https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html

print("Chinese vector length:", np.linalg.norm(chinese_features.toarray().T, ord=2))

Chinese vector length: 1.0


You should take a look at all of the other parameters for `sklearn` implementation of TF-IDF. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is very clear and includes examples similar to the above.

## Discussion

* What are some obvious limitations of TF-IDF? 
    * Say you fit a TF-IDF model to a large number of documents. What would you expect the resulting vectors to look like? What properties would they have?
    * Say you add some new documents to your corpus after already fitting a TF-IDF model to it. Can you update it?
    * How might you compare two or more TF-IDF vectors? 