# Tokenization
- You'd think this was isolated to [Natural Language Processing](../categories/nlp.md), but no.
- if you want to use an [Unsupervised](../categories/ml.md) model to cluster documents into categories, you need to tokenize the words there as well
- tokenization is breaking strings into chunks (tokens). They're often words, but sometimes not - you  might include the period after a word to capture the fact that that word means something different because it comes at the end of a sentence, or you might grab the `ing` from the end of a word indicating it's an ongoing action.

(tf-idf)=
## TF-IDF (Term Frequency - Inverse Document Frequency) Matrix
- **TF-IDF** is a weighted measure of how important a word is to a document in a collection. It can be used to make a matrix. 
- **TF (Term Frequency)** - how often a specific word appears in a single document, filtering out words that are probably just one-offs or from a bibliography or something
- **IDF (Inverse Document Frequency)** - how exclusive a word is to a single document (words that appear a lot in all documents will have a low **IDF**, filtering out common words like `"the"`)
- so basically a word that appears many times in the current document (high **TF**) but doesn't appear much in *most* of the *other* documents (high **IDF**) will have a really high **TF-IDF** score
    - each unique word in the document could be assigned a **TF-IDF** value, 
    - you can a matrix out of all the **TF-IDF**'s for each document
    - then you could cluster from the matrix with K-Means or something
    - you'd assume that documents with similar high-**TF-IDF** words are on similar topics

### Equations
- $TF\text{-}IDF$ ([Term Frequency - Inverse Document Frequency](https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/))
    - $TF\text{-}IDF(t,d,D) = TF \times IDF$
- $TF$ (Term Frequency)
    - $TF(t,d) = \Large\frac{\text{number of times term } t \text{ appears in document }d}{\text{total number of terms in document }d}$
- $IDF$ (Inverse Document Frequency)
    - $IDF(t,D) = \log\Large(\frac{\text{total number of documents in corpus } D}{\text{number of documents containing term }t})$

| factor | interpretation |
| --- | --- |
| high $TF$            | term appears frequently in document $d$ |
| high $IDF$           | term appears in fewer of documents $D$ |
| high $TF\text{-}IDF$ | term $t$ is frequently used in and unique to document $d$ |

### Example - TF-IDF of "cat"

| "documents" $D$ | $TF(t,d)$ | $IDF(t,D)$ | $TF\text{-}IDF(t,d,D)$ |
| --- | --- | --- | --- |
| doc $a=$ `The cat sat on the mat.       ` | $1/6$ | $\log(3/2)\approx 0.176$ | $1/6\times 0.176 \approx 0.029$ |
| doc $b=$ `The dog played in the ok park.` | $0/7$ | $\log(3/2)\approx 0.176$ | $0/6\times 0.176 = 0$           |
| doc $c=$ `Cats and dogs are great pets. ` | $1/6$ | $\log(3/2)\approx 0.176$ | $1/6\times 0.176 \approx 0.029$ |

### Example - Scikit-Learn
- use `sklearn`'s `TfidfVectorizer.fit_transform()` to calculate the TF-IDF's
- use very short "documents" to keep the table small since there's one TF-IDF per $\text{number of terms}\times\text{number of documents}$

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer
# create three super-short "documents"
docs = {
    "d0": "Geeks for geeks",
    "d1": "Geeks",
    "d2": "Sweet peas",
}
# apply the TF-IDF vectorizer to the "documents"
vectorizer = TfidfVectorizer()
sparse_tfidf_matrix = vectorizer.fit_transform(docs.values())

In [75]:
# make a table with term indices and IDF's
from tabulate import tabulate
# extract data
terms = vectorizer.get_feature_names_out() # each term is a feature name
idfs = vectorizer.idf_ # IDF values are in a list (each term has one IDF value)
vocab = vectorizer.vocabulary_ # each term gets an index, those are recorded in the vocab dictionary
dense_matrix = sparse_tfidf_matrix.toarray() # can convert to a dense matrix (takes up much more space)
# combine the term indexes and IDF's into a printable table
indx_idfs = [{"term": term, "index": vocab[term], "idf": idf} for term, idf in zip(terms, idfs)]
idf_table = tabulate(indx_idfs, headers="keys", tablefmt="simple")
# convert the default sparse matrix to a dense matrix and label the rows/columns
dense_table = tabulate(dense_matrix, headers=terms, showindex=docs.keys())
# Print out the IDF's and TF-IDF's
spcr = "\n------------------------------------------------------\n"
print(f"\ndocuments:\n{docs}")
print(f"\nterm indexes and IDF's:\n{idf_table}")
print(f"\nTF-IDF dense  matrix:\n{dense_table}")
print(f"\nTF-IDF sparse matrix:\n{sparse_tfidf_matrix}\nCoords = (doc index, term index),  Values = TF-IDFs")


documents:
{'d0': 'Geeks for geeks', 'd1': 'Geeks', 'd2': 'Sweet peas'}

term indexes and IDF's:
term      index      idf
------  -------  -------
for           0  1.69315
geeks         1  1.28768
peas          2  1.69315
sweet         3  1.69315

TF-IDF dense  matrix:
         for     geeks      peas     sweet
--  --------  --------  --------  --------
d0  0.549351  0.835592  0         0
d1  0         1         0         0
d2  0         0         0.707107  0.707107

TF-IDF sparse matrix:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (3, 4)>
  Coords	Values
  (0, 1)	0.8355915419449176
  (0, 0)	0.5493512310263033
  (1, 1)	1.0
  (2, 3)	0.7071067811865476
  (2, 2)	0.7071067811865476
Coords = (doc index, term index),  Values = TF-IDFs


#### TF-IDF Matrix
- `dense matrix` lists TF-IDF for every combination of term and document
- `sparse matrix` only lists cells with nonzero TF-IDF's by coordinate (saves a lot of space/memory)
- most cells of `dense matrix` have `0`'s, i.e. where `TF-IDF` is `0` because that term isn't in that document
- can turn the above matrix into this table: 

| Doc | Term | Coords | tf-idf value |
| --- | ---  | --- | --- |
| d0 | for   | (0 , 0) | 0.549 |
| d0 | geeks | (0 , 1) | 0.835 |
| d1 | geeks | (1 , 1) | 1.000 |
| d2 | peas  | (2 , 2) | 0.707 |
| d2 | sweet | (2 , 3) | 0.707 |

So now the documents are vectorized. We've turned documents (lists of strings) into vectors of numbers.