# TF-IDF and cosine similarity
Part one of this lesson is based on a tutorial that gives a more in-depth introduction to TF-IDF:<br>
https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf

Part two is an example of how lists of words weighted using TF-IDF can be used to calculate document similarity.

**TF-IDF (term frequency-inverse document frequency)**:  A measure that can quantify the importance or relevance of words in a collection of documents<br>
**Cosine similarity**: use geometry to calculate similarity between documents

Python libraries
- sklearn (machine learning)
- nltk (natural language processing)
- pandas (process tabular data)



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import os
from collections import Counter

import nltk
nltk.download('punkt')
from nltk import word_tokenize

In [None]:
!unzip lesson-files.zip

## 1. TF-IDF

### 1.1 Load text data

In [None]:
# Get a list of the filenames
input_files = os.listdir('lesson-files/txt')
input_files.sort()
input_files[:5]

In [None]:
# Open the files and append their content to a list
text_data = []
for filename in input_files:
  with open("lesson-files/txt/" + filename) as input_file:
    text = input_file.read()
    text_data.append(text)

In [None]:
# Inspect the first 400 characters in the first document
text_data[0][:400]

### 1.2 Tokenization: splitting text into words and punctutation

In [None]:
tokens = text_data[0].split()
Counter(tokens).most_common(20)

In [None]:
tokens = word_tokenize(text_data[0])
Counter(tokens).most_common(20)

### 1.3 TfidfVectorizer

Parameters:<br>
* <code>max_df=.65</code>: Ignore terms that appear in more than 65% of the documents
* <code>min_df=1</code>: ignore words that occur in less than one document. The value must be higher than 1 for the parameter to have any effect
* <code>stop_words</code>: manually list words that you want to ignore
* <code>max_features</code>: limit the number of features (words)
* <code>norm=None</code>: disable normalization (explained in the tutorial at Programming Historian)
* <code>tokenizer=None</code>: Use default tokenization. This parameter allows us to override the tokenization process, for instance by using the tokenizer from nltk: <code>tokenizer=word_tokenize</code>

Other possible parameters are listed in the documentation:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
# Create a vectorizer object and fit it to our data (calculate TF-IDF values)
vectorizer = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, max_features=None, tokenizer=None)

In [None]:
# Fit the vectorizer to our data
transformed_documents = vectorizer.fit_transform(text_data)

The TfidfVectorizer stores the calculated values in a sparse matrix - a list of lists that saves space in memory by only storing values other than zero. Non existing values are assumed to be zero.

In [None]:
transformed_documents

In [None]:
# Sparse data: 285947  non-zero values across 13 million cells
366 * 36269 

In [None]:
# Convert the sparse matrix to a regular table or array
transformed_documents_as_array = transformed_documents.toarray()

# Inspect the array and verify that it represents the same number of documents that we have in the file list
feature_table = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())
feature_table

### 1.4 Read and write csv-files: Pandas

In [None]:
metadata = pd.read_csv("lesson-files/metadata.csv", index_col=0)
metadata[:4]

In [None]:
# Create output folder if it does not exist
if not os.path.exists("tf_idf_output"):
    os.mkdir("tf_idf_output")

# loop each item from the list of input files and the array of transformed documents in parallel
for filename, doc in zip(input_files, transformed_documents_as_array):
  # convert the output to a dataframe
  terms_and_scores = zip(vectorizer.get_feature_names(), doc)
  one_doc_as_df = pd.DataFrame(terms_and_scores, columns = ["term", "score"]).sort_values(by='score', ascending=False)

  # write the output to a csv
  one_doc_as_df.to_csv("tf_idf_output/" + filename.replace('txt', 'csv'), index = False)

In [None]:
pd.read_csv("tf_idf_output/0101.csv").head(4)

In [None]:
pd.read_csv("tf_idf_output/0104.csv").head(4)

In [None]:
def load_terms(document_index, n = 4):
  return pd.read_csv("tf_idf_output/"+input_files[document_index].replace('txt', 'csv')).head(n)

## 2.  Cosine similarity

1. Treat the list of TF-IDF weighted values for each document as if they were dimensions in a physical space.
2. Use the "angle" between two documents to calculate their similarity

In [None]:
# Create a table of similarities and display it as a dataframe
similarities = cosine_similarity(transformed_documents)
pd.DataFrame(similarities)

In [None]:
# Sort the document indices by from low to high similarity
similar_sorted = similarities[3].argsort()
# Flip it (high to low similiarity)
similar_sorted = np.flip(similar_sorted)
# Inspect the first element (the index/position of the most similar document)
similar_sorted[0]

In [None]:
# List the indices of th top five most similar documents
similar_docs = similar_sorted[1:6]
similar_docs

In [None]:
load_terms(3)

In [None]:
load_terms(287)

In [None]:
# Filter the metadata table on the indices
metadata.iloc[similar_docs]

In [None]:
pd.set_option('display.max_rows', 30)
load_terms(57, n=30)

In [None]:
load_terms(0, n=30)