# CS 1656 – Introduction to Data Science 

## Instructors: Alexandros Labrinidis, XIowei Jia
## Teaching Assistants: Evangelos Karageorgos, Xiaoting Li, Zi Han Ding
## Recitation 10: (Optional) - Information Retrieval

In this lab, you will use sklearn's tfidf function to retrieve the most important word across a vast amount of text. You will then implement your own tf-idf and compare the results.

In [2]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

For the example, we will create an arbitrary list of documents to find tf-idf values.

In [3]:
#Generate a corpus of documents
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

Next, we will try and use a CountVecorizer object from scikit-learn to analyze the text, produce the term (word) vocabulary, and calculate the term frequencies per document.

In [4]:
# CountVectorizer - analyzes the text, produces term vocabulary and frequencies
count_vectorizer = CountVectorizer()

# analyze the corpus to produce the vocabulary and the number of word occurances per document
TF_matrix = count_vectorizer.fit_transform(corpus)

# print the terms (word vocabulary)
print(count_vectorizer.get_feature_names_out())

# print the term frequencies
print(TF_matrix.toarray())

# Convert the frequencies matrix to a DataFrame for better readability
TF_df = pd.DataFrame(TF_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

# Display the DataFrame
TF_df

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


Next, we will use a transformer object that calculates TF-IDF values based on term frequencies

In [5]:
# TfidfTransformer - transforms term frequency values to tf-idf values
TF_IDF_transformer = TfidfTransformer()
TF_IDF_matrix = TF_IDF_transformer.fit_transform(TF_matrix)

# Convert to DataFrame (note that we get the features names from the count vectorizer)
TF_IDF_df = pd.DataFrame(TF_IDF_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())
TF_IDF_df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


Finally, we will use a scikit-learn object that combines analyzing the corpus and calculating of term frequencies and tf-idf values in a single operation.

In [6]:
# TfidfVectorizer - calculates TF-IDF values from the corpus in a single step
TF_IDF_vectorizer = TfidfVectorizer()

# Analyze the corpus and produce the tf-idf values
TF_IDF_matrix = TF_IDF_vectorizer.fit_transform(corpus)

# Convert the matrix to a DataFrame
TF_IDF_df = pd.DataFrame(TF_IDF_matrix.toarray(), columns=TF_IDF_vectorizer.get_feature_names_out())
TF_IDF_df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


Furthermore, we can examine the IDF falues of the vectorizer

In [7]:
TF_IDF_vectorizer.idf_

array([1.91629073, 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.91629073, 1.        , 1.91629073, 1.        ])

## Tasks
For this lab, you will try to now run the TF-IDF calculations on chapters of books from Project Gutenberg (https://www.gutenberg.org/). These books will make up the corpus of the documents you are searching through. The books are:
- Alice's Adventures in Wonderland by Lewis Carroll
- A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens
- The Works of Edgar Allan Poe — Volume 2 by Edgar Allan Poe
- Great Expectations by Charles Dickens
- The Iliad by Homer
- Les Misérables by Victor Hugo
- Macbeth by William Shakespeare
- The Odyssey by Homer
- Oliver Twist by Charles Dickens
- Poirot Investigates by Agatha Christie
- Pride and Prejudice by Jane Austen
- Romeo and Juliet by William Shakespeare
- On the Origin of Species By Means of Natural Selection by Charles Darwin
- The Time Machine by H. G. Wells
- A Tale of Two Cities by Charles Dickens
- The War of the Worlds by H. G. Wells
- An Inquiry into the Nature and Causes of the Wealth of Nations by Adam Smith

### Task 1
Given a folder of books, read all books as a single corpus of documents (every book is a single document), and run TF-IDF calculations on them. Produce a dataframe like in the examples.

#### Task 1.1
Given a folder, read all files as a single corpus of documents (every file is a single document). Produce a single list of strings.

In [8]:
# Get the path to the directory containing the corpus
path = os.getcwd() + "/public"


#### Task 1.2
Given a corpus (a list of strings), produce a TF-IDF dataframe, as in the examples.

### Task 2
Given a dataframe of TF-IDF values, find the term with the highest TF-IDF value for each document. You must produce a dataframe with columns **Term** and **Max TF-IDF**. For example, given the simple TF-IDF dataframe above, you should produce a dataframe with rows ('first', 0.580286), ('document', 0.687624), ('and', 0.511849), and ('first', 0.580286), as shown in the example below. NOTE: If multiple terms share the same maximum score, you should just produce the first one of these terms in the original order of the columns.

In [9]:
pd.DataFrame([('first', 0.580286), ('document', 0.687624), ('and', 0.511849),('first', 0.580286)], columns=['Term', 'Max TF-IDF'])

Unnamed: 0,Term,Max TF-IDF
0,first,0.580286
1,document,0.687624
2,and,0.511849
3,first,0.580286


### Task 3
Implement your own IDF function that calculates the IDF for each term of the corpus manually. You must use natural logarithm ( ```np.log(x)``` ), instead of base 2 logarithm ( ```np.log2(x)``` ) in your implementation. Produce a dataframe with the columns **Term** and **IDF**.

### Task 4
Since  scikit-learn uses its own algorithm to calculate IDF values, Compare your IDF calculation with scikit-learn's implementation. Given a tolerance value **e**, assume that floats **a** and **b** are considered equal if they are within **e** of each other. Given a corpus and a tolerance value, calculate the percentage of documents that have the same IDF value as your IDF implementation. The percentage must be a float between 0 and 100. HINT: a TfidfVectorizer object stores the IDF values in the property *idf_*.

In [10]:
# check if the difference is within the tolerance
tolerance = 1e-6
