# Feature Extraction & TF-IDF

Today, we're going to implement our tf-idf counter and sketch out the broad outlines of our feature extraction code. Keep in mind, we want everything we write to be compatible with the cleaning and loading code we wrote yesterday, since that's the data that we'll be extracting features from!

In [1]:
from math import log

### Finding TF: Dictionary counting

For our TF function, we're going to want to count the number of times a word occurs in a document. Then, we must divide by the total length of the document to find the TF value of the word for that document.

For now, we will have each document be a list of individual words, rather than one long sentence. This makes it easier to work with. We'll learn more about formatting data tomorrow.

In [2]:
# TODO:
# - find the number of times keyword shows up in document
# - find the length of the document
# - output the TF value
def find_tf(keyword, document):
    keyword_count = document.count(keyword)
    return keyword_count / len(document)

### IDF scaling

What we're going to do now is write a function that finds the relative frequency of any word across all documents (that is, what portion of documents contain that word). We will later use this term to scale individual term counts for each text document.

We're going to structure this function to read from a dictionary of text bodies. The keys in the dictionary are IDs, while the values are the documents, which are long Strings. Tomorrow, during data cleaning, this is the format we will use to represent the Fake News Challenge data.

Here is the documentation for the dictionary type. We're going to want a function that lets us loop through the keys and items in a dictionary --- can you find it? 

https://docs.python.org/3/tutorial/datastructures.html#dictionaries


In [3]:
# Find the idf for a particular keyword for a corpus of documents, which is a dictionary of ids and documents
def find_idf(keyword, corpus):
    docs_containing = 0
    idf = {}
    
    # TODO: loop through the items in id2body using a dictionary method
    for (doc_id, doc) in corpus.items():
        if keyword in doc:
            docs_containing += 1
    
    total_docs = len(corpus)
    
    if docs_containing == 0:
        return float('inf') # this shouldn't matter since the IDF doesn't mean anything if no documents contain the word
    else:
        return log(total_docs / docs_containing)

In [5]:
# Let's test your IDF function! Here is some example data (7 news articles about Brazil) which
# we have prepared for you. You're welcome to read over this code, but you don't need to do 
# anything with it for now: just run it to get the example data.
# We will learn more about preparing and cleaning data tomorrow.

from nltk.tokenize import word_tokenize

def get_tokenized_article(article_name):
    file = open(article_name, "r")
    raw_file = file.read()
    tokenized_file = word_tokenize(raw_file)
    return tokenized_file

def get_corpus(article_names):
    corpus = {}
    for article_name in article_names:
        corpus[article_name] = get_tokenized_article(article_name)
    return corpus

example_article_names = ["Week_2/article1.txt", "Week_2/article2.txt", "Week_2/article3.txt", "Week_2/article4.txt", "Week_2/article5.txt", "Week_2/article6.txt", "Week_2/article7.txt"]
example_corpus = get_corpus(example_article_names)

FileNotFoundError: [Errno 2] No such file or directory: 'Week_2/article2.txt'

In [None]:
# Let's take a look at some of the idf values. Do these look about right to you?
print(find_idf("the", example_corpus))
print(find_idf("a", example_corpus))
print(find_idf("an", example_corpus))
print(find_idf("we", example_corpus))
print(find_idf("Brazil", example_corpus))
print(find_idf("animal", example_corpus))
print(find_idf("of", example_corpus))

### Putting it all together

Now, we've written functions that can calculate TF and IDF values for any word in a corpus of documents. Let's put it together to write a TF-IDF function that finds the TF-IDF values for a word in a corpus!

In [None]:
def tf_idf(keyword, corpus):
    idf = find_idf(keyword, corpus)
    tf_values = {}
    tf_idf_values = {}
    for (doc_id, doc) in corpus.items():
        tf_values[doc_id] = find_tf(keyword, doc)
        tf_idf_values[doc_id] = tf_values[doc_id] * idf
    
    return tf_idf_values

In [None]:
# Test out your function below! Do your results make sense?
tf_idf("Amazon", example_corpus)

### Challenge: Search function

If you have extra time, try using our TF-IDF calculations to return the most relevant document from a corpus, based on a list of keywords!

In [None]:
# TODO:
# - get the TF-IDF value of each keyword for each document
# - sum the TF-IDF values to find the total TF-IDF value of those keywords for that document
# - return the ID of the document with the highest value
def get_most_relevant(keywords, corpus):
    pass