# Feature Extraction & TF-IDF

Today, we're going to implement our tf-idf counter and sketch out the broad outlines of our feature extraction code.

In [None]:
from math import log

### Finding TF: Dictionary counting

For our TF function, we're going to want to count the number of times a word occurs in a document. Then, we must divide by the total length of the document to find the TF value of the word for that document.

For now, we will have each document be a list of individual words, rather than one long sentence. This makes it easier to work with. We'll learn more about formatting data tomorrow.

In [None]:
# TODO:
# - find the number of times keyword shows up in document (document is a list of words) -> ["The", "Prime", "Minister", ...]
# - find the length of the document
# - output the TF value
def find_tf(keyword, document):
    keyword_count = _______
    return ________/________

### IDF scaling

What we're going to do now is write a function that finds the relative frequency of any word across all documents (that is, what portion of documents contain that word). We will later use this term to scale individual term counts for each text document.

We're going to structure this function to read from a dictionary of text bodies. The keys in the dictionary are IDs, while the values are the documents, which are long Strings. Tomorrow, during data cleaning, this is the format we will use to represent the Fake News Challenge data.

Here is the documentation for the dictionary type. We're going to want a function that lets us loop through the keys and items in a dictionary --- can you find it? 

https://docs.python.org/3/tutorial/datastructures.html#dictionaries


In [None]:
# Find the idf for a particular keyword for a corpus of documents, which is a dictionary of ids and documents
def find_idf(keyword, corpus):
    docs_containing = 0
    idf = {}
    
    # TODO: loop through the items in id2body using a dictionary method
    for (doc_id, doc) in corpus.items():
        if _________
            docs_containing += 1
    
    total_docs = ______ # total number of documents
    
    if docs_containing == 0:
        return float('inf') # this shouldn't matter since the IDF doesn't mean anything if no documents contain the word
    else:
        return #the idf --> log (total # of document / # of documents containing the keyword)

In [None]:
# Let's test your IDF function! Here is some example data (7 news articles about Brazil) which
# we have prepared for you. You're welcome to read over this code, but you don't need to do 
# anything with it for now: just run it to get the example data.
# We will learn more about preparing and cleaning data tomorrow.

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def get_tokenized_article(article_name):
    file = open(article_name, "r")
    raw_file = file.read()
    tokenized_file = word_tokenize(raw_file)
    return tokenized_file

def get_corpus(article_names):
    corpus = {}
    for article_name in article_names:
        corpus[article_name] = get_tokenized_article(article_name)
    return corpus

example_article_names = ["article1.txt", "article2.txt", "article3.txt", "article4.txt", "article5.txt", "article6.txt", "article7.txt"]
example_corpus = get_corpus(example_article_names)
print(example_corpus)

In [None]:
# Let's take a look at some of the tf and idf values. Do these look about right to you?
print("TF")
print(find_tf("the", _______)) #find the tf of "the" in the first article
print(find_tf("a", _______)) #copy what you did in the previous line
print(find_tf("an", _______))
print(find_tf("we", _______))
print(find_tf("Brazil", _______))
print(find_tf("animal", _______))
print(find_tf("of", _______))

print("\nIDF")
print(find_idf("the", ______))
print(find_idf("a", ______))
print(find_idf("an", ______))
print(find_idf("we", ______))
print(find_idf("Brazil", ______))
print(find_idf("animal", ______))
print(find_idf("of", ______))

### Putting it all together

Now, we've written functions that can calculate TF and IDF values for any word in a corpus of documents. Let's put it together to write a TF-IDF function that finds the TF-IDF values for a word in a corpus!

In [None]:
def tf_idf(keyword, corpus):
    idf = _____ #find the idf of the keyword in corpus
    tf_values = {}
    tf_idf_values = {}
    for ____ in ____: #traverse through the corpus dictionary
        tf_values[doc_id] = _____ # find the tf score of the keyword in the doc
        tf_idf_values[doc_id] = ____  #multiple the tf score with the idf
    
    return tf_idf_values

In [None]:
# Test out your function below! Do your results make sense?
tf_idf("Amazon", example_corpus)

### Let's Get Back to the Fake News Challenge
Now let's relate this back to the Fake News Challenge. How can solving a search query help us determine what's fake and what's real? How can TFIDF be a metric that represents each document and headline? Brainstom amonst yourselves. (Think about a collection of TFIDF scores for each unique word in a document).

To finish off the day, we'll be creating a matrix of TFIDF scores. Each row is the list of TFIDF scores for a word in the vocabulary of the corpus. Each column corresponds to a separate document.

In [None]:
vocabulary_corpus = ____ #what is a kind of collection in python that only adds unique elements
for _____ in ____: #traverse through the example_corpus
    vocabulary_corpus = vocabulary_corpus.union(set(example_corpus[____]))
#print(vocabulary_corpus)
tfidf_matrix = [[]]
r = 0 #row index counter
for ____ in ____: #traverse through vocabulary_corpus
    tfidf = tf_idf(____, example_corpus)
    for (doc_id, score) in tfidf.items():
        tfidf_matrix[r].append(______)
        tfidf_matrix.append([])
    r += 1
r = 0
for val in vocabulary_corpus:
    print(val, end = ' ')
    print(tfidf_matrix[r])
    r += 1

### Challenge: Search function

If you have extra time, try using our TF-IDF calculations to return the most relevant document from a corpus, based on a list of keywords!

In [None]:
# TODO:
# - get the TF-IDF value of each keyword for each document
# - sum the TF-IDF values to find the total TF-IDF value of those keywords for that document
# - return the ID of the document with the highest value
def get_most_relevant(keywords, corpus):
    pass