# Feature Extraction & TF-IDF

Today, we're going to implement our tf-idf counter and sketch out the broad outlines of our feature extraction code. Keep in mind, we want everything we write to be compatible with the cleaning and loading code we wrote yesterday, since that's the data that we'll be extracting features from!

In [None]:
from math import log

### Finding TF: Dictionary counting

For our TF function, we're going to want to count the number of times a word occurs in a document. Then, we must divide by the total length of the document to find the TF value of the word for that document.

For now, we will have each document be a list of individual words, rather than one long sentence. This makes it easier to work with. We'll learn more about formatting data tomorrow.

In [None]:
# TODO:
# - find the number of times keyword shows up in document
# - find the length of the document
# - output the TF value
def find_tf(keyword, document):
    keyword_count = document.count(keyword)
    return keyword_count / len(document)

### IDF scaling

What we're going to do now is write a function that finds the relative frequency of any word across all documents (that is, what portion of documents contain that word). We will later use this term to scale individual term counts for each text document.

We're going to structure this function to read from a dictionary of text bodies. The keys in the dictionary are IDs, while the values are the documents, which are long Strings. Tomorrow, during data cleaning, this is the format we will use to represent the Fake News Challenge data.

Here is the documentation for the dictionary type. We're going to want a function that lets us loop through the keys and items in a dictionary --- can you find it? 

https://docs.python.org/3/tutorial/datastructures.html#dictionaries


In [None]:
# Find the idf for a particular keyword for a corpus of documents, which is a dictionary of ids and documents
def find_idf(keyword, corpus):
    docs_containing = 0
    idf = {}
    
    # TODO: loop through the items in id2body using a dictionary method
    for (doc_id, doc) in corpus.items():
        if keyword in doc:
            docs_containing += 1
    
    total_docs = len(corpus)
    
    return log(total_docs / docs_containing)

In [None]:
# Let's test your IDF function! Here is some example data from Charles Dickens. Each document
# is one sentence in the paragraph.

dickens_text = "It was the best of times, \
it was the worst of times, \
it was the age of wisdom, \
it was the age of foolishness, \
it was the epoch of belief, \
it was the epoch of incredulity, \
it was the season of Light, \
it was the season of Darkness, \
it was the spring of hope, \
it was the winter of despair, \
we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way— in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.\
There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever.\
It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood.\
France, less favoured on the whole as to matters spiritual than her sister of the shield and trident, rolled with exceeding smoothness down hill, making paper money and spending it. Under the guidance of her Christian pastors, she entertained herself, besides, with such humane achievements as sentencing a youth to have his hands cut off, his tongue torn out with pincers, and his body burned alive, because he had not kneeled down in the rain to do honour to a dirty procession of monks which passed within his view, at a distance of some fifty or sixty yards. It is likely enough that, rooted in the woods of France and Norway, there were growing trees, when that sufferer was put to death, already marked by the Woodman, Fate, to come down and be sawn into boards, to make a certain movable framework with a sack and a knife in it, terrible in history. It is likely enough that in the rough outhouses of some tillers of the heavy lands adjacent to Paris, there were sheltered from the weather that very day, rude carts, bespattered with rustic mire, snuffed about by pigs, and roosted in by poultry, which the Farmer, Death, had already set apart to be his tumbrils of the Revolution. But that Woodman and that Farmer, though they work unceasingly, work silently, and no one heard them as they went about with muffled tread: the rather, forasmuch as to entertain any suspicion that they were awake, was to be atheistical and traitorous.\
In England, there was scarcely an amount of order and protection to justify much national boasting. Daring burglaries by armed men, and highway robberies, took place in the capital itself every night; families were publicly cautioned not to go out of town without removing their furniture to upholsterers' warehouses for security; the highwayman in the dark was a City tradesman in the light, and, being recognised and challenged by his fellow-tradesman whom he stopped in his character of the Captain, gallantly shot him through the head and rode away; the mail was waylaid by seven robbers, and the guard shot three dead, and then got shot dead himself by the other four, in consequence of the failure of his ammunition: after which the mail was robbed in peace; that magnificent potentate, the Lord Mayor of London, was made to stand and deliver\
on Turnham Green, by one highwayman, who despoiled the illustrious creature in sight of all his retinue; prisoners in London gaols fought battles with their turnkeys, and the majesty of the law fired blunderbusses in among them, loaded with rounds of shot and ball; thieves snipped off diamond crosses from the necks of noble lords at Court drawing-rooms; musketeers went into St. Giles's, to search for contraband goods, and the mob fired on the musketeers, and the musketeers fired on the mob, and nobody thought any of these occurrences much out of the common way. In the midst of them, the hangman, ever busy and ever worse than useless, was in constant requisition; now, stringing up long rows of miscellaneous criminals; now, hanging a housebreaker on Saturday who had been taken on Tuesday; now, burning people in the hand at Newgate by the dozen, and now burning pamphlets at the door of Westminster Hall; to-day, taking the life of an atrocious murderer, and to-morrow of a wretched pilferer who had robbed a farmer's boy of sixpence\
All these things, and a thousand like them, came to pass in and close upon the dear old year one thousand seven hundred and seventy-five. Environed by them, while the Woodman and the Farmer worked unheeded, those two of the large jaws, and those other two of the plain and the fair faces, trod with stir enough, and carried their divine rights with a high hand. Thus did the year one thousand seven hundred and seventy-five conduct their Greatnesses, and myriads of small creatures—the creatures of this chronicle among the rest—along the roads that lay before them."
example_sentences = dickens_text.split('.')
example_corpus = {}
for i in range(len(example_sentences)):
    example_corpus[i] = example_sentences[i]
print(example_corpus[0])

In [None]:
# Let's take a look at some of the idf values. Do these look about right to you?
print(find_idf("the", example_corpus))
print(find_idf("a", example_corpus))
print(find_idf("an", example_corpus))
print(find_idf("we", example_corpus))
print(find_idf("of", example_corpus))
print(find_idf("France", example_corpus))

### Putting it all together

Now, we've written functions that can calculate TF and IDF values for any word in a corpus of documents. Let's put it together to write a TF-IDF function that finds the TF-IDF values for a word in a corpus!

In [None]:
def tf_idf(keyword, corpus):
    idf = find_idf(keyword, corpus)
    tf_values = {}
    tf_idf_values = {}
    for (doc_id, doc) in corpus.items():
        tf_values[doc_id] = find_tf(keyword, doc)
        tf_idf_values[doc_id] = tf_values[doc_id] * idf
    
    return tf_idf_values

In [None]:
# Test out your function below! Do your results make sense?


### Challenge: Search function

If you have extra time, try using our TF-IDF calculations to return the most relevant document from a corpus, based on a list of keywords!

In [None]:
# TODO:
# - get the TF-IDF value of each keyword for each document
# - sum the TF-IDF values to find the total TF-IDF value of those keywords for that document
# - return the ID of the document with the highest value
def get_most_relevant(keywords, corpus):
    pass