# Headline Cleaning Part 2: Tidying the Data
## An incomplete notebook to get you started. 
---
The purpose of this file is to get you started on the full headline cleaning challenge.

---

Now that we're familiar with the headline data, we're ready to _really_ clean this data up.

In this workbook, we're going to make two output files. Here's what should go in them:
* `clean_headlines.txt`: This should be a tab-delimited text file with a column for the paper, the date (in `YYYY-MM-DD` format), and the text of the headline. 
* `headline_word_count.txt`: This file should be tab-delimited with columns for the paper, the date (same format as above), each word the paper used on that date, and the count of those words. The words are defined as alphanumeric characters separated by whitespace from other characters, should have punctuation removed, and should be cast to lowercase. For example, "Mr. Bean" would be two words, "mr" and "bean". 

Let's work through an example so you can see what I mean. Here's a list of the headlines for the _Missoulian_ for 2015-09-23:
* After EWU air raid, Bobcats switch focus to Cal Poly option
* Alternative art gallery FrontierSpace to hold annual art auction
* Does state gets passing grade for education funding? Group aims to find out
* Fall films have issues
* Family in need after Bonner fire destroyed home, killed cat
* Headwaters Dance Company's "last hurrah" concerts are next week
* Montana coaching tree gathers for fundraiser
* Mountain Line to detour Saturday due to UM Homecoming Parade
* Rabid bat found at picnic area north of Helena
* Ravalli County attorney's daughter jailed in sibling assault
* Search continues today for missing Kalispell bow hunter
* Tester to escort Pope Francis to House chambers

In `clean_headlines.txt` we'd expect to see a row that looks like this:
Missoulian    2015-09-23    After EWU air raid, Bobcats switch focus to Cal Poly option

In `headline_word_count.txt` we'd expect to see a row that looks like this:
Missoulian    2015-09-23    after    2

(Note that the gaps between fields should be a tab.)


In [None]:
from collections import defaultdict
from pprint import pprint
from string import punctuation


working_dir =  "C:\\Users\\jchan\\Dropbox\\Teaching\\AppliedDataAnalytics\\Code\\headline-cleaning\\"

input_files = ["missoula.txt","sidney.txt","butte.txt","bozeman.txt","billings.txt"]
paper_names = ["Missoulian","Sidney Herald","Montana Standard","Bozeman Daily Chronicle","Billings Gazette"]

Let's open up one of these files and see what we have. 

In [None]:
this_input = input_files[0]

with open(working_dir + this_input) as infile :
    for idx,row in enumerate(infile.readlines()) :
        print(row)
        if idx > 2 :
            break

Still a mess. I'm going to write a function that takes the paper in and builds a well-formatted date from these ugly dates. 

First, I'm just going to read in all the dates and keep them in order so that I can see what's going on. This step could be done in Excel too.

In [None]:
dates = defaultdict(list) # paper keying a list of dates

for this_input in input_files : # notice, by not hardcoding the input file I can quickly wrap the above cell in a `for`
    with open(working_dir + this_input,'r',encoding="Latin-1") as infile :
        
        these_dates = infile.readline().strip().split("\t")
        paper = paper_names[input_files.index(this_input)] # note the use of index here. 
        
        dates[paper] = these_dates


In [None]:
# Now I'm going to print the paper and the dates in "clean" fashion 
# to try to figure out where the 2015/2016 break is for each paper.
for paper in dates :
    print(paper + " : " + " ".join(dates[paper]))
    print("\n\n")
    

Okay, now we're in a position to write a function that takes an ugly date and makes a pretty one. We're going to need the paper as an input into the function. I've put a stub here as well as a set of `assert` statements you can use to test your code.

In [None]:
def reformat_date(ugly_date, the_paper) :
    '''
        Takes as input a date of the form "D-MMM" and returns a date of the form
        "YYYY-MM-DD". Note that we have to do some work on years. Dates in Oct, Nov,
        Dec are *almost* all 2015. 
        
        For every paper other than Bozeman Daily Chronicle, dates in a year
        later than 21-Sep are in 2015. For BZN, 28-Sep and 5-Oct are in 2016.
    '''
    # Here's where you'll put your code to handle the formatting of 
    # ugly dates into pretty dates. 
    pass
    

In [None]:
# Lots of testing here to make sure I've got what I expect.
assert("2015-09-23" == reformat_date("23-Sep","Missoulian"))
assert("2015-09-30" == reformat_date("30-Sep","Missoulian"))
assert("2015-12-23" == reformat_date("23-Dec","Missoulian"))
assert("2016-01-06" == reformat_date("6-Jan","Missoulian"))
assert("2016-03-09" == reformat_date("9-Mar","Missoulian"))
assert("2016-06-01" == reformat_date("1-Jun","Missoulian"))
assert("2016-09-21" == reformat_date("21-Sep","Missoulian"))

assert("2015-09-22" == reformat_date("22-Sep","Sidney Herald"))
assert("2015-09-30" == reformat_date("30-Sep","Sidney Herald"))
assert("2016-06-01" == reformat_date("1-Jun","Sidney Herald"))
assert("2016-09-21" == reformat_date("21-Sep","Sidney Herald"))

assert("2015-09-22" == reformat_date("22-Sep","Montana Standard"))
assert("2015-09-30" == reformat_date("30-Sep","Montana Standard"))
assert("2015-10-28" == reformat_date("28-Oct","Montana Standard"))
assert("2016-06-01" == reformat_date("1-Jun","Montana Standard"))
assert("2016-09-21" == reformat_date("21-Sep","Montana Standard"))

assert("2015-09-22" == reformat_date("22-Sep","Billings Gazette"))
assert("2015-09-30" == reformat_date("30-Sep","Billings Gazette"))
assert("2015-10-28" == reformat_date("28-Oct","Billings Gazette"))
assert("2016-06-01" == reformat_date("1-Jun","Billings Gazette"))
assert("2016-09-21" == reformat_date("21-Sep","Billings Gazette"))

# Bozeman: Standard ones plus special cases
assert("2015-09-23" == reformat_date("23-Sep","Bozeman Daily Chronicle"))
assert("2015-09-30" == reformat_date("30-Sep","Bozeman Daily Chronicle"))
assert("2015-10-28" == reformat_date("28-Oct","Bozeman Daily Chronicle"))
assert("2016-06-01" == reformat_date("1-Jun","Bozeman Daily Chronicle"))
assert("2016-09-21" == reformat_date("21-Sep","Bozeman Daily Chronicle"))

assert("2016-09-21" == reformat_date("21-Sep","Bozeman Daily Chronicle"))
assert("2016-09-28" == reformat_date("28-Sep","Bozeman Daily Chronicle"))
assert("2016-10-05" == reformat_date("5-Oct","Bozeman Daily Chronicle"))

---

Okay, now it's time for you to shine. Write some code that loops through the files and creates our first output file. This file should have a column for newspaper, date, and the headline. 

In [None]:
# Code for creating the first output file can start here.

In [None]:
# Here's a place to put code that writes out the data to `clean_headlines.txt`

---
Having done that, we want to create the second output file, which does word counts by paper and date. Our lives may be easier if we start by writing a function that takes a headline as input and returns a list of the words in that headline with punctuation removed and everything in lowercase. For reasons that are unclear to me, I'm just giving you my function that does this. Let me know if you have any questions.

In [None]:
def clean_headline(hl) :
    ''' Removes punctuation, changes to lowercase, and splits on whitespace.
        Returns a list of the words in a headline'''
    
    # Let's define our excluded characters. 
    excluded_chars = set(punctuation)
    excluded_chars.add(("`","‘","’")) # some chars missed by `punctuation`

    # now, take the headline, cast it to lower, remove the punctuation characters
    # and return a list of words. 
    hl = hl.lower()
    hl = ''.join([ch for ch in hl if ch not in excluded_chars])
    return(hl.split())


In [None]:
# And, as always, let's test.
assert(clean_headline("A headline test.") == ['a','headline','test'])
assert(clean_headline("Moar testin'.") == ['moar','testin'])
assert(clean_headline("Bobcats' woes.") == ['bobcats','woes'])

Now take the headlines, split them into words, and record the counts by paper, date, and word. 

In [None]:
# your code for building the word count data can go here.

In [None]:
# Here's a place to put code that writes out the data to `headline_word_count.txt`