# Headline Cleaning Part 1
## Solution

Our goal in this phase is to read in one of the headline files, produce a list of the dates, turn them into dates that look like `YYYY-MM-DD`, and get a count of words in the file. 

In [None]:
# Note, you have to change this to represent the path on your machine. Or make it 
# empty if the file you're reading in is in the same folder. 
working_dir = "C:\\Users\\jchan\\Dropbox\\Teaching\\AppliedDataAnalytics\\Code\\headline-cleaning\\"
# working_dir = ""

input_file = "missoula.txt" # Let's work with Missoula

Let's open up one of these files and see what we have. 

In [None]:
with open(working_dir + input_file,'r',encoding="Latin-1") as infile :
    for idx,row in enumerate(infile.readlines()) :
        print(row)
        if idx == 1 :
            break

Okay, this is a mess. The first row has all the dates, the subsequent rows have headlines with many blanks put in. Let's start by getting all the dates, which are on the first row.

In [None]:
with open(working_dir + input_file) as infile :
    dates = infile.readline().strip().split("\t")

dates[:5]

An irksome formulation of dates. There are fancy ways to solve this, but let's use a simple way. I'm going to just make a lookup dictionary to map the month to the month number.

In [None]:
month_abbr = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
month_num = [n+1 for n in range(12)]
month_str = ["{:02d}".format(n) for n in month_num]

month_lu = dict(zip(month_abbr,month_str))

In [None]:
clean_dates = []
for date in dates :
    d,m = date.split("-")
    m_num = month_lu[m]
    d_str = "{:02d}".format(int(d))
    
    if ((m == "Sep" and int(d) > 22) or
        m in ["Oct","Nov","Dec"]) :
        y = "2015"
    else :
        y = "2016" 

    clean_dates.append("-".join([y,m_num,d_str]))

I could test this with something like:

In [None]:
dict(zip(dates,clean_dates))

The following goes above and beyond, but it'll be useful for what follows. Let's make a function that takes as input one of these dates and returns the correctly formatted date.

In [None]:
def reformat_missoula_date(ugly_date) :
    '''
        Takes as input a date of the form "D-MMM" and returns a date of the form
        "YYYY-MM-DD". Note that we have to do some work on years. Dates in Oct, Nov,
        Dec are all 2015. Also, September dates _after_ 22-Sep are 2015. Note that 
        this cutoff only works for Missoula--we'd have to have come up with other 
        cutoffs for other papers. 
    '''
    
    month_abbr = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
    month_num = [n+1 for n in range(12)]
    month_str = ["{:02d}".format(n) for n in month_num]
    month_lu = dict(zip(month_abbr,month_str))
    
    d,m = ugly_date.split("-")
    m_num = month_lu[m]
    d_str = "{:02d}".format(int(d))
    
    if ((m == "Sep" and int(d) > 22) or
        m in ["Oct","Nov","Dec"]) :
        y = "2015"
    else :
        y = "2016" 

    pretty_date = "-".join([y,m_num,d_str])
    
    return(pretty_date)
    

In [None]:
# Let's test it
assert(reformat_missoula_date("21-Sep")=="2016-09-21")
assert(reformat_missoula_date("23-Sep")=="2015-09-23")
assert(reformat_missoula_date("8-Jun")=="2016-06-08")
assert(reformat_missoula_date("30-Mar")=="2016-03-30")

# This one fails, just so you can see it.
assert(reformat_missoula_date("27-Sep")=="2016-09-27")

Now we'd like to count all the words in file. I'm going to do this three ways. The first one will use really basic techniques; the second will speed things up a bit with `defaultdict`; the third will use the `Counter` data type which is meant for this. 

In [None]:
dict_counter = {}

with open(working_dir + input_file) as infile :
    next(infile) # skip the first row--the dates

    for line in infile.readlines() :
        split_line = line.strip().split()
        
        for word in split_line :
            if word not in dict_counter :
                dict_counter[word] = 1
            else :
                dict_counter[word] += 1
        

In [None]:
# Let's do a little testing of this one
print(dict_counter["the"])
print(dict_counter["The"])
print(dict_counter["Griz"])

In [None]:
# Now with default dict
from collections import defaultdict

ddict_counter = defaultdict(int)

with open(working_dir + input_file) as infile :
    next(infile) # skip the first row--the dates

    for line in infile.readlines() :
        split_line = line.strip().split()
        
        for word in split_line :
            ddict_counter[word] += 1


In [None]:
# Let's do a little testing of this one
print(ddict_counter["the"])
print(ddict_counter["The"])
print(ddict_counter["Griz"])

In [None]:
# Now let's use counter. Easiest way is to make a list of all the words.
from collections import Counter
all_words = []

with open(working_dir + input_file) as infile :
    next(infile) # skip the first row--the dates

    for line in infile.readlines() :
        split_line = line.strip().split()
        all_words.extend(split_line)

word_cnt = Counter(all_words)

In [None]:
word_cnt.most_common(10)