---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [2]:
import pandas as pd

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.DataFrame(doc, columns=['text'])
# df.head(10)

df.head(10)

Unnamed: 0,text
0,03/25/93 Total time of visit (in minutes):\n
1,6/18/85 Primary Care Doctor:\n
2,sshe plans to move as of 7/8/71 In-Home Servic...
3,7 on 9/27/75 Audit C Score Current:\n
4,2/6/96 sleep studyPain Treatment Pain Level (N...
5,.Per 7/06/79 Movement D/O note:\n
6,"4, 5/18/78 Patient's thoughts about current su..."
7,10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8,3/7/86 SOS-10 Total Score:\n
9,(4/10/71)Score-1Audit C Score Current:\n


In [72]:
import pandas as pd
def date_sorter():
    
    doc = []
    with open('dates.txt') as file:
        for line in file:
            doc.append(line)
 
            
#     current plan as of 10/26
# INITIAL PROCESSING
#  map out each iteration of the regex to a unique dataframe with the OriginalMatch, Month, Day, Year
# DATA PROCESSING
# drop any np.nan amounts from those dfs where it makes sense EG if only grabbing the year...should probably key off of the year column as there will be a lot of false positives if matching based on days and month
# add the data for any of those that are missing data and potentially do a conversion if need be for a key value of SEPT = 9, OCT=10 etc
# combine all values based on columns and create a new data frame with real pandas datetime objects

# SORT!
       

# done 04/20/2009; 04/20/09; 4/20/09; 4/3/09
# done grouping see iloc[1]
    pattern1 = '(?P<OriginalMatch>(?P<Month>\d?\d)[/|-](?P<Day>\d?\d)[/|-](?P<Year>\d{2,4}))'
    
    
# Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
# couldn't find any wild examples, copied and apsted the above and validated
    pattern2 =  '(?P<OriginalMatch>(?P<Month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\S]*[^a-zA-Z\d:](?P<Day>\d{1,2})[\S]*[^a-zA-Z\d:](?P<Year>\d{4}))'  

    
    
# done 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
#  done grouping see iloc[150]
    pattern3 = '(?P<OriginalMatch>(?P<Day>\d{1,2})[+\s|-](?P<Month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\S]*[\s|-](?P<Year>\d{4}))'

    # done Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
#     cannot find any "in the wild" examples but added the above and it is grouping correctly
# tested and done grouping
    pattern4 = '(?P<OriginalMatch>(?P<Month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\S]*[\s](?P<Day>\d{1,2})[\S]*[,][\s](?P<Year>\d{4}))'

    # done  grouping Feb 2009; Sep 2009; Oct 2010
# breakdown for pattern 5
# [^\d{1,2}] the first group does not match 1 to two digits
# [+\s|-] matches whitespace (note different than the ^ on the first block, which is a DOES NOT MATCH) aOR it matches a hyphen
# (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) then it is grabbing the group (the [parenthesis] ) which matches any of the variations on monhts
# [\S]*[\s] is not grabbing (hence the brackets) any string characters (eg a-zA-Z) so January. followed by white space But it is not including those in the grouping
# (\d{4} then it is grabbing the 4 digit year
# done 
# lookup 258
# done grouping
    pattern5 = '(?P<OriginalMatch>[^\d{1,2}]?\S?(?P<Month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\S]*[\s|-](?P<Year>\d{4}))'
   

    # done  6/2008; 12/2009
    # iloc[453] 
#     done grouping
    pattern6 = '(?P<OriginalMatch>[\s|^/|\~|\.|\(|\|)|\d/]?(?P<Month>\d{1,2})/(?P<Year>\d{4}))'

# done 2009; 2010
# works for lookup 457, middle of the line
# works for lookup 477, beginning of the line  
# does not match look up 445, 7/1981
# this works pretty well...but I don't want to code EVERYTHING with this...it is hungry so APR 2010 would be captured.
# so maybe run pattern7 on the remaining dataframe after removing anything that matched before
    pattern7 = '(?P<OriginalMatch>(?<!/)[^/|\s|\~|\.|\(|\|)]?(?P<Year>[1|2]\d{3})(?!\d{1,2}/))'


    df = pd.DataFrame(doc, columns=['text'])
    
    month_dictionary = {
        'Jan': 1,
        'Feb': 2,
        'Mar': 3,
        'Apr': 4,
        'May': 5,
        'Jun': 6,
        'Jul': 7,
        'Aug': 8,
        'Sep': 9,
        'Oct': 10,
        'Nov': 11,
        'Dec': 12
    }
    
#     PATTERN 1 PROCESSING
# 
    
    df_pattern1 = df['text'].str.extract(pattern1)
    
    df_pattern1 = df_pattern1.dropna()
    
    pattern1_indexes = df_pattern1.index.values.tolist()

#     some year dates are only two digits, so check their length, and if they are two digits, change them into a string and add a '19' to the front, per the assignment's instructions
    df_pattern1["Year"] = df_pattern1['Year'].apply(lambda x: '19'+str(x) if len(x) == 2 else x)
#     need to convert the year back to a number at some point. Oh Well.
#     df_pattern1 = df_pattern1.to_numeric
    
    df = df.drop(pattern1_indexes) 
    
#     PATTERN 2 PROCESSING
    
    df_pattern2 = df['text'].str.extract(pattern2)
    
    df_pattern2 = df_pattern2.dropna()
    
    df_pattern2_indexes = df_pattern2.index.values.tolist()
    
    df_pattern2["Month"] = df_pattern2['Month'].apply(lambda x: month_dictionary[x])
    
    df = df.drop(df_pattern2_indexes) 
    
#     PATTERN 3 PROCESSING

    df_pattern3 = df['text'].str.extract(pattern3)
    
    df_pattern3 = df_pattern3.dropna()
    
    df_pattern3["Month"] = df_pattern3['Month'].apply(lambda x: month_dictionary[x])
    
    df_pattern3_indexes = df_pattern3.index.values.tolist()
    
    df = df.drop(df_pattern3_indexes) 
    
# PATTERN 4 (nothing in this one)

    df_pattern4 = df['text'].str.extract(pattern4)
    
    df_pattern4 = df_pattern4.dropna()
    
    df_pattern4_indexes = df_pattern4.index.values.tolist()
    
    df = df.drop(df_pattern4_indexes) 

# PATTERN 5
    df_pattern5 = df['text'].str.extract(pattern5)
    
    df_pattern5 = df_pattern5.dropna()
    
    df_pattern5["Month"] = df_pattern5['Month'].apply(lambda x: month_dictionary[x])
    
#     this pattern doesn't have a day value, so a 1 is hardcoded in the day column

    df_pattern5['Day']  = 1
    
    df_pattern5_indexes = df_pattern5.index.values.tolist()
    
    df = df.drop(df_pattern5_indexes) 
# PATTERN 6
    df_pattern6 = df['text'].str.extract(pattern6)
    
    df_pattern6 = df_pattern6.dropna()
    
    df_pattern6['Day']  = 1
    
    df_pattern6_indexes = df_pattern6.index.values.tolist()
    
    df = df.drop(df_pattern6_indexes) 
# PATTERN 7
    df_pattern7 = df['text'].str.extract(pattern7)
    
    df_pattern7 = df_pattern7.dropna()
    
    df_pattern7['Day']  = 1
    
    df_pattern7['Month']  = 1
    
    df_pattern7_indexes = df_pattern7.index.values.tolist()
    
    df = df.drop(df_pattern7_indexes) 
    
    
    normalized_df = pd.concat([df_pattern1, df_pattern2, df_pattern3, df_pattern4,
                               df_pattern5, df_pattern6, df_pattern7], ignore_index=False, sort=True)
    
    sorted_df = normalized_df.sort_values(by=['Year', 'Month', 'Day'])
    
    sorted_df = sorted_df.reset_index()
    
    s = sorted_df.ix[:,0]
    
    return s
    
date_sorter()

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


0        9
1       84
2       53
3        2
4       28
5      474
6      153
7      129
8      225
9      171
10     191
11      13
12      98
13     111
14      31
15     486
16     335
17     323
18     345
19     405
20      57
21     415
22      36
23     422
24     375
25     380
26     481
27     299
28     162
29     154
      ... 
470    208
471    139
472    320
473    383
474    393
475     34
476    244
477    286
478    480
479    279
480    198
481    431
482    463
483    255
484    381
485    439
486    401
487    366
488    475
489    257
490    152
491    235
492    464
493    253
494    231
495    141
496    186
497    161
498    413
499    427
Name: index, Length: 500, dtype: int64