# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. **This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").**

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [1]:
import pandas as pd

doc = []
with open('assets/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [2]:
def date_sorter():

    pattern = (
        # 04/20/2009; 04/20/09; 4/20/09; 4/3/09 (handle leading zero in months)
        r'\b(?:0?([1-9]|1[0-2]))[\/\-](\d{1,2})[\/\-](\d{2,4})\b|'

        # Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
        r'\b([A-Za-z]{3,})\s+(\d{1,2})(?:th|st|nd|rd)?,?\s+(\d{4})\b|'

        # Feb 2009; Sep 2009; Oct 2010
        r'\b([A-Za-z]{3,})\s+(\d{4})\b|'

        # 6/2008; 12/2009
        r'\b(0?[1-9]|1[0-2])/(\d{4})\b|'

        # 2009; 2010 (skip false positives like phone numbers)
        r'\b(?:s|y|r)?(19\d{2}|20\d{2})\b(?![-\d])|'

        # Years in parentheses (e.g., (1988-now))
        r'\(\s*(19\d{2}|20\d{2})\s*(?:[-–]\s*now)?\s*\)|'

        # Years followed by dashes (e.g., 1990--)
        r'\b(19\d{2}|20\d{2})\b(?=\s*[-–])'
    )
    # YOUR CODE HERE
    #raise NotImplementedError()

    # Extract potential dates
    extracted = df.astype(str).str.extract(pattern)
    
    # Get the first non-null match in each row
    years = extracted.bfill(axis=1).iloc[:, 0]
    
    # Convert to numeric, handling two-digit years
    years = pd.to_numeric(years, errors='coerce')
    years = years.apply(lambda x: x + 1900 if pd.notna(x) and x < 100 else x)
    
    # Create a tuple of (year, original_index) for tie-breaking
    year_index_pairs = list(enumerate(years))
    
    # Sort by year and get the indices
    sorted_indices = years.sort_values().index
    
    return pd.Series(sorted_indices)

In [3]:
order = date_sorter()

order.to_string()

'0       39\n1       77\n2       68\n3      445\n4      410\n5      399\n6      450\n7      381\n8       15\n9       13\n10     369\n11     371\n12     358\n13      45\n14     397\n15     376\n16     124\n17     389\n18       4\n19      49\n20      23\n21      36\n22     395\n23     106\n24     105\n25     104\n26      62\n27     436\n28     415\n29      56\n30     420\n31     352\n32     405\n33      97\n34     412\n35       0\n36      35\n37     354\n38     452\n39     355\n40      22\n41      30\n42       8\n43     347\n44     388\n45      16\n46      95\n47     422\n48     393\n49       9\n50      91\n51      71\n52     424\n53      73\n54     432\n55     431\n56      92\n57     438\n58      18\n59     446\n60      32\n61      78\n62     402\n63      43\n64     391\n65      11\n66     118\n67     115\n68      25\n69     123\n70     384\n71     372\n72     108\n73     349\n74     392\n75      98\n76      84\n77     398\n78     350\n79     419\n80     427\n81      10\n82     448\n83 