### Date recover

For more contex on this code, please read the [README file](https://github.com/Chinnasf/Text-Mining/blob/master/README.md).

---

**Date of Creation**: Aug 05, 2020<br>
**Author**: Karina Chiñas Fuentes<br>
**Email**: chinnasf@outlook.de

---


**OBJECTIVE**: receive a txt file, obtain the dates registered per event and sort the events according to their dates. 

The code below searches for the following types of dates (and their variants) inside paragraphs:

- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
- Jan.29.2020; September.17.1995; 08.17.20 
- 20 March, 2009; 1 Mar. 2020; Sept. 2004
- Mar-20-2009; Mar.1.09; September/2/09
- 10.March.19; 2/DEC/89; 
- December 10, 19; Dec 15, 1919; Dec. 19; Dec. 1919
- 10-10-10; 5-19; 05/1919; 10.15.19 
- 1/19; 01/19; 10;1919
- 1919; {whitespace} 2020
- {no whitespace with character} 2010


Then, checks for typos and suggests the correct spelling to therefore turn the registered date into datetime format.

In [1]:
import pandas as pd

# Turning the .txt file into a DataFrame 
doc = []
with open('Data/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.DataFrame(doc,columns=["data"])
df.head(10)

Unnamed: 0,data
0,03/25/93 Total time of visit (in minutes):\n
1,6/18/85 Primary Care Doctor:\n
2,sshe plans to move as of 7/8/71 In-Home Servic...
3,7 on 9/27/75 Audit C Score Current:\n
4,2/6/96 sleep studyPain Treatment Pain Level (N...
5,.Per 7/06/79 Movement D/O note:\n
6,"4, 5/18/78 Patient's thoughts about current su..."
7,10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8,3/7/86 SOS-10 Total Score:\n
9,(4/10/71)Score-1Audit C Score Current:\n


In [2]:
import re
from nltk.metrics.distance import edit_distance
from nltk.corpus import words
correct_spellings = words.words()

# Variables needed for dates with string format.
months = "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z.]*" 
formal = "[sndrht]{2}"

def raw_date_filter(s):
    # First try all dates starting with string format
    type_a1 = "("+months+"(\s)\d{4})|("+months+"(,\s)\d{2,4})|("+months+"(\s)\d{0,2}"+formal+"(,\s)\d{2,4})"
    type_a2 = "("+months+"(/|-|\.)\d{1,2}(/|-|\.)\d{2,4})|("+months+"\s\d{0,2}(,|-|\.|,)?\s\d{2,4})"
    
    clf_a = "(\d{2}\s"+months+"\s\d{4})|"+type_a1+"|"+type_a2
    x = [m.group(0) for m in re.finditer(re.compile(r""+clf_a),s.title())]
    if len(x) != 0:
        return x[0]
    # Then try the other types of dates
    else:
        type_b1 = "(\d{0,2}\s"+months+"(,|-|\.|,)?\s\d{4})|(\d{1,2}(/|-|\.)"+months+"(/|-|\.)\d{2,4})"
        type_b2 = "(\d{0,2}(/|-|\.)\d{1,2}(/|-|\.)\d{2,4})|(\d{1,2}(/)\d{2,4})|((\s|\s\W|\w)\d{4})|(\d{4})"
    
        clf_b = type_b1+"|"+type_b2
        y = [m.group(0) for m in re.finditer(re.compile(r""+clf_b),s.title())]
        return y[0]

def typo_cleaner(s):
    # Tokenizing string
    tk   = [x for x in re.split("\s|-|\.|,|~|\(|\)|\[|\]|/",s.title()) if x != ""] 
    
    # If there is only a year with a string, collect the year.
    if len(tk) == 1: 
        return re.findall(r"\d+",s)[0] # returns only year xxxx
    else:
        # checking and correcting spelling typos in string format
        pos_word = [(i,tk[i]) for i in range(len(tk)) if tk[i].isdigit() == False]
        if len(pos_word) !=0: 
            word = pos_word[0][1]
            if len(word) == 3: # Leave strings that use short form of months
                return '/'.join(tk) # returns xx/xx/xxxx
            else:              # Check for spellings in full months
                spelling_suggestions = [w for w in correct_spellings if w[0] == word[0]]
                distance = [edit_distance(word,suggestion) for suggestion in spelling_suggestions]
                correct_word = spelling_suggestions[distance.index(min(distance))] 
                tk[pos_word[0][0]] = correct_word
                return '/'.join(tk) # returns xx/xx/xxxx
        # Cleaning unwanted characters in numerical format
        else:
            return '/'.join(tk) # returns xx/xx/xxxx

# Date recover:         
df["rdate"] = df["data"].apply(raw_date_filter)
df["date"] = df["rdate"].apply(typo_cleaner)

# Turn obtained dates into timestamp format using pd.to_datetime() and sorting data according to events.
df["dates"] = pd.to_datetime(df["date"])
df.drop(["rdate","date"],axis=1,inplace=True)
df.set_index(["dates"],inplace=True)
df.sort_index(inplace=True)

# The results is a dataframe with sorted data from latest events to most recent events.
df

Unnamed: 0_level_0,data
dates,Unnamed: 1_level_1
1971-04-10,(4/10/71)Score-1Audit C Score Current:\n
1971-05-18,5/18/71 Total time of visit (in minutes):\n
1971-07-08,sshe plans to move as of 7/8/71 In-Home Servic...
1971-07-11,7/11/71 SOS-10 Total Score:\n
1971-09-12,9/12/71 [report_end]\n
...,...
2016-05-01,50 yo DWF with a history of alcohol use disord...
2016-05-30,30 May 2016 SOS-10 Total Score:\n
2016-10-13,13 Oct 2016 Primary Care Doctor:\n
2016-10-19,19 Oct 2016 Communication with referring physi...
