In [19]:
import pandas as pd
import numpy as np 
import re

In [20]:
df=pd.read_csv("date_parser_testcases.csv")

In [21]:
df.head()

Unnamed: 0,Input,Expected Output
0,"The event will take place on March 5, 2023.",05/03/2023
1,Her birthday is on 07/08/1990.,07/08/1990
2,The deadline is 2022-12-31.,31/12/2022
3,We met on 1st of January 2000.,01/01/2000
4,"The concert is scheduled for 15th September, 2...",15/09/2021


In [22]:
df.head(50)

Unnamed: 0,Input,Expected Output
0,"The event will take place on March 5, 2023.",05/03/2023
1,Her birthday is on 07/08/1990.,07/08/1990
2,The deadline is 2022-12-31.,31/12/2022
3,We met on 1st of January 2000.,01/01/2000
4,"The concert is scheduled for 15th September, 2...",15/09/2021
5,Let's catch up on 02.04.2022.,02/04/2022
6,The project started on 5/6/19.,05/06/2019
7,He was born on 1987/11/23.,23/11/1987
8,Christmas is on 25th Dec 2024.,25/12/2024
9,"The meeting is set for April 03, 2020.",03/04/2020


In [23]:
df.info

<bound method DataFrame.info of                                                 Input Expected Output
0         The event will take place on March 5, 2023.      05/03/2023
1                      Her birthday is on 07/08/1990.      07/08/1990
2                         The deadline is 2022-12-31.      31/12/2022
3                      We met on 1st of January 2000.      01/01/2000
4   The concert is scheduled for 15th September, 2...      15/09/2021
..                                                ...             ...
95  We celebrate Independence Day on 2023-07-04, a...      04/07/2023
96  The final date for submission is 30th November...      30/11/2022
97  The annual conference is on 15th October 2023,...      15/10/2023
98  His birthdate, noted as 1990-05-20, is in the ...      20/05/1990
99  The festival will be celebrated on 12th August...      12/08/2024

[100 rows x 2 columns]>

In [24]:
df.isna().sum()

Input              0
Expected Output    0
dtype: int64

In [25]:
#to convert month names to numbers
def month_name_to_number(month):
    month_dict = {
        'january': '01', 'february': '02', 'march': '03', 'april': '04',
        'may': '05', 'june': '06', 'july': '07', 'august': '08',
        'september': '09', 'october': '10', 'november': '11', 'december': '12'
    }
    return month_dict.get(month.lower(), None)

In [26]:
def extract_date(text):
    # Define regex patterns for different date formats
    patterns = [
        r'(\d{1,2})(st|nd|rd|th)?[\s\-]*(of)?[\s\-]*(\b\w+\b)[\s\-,]*(\d{4})',  # 1st of January 2000, 15th September, 2021
        r'(\d{4})[\-\/\.](\d{2})[\-\/\.](\d{2})',  # 2022-12-31, 2022.12.31
        r'(\d{1,2})[\-\/\.](\d{1,2})[\-\/\.](\d{4})',  # 07/08/1990, 07.08.1990
        r'(\b\w+\b)[\s\-]*(\d{1,2})(st|nd|rd|th)?[\s\-,]*(\d{4})'  # March 5, 2023
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            if len(match.groups()) == 5:  # Handle formats with month names
                day = match.group(1).zfill(2)
                month = month_name_to_number(match.group(4) if match.group(4) else match.group(3))
                year = match.group(5)
                return f"{day}/{month}/{year}" if month else None
            elif len(match.groups()) == 3:  # Handle formats like 2022-12-31
                year, month, day = match.group(1), match.group(2).zfill(2), match.group(3).zfill(2)
                return f"{day}/{month}/{year}"
            elif len(match.groups()) == 4:  # Handle formats like 07/08/1990
                day, month, year = match.group(1).zfill(2), match.group(2).zfill(2), match.group(3)
                return f"{day}/{month}/{year}"
    
    # If no patterns match, return None
    return None

In [27]:
# Apply the function to the dataset
df['Parsed Date'] = df['Input'].apply(extract_date)


In [28]:
# Compare with the expected output
df['Match'] = df['Parsed Date'] == df['Expected Output']

In [29]:
# Output the results
incorrect_rows = df[df['Match'] == False]
print(f"Number of incorrect rows: {len(incorrect_rows)}")
print("\nIncorrectly parsed dates:")
print(incorrect_rows[['Input', 'Expected Output', 'Parsed Date']])


Number of incorrect rows: 52

Incorrectly parsed dates:
                                                Input  Expected Output  \
0         The event will take place on March 5, 2023.       05/03/2023   
1                      Her birthday is on 07/08/1990.       07/08/1990   
5                       Let's catch up on 02.04.2022.       02/04/2022   
6                      The project started on 5/6/19.       05/06/2019   
8                      Christmas is on 25th Dec 2024.       25/12/2024   
9              The meeting is set for April 03, 2020.       03/04/2020   
13                      They got married on 12/12/12.       12/12/2012   
14            The workshop is on February 15th, 2022.       15/02/2022   
15                  Submit your report by 08/31/2021.       31/08/2021   
19                 The new year begins on 01-01-2023.       01/01/2023   
20                      The seminar is on 03/14/2022.       14/03/2022   
21                         My last day is 31.08.2020.   

# #Overall Performance:
Total Cases Analyzed: The function processed a total of 100 rows (48 correctly parsed, 52 incorrectly parsed).
Correctly Parsed Dates: 48 cases were correctly parsed, meaning the function correctly identified and formatted the dates as expected.
Incorrectly Parsed Dates: 52 cases were incorrectly parsed, where the function failed to extract or format the dates accurately.

In [30]:
# Visualize the comparison between expected and parsed dates
correct_rows = df[df['Match'] == True]
print(f"\nNumber of correctly parsed rows: {len(correct_rows)}")
print("\nCorrectly parsed dates:")
print(correct_rows[['Input', 'Expected Output', 'Parsed Date']])


Number of correctly parsed rows: 48

Correctly parsed dates:
                                                Input Expected Output  \
2                         The deadline is 2022-12-31.      31/12/2022   
3                      We met on 1st of January 2000.      01/01/2000   
4   The concert is scheduled for 15th September, 2...      15/09/2021   
7                          He was born on 1987/11/23.      23/11/1987   
10  Her birthdate, noted as 1997-05-20, is in the ...      20/05/1997   
11      Her appointment is on the 2nd of March, 2021.      02/03/2021   
12                       The exam date is 2021.11.10.      10/11/2021   
16                The course starts on 1st July 2023.      01/07/2023   
17          Independence Day is on 4th of July, 2022.      04/07/2022   
18                        His birthday is 1995/10/30.      30/10/1995   
22                        The due date is 2020-02-28.      28/02/2020   
24       The conference will be held on 5th May 2023.      05/

# Patterns in Incorrectly Parsed Dates:
Looking at the incorrect results, several patterns and issues emerge:

Month Names Handling:

Example: "The event will take place on March 5, 2023." was parsed as "March/05/None". The function failed to convert the month name "March" to its numerical representation ("03").
Cause: The current regex pattern or the month_name_to_number function isn't robust enough to handle all formats, particularly when the month name appears without additional context (like "March 5, 2023").
Ambiguous Date Formats:

Example: "Her birthday is on 07/08/1990." was correctly parsed, but in the case of "Submit your report by 08/31/2021.", it was incorrectly parsed as "2021/31/08".
Cause: The function may misinterpret month and day positions due to the ambiguity in the formats (MM/DD/YYYY vs. DD/MM/YYYY).
Edge Cases in Shortened Year Formats:

Example: "They got married on 12/12/12." returned None. The function might not correctly interpret shortened year formats like "12" (meaning 2012).
Cause: Lack of logic to handle two-digit years.
Ordinal Indicators and Suffixes:

Example: "The holiday starts on Dec 20th, 2021." was parsed as "Dec/20/th".
Cause: The regex captures ordinal indicators like "th" but fails to handle them properly in the final output.
3. Correctly Parsed Dates:
The correctly parsed dates mostly involve straightforward and unambiguous formats, such as:

YYYY-MM-DD formats: "The deadline is 2022-12-31." -> "31/12/2022".
Day-Month-Year with ordinals: "We met on 1st of January 2000." -> "01/01/2000".
Dates without ordinals: "The official exam date is now 2021.11.10." -> "10/11/2021".
These cases demonstrate that the function works well with clear, unambiguous, and common date formats.