# Assignment 1 - Part 1: Regular Expressions and Date Extraction

**Course:** Natural Language Processing



---


## Assignment Overview

In this assignment, you'll work with messy medical data and use regex to extract relevant information.

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but dates are encoded in many different formats.

**Date formats you may encounter:**
- `04/20/2009; 04/20/09; 4/20/09; 4/3/09`
- `Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009`
- `20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009`
- `Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009`
- `Feb 2009; Sep 2009; Oct 2010`
- `6/2008; 12/2009`
- `2009; 2010`

---

## Setup

In [7]:
import pandas as pd
import numpy as np
import re
from datetime import datetime

# Create a dummy 'dates.txt' file for demonstration purposes.
# In a real scenario, you would ensure the actual 'dates.txt' file is uploaded
# or is present in the correct directory.
dummy_dates_content = """04/20/2009
Mar-20-2009
20 Mar 2009
Feb 2009
6/2008
2009
1/5/89
9/2009
2010
"""
with open('dates.txt', 'w') as f:
    f.write(dummy_dates_content)

# Load the data
doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
print(f"Loaded {len(df)} medical notes")
print("\nFirst 5 notes:")
print(df.head())

Loaded 9 medical notes

First 5 notes:
0     04/20/2009\n
1    Mar-20-2009\n
2    20 Mar 2009\n
3       Feb 2009\n
4         6/2008\n
dtype: object


---

## Question 1 (1 point)

**Write a regex pattern to extract dates in the format `MM/DD/YY` or `MM/DD/YYYY`.**

Examples: `03/25/93`, `6/18/85`, `5/24/1990`, `1/25/2011`

*This function should return a list of all matched date strings.*

In [8]:
def question_one():
    """
    Extract all dates in MM/DD/YY or MM/DD/YYYY format.

    Returns:
        list: List of matched date strings
    """

    pattern = r"\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\.?\s*[-, ]*\d{1,2}(?:st|nd|rd|th)?(?:[-, ]*\d{4})\b"  # Define your regex pattern

    results = []
    for note in df:
        matches = re.findall(pattern, note)
        results.extend(matches)

    return results

# Test your function
q1_result = question_one()
print(f"Found {len(q1_result)} dates")
print(f"First 10: {q1_result[:10]}")

Found 1 dates
First 10: ['Mar-20-2009']


---

## Question 2 (1 point)

**Write a regex pattern to extract dates with month names.**

Examples: `Mar-20-2009`, `March 20, 2009`, `Mar 20 2009`, `Mar. 20, 2009`

*This function should return a list of all matched date strings.*

In [9]:
def question_two():
    """
    Extract all dates with month names (e.g., Mar 20, 2009).

    Returns:
        list: List of matched date strings
    """

    pattern = r"\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\.?\s*[-, ]*\d{1,2}(?:st|nd|rd|th)?(?:[-, ]*\d{4})\b"  # Define your regex pattern

    results = []
    for note in df:
        matches = re.findall(pattern, note, re.IGNORECASE) # Added re.IGNORECASE for month name matching
        results.extend(matches)

    return results

# Test your function
q2_result = question_two()
print(f"Found {len(q2_result)} dates")
print(f"First 10: {q2_result[:10]}")

Found 1 dates
First 10: ['Mar-20-2009']


---

## Question 3 (1 point)

**Write a regex pattern to extract dates in the format `DD Month YYYY`.**

Examples: `20 Mar 2009`, `20 March 2009`, `20 Mar. 2009`

*This function should return a list of all matched date strings.*

In [10]:
def question_three():
    """
    Extract all dates in DD Month YYYY format.

    Returns:
        list: List of matched date strings
    """
    # YOUR CODE HERE
    pattern = r"\b\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\.?\s+\d{4}\b"  # Define your regex pattern

    results = []
    for note in df:
        matches = re.findall(pattern, note, re.IGNORECASE) # Added re.IGNORECASE for month name matching
        results.extend(matches)

    return results

# Test your function
q3_result = question_three()
print(f"Found {len(q3_result)} dates")
print(f"First 10: {q3_result[:10]}")

Found 1 dates
First 10: ['20 Mar 2009']


---

## Question 4 (1 point)

**Write a function that uses regex to extract all email addresses from a given text.**

Test text is provided below.

*This function should return a list of email addresses.*

In [11]:
def question_four(text):
    """
    Extract all email addresses from text.

    Args:
        text (str): Input text

    Returns:
        list: List of email addresses
    """
    # YOUR CODE HERE
    pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"  # Define your regex pattern

    return re.findall(pattern, text)

# Test your function
test_text = """
Contact us at support@company.com or sales@company.org.
You can also reach john.doe@email.co.uk or jane_doe123@university.edu.
Invalid emails: @invalid.com, user@, not-an-email
"""

q4_result = question_four(test_text)
print(f"Found emails: {q4_result}")

Found emails: ['support@company.com', 'sales@company.org', 'john.doe@email.co.uk', 'jane_doe123@university.edu']


---

## Question 5 (1 point)

**Write a function that uses regex to clean text by:**
1. Removing all digits
2. Removing all punctuation except spaces
3. Converting to lowercase
4. Removing extra whitespace

*This function should return the cleaned string.*

In [16]:
def question_five(text):
    """
    Clean text by removing digits, punctuation, and normalizing whitespace.

    Args:
        text (str): Input text

    Returns:
        str: Cleaned text
    """
    # 1. Removing all digits
    cleaned_text = re.sub(r"\d+", "", text)

    # 2. Removing all punctuation except spaces
    # Using string.punctuation directly to remove punctuation
    # We need to iterate through punctuation to remove one by one or create a regex char set
    # Easier to just remove non-alphanumeric and non-space characters then normalize spaces
    # Or, if 'punctuation except spaces' means we keep all characters not digits or punctuation
    cleaned_text = re.sub(r"[^\w\s]", "", cleaned_text) # Remove all non-word characters (which includes punctuation) except underscore
    cleaned_text = re.sub(r"_", "", cleaned_text) # Remove underscore if it's considered punctuation

    # 3. Converting to lowercase
    cleaned_text = cleaned_text.lower()

    # 4. Removing extra whitespace
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    return cleaned_text # Return cleaned text

# Test your function
test_text = "Hello, World! 123 This is a TEST... with 456 numbers!!!"
q5_result = question_five(test_text)
print(f"Original: '{test_text}'")
print(f"Cleaned:  '{q5_result}'")
# Expected: 'hello world this is a test with numbers'

Original: 'Hello, World! 123 This is a TEST... with 456 numbers!!!'
Cleaned:  'hello world this is a test with numbers'


---

## Question 6 (2 points)

**Write a function that extracts and validates phone numbers.**

Valid formats:
- `XXX-XXX-XXXX`
- `(XXX) XXX-XXXX`
- `XXX.XXX.XXXX`
- `XXX XXX XXXX`

*This function should return a list of phone numbers in standardized format `XXX-XXX-XXXX`.*

In [20]:
def question_six(text):
    """
    Extract phone numbers and return them in XXX-XXX-XXXX format.

    Args:
        text (str): Input text

    Returns:
        list: List of phone numbers in XXX-XXX-XXXX format
    """
    # Regex pattern to capture all valid phone number formats
    # It looks for 3 digits, followed by a separator (hyphen, dot, space, or space after closing parenthesis),
    # followed by 3 digits, followed by a separator, followed by 4 digits.
    # It also handles the (XXX) XXX-XXXX format.
    pattern = re.compile(r"""
        (?:                            # Non-capturing group for overall structure
            (?:                        # Group for the first three digits, allowing parentheses
                \((\d{3})\)           # (XXX) format
                |                      # OR
                (\d{3})               # XXX format
            )
            [-.\s]?                   # Optional separator (hyphen, dot, space)
            (\d{3})                   # Middle three digits
            [-.\s]?                   # Optional separator
            (\d{4})                   # Last four digits
        )
    """, re.VERBOSE)

    found_numbers = []
    for match in pattern.finditer(text):
        # Extract the captured digit groups. The first group can be from (XXX) or XXX.
        if match.group(1): # if (XXX) format was matched
            g1 = match.group(1)
        else: # if XXX format was matched
            g1 = match.group(2)

        g2 = match.group(3)
        g3 = match.group(4)

        # Concatenate and format
        formatted_number = f"{g1}-{g2}-{g3}"
        found_numbers.append(formatted_number)

    return found_numbers # Return list of standardized phone numbers

---

## Question 7 (3 points)

**This is the main challenge: Extract all dates from the medical notes and sort them chronologically.**

**Rules:**
- Assume all dates in `xx/xx/xx` format are `mm/dd/yy`
- Assume all 2-digit years are from the 1900s (e.g., `1/5/89` is January 5th, 1989)
- If the day is missing (e.g., `9/2009`), assume it is the 1st day of the month
- If the month is missing (e.g., `2010`), assume it is January 1st

*This function should return a pandas Series of length 500, where the values are the original indices sorted by date in ascending chronological order.*

**Example:**
```python
# If original series was:
#    0    1999
#    1    2010
#    2    1978
# Your function should return:
#    0    2    (1978 is earliest)
#    1    0    (1999 is second)
#    2    1    (2010 is latest)
```

In [21]:
def question_seven():
    """
    Extract dates from all medical notes and return indices sorted chronologically.

    Returns:
        pd.Series: Series of length 500 with original indices sorted by date
    """
    # YOUR CODE HERE
    # Hint:
    # 1. Create regex patterns to match different date formats
    # 2. Extract dates from each note
    # 3. Parse dates into datetime objects
    # 4. Sort by date and return the indices

    return pd.Series([])  # Return the sorted indices

# Test your function
q7_result = question_seven()
print(f"Result length: {len(q7_result)}")
print(f"First 10 indices: {list(q7_result.head(10))}")
print(f"Last 10 indices: {list(q7_result.tail(10))}")

Result length: 0
First 10 indices: []
Last 10 indices: []


---

## Summary of Functions for Grading

Make sure all these functions are properly implemented before exporting:

In [25]:
# Run this cell to verify all functions exist and return correct types
print("Checking functions...")

try:
    r1 = question_one()
    assert isinstance(r1, list), "question_one should return a list"
    print("✓ question_one: OK")
except Exception as e:
    print(f"✗ question_one: {e}")

try:
    r2 = question_two()
    assert isinstance(r2, list), "question_two should return a list"
    print("✓ question_two: OK")
except Exception as e:
    print(f"✗ question_two: {e}")

try:
    r3 = question_three()
    assert isinstance(r3, list), "question_three should return a list"
    print("✓ question_three: OK")
except Exception as e:
    print(f"✗ question_three: {e}")

try:
    r4 = question_four("test@email.com")
    assert isinstance(r4, list), "question_four should return a list"
    print("✓ question_four: OK")
except Exception as e:
    print(f"✗ question_four: {e}")

try:
    r5 = question_five("Hello World 123")
    assert isinstance(r5, str), "question_five should return a string"
    print("✓ question_five: OK")
except Exception as e:
    print(f"✗ question_five: {e}")

try:
    r6 = question_six("123-456-7890")
    assert isinstance(r6, list), "question_six should return a list"
    print("✓ question_six: OK")
except Exception as e:
    print(f"✗ question_six: {e}")

try:
    r7 = question_seven()
    assert isinstance(r7, pd.Series), "question_seven should return a pandas Series"
    print("✓ question_seven: OK")
except Exception as e:
    print(f"✗ question_seven: {e}")

print("\nDone! Export this notebook as .py file when all functions pass.")

Checking functions...
✓ question_one: OK
✓ question_two: OK
✓ question_three: OK
✓ question_four: OK
✓ question_five: OK
✓ question_six: OK
✓ question_seven: OK

Done! Export this notebook as .py file when all functions pass.


# Task
Implement the `question_seven` function. This function should extract all dates from the `df` Series (which contains medical notes from "dates.txt"), parse them into `datetime` objects, and then return a pandas Series of the original indices sorted by these dates in ascending chronological order.

**Date formats to handle:**
- `MM/DD/YY` or `MM/DD/YYYY` (e.g., `04/20/2009`, `4/20/09`)
- Month names (e.g., `Mar-20-2009`, `March 20, 2009`, `Feb 2009`)
- `DD Month YYYY` (e.g., `20 Mar 2009`)
- `MM/YYYY` (e.g., `6/2008`)
- `YYYY` (e.g., `2009`)

**Parsing Rules:**
- `xx/xx/xx` is `mm/dd/yy`
- 2-digit years are from the 1900s (e.g., `1/5/89` is January 5th, 1989)
- If the day is missing (e.g., `9/2009`), assume it is the 1st day of the month
- If the month is missing (e.g., `2010`), assume it is January 1st

The final output should be a pandas Series of the original indices, sorted chronologically, and should have a length of 500 (even if `df` is smaller, assume the actual `dates.txt` has enough data to produce 500 entries).

## Define Regex Patterns

### Subtask:
Define a set of regex patterns to capture all possible date formats mentioned in the problem description (MM/DD/YY, MM/DD/YYYY, month names, DD Month YYYY, Month YYYY, MM/YYYY, YYYY). These patterns will be used to extract date strings from the `df` Series.


**Reasoning**:
The subtask requires defining a set of regex patterns to extract various date formats within the `question_seven` function. I will add these patterns, using named capture groups and prioritizing from most specific to least specific, inside the function.



In [24]:
def question_seven():
    """
    Extract dates from all medical notes and return indices sorted chronologically.

    Returns:
        pd.Series: Series of length 500 with original indices sorted by date
    """

    # Hint:
    # 1. Create regex patterns to match different date formats
    # 2. Extract dates from each note
    # 3. Parse dates into datetime objects
    # 4. Sort by date and return the indices

    # Define regex patterns for various date formats, ordered by specificity
    patterns = [
        # Full dates with month names (e.g., Mar 20, 2009; March 20th, 2009; Mar-20-2009)
        re.compile(r"""(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
                     [\s\.-]*(?P<day>\d{1,2})(?:st|nd|rd|th)?(?:[\s,/-]*(?P<year>\d{4}))""", re.IGNORECASE | re.VERBOSE),
        # DD Month YYYY (e.g., 20 Mar 2009)
        re.compile(r"""(?P<day>\d{1,2})[\s\.-]*(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
                     [\s,/-]*(?P<year>\d{4})""", re.IGNORECASE | re.VERBOSE),
        # MM/DD/YYYY or MM/DD/YY
        re.compile(r"""(?P<month>\d{1,2})/(?P<day>\d{1,2})/(?P<year>\d{2,4})""", re.VERBOSE),
        # Month YYYY (e.g., Feb 2009)
        re.compile(r"""(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
                     [\s,/-]*(?P<year>\d{4})""", re.IGNORECASE | re.VERBOSE),
        # MM/YYYY (e.g., 6/2008)
        re.compile(r"""(?P<month>\d{1,2})/(?P<year>\d{4})""", re.VERBOSE),
        # YYYY (e.g., 2009)
        re.compile(r"""(?P<year>\d{4})""", re.VERBOSE)
    ]

    extracted_dates = []
    for i, note in enumerate(df):
        found_date = None
        for pattern in patterns:
            match = pattern.search(note)
            if match:
                found_date = match.groupdict()
                # Convert month name to number if present
                if 'month' in found_date and not found_date['month'].isdigit():
                    month_name = found_date['month'][:3].capitalize()
                    month_map = {
                        'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
                        'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12
                    }
                    found_date['month'] = month_map.get(month_name, 1) # Default to Jan if not found

                # Handle missing day, month (default to 1st)
                found_date.setdefault('day', '1')
                found_date.setdefault('month', '1')

                # Handle two-digit years (19xx)
                year_str = found_date['year']
                if len(year_str) == 2:
                    year = int(year_str)
                    if year >= 0 and year <= 99: # Assuming 1900s
                        found_date['year'] = str(1900 + year)

                # Ensure year is 4 digits for datetime conversion
                found_date['year'] = str(found_date['year']).zfill(4) # Pad with leading zeros if less than 4 digits (e.g., 09 -> 0009)

                try:
                    # Construct a full date string to parse consistently
                    date_str = f"{found_date['year']}-{found_date['month']}-{found_date['day']}"
                    parsed_date = datetime.strptime(date_str, '%Y-%m-%d')
                    extracted_dates.append((parsed_date, i)) # Store datetime object and original index
                    break # Move to next note once a date is found
                except ValueError:
                    # This can happen if parsing fails for some reason, e.g., 'Feb 30'
                    # For now, we just skip it, but a more robust solution might handle it
                    # by trying other patterns or default values more carefully.
                    pass

    # Sort the extracted dates chronologically
    extracted_dates.sort(key=lambda x: x[0])

    # Return a pandas Series of original indices sorted by date
    return pd.Series([idx for date, idx in extracted_dates])

# Test your function
q7_result = question_seven()
print(f"Result length: {len(q7_result)}")
print(f"First 10 indices: {list(q7_result.head(10))}")
print(f"Last 10 indices: {list(q7_result.tail(10))}")

Result length: 9
First 10 indices: [6, 4, 5, 3, 1, 2, 0, 7, 8]
Last 10 indices: [6, 4, 5, 3, 1, 2, 0, 7, 8]


## Final Task

### Subtask:
Confirm that the `question_seven` function correctly extracts, parses, and sorts dates chronologically, and returns a pandas Series of the correct length with the original indices.


## Summary:

### Data Analysis Key Findings

*   **Date Extraction and Parsing:** The `question_seven` function successfully extracts dates from medical notes using a series of specific regex patterns designed to capture various date formats, including:
    *   Full dates with month names (e.g., `Mar 20, 2009`, `Mar-20-2009`).
    *   `DD Month YYYY` (e.g., `20 Mar 2009`).
    *   `MM/DD/YYYY` or `MM/DD/YY` (e.g., `04/20/2009`, `4/20/09`).
    *   `Month YYYY` (e.g., `Feb 2009`).
    *   `MM/YYYY` (e.g., `6/2008`).
    *   `YYYY` (e.g., `2009`).
*   **Parsing Rule Adherence:**
    *   Two-digit years (e.g., `89`) are correctly interpreted as belonging to the 1900s (e.g., `1989`).
    *   Missing day values (e.g., `9/2009`) are defaulted to the 1st of the month.
    *   Missing month values (e.g., `2010`) are defaulted to January 1st.
*   **Chronological Sorting:** The function correctly parses the extracted date strings into `datetime` objects and sorts the original indices of the medical notes chronologically based on these parsed dates.
*   **Output Format:** The function returns a pandas Series containing the original indices, sorted by the extracted dates. For the test case with a dummy `df`, the result had a length of 9.

### Insights or Next Steps

*   The current implementation demonstrates a robust approach to extracting and standardizing various date formats.
*   To fully meet the task requirement, the function should be tested with the actual `dates.txt` dataset to confirm it produces a Series of length 500, as specified.
