# Advanced Exercises – `dateutil` Parser

In this notebook you'll practice **advanced usage patterns** of the `dateutil.parser` module:

* Handling ambiguous date formats with `dayfirst` / `yearfirst`
* Using `default` to fill in missing components
* Fuzzy parsing of noisy text and robust error handling
* Custom time zone resolution with `tzinfos`
* End‑to‑end data cleaning and normalization pipeline

Each exercise is followed by a **worked solution** that follows best practices (clear naming, comments, tests on sample data, and separation of concerns).


In [1]:
from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from typing import Iterable, List, Optional, Sequence, Tuple, Dict, Any

from dateutil import parser, tz


## Exercise 1 – Parsing Ambiguous Date Formats

You receive timestamps from two partners:

* Partner **US** uses `M/D/YYYY` (e.g. `3/6/2020` is **March 6**).
* Partner **EU** uses `D/M/YYYY` (e.g. `3/6/2020` is **6 March**).

You get a list of raw date strings *without* knowing which partner they came from, but you **do** know that:

* If the first number is greater than 12, it must be a **day** (EU format).
* Otherwise, you assume **US format**.

### Task

Implement a function:

```python
def parse_mixed_american_european(date_str: str) -> datetime:
    ...
```

Requirements:

1. Use `dateutil.parser.parse`.
2. Use **`dayfirst`** to disambiguate when needed.
3. Return **naive** `datetime` objects (no timezone).
4. Raise a `ValueError` with a clear error message if parsing fails.

Test it on:

```python
samples = [
    "3/6/2020",   # 2020-03-06 00:00:00 (US)
    "16/2/2020",  # 2020-02-16 00:00:00 (EU)
    "11/12/2020"  # 2020-11-12 00:00:00 (US)
]
```


In [2]:
def parse_mixed_american_european(date_str: str) -> datetime:
    """Parse a date string that could be in US (M/D/Y) or EU (D/M/Y) format.

    Heuristic:
    - If the first number > 12 => must be a day => treat as day-first.
    - Otherwise => treat as US-style month-first.
    """    
    date_str = date_str.strip()
    if not date_str:
        raise ValueError("Empty date string")

    # Extract the first numeric token to decide the heuristic.
    first_token = date_str.split("/")[0].strip()
    try:
        first_number = int(first_token)
    except ValueError as exc:
        raise ValueError(f"Cannot interpret leading token '{first_token}' as a day/month number") from exc

    dayfirst = first_number > 12

    try:
        # We only care about the date portion, so time defaults are fine.
        return parser.parse(date_str, dayfirst=dayfirst)
    except (ValueError, parser.ParserError) as exc:
        raise ValueError(f"Could not parse date string '{date_str}' with heuristic dayfirst={dayfirst}") from exc


# Quick sanity checks
samples = [
    "3/6/2020",   # March 6, 2020 (US)
    "16/2/2020",  # 16 February 2020 (EU)
    "11/12/2020", # November 12, 2020 (US)
]

for s in samples:
    print(s, "->", parse_mixed_american_european(s))


3/6/2020 -> 2020-03-06 00:00:00
16/2/2020 -> 2020-02-16 00:00:00
11/12/2020 -> 2020-11-12 00:00:00


## Exercise 2 – Using `default` to Fill Missing Components

You receive log records from a service that uses **partial timestamps**:

```text
"2020-03-01"
"03-02 10:15"
"10:30"
"2020/03/04 23"
```

Rules:

1. If the **year** is missing, assume the year of the **reference datetime**.
2. If the **date** is missing, assume the date of the **reference datetime**.
3. If the **time** is missing, assume `00:00:00`.

### Task

Implement:

```python
def parse_with_reference(ts: str, reference: datetime) -> datetime:
    ...
```

Use `parser.parse(..., default=reference)` and then **normalize** the parts so that:

* Any components explicitly present in `ts` override the default.
* Missing parts come from `reference` or from zero‑time when required.

Test with:

```python
reference = datetime(2020, 3, 1, 8, 0, 0)
inputs = [
    "2020-03-01",
    "03-02 10:15",
    "10:30",
    "2020/03/04 23"
]
```


In [3]:
def parse_with_reference(ts: str, reference: datetime) -> datetime:
    """Parse a partial timestamp using a reference datetime.

    Uses dateutil.parser with `default=reference` and then enforces:
    - If year/month/day are missing, take them from `reference`.
    - If hour/minute/second are missing, use 0.
    """    
    ts = ts.strip()
    if not ts:
        raise ValueError("Empty timestamp string")

    # First, parse using reference as the default.
    parsed = parser.parse(ts, default=reference)

    # Now we need to determine which components were actually present.
    # A robust yet simple heuristic is to inspect the string to see if
    # particular components appear.
    has_year = any(len(tok) == 4 and tok.isdigit() for tok in ts.replace("/", "-").split("-"))
    has_month_or_day = any(tok.isdigit() and len(tok) in (1, 2) for tok in ts.replace("/", "-").split("-"))
    has_time = ":" in ts or any(tok.endswith("h") for tok in ts.split())

    # Start from reference and selectively override components from parsed.
    year = parsed.year if has_year else reference.year
    month = parsed.month if has_year or has_month_or_day else reference.month
    day = parsed.day if has_year or has_month_or_day else reference.day

    if has_time:
        hour = parsed.hour
        minute = parsed.minute
        second = parsed.second
        microsecond = parsed.microsecond
    else:
        hour = 0
        minute = 0
        second = 0
        microsecond = 0

    return datetime(year, month, day, hour, minute, second, microsecond)


# Demonstration
reference = datetime(2020, 3, 1, 8, 0, 0)
inputs = [
    "2020-03-01",   # full date, no time
    "03-02 10:15",  # month-day time, no year
    "10:30",        # time only
    "2020/03/04 23" # full date, hour only
]

for s in inputs:
    print(f"{s!r} -> {parse_with_reference(s, reference)}")


'2020-03-01' -> 2020-03-01 00:00:00
'03-02 10:15' -> 2020-03-02 10:15:00
'10:30' -> 2020-03-01 10:30:00
'2020/03/04 23' -> 2020-03-04 00:00:00


## Exercise 3 – Fuzzy Parsing of Noisy Text

You get human‑written messages like:

```text
"Payment received on 2020-04-03 at 5pm, thanks!"
"Let's meet next Tuesday (2020/04/07 14:30 CET) at the office."
"created: 04-08-2020 09:00 GMT; last updated: 04-10-2020 18:00 GMT"
"no date here, just some text"
```

### Task

1. Implement a function

   ```python
   def extract_first_datetime(text: str, tzinfo=None) -> Optional[datetime]:
       ...
   ```

   that:

   * Uses `parser.parse(..., fuzzy_with_tokens=True)` to try to extract a datetime.
   * Returns **timezone‑aware** datetimes when `tzinfo` is provided, otherwise returns naive ones.
   * Catches `ParserError` and returns `None` instead of raising.

2. Apply it to each line in the list above and print either the parsed datetime or `"NO DATE"`.

This is a typical pattern when mining timestamps from logs, emails, or free‑form notes.


In [4]:
from dateutil.parser import ParserError  # type: ignore[attr-defined]


def extract_first_datetime(text: str, tzinfo=None) -> Optional[datetime]:
    """Extract the first recognizable datetime from a noisy string.

    - Uses `fuzzy_with_tokens=True` so that unrelated text is ignored.
    - Returns a timezone-aware datetime if `tzinfo` is provided.
    - Returns None if no datetime can be parsed.
    """    
    try:
        dt, _tokens = parser.parse(text, fuzzy_with_tokens=True)
    except (ValueError, ParserError):
        return None

    if tzinfo is not None and dt.tzinfo is None:
        # Localize naive datetime to the provided timezone.
        dt = dt.replace(tzinfo=tzinfo)

    return dt


examples = [
    "Payment received on 2020-04-03 at 5pm, thanks!",
    "Let's meet next Tuesday (2020/04/07 14:30 CET) at the office.",
    "created: 04-08-2020 09:00 GMT; last updated: 04-10-2020 18:00 GMT",
    "no date here, just some text",
]

default_tz = tz.UTC

for line in examples:
    dt = extract_first_datetime(line, tzinfo=default_tz)
    print(f"{line!r} -> {dt if dt is not None else 'NO DATE'}")


'Payment received on 2020-04-03 at 5pm, thanks!' -> 2020-04-03 17:00:00+00:00
"Let's meet next Tuesday (2020/04/07 14:30 CET) at the office." -> 2020-04-07 14:30:00+00:00
'created: 04-08-2020 09:00 GMT; last updated: 04-10-2020 18:00 GMT' -> NO DATE
'no date here, just some text' -> NO DATE




## Exercise 4 – Custom Time Zone Abbreviations with `tzinfos`

By default, `dateutil` can understand many common timezone abbreviations (e.g. `UTC`, `GMT`), but in real‑world systems you may encounter **custom or ambiguous** abbreviations.

Suppose a legacy system uses:

* `EET` for `UTC+02:00`
* `EEST` for `UTC+03:00` (summer time)
* `NYT` for America/New_York local time

You receive strings like:

```text
"2020-05-01 10:00 EET"
"2020-07-01 10:00 EEST"
"2020-03-01 09:30 NYT"
```

### Task

1. Build a `tzinfos` mapping that resolves these abbreviations to concrete `tzinfo` objects using `dateutil.tz.gettz`.
2. Implement

   ```python
   def parse_with_custom_tz(dt_str: str) -> datetime:
       ...
   ```

   that uses `parser.parse(..., tzinfos=TZINFOS)` and always returns a **timezone‑aware** datetime.

3. Demonstrate that `NYT` respects daylight saving time automatically (hint: parse two dates in winter and summer).


In [5]:
# Define our custom time zone mapping.
TZINFOS: Dict[str, Any] = {
    "EET": tz.gettz("Etc/GMT-2"),          # fixed offset +02:00
    "EEST": tz.gettz("Etc/GMT-3"),         # fixed offset +03:00
    "NYT": tz.gettz("America/New_York"),   # real region with DST rules
}


def parse_with_custom_tz(dt_str: str) -> datetime:
    """Parse a datetime string using custom timezone abbreviations.

    Ensures the result is timezone-aware by providing `tzinfos`.
    """    
    dt = parser.parse(dt_str, tzinfos=TZINFOS)
    if dt.tzinfo is None:
        # Should not happen when abbreviations are present, but be defensive.
        dt = dt.replace(tzinfo=tz.UTC)
    return dt


examples = [
    "2020-05-01 10:00 EET",
    "2020-07-01 10:00 EEST",
    "2020-01-15 09:30 NYT",  # winter in New York
    "2020-07-15 09:30 NYT",  # summer in New York (DST)
]

for s in examples:
    dt = parse_with_custom_tz(s)
    print(f"{s!r} -> {dt}  (UTC offset: {dt.utcoffset()})")


'2020-05-01 10:00 EET' -> 2020-05-01 10:00:00+02:00  (UTC offset: 2:00:00)
'2020-07-01 10:00 EEST' -> 2020-07-01 10:00:00+03:00  (UTC offset: 3:00:00)
'2020-01-15 09:30 NYT' -> 2020-01-15 09:30:00-05:00  (UTC offset: -1 day, 19:00:00)
'2020-07-15 09:30 NYT' -> 2020-07-15 09:30:00-04:00  (UTC offset: -1 day, 20:00:00)


## Exercise 5 – Robust Normalization Pipeline

You are given a list of heterogeneous date strings coming from different systems:

```python
raw_dates = [
    "2020-01-01T10:30:00Z",
    "01/02/2020 15:00",
    "2/1/2020 3:00 pm",
    "March 5, 2020 11:45",
    "2020-04-01 08:00 +02:00",
    "Invalid 2020-13-01 date",
]
```

Business requirements:

1. Interpret **unambiguous** dates using `dateutil.parser.parse`.
2. Use the following heuristics for ambiguous `M/D/Y` vs `D/M/Y` cases:
   * If `day > 12` ⇒ interpret as **day-first**.
   * Else ⇒ interpret as **month-first**.
3. Normalize everything to **UTC** and return ISO‑8601 strings.
4. For any entry that cannot be parsed, return `None` and log a useful error message.

### Task

Implement:

```python
def normalize_dates_to_utc_iso(raw_dates: Sequence[str]) -> List[Optional[str]]:
    ...
```

and run it on the `raw_dates` list above.


In [6]:
def _guess_dayfirst_flag(date_str: str) -> bool:
    """Return True if we should treat `date_str` as day-first, False otherwise.

    Heuristic:
    - If the first numeric token is > 12, it must be the day => day-first.
    - Otherwise assume month-first.
    """    
    # Very similar heuristic to Exercise 1, but kept private for this pipeline.
    for sep in ("/", "-", "."):
        if sep in date_str:
            first_token = date_str.split(sep)[0].strip()
            break
    else:
        return False  # no obvious separator; fall back to default

    try:
        first_number = int(first_token)
    except ValueError:
        return False

    return first_number > 12


def normalize_dates_to_utc_iso(raw_dates: Sequence[str]) -> List[Optional[str]]:
    """Normalize a collection of date strings to ISO-8601 UTC strings.

    Returns a list aligned with `raw_dates`:
    - Successful parse => ISO 8601 string in UTC (e.g. '2020-01-01T15:00:00+00:00').
    - Failed parse => None.
    """    
    normalized: List[Optional[str]] = []

    for original in raw_dates:
        s = original.strip()
        if not s:
            print(f"Skipping empty date string.")
            normalized.append(None)
            continue

        dayfirst = _guess_dayfirst_flag(s)

        try:
            dt = parser.parse(s, dayfirst=dayfirst)
        except (ValueError, parser.ParserError) as exc:
            print(f"Failed to parse '{original}': {exc}")
            normalized.append(None)
            continue

        # Ensure timezone-aware; assume naive datetimes are in UTC by default
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=tz.UTC)

        # Convert to UTC
        dt_utc = dt.astimezone(tz.UTC)
        normalized.append(dt_utc.isoformat())

    return normalized


raw_dates = [
    "2020-01-01T10:30:00Z",
    "01/02/2020 15:00",
    "2/1/2020 3:00 pm",
    "March 5, 2020 11:45",
    "2020-04-01 08:00 +02:00",
    "Invalid 2020-13-01 date",
]

normalized = normalize_dates_to_utc_iso(raw_dates)

for raw, norm in zip(raw_dates, normalized):
    print(f"{raw!r} -> {norm}")


Failed to parse 'Invalid 2020-13-01 date': Unknown string format: Invalid 2020-13-01 date
'2020-01-01T10:30:00Z' -> 2020-01-01T10:30:00+00:00
'01/02/2020 15:00' -> 2020-01-02T15:00:00+00:00
'2/1/2020 3:00 pm' -> 2020-02-01T15:00:00+00:00
'March 5, 2020 11:45' -> 2020-03-05T11:45:00+00:00
'2020-04-01 08:00 +02:00' -> 2020-01-04T06:00:00+00:00
'Invalid 2020-13-01 date' -> None
