# DateTracker Testing

Test the DateTracker class to verify S3 read/write functionality.

In [1]:
import sys
sys.path.insert(0, '/opt/notebooks')

from scripts.date_tracker import DateTracker

## Test 1: Initialize DateTracker

Create a new tracker for NFL. This should:
- Connect to S3
- Try to load existing tracking file
- Return empty list if file doesn't exist

In [2]:
tracker = DateTracker(sport="NFL")
print("Tracker initialized successfully!")
print(f"S3 path: {tracker.s3_path}")
print(f"Currently scraped dates: {tracker.get_all_scraped()}")

Tracker initialized successfully!
S3 path: s3://dfscrunch-data-lake/staging/metadata/NFL/scraped_dates.json
Currently scraped dates: []


## Test 2: Check if Date is Scraped

Test the `is_scraped()` method with a date that hasn't been scraped.

In [3]:
test_date = "2025-09-01"
is_scraped = tracker.is_scraped(test_date)
print(f"Is {test_date} scraped? {is_scraped}")

Is 2025-09-01 scraped? False


## Test 3: Mark Date as Scraped

Mark a date as scraped and verify it persists to S3.

In [4]:
tracker.mark_scraped(test_date)
print(f"Marked {test_date} as scraped")
print(f"All scraped dates: {tracker.get_all_scraped()}")

Marked 2025-09-01 as scraped
All scraped dates: ['2025-09-01']


## Test 4: Verify Persistence

Create a NEW tracker instance to verify the date persisted to S3.

In [5]:
tracker2 = DateTracker(sport="NFL")
print(f"New tracker instance loaded dates: {tracker2.get_all_scraped()}")
print(f"Is {test_date} scraped in new instance? {tracker2.is_scraped(test_date)}")

New tracker instance loaded dates: ['2025-09-01']
Is 2025-09-01 scraped in new instance? True


## Test 5: Mark Multiple Dates

Add several dates to test sorting and deduplication.

In [6]:
dates_to_add = ["2025-09-05", "2025-09-03", "2025-09-02", "2025-09-04"]

for date in dates_to_add:
    tracker.mark_scraped(date)
    print(f"Added {date}")

print(f"\nAll scraped dates (should be sorted): {tracker.get_all_scraped()}")

Added 2025-09-05
Added 2025-09-03
Added 2025-09-02
Added 2025-09-04

All scraped dates (should be sorted): ['2025-09-01', '2025-09-02', '2025-09-03', '2025-09-04', '2025-09-05']


## Test 6: Try to Add Duplicate

Adding the same date twice should not create duplicates.

In [7]:
before_count = len(tracker.get_all_scraped())
tracker.mark_scraped("2025-09-03")  # Already added above
after_count = len(tracker.get_all_scraped())

print(f"Before: {before_count} dates, After: {after_count} dates")
print(f"Duplicate handled correctly: {before_count == after_count}")
print(f"All dates: {tracker.get_all_scraped()}")

Before: 5 dates, After: 5 dates
Duplicate handled correctly: True
All dates: ['2025-09-01', '2025-09-02', '2025-09-03', '2025-09-04', '2025-09-05']


## Test 7: Check S3 File Directly

Read the S3 file directly to verify the JSON structure.

In [8]:
import json
import pandas as pd

# Read the file directly
with pd.io.common.get_handle(
    tracker.s3_path, mode="r", storage_options=tracker.storage_options
) as handles:
    raw_content = handles.handle.read()
    print("Raw S3 file content:")
    print(raw_content)

Raw S3 file content:
[
  "2025-09-01",
  "2025-09-02",
  "2025-09-03",
  "2025-09-04",
  "2025-09-05"
]


## Summary

If all tests above pass:
- ✅ DateTracker can initialize and connect to S3
- ✅ Can check if dates are scraped
- ✅ Can mark dates as scraped
- ✅ Data persists across instances
- ✅ Dates are sorted and deduplicated
- ✅ S3 file has correct JSON format

**Next Steps:**
1. Deploy DAG to Airflow
2. Test with manual trigger
3. Verify skip logic
4. Start backfilling