<a href="https://colab.research.google.com/github/JoshOdegai/Josh_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [4]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [7]:
# Write your answer here
from pathlib import Path
import re


email_check = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")


def normalize_phone(raw):
  digits = re.sub(r"\D", "", raw)
  if len(digits) >= 10:
    return digits[-10:]
  else:
    digits = ""
    return digits


def extract_fields(line):
    email_match = re.search(r"<([^>]+)>", line)
    if email_match:
        email = email_match.group(1).strip()
        name_part = line.split("<")[0].strip().strip('"').strip()
    else:
        email_plain = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", line)
        if email_plain:
          email = email_plain.group(0).strip()
        else:
           email = ""
        name_part = line.split(",")[0].strip().strip('"').strip()

    parts = [p.strip() for p in line.split(",")]
    if parts:
      phone = parts[-1]
    else:
      phone = ""
    return name_part, email, phone


path = Path('data')
path.mkdir(exist_ok=True)
contact_path = path / 'contacts_raw.txt'
output_path = path / 'contacts_clean.csv'

try:
  with contact_path.open('r') as f:
        contents = f.readlines()
except FileNotFoundError:
    print(f"File {'contacts_raw.txt'!r} not found. Make sure the file is in the correct location")
else:
  cleaned_rows = []
  seen_emails = set()

  for content in contents:
      name, email, phone = extract_fields(content)
      name = name.strip()
      email = email.strip()
      phone = phone.strip()

      if not email_check.fullmatch(email):
          continue

      email_key = email.casefold()
      if email_key in seen_emails:
          continue
      seen_emails.add(email_key)


      phone = normalize_phone(phone)

      cleaned_rows.append((name, email, phone))

  with output_path.open('w') as f:
    # Write header
    f.write("name,email,phone\n")

    # Write each row
    for name, email, phone in cleaned_rows:
        f.write(f"{name},{email},{phone}\n")





## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [6]:
# Write your answer here
import unittest

class TestCRMCleanup(unittest.TestCase):

  def test_emails(self):
    emails = [('alice@goodexample1.com', True), ('johnnycage@goodexample2.com', True), ('kellyo.badexample1.com', False), ('badexample@', False)]
    for email, expected in emails:
      with self.subTest(email=email):
        result = bool(email_check.fullmatch(email))
        self.assertEqual(
          result, expected,
          )

  def test_phone_numbers(self):
    phone_numbers = [
      ("(943) 625-1258", "9436251258"),    ("972-915-2323", "9729152323"),  ("214 215 1247", "2142151247"), ("+1-532-921-6432", "5329216432"),  ("5559876", ""),
    ]
    for raw, expected in phone_numbers:
      with self.subTest(raw=raw):
        result = normalize_phone(raw)
        self.assertEqual(
          result, expected,
        )

  def test_parsing(self):
    multi_line = """Alice Johnson <alice@example.com> , +1 (469) 555-1234
    Sara M. , sara@mail.co , 214 555 8888
    "Mehdi A." <mehdi.ay@example.org> , (469)555-9999
    """

    lines = multi_line.strip().split("\n")

    results = [extract_fields(line) for line in lines]

    expected = [ ("Alice Johnson", "alice@example.com", "+1 (469) 555-1234"), ("Sara M.", "sara@mail.co", "214 555 8888"), ("Mehdi A.", "mehdi.ay@example.org", "(469)555-9999")
    ]

    self.assertEqual(results, expected)

  def test_de_dup(self):
    dup_test = [("Alice Johnson", "alice@example.com", "4695551234"), ("alice johnson", "Alice@Example.com", "4695551234"),  ("Bob", "bob@example.com", "9725557777")
    ]

    seen_emails = set()
    cleaned_rows = []

    for name, email, phone in dup_test:
        key = email.casefold()
        if key in seen_emails:
            continue
        seen_emails.add(key)
        cleaned_rows.append((name, email, phone))

    expected_result= [("Alice Johnson", "alice@example.com", "4695551234"), ("Bob", "bob@example.com", "9725557777")
    ]

    self.assertEqual(cleaned_rows, expected_result)


unittest.main(argv=[''], exit=False, verbosity=2)

test_de_dup (__main__.TestCRMCleanup.test_de_dup) ... ok
test_emails (__main__.TestCRMCleanup.test_emails) ... ok
test_parsing (__main__.TestCRMCleanup.test_parsing) ... ok
test_phone_numbers (__main__.TestCRMCleanup.test_phone_numbers) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.009s

OK


<unittest.main.TestProgram at 0x7a46795858e0>

## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
