<a href="https://colab.research.google.com/github/Buraporn-Subsomboon/Buraporn_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [None]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [13]:
# q1-style implementation inside the notebook (file I/O, exceptions, regex, de-dup, CSV)
from pathlib import Path
import csv
import re
from typing import Iterable, List, Tuple, Optional

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
NON_DIGIT_RE = re.compile(r"\D")

def is_valid_email(email: str) -> bool:
    if email is None:
        return False
    email = email.strip()
    return re.fullmatch(EMAIL_RE, email) is not None

def normalize_phone(raw: str) -> str:
    if raw is None:
        return ""
    digits = NON_DIGIT_RE.sub("", raw)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def _strip_quotes_preserve(s: str) -> str:
    return s.strip()

def extract_first_email(line: str) -> Optional[str]:
    m = EMAIL_RE.search(line)
    return m.group(0) if m else None

def extract_name_and_phone(line: str, email: str) -> Tuple[str, str]:
    raw = line.strip()
    name = ""
    phone_raw = ""

    if "<" in raw and ">" in raw:
        before = raw.split("<", 1)[0]
        name = _strip_quotes_preserve(before).strip().strip(",")
        parts = [p.strip() for p in raw.split(",")]
        for part in reversed(parts):
            if any(ch.isdigit() for ch in part):
                phone_raw = part
                break
    else:
        parts = [p.strip() for p in raw.split(",")]
        for p in parts:
            if p and p != email and any(ch.isalpha() for ch in p):
                if email not in p:
                    name = _strip_quotes_preserve(p)
                    break
        for p in reversed(parts):
            if any(ch.isdigit() for ch in p):
                phone_raw = p
                break

    return name.strip(), phone_raw.strip()

def parse_contacts_lines(lines: Iterable[str]) -> List[Tuple[str, str, str]]:
    seen = set()
    rows: List[Tuple[str, str, str]] = []
    for line in lines:
        if not line or not line.strip():
            continue
        email_found = extract_first_email(line)
        if not email_found:
            continue
        email_clean = email_found.strip()
        if not is_valid_email(email_clean):
            continue
        key = email_clean.casefold()
        if key in seen:
            continue
        name, phone_raw = extract_name_and_phone(line, email_clean)
        phone_norm = normalize_phone(phone_raw)
        rows.append((name, email_clean, phone_norm))
        seen.add(key)
    return rows

def run_cleanup(raw_filename: str = "contacts_raw.txt",
                out_filename: str = "contacts_clean.csv") -> None:
    """Read raw file, clean & dedup, then write CSV. Gracefully handle missing file."""
    raw_path = Path(raw_filename)
    out_path = Path(out_filename)
    try:
        with raw_path.open("r", encoding="utf-8") as f:
            lines = f.read().splitlines()
    except FileNotFoundError:
        print("contacts_raw.txt not found. Please ensure the file exists beside this notebook.")
        return

    rows = parse_contacts_lines(lines)

    with out_path.open("w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["name", "email", "phone"])
        for name, email, phone in rows:
            writer.writerow([name, email, phone])

    print(f"Wrote {out_path.name} with {len(rows)} rows.")

run_cleanup()


contacts_raw.txt not found. Please ensure the file exists beside this notebook.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [12]:
import unittest

class TestEmailValidation(unittest.TestCase):
    def test_valid_emails(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("MEHDI.AY+vip@example.org"))
        self.assertTrue(is_valid_email("x_y.z-1@sub.example.co"))

    def test_invalid_emails(self):
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("no-at-symbol.example.com"))
        self.assertFalse(is_valid_email("name@example"))
        self.assertFalse(is_valid_email("name@.com"))

class TestPhoneNormalization(unittest.TestCase):
    def test_various_formats_to_last10(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("(469)555-9999"), "4695559999")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")
        self.assertEqual(normalize_phone("1-972-777-2121"), "9727772121")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")

    def test_too_short(self):
        self.assertEqual(normalize_phone("972-555-77"), "")

    def test_none_and_empty(self):
        self.assertEqual(normalize_phone(None), "")
        self.assertEqual(normalize_phone(""), "")

class TestParsingAndDedup(unittest.TestCase):
    def test_parse_multiline_and_structure(self):
        raw = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'Bob Roberts <bob[at]example.com> , 972-555-777',
            'Sara M. , sara@mail.co , 214 555 8888',
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999',
            'Delaram <delaram@example.io>, +1-972-777-2121',
            'Nima <NIMA@example.io> , 972.777.2121',
            'duplicate <Alice@Example.com> , 469 555 1234'
        ]
        rows = parse_contacts_lines(raw)
        expected = [
            ('Alice Johnson', 'alice@example.com', '4695551234'),
            ('Sara M.', 'sara@mail.co', '2145558888'),
            ('"Mehdi A."', 'mehdi.ay@example.org', '4695559999'),
            ('Delaram', 'delaram@example.io', '9727772121'),
            ('Nima', 'NIMA@example.io', '9727772121'),
        ]
        self.assertEqual(rows, expected)

    def test_dedup_case_insensitive(self):
        raw = [
            "n1 <USER@Example.com>, 111-222-3333",
            "n2 <user@example.com>, 444-555-6666",
        ]
        rows = parse_contacts_lines(raw)
        self.assertEqual(rows, [('n1', 'USER@Example.com', '1112223333')])

unittest.TextTestRunner(verbosity=2).run(
    unittest.defaultTestLoader.loadTestsFromModule(__import__(__name__))
)


test_email_validity (__main__.TestCRM.test_email_validity) ... ok
test_parse_and_dedup (__main__.TestCRM.test_parse_and_dedup) ... ok
test_phone_normalization (__main__.TestCRM.test_phone_normalization) ... ok
test_invalid_emails (__main__.TestEmailValidation.test_invalid_emails) ... ok
test_valid_emails (__main__.TestEmailValidation.test_valid_emails) ... ok
test_dedup_case_insensitive (__main__.TestParsingAndDedup.test_dedup_case_insensitive) ... ok
test_parse_multiline_and_structure (__main__.TestParsingAndDedup.test_parse_multiline_and_structure) ... FAIL
test_none_and_empty (__main__.TestPhoneNormalization.test_none_and_empty) ... ok
test_too_short (__main__.TestPhoneNormalization.test_too_short) ... ok
test_various_formats_to_last10 (__main__.TestPhoneNormalization.test_various_formats_to_last10) ... ok

FAIL: test_parse_multiline_and_structure (__main__.TestParsingAndDedup.test_parse_multiline_and_structure)
----------------------------------------------------------------------


<unittest.runner.TextTestResult run=10 errors=0 failures=1>

## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
