<a href="https://colab.research.google.com/github/TBirtrn53/Thomas_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [1]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [12]:
%%writefile q1_crm_cleanup.py

from __future__ import annotations

from pathlib import Path
import csv
import re
from typing import Iterable, List, Dict, Tuple

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")

def is_valid_email(raw: str) -> bool:
    if raw is None:
        return False
    return bool(EMAIL_RE.fullmatch(raw.strip()))

def normalize_phone(raw: str) -> str:
    if raw is None:
        return ""
    digits = re.sub(r"\D", "", raw)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def _extract_name_email_phone(line: str) -> Tuple[str, str, str]:
    s = line.strip()

    # Angle-bracket style
    m = re.search(r"^(?P<name>.+?)\s*<(?P<email>[^>]+)>\s*,\s*(?P<phone>.*)$", s)
    if m:
        return m.group("name").strip(), m.group("email").strip(), m.group("phone").strip()

    # Comma-separated fields
    parts = [p.strip() for p in re.split(r"\s*,\s*", s)]
    if len(parts) >= 3:
        name, email, phone = parts[0], parts[1], parts[2]
        return name, email, phone
    if len(parts) == 2:
        name, email = parts
        return name, email, ""

    # Fallback: find any email in the string
    m2 = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", s)
    email = m2.group(0).strip() if m2 else ""
    if email:
        before, _, after = s.partition(email)
        name = before.replace("<", "").replace(">", "").replace(",", " ").strip()
        phone = after.replace(",", " ").strip()
        return name, email, phone

    # Nothing recognized; treat as name only
    return s, "", ""

def parse_contacts_text(text: str) -> List[Dict[str, str]]:
    rows = []
    for line in text.splitlines():
        if not line.strip():
            continue
        name, email, phone_raw = _extract_name_email_phone(line)
        rows.append({"name": name.strip(), "email": email.strip(), "phone": phone_raw.strip()})
    return rows

def clean_contacts(rows: Iterable[Dict[str, str]]) -> List[Dict[str, str]]:
    seen = set()
    cleaned: List[Dict[str, str]] = []
    for r in rows:
        name = (r.get("name") or "").strip()
        email = (r.get("email") or "").strip()
        phone_raw = (r.get("phone") or "").strip()

        if not is_valid_email(email):
            continue

        key = email.casefold()
        if key in seen:
            continue
        seen.add(key)

        phone = normalize_phone(phone_raw)
        cleaned.append({"name": name, "email": email, "phone": phone})
    return cleaned

def read_contacts_file(path: Path) -> List[Dict[str, str]]:
    try:
        text = path.read_text(encoding="utf-8")
    except FileNotFoundError:
        print(f"contacts file not found: {path}")
        return []
    return parse_contacts_text(text)

def write_contacts_csv(rows: Iterable[Dict[str, str]], path: Path) -> None:
    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
        writer.writeheader()
        for r in rows:
            writer.writerow({"name": r["name"], "email": r["email"], "phone": r["phone"]})

def main() -> None:
    raw_path = Path("contacts_raw.txt")
    out_path = Path("contacts_clean.csv")
    rows = read_contacts_file(raw_path)
    if not rows:
        return
    cleaned = clean_contacts(rows)
    write_contacts_csv(cleaned, out_path)
    print(f"Wrote {out_path} with {len(cleaned)} rows.")

if __name__ == "__main__":
    main()


Overwriting q1_crm_cleanup.py


from q1_crm_cleanup import (
    is_valid_email,
    normalize_phone,
    parse_contacts_text,
    clean_contacts,
)
## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [14]:
%%writefile test_crm_cleanup.py


import unittest
from q1_crm_cleanup import (
    is_valid_email,
    normalize_phone,
    parse_contacts_text,
    clean_contacts,
)

class TestEmailValidation(unittest.TestCase):
    def test_valid_emails(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("A.B-c+d_e@sub.example.co"))
        self.assertTrue(is_valid_email("  mehdi.ay@example.org  "))

    def test_invalid_emails(self):
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("alice@example"))
        self.assertFalse(is_valid_email("alice@.com"))
        self.assertFalse(is_valid_email("not-an-email"))
        self.assertFalse(is_valid_email(""))

class TestPhoneNormalization(unittest.TestCase):
    def test_basic_forms(self):
        self.assertEqual(normalize_phone("(469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("972.555.7777"), "9725557777")
        self.assertEqual(normalize_phone("+1 (214) 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("  +1-972-777-2121  "), "9727772121")

    def test_too_short(self):
        self.assertEqual(normalize_phone("972-555-777"), "")  # 9 digits
        self.assertEqual(normalize_phone("abc"), "")
        self.assertEqual(normalize_phone(""), "")

class TestParsingAndDedup(unittest.TestCase):
    def test_parse_and_clean(self):
        raw = (
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234\n'
            'Bob Roberts <bob[at]example.com> , 972-555-777\n'
            'Sara M. , sara@mail.co , 214 555 8888\n'
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\n'
            'Delaram <delaram@example.io>, +1-972-777-2121\n'
            'Nima <NIMA@example.io> , 972.777.2121\n'
            'duplicate <Alice@Example.com> , 469 555 1234\n'
        )
        rows = parse_contacts_text(raw)
        cleaned = clean_contacts(rows)

        expected = [
            {"name": "Alice Johnson", "email": "alice@example.com", "phone": "4695551234"},
            {"name": "Sara M.", "email": "sara@mail.co", "phone": "2145558888"},
            {"name": '"Mehdi A."', "email": "mehdi.ay@example.org", "phone": "4695559999"},
            {"name": "Delaram", "email": "delaram@example.io", "phone": "9727772121"},
            {"name": "Nima", "email": "NIMA@example.io", "phone": "9727772121"},
        ]
        self.assertEqual(cleaned, expected)

    def test_dedup_case_insensitive(self):
        raw = (
            "A <x@example.com> , 111-111-1111\n"
            "B <X@EXAMPLE.com> , 222-222-2222\n"
        )
        rows = parse_contacts_text(raw)
        cleaned = clean_contacts(rows)
        self.assertEqual(
            cleaned,
            [{"name": "A", "email": "x@example.com", "phone": "1111111111"}]
        )

if __name__ == "__main__":
    unittest.main(verbosity=2)


Overwriting test_crm_cleanup.py


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
