<a href="https://colab.research.google.com/github/Aishatoo07/AissatouDieng_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [1]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [5]:

from pathlib import Path
import csv
import re

# Email validation regex (full match)
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def normalize_phone(raw: str) -> str:
    """Remove non-digits; if ≥10 digits, keep last 10 digits, else return empty string."""
    digits = re.sub(r"\D", "", raw or "")
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def extract_email(text: str) -> str | None:
    """Try to extract an email-like token from a messy contact line."""
    if not text:
        return None
    # Prefer angle-bracket form <...>
    m = re.search(r"<([^>]+)>", text)
    if m:
        return m.group(1).strip()
    # Otherwise, split by comma/space and look for a token with '@'
    for token in re.split(r"[,\s]+", text.strip()):
        if "@" in token:
            return token.strip()
    return None

def extract_name(text: str) -> str:
    """Extract name before '<' or before first comma."""
    if not text:
        return ""
    if "<" in text:
        return text.split("<", 1)[0].strip().strip(",")
    if "," in text:
        return text.split(",", 1)[0].strip()
    return text.strip()

def extract_phone(text: str) -> str:
    """Extract last comma-separated field containing digits."""
    parts = [p.strip() for p in text.split(",")]
    for part in reversed(parts):
        if any(ch.isdigit() for ch in part):
            return part
    return ""

def main() -> None:
    src = Path("contacts_raw.txt")

    # (0.4) File read/write via pathlib + graceful FileNotFoundError handling
    try:
        raw_lines = src.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print("⚠️ contacts_raw.txt not found. Please create the dataset file and rerun.")
        return

    cleaned_rows: list[tuple[str, str, str]] = []
    seen_emails: set[str] = set()  # case-insensitive dedup

    for line in raw_lines:
        line = line.strip()
        if not line:
            continue

        # Extract name
        name = extract_name(line)

        # Extract and validate email
        candidate_email = extract_email(line)
        if not candidate_email:
            continue
        email = candidate_email.strip()

        # Full match validation (no partial)
        if EMAIL_RE.fullmatch(email) is None:
            continue

        # Normalize phone
        raw_phone = extract_phone(line)
        phone = normalize_phone(raw_phone)

        # Deduplicate by case-insensitive email
        key = email.casefold()
        if key in seen_emails:
            continue
        seen_emails.add(key)

        # Preserve order
        cleaned_rows.append((name, email, phone))

    # Write output CSV (UTF-8)
    out_path = Path("contacts_clean.csv")
    with out_path.open("w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["name", "email", "phone"])
        w.writerows(cleaned_rows)

    print(f"✅ Wrote {out_path} with {len(cleaned_rows)} cleaned contacts.")

if __name__ == "__main__":
    main()


✅ Wrote contacts_clean.csv with 5 cleaned contacts.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [7]:
%%writefile q1_crm_cleanup.py
from pathlib import Path
import csv, re

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def normalize_phone(raw: str) -> str:
    digits = re.sub(r"\D", "", raw or "")
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def extract_email(text: str) -> str | None:
    if not text:
        return None
    m = re.search(r"<([^>]+)>", text)
    if m:
        return m.group(1).strip()
    for token in re.split(r"[,\s]+", text.strip()):
        if "@" in token:
            return token.strip()
    return None

def extract_name(text: str) -> str:
    if not text:
        return ""
    if "<" in text:
        return text.split("<", 1)[0].strip().strip(",")
    if "," in text:
        return text.split(",", 1)[0].strip()
    return text.strip()

def extract_phone(text: str) -> str:
    parts = [p.strip() for p in text.split(",")]
    for part in reversed(parts):
        if any(ch.isdigit() for ch in part):
            return part
    return ""

def main() -> None:
    src = Path("contacts_raw.txt")
    try:
        raw_lines = src.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print("contacts_raw.txt not found. Please create it and rerun.")
        return

    cleaned_rows = []
    seen = set()
    for line in raw_lines:
        line = line.strip()
        if not line:
            continue
        name = extract_name(line)
        cand_email = extract_email(line)
        if not cand_email:
            continue
        email = cand_email.strip()
        if EMAIL_RE.fullmatch(email) is None:
            continue
        phone = normalize_phone(extract_phone(line))
        key = email.casefold()
        if key in seen:
            continue
        seen.add(key)
        cleaned_rows.append((name, email, phone))

    out_path = Path("contacts_clean.csv")
    with out_path.open("w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["name", "email", "phone"])
        w.writerows(cleaned_rows)

    print(f"Wrote {out_path} with {len(cleaned_rows)} cleaned contacts.")

if __name__ == "__main__":
    main()


Writing q1_crm_cleanup.py


In [8]:
%%writefile test_crm_cleanup.py
import unittest
from q1_crm_cleanup import (
    EMAIL_RE,
    normalize_phone,
    extract_email,
    extract_name,
    extract_phone,
)

def parse_lines_pipeline(multiline: str):
    cleaned_rows = []
    seen = set()
    for raw in multiline.splitlines():
        line = raw.strip()
        if not line:
            continue
        name = extract_name(line)
        cand_email = extract_email(line)
        if not cand_email:
            continue
        email = cand_email.strip()
        if EMAIL_RE.fullmatch(email) is None:
            continue
        phone = normalize_phone(extract_phone(line))
        key = email.casefold()
        if key in seen:
            continue
        seen.add(key)
        cleaned_rows.append((name, email, phone))
    return cleaned_rows


class TestCRM(unittest.TestCase):
    def test_email_validation(self):
        valid = ["alice@example.com", "mehdi.ay@example.org"]
        invalid = ["bad@", "a@b", "x.com"]
        for e in valid:
            self.assertIsNotNone(EMAIL_RE.fullmatch(e))
        for e in invalid:
            self.assertIsNone(EMAIL_RE.fullmatch(e))

    def test_phone_normalization(self):
        self.assertEqual(normalize_phone("(469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("+1-972-777-2121"), "9727772121")
        self.assertEqual(normalize_phone("972-555-777"), "")

    def test_parsing_and_dedup(self):
        data = (
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234\n'
            'Duplicate <ALICE@EXAMPLE.COM> , 469 555 1111\n'
            'Sara M. , sara@mail.co , 214 555 8888\n'
        )
        result = parse_lines_pipeline(data)
        expected = [
            ("Alice Johnson", "alice@example.com", "4695551234"),
            ("Sara M.", "sara@mail.co", "2145558888"),
        ]
        self.assertEqual(result, expected)


if __name__ == "__main__":
    unittest.main(verbosity=2)


Writing test_crm_cleanup.py


In [9]:
!python -m unittest test_crm_cleanup.py


...
----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
