<a href="https://colab.research.google.com/github/Raghadshh/RaghadShaheen_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [29]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [30]:
from pathlib import Path
import re
import csv
from typing import Optional, Tuple

EMAIL_PATTERN = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"

def is_valid_email(email_text: str) -> bool:
    email_clean = email_text.strip()
    return re.fullmatch(EMAIL_PATTERN, email_clean) is not None

def normalize_phone(raw_phone: str) -> str:
    only_digits = re.sub(r"\D", "", raw_phone or "")
    return only_digits[-10:] if len(only_digits) >= 10 else ""

def parse_contact_line(line: str) -> Optional[Tuple[str, str, str]]:

    s = line.strip()
    if not s:
        return None

    if "<" in s and ">" in s and "," in s:
        m = re.match(
            r'^\s*"?(?P<name>[^"<]+)"?\s*<(?P<email>[^>]+)>\s*,\s*(?P<phone>.+?)\s*$',
            s
        )
        if m:
            name = m.group("name").strip(' \t"')
            email = m.group("email").strip(' \t"')
            phone = m.group("phone").strip(' \t"')
            return name, email, phone

    parts = [p.strip() for p in s.split(",", 2)]
    if len(parts) >= 3:
        name = parts[0].strip(' \t<>"')
        email = parts[1].strip(' \t<>"')
        phone = parts[2].strip(' \t<>"')
        return name, email, phone

    return None

def main() -> None:
    input_path = Path("contacts_raw.txt")
    output_path = Path("contacts_clean.csv")

    try:
        all_lines = input_path.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print("contacts_raw.txt not found put it beside script.")
        return

    emails_seen_lower = set()
    cleaned_rows = []

    for one_line in all_lines:
        parsed = parse_contact_line(one_line)
        if not parsed:
            continue

        name_text, email_text, phone_text = parsed

        if not is_valid_email(email_text):
            continue

        email_lower = email_text.casefold()
        if email_lower in emails_seen_lower:
            continue
        emails_seen_lower.add(email_lower)

        phone_norm = normalize_phone(phone_text)

        cleaned_rows.append((name_text, email_text, phone_norm))

    with output_path.open("w", encoding="utf-8", newline="") as handle:
        writer = csv.writer(handle)
        writer.writerow(["name", "email", "phone"])
        writer.writerows(cleaned_rows)

    print(f"contacts_clean.csv written with {len(cleaned_rows)} cleaned records.")

if __name__ == "__main__":
    main()

contacts_clean.csv written with 5 cleaned records.


import os, sys
# Ensure current working directory is on sys.path
sys.path.append(os.getcwd())## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [31]:
import sys, os, unittest
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())
sys.modules.pop("q1_crm_cleanup", None)
from q1_crm_cleanup import validemail, normalphone

class TestCRMCleanup(unittest.TestCase):
    def test_valid_email(self):
        self.assertTrue(validemail("alice@example.com"))
        self.assertTrue(validemail("john.doe123@domain.co.uk"))
        self.assertTrue(validemail("  alice@example.com  "))
        self.assertFalse(validemail("bob[at]example.com"))
        self.assertFalse(validemail("sara@mail"))
        self.assertFalse(validemail("@missinglocal.com"))

    def test_normal_phone(self):
        self.assertEqual(normalphone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalphone("972-555-7777"), "9725557777")
        self.assertEqual(normalphone("214 555 8888"), "2145558888")
        self.assertEqual(normalphone("(469)555-9999"), "4695559999")
        self.assertEqual(normalphone("12345"), "")
        self.assertEqual(normalphone("011-1-972-555-0000"), "9725550000")

    def test_parsing_and_dedup_order(self):
        lines = [
            "Alice Johnson, alice@example.com, +1 (469) 555-1234",
            "duplicate, Alice@Example.com, 469 555 1234",
            "Mehdi, mehdi@example.org, (469)555-9999",
            "Bad Email, bob[at]example.com, 972-555-7777",
        ]

        emails_seen_lower = set()
        cleaned_rows = []

        for text in lines:
            parts = text.split(",")
            if len(parts) < 3:
                continue
            name_text = parts[0].strip()
            email_text = parts[1].strip()
            phone_text = parts[2].strip()

            if not validemail(email_text):
                continue

            email_lower = email_text.casefold()
            if email_lower in emails_seen_lower:
                continue
            emails_seen_lower.add(email_lower)

            phone_norm = normalphone(phone_text)
            cleaned_rows.append((name_text, email_text, phone_norm))

        expected_rows = [
            ("Alice Johnson", "alice@example.com", "4695551234"),
            ("Mehdi", "mehdi@example.org", "4695559999"),
        ]
        self.assertEqual(cleaned_rows, expected_rows)

unittest.main(argv=[''], verbosity=2, exit=False)

test_mail (__main__.T.test_mail) ... ok
test_parse_dedup (__main__.T.test_parse_dedup) ... ok
test_phone (__main__.T.test_phone) ... ok
test_normal_phone (__main__.TestCRMCleanup.test_normal_phone) ... ok
test_parsing_and_dedup_order (__main__.TestCRMCleanup.test_parsing_and_dedup_order) ... ok
test_valid_email (__main__.TestCRMCleanup.test_valid_email) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.008s

OK


<unittest.main.TestProgram at 0x7bc78a7476b0>

## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
