<a href="https://colab.research.google.com/github/Kaczmarcyck/AntoninKazmar_DTSC3020Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [10]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [6]:
from pathlib import Path
import re
import csv

EMAIL_REGEX = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def is_valid_email(email: str) -> bool:
    email = email.strip()
    return bool(EMAIL_REGEX.fullmatch(email))

def normalize_phone(phone: str) -> str:
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def crm_cleanup(input_file="contacts_raw.txt", output_file="contacts_clean.csv"):
    input_path = Path(input_file)
    if not input_path.exists():
        print(f"{input_file}: failed to open file")
        return 1

    cleaned = []
    seen_emails = set()

    with input_path.open(encoding="utf-8") as f:
        for line in f:
            try:
                name_part, rest = line.split("<", 1)
                email_part, phone_part = rest.split(">", 1)
                name = name_part.strip()
                email = email_part.strip()
                phone = phone_part.strip().strip(",").strip()
            except ValueError:
                # Skip malformed lines gracefully
                continue

            if not is_valid_email(email):
                continue

            if email.casefold() in seen_emails:
                continue

            seen_emails.add(email.casefold())
            phone_clean = normalize_phone(phone)
            cleaned.append((name, email, phone_clean))

    # Write output CSV
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["name", "email", "phone"])
        writer.writerows(cleaned)

    print(f"Wrote {output_file} with {len(cleaned)} contacts.")
    return 0


if __name__ == "__main__":
    crm_cleanup()


Wrote contacts_clean.csv with 4 contacts.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [26]:
# test_crm_cleanup.py
import unittest
import re
import csv
from pathlib import Path

# Copy the relevant functions from the q1_crm_cleanup cell
# Reverted to the original regex from the prompt
EMAIL_REGEX = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def is_valid_email(email: str) -> bool:
    email = email.strip()
    # Based on the provided regex, this is considered valid. No extra check needed here based on prompt.
    return bool(EMAIL_REGEX.fullmatch(email))

def normalize_phone(phone: str) -> str:
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

# Although not directly tested in this cell, the parsing logic is needed for the parsing test
def parse_contact_line(line: str):
    line = line.strip()
    # Pattern for <email> format
    match_angle_brackets = re.match(r"(.*?)\s*<(.*?)>\s*,?\s*(.*)", line)
    if match_angle_brackets:
        name, email, phone = match_angle_brackets.groups()
        name = name.strip()
        # Handle quoted names
        if name.startswith('"') and name.endswith('"'):
            name = name[1:-1]
        return name, email.strip(), phone.strip()

    # Pattern for comma-separated format (name, email, phone)
    match_comma_separated = re.match(r"(.*?)\s*,\s*(.*?)\s*,\s*(.*)", line)
    if match_comma_separated:
        name, email, phone = match_comma_separated.groups()
        name = name.strip()
        # Handle quoted names
        if name.startswith('"') and name.endswith('"'):
            name = name[1:-1]
        return name, email.strip(), phone.strip()

    return None, None, None


class TestCRMHelpers(unittest.TestCase):
    def test_valid_emails(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("bob.smith+promo@company.org"))
        self.assertTrue(is_valid_email("user123@sub.domain.co.uk"))
        # Based on the provided regex, this is considered valid
        self.assertTrue(is_valid_email("user@domain..com"))


    def test_invalid_emails(self):
        self.assertFalse(is_valid_email("alice@"))
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("user@domain"))
        self.assertFalse(is_valid_email("user@.com")) # domain starts with dot
        self.assertFalse(is_valid_email("user@domain.")) # domain ends with dot


    def test_phone_normalization(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("555-1234"), "")  # too short
        self.assertEqual(normalize_phone("001-214-987-6543"), "2149876543")
        self.assertEqual(normalize_phone("(972)555-9999"), "9725559999")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("123"), "") # too short
        self.assertEqual(normalize_phone(""), "") # empty string


    def test_parsing_and_deduplication(self):
        raw_data = """Alice Johnson <alice@example.com> , +1 (469) 555-1234
Bob Roberts <bob[at]example.com> , 972-555-777
Sara M. , sara@mail.co , 214 555 8888
"Mehdi A." <mehdi.ay@example.org> , (469)555-9999
Delaram <delaram@example.io>, +1-972-777-2121
Nima <NIMA@example.io> , 972.777.2121
duplicate <Alice@Example.com> , 469 555 1234
Malformed Line Without Email
"""
        # Simulate the cleanup logic without file I/O
        cleaned = []
        seen_emails = set()

        for line in raw_data.strip().split('\n'):
            name, email, phone = parse_contact_line(line)

            if not email or not is_valid_email(email): # Check if email is not None and valid
                continue

            if email.casefold() in seen_emails:
                continue

            seen_emails.add(email.casefold())
            phone_clean = normalize_phone(phone)
            cleaned.append((name, email, phone_clean))

        # Expected results after filtering and deduplication (case-insensitive email)
        expected_corrected = [
            ("Alice Johnson", "alice@example.com", "4695551234"),
            ("Sara M.", "sara@mail.co", "2145558888"),
            ("Mehdi A.", "mehdi.ay@example.org", "4695559999"),
            ("Delaram", "delaram@example.io", "9727772121"),
            ("Nima", "NIMA@example.io", "9727772121"), # Added Nima to the expected list
        ]

        self.assertEqual(cleaned, expected_corrected)


if __name__ == "__main__":
    unittest.main(argv=['first-arg-is-ignored'], exit=False) # Added arguments to run in Colab

....
----------------------------------------------------------------------
Ran 4 tests in 0.007s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
