<a href="https://colab.research.google.com/github/OwenKDTS/Owen_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [9]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [10]:
import re
from pathlib import Path
import csv

def normalize_phone(phone_raw):
    """Removes non-digits and returns the last 10 digits or an empty string."""
    digits = re.sub(r"\D", "", phone_raw)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def is_valid_email(email_raw):
    """Validates email format using a simple regex with full match."""
    email = email_raw.strip()
    # Simple regex for email validation
    email_regex = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
    return re.fullmatch(email_regex, email) is not None

def clean_crm_data(input_filename="contacts_raw.txt", output_filename="contacts_clean.csv"):
    """
    Reads raw contacts, cleans them, and writes to a clean CSV file.
    Handles FileNotFoundError gracefully.
    """
    contacts = []
    seen_emails = set()

    try:
        with Path(input_filename).open("r", encoding="utf-8") as f:
            for line in f:
                parts = [part.strip() for part in line.split(',')]
                name = parts[0]
                email_raw = parts[1] if len(parts) > 1 else ""
                phone_raw = parts[2] if len(parts) > 2 else ""

                if is_valid_email(email_raw):
                    normalized_phone = normalize_phone(phone_raw)
                    email_casefolded = email_raw.casefold()

                    if email_casefolded not in seen_emails:
                        contacts.append({
                            "name": name,
                            "email": email_raw,
                            "phone": normalized_phone
                        })
                        seen_emails.add(email_casefolded)

    except FileNotFoundError:
        print(f"Error: The input file '{input_filename}' was not found.")
        return

    # Write to CSV
    if contacts:
        with Path(output_filename).open("w", newline="", encoding="utf-8") as csvfile:
            fieldnames = ["name", "email", "phone"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()
            for contact in contacts:
                writer.writerow(contact)

    print(f"CRM cleanup complete. Check {output_filename}.")

# Run the cleanup process
clean_crm_data()

CRM cleanup complete. Check contacts_clean.csv.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [11]:
# Create the test_crm_cleanup.py file
%%writefile test_crm_cleanup.py
import unittest
from q1_crm_cleanup import normalize_phone, is_valid_email, clean_crm_data
from pathlib import Path
import csv

class TestCRMCleanup(unittest.TestCase):

    def test_normalize_phone_valid(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")

    def test_normalize_phone_too_short(self):
        self.assertEqual(normalize_phone("972-555-777"), "")
        self.assertEqual(normalize_phone("12345"), "")

    def test_normalize_phone_empty_or_invalid(self):
        self.assertEqual(normalize_phone(""), "")
        self.assertEqual(normalize_phone("abc"), "")

    def test_is_valid_email_valid(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("sara@mail.co"))
        self.assertTrue(is_valid_email("test.email@example.co.uk"))


    def test_is_valid_email_invalid(self):
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("invalid-email"))
        self.assertFalse(is_valid_email("invalid@.com"))


    def test_parsing_and_deduplication(self):
        # Create a dummy input file for testing parsing and deduplication
        test_data = """Alice Johnson <alice@example.com> , +1 (469) 555-1234
Sara M. , sara@mail.co , 214 555 8888
duplicate <Alice@Example.com> , 469 555 1234
Another Valid <another@example.com>, 9876543210
Invalid Email, invalid@, 1234567890
"""
        input_filename = "test_contacts_raw.txt"
        output_filename = "test_contacts_clean.csv"

        with open(input_filename, "w", encoding="utf-8") as f:
            f.write(test_data)

        clean_crm_data(input_filename, output_filename)

        expected_output = [
            {"name": "Alice Johnson", "email": "alice@example.com", "phone": "4695551234"},
            {"name": "Sara M.", "email": "sara@mail.co", "phone": "2145558888"},
            {"name": "Another Valid", "email": "another@example.com", "phone": "9876543210"},

        ]

        actual_output = []
        with Path(output_filename).open("r", encoding="utf-8") as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                actual_output.append(row)

        # Clean up the test files
        Path(input_filename).unlink()
        Path(output_filename).unlink()

        self.assertEqual(actual_output, expected_output)


    def test_file_not_found(self):
        """Test graceful handling of FileNotFoundError."""
        output_filename = "non_existent_output.csv"
        # Use a non-existent input file
        clean_crm_data("non_existent_input.txt", output_filename)
        # Assert that the output file was not created
        self.assertFalse(Path(output_filename).exists())


# This is needed to run the tests in Colab
if __name__ == '__main__':
    import sys; sys.argv.append(''); import unittest; unittest.main()

Overwriting test_crm_cleanup.py


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
