**Note: For this assignment, you may only use standard Python and the `re` (Regular Expression) module. Advanced libraries such as NumPy, Pandas are not permitted**

## Exercises 1

Use `re.search` to find whether a string contains a phone number. The pattern that you write should detect a phone number in the following strings.  
```
"Call me at 382-384-3840."  
"my number is (510) 849-3519. Call me!"
```  
And not find a match in the following strings. 
```
"my number is 510-849-35192"  
"here’s my number: 510-849.3519"
``` 
Consider making your own tests as well  

In [20]:
import re
from datetime import datetime, timedelta

In [3]:
# YOUR CODE HERE
phone_pattern = r'\(?\d{3}\)?[ -]\d{3}-\d{4}\.'

tests = [
    "Call me at 382-384-3840.",
    "my number is (510) 849-3519. Call me!",
    "my number is 510-849-35192",
    "here’s my number: 510-849.3519"
]

for text in tests:
    match = re.search(phone_pattern, text)
    if match:
        print(f"Found: {match.group()} in → {text}")
    else:
        print(f"No match in → {text}")

Found: 382-384-3840. in → Call me at 382-384-3840.
Found: (510) 849-3519. in → my number is (510) 849-3519. Call me!
No match in → my number is 510-849-35192
No match in → here’s my number: 510-849.3519


## Exercise 2

Use `re.sub` to alter the string below so that the dates have a common format that uses a dash for the day, month, and year separator.  
```
03/12/2018, 03.13.18, 03/14/2018, 03:15:2018
```

In [4]:
# YOUR CODE HERE
input_string = "03/12/2018, 03.13.18, 03/14/2018, 03:15:2018"
date_pattern = r'(\d{2})[/.:](\d{2})[/.:](\d{2,4})'
replacement = r'\1-\2-\3'
output_string = re.sub(date_pattern, replacement, input_string)
output_string

'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'

## Exercise 3

Consider the first five sentences of the novel “Little Women” below. Extract the spoken dialog from each sentence.

In [5]:
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''
print(text)


"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."



In [6]:
# YOUR CODE HERE
sentences = [s.strip() for s in text.split('\n') if s.strip()]
dialogue_pattern = r'"(.*?)"'
dialogues = []

for sentence in sentences:
    match = re.search(dialogue_pattern, sentence)
    if match:
        dialogues.append(match.group(1))

for i, dialogue in enumerate(dialogues):
    print(f"Sentence {i+1}: \"{dialogue}\"")

Sentence 1: "Christmas won't be Christmas without any presents,"
Sentence 2: "It's so dreadful to be poor!"
Sentence 3: "I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,"
Sentence 4: "We've got Father and Mother, and each other,"
Sentence 5: "We haven't got Father, and shall not have him for a long time."


## Exercise 4

In this exercise, you you working with ```email_test.txt``` file (attached), using Regular Expression.\
`Original Dataset: https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus`

In [23]:
# YOUR CODE HERE: open file
with open("email_test.txt", "r", encoding="utf-8") as f:
    data = f.read()

split_pattern = r'(?=^From r\s+)'
emails = re.split(split_pattern, data, flags=re.M)
emails = [email for email in emails if email.strip()]
total_emails = len(emails)

#### Simple Fraudulent email detection

1. Count how many emails contain urgency-related words (URGENT, IMMEDIATELY, QUICK, ASSISTANCE, CONFIDENTIAL). 
Calculate what percentage of emails use these tactics.

In [4]:
# YOUR CODE HERE
keywords = ['URGENT', 'IMMEDIATELY', 'QUICK', 'ASSISTANCE', 'CONFIDENTIAL']
search_pattern = r'\b(' + '|'.join(keywords) + r')\b'
    
count_with_keywords = 0
for email_content in emails:
    if re.search(search_pattern, email_content, re.IGNORECASE):
        count_with_keywords += 1
        
percentage = (count_with_keywords / total_emails) * 100
        
print(f"Tổng số email đã phân tích: {total_emails}")
print(f"Số email chứa các từ khóa khẩn cấp: {count_with_keywords}")
print(f"Tỷ lệ phần trăm email sử dụng các chiến thuật này: {percentage}%")

Tổng số email đã phân tích: 1330
Số email chứa các từ khóa khẩn cấp: 1143
Tỷ lệ phần trăm email sử dụng các chiến thuật này: 85.93984962406014%


2. Find all mentions of money amounts in the email bodies (e.g., `US$25M`, `$100,000.00`, `USD$31,000,000.00`). Calculate:
- Total number of money mentions across all emails
- The largest amount mentioned
- The smallest amount mentioned
- Average amount per email

In [12]:
all_amounts = []
money_pattern = r'((USD|US ?)\$?|\$)\s*([\d,.]+)\s*(M|Million|B|Billion)?'

for email_content in emails:
    mentions = re.findall(money_pattern, email_content, re.IGNORECASE)
    for match in mentions:
        num_str = match[2]
        suffix_str = match[3]
        amount = None
        
        num_str = num_str.replace(',', '')
        try:
            amount = float(num_str)
        except ValueError:
            continue 
    
        if suffix_str:
            suffix_str_lower = suffix_str.lower()
            if suffix_str_lower in ('m', 'million'):
                amount *= 1_000_000
            elif suffix_str_lower in ('b', 'billion'):
                amount *= 1_000_000_000
        
        if amount is not None:
            all_amounts.append(amount)

if all_amounts:     
    total_mentions = len(all_amounts)
    max_amount = max(all_amounts)
    min_amount = min(all_amounts)
    total_sum = sum(all_amounts)
    
    average_per_email = 0
    if total_emails > 0:
        average_per_email = total_sum / total_emails
            
    print(f"1. Tổng số lượt đề cập đến tiền: {total_mentions}")
    print(f"2. Số tiền lớn nhất được đề cập: ${max_amount:,.2f}")
    print(f"3. Số tiền nhỏ nhất được đề cập: ${min_amount:,.2f}")
    print(f"4. Số tiền trung bình mỗi email (Tổng tiền / Tổng email): ${average_per_email:,.2f}")
else:
    print("Không tìm thấy đề cập nào về tiền.")

1. Tổng số lượt đề cập đến tiền: 2169
2. Số tiền lớn nhất được đề cập: $80,500,000,000,000,000.00
3. Số tiền nhỏ nhất được đề cập: $0.00
4. Số tiền trung bình mỗi email (Tổng tiền / Tổng email): $121,485,353,774,730.48


3. Extract all mentions of deaths or deceased persons (e.g., "late father", "died", "deceased", "death of").\
   What percentage of emails use death as part of their story?

In [14]:
simple_words = [
    'died',
    'deceased',
    'death',
    'dead',
    'assassinated',
    'assassination',
    'demise',
    'passed away',
    'killed',
    'fatalities',
    'casualties'
]

special_patterns = [
    'late ',         
    'lost .* lives' 
]

p1 = r'\b(' + '|'.join(simple_words) + r')\b'
p2 = r'(' + '|'.join(special_patterns) + r')'
death_pattern = p1 + '|' + p2

count_with_death = 0
for email_content in emails:
    if re.search(death_pattern, email_content, re.IGNORECASE):
        count_with_death += 1
                
percentage_death = 0.0
if total_emails > 0:
    percentage_death = (count_with_death / total_emails) * 100
  
print(f"Số email sử dụng câu chuyện liên quan đến cái chết: {count_with_death}")
print(f"Tỷ lệ phần trăm email sử dụng chiến thuật này: {percentage_death:.2f}%")

Số email sử dụng câu chuyện liên quan đến cái chết: 837
Tỷ lệ phần trăm email sử dụng chiến thuật này: 62.93%


4. Many emails mention percentage splits of money (e.g., "70% for us", "20% for you", "10% for expenses"). \
   Extract all percentage distributions and identify the most common split pattern offered to recipients.

   Example: most common split patterns:\
   70% - 20% - 10%: appears 15 times\
   75% - 20% - 5%: appears 12 times\
    60% - 30% - 10%: appears 8 times\
    80% - 15% - 5%: appears 6 times\
    55% - 30% - 10% - 5%: appears 4 times

In [15]:
#YOUR CODE HERE
split_patterns_counter = {}
percentage_pattern = r'(\d+)%'

for email_content in emails:       
    percentages_str = re.findall(percentage_pattern, email_content)
    percentages_int = []
    for p_str in percentages_str:
        p_int = int(p_str)
        if 0 < p_int < 100:
            percentages_int.append(p_int)
        
    if len(percentages_int) >= 2:
        percentages_int.sort(reverse=True)
        pattern_tuple = tuple(percentages_int)
        current_count = split_patterns_counter.get(pattern_tuple, 0)
        split_patterns_counter[pattern_tuple] = current_count + 1
    
sorted_splits = sorted(
    split_patterns_counter.items(), 
    key=lambda item: item[1], 
    reverse=True
)
        
for pattern_tuple, count in sorted_splits[:5]:
    pattern_str = " - ".join([f"{p}%" for p in pattern_tuple])
    print(f"{pattern_str}: xuất hiện {count} lần")

60% - 30% - 10%: xuất hiện 78 lần
70% - 25% - 5%: xuất hiện 72 lần
70% - 20% - 10%: xuất hiện 50 lần
75% - 20% - 5%: xuất hiện 37 lần
60% - 35% - 5%: xuất hiện 36 lần


5. Create a "scam score" for each email based on:\
    Urgency keywords (1 point each)\
    Money mentions (2 points each)\
    Percentages offered (1 point each)\
    Death mentions (1 point)\
    ALL CAPS usage (1 point if >20% of text)

    ***Rank the top 10 highest-scoring emails*** 

In [16]:
# YOUR CODE HERE
scam_scores = []

for email_content in emails:
    score = 0
    
    urgency_matches = re.findall(search_pattern, email_content, re.IGNORECASE)
    score += len(urgency_matches) * 1

    money_mentions = re.findall(money_pattern, email_content, re.IGNORECASE)
    score += len(money_mentions) * 2

    percentages_str = re.findall(percentage_pattern, email_content)
    percentages_int = [int(p) for p in percentages_str if 0 < int(p) < 100]
    score += len(percentages_int) * 1

    if re.search(death_pattern, email_content, re.IGNORECASE):
        score += 1

    letters = [c for c in email_content if c.isalpha()]
    if letters:
        num_upper = sum(1 for c in letters if c.isupper())
        if num_upper / len(letters) > 0.2:
            score += 1

    scam_scores.append(score)

email_scores_with_index = list(enumerate(scam_scores))
top_emails = sorted(email_scores_with_index, key=lambda x: x[1], reverse=True)[:10]

for rank, (idx, score) in enumerate(top_emails, start=1):
    email_preview = emails[idx][:100].replace("\n", " ") + "..." 
    print(f"{rank}. Email {idx+1}: Scam score = {score}")
    print(f"   Preview: {email_preview}\n")

1. Email 788: Scam score = 71
   Preview:  Wed Apr  7 09:38:50 2004 Return-Path: <monicamtf11@galmail.co.za> From: "monica martins" <monicamtf...

2. Email 593: Scam score = 43
   Preview:  Tue Dec  9 19:48:27 2003 Return-Path: <dr_usmanbello@fsmail.net> Message-ID: <30547158.107101727432...

3. Email 134: Scam score = 34
   Preview:  Thu Feb 27 08:25:50 2003 Return-Path: <drabdullrasaq@phantomemail.com> Message-Id: <200302271324.h1...

4. Email 591: Scam score = 32
   Preview:  Fri Dec  5 12:06:18 2003 Return-Path: <ebaye52@yahoo.co.uk> Message-ID: <20031205170606.77721.qmail...

5. Email 244: Scam score = 30
   Preview:  Fri May 30 05:50:47 2003 Return-Path: <chuksanthony05@netscape.net> Message-Id: <200305300950.h4U9o...

6. Email 533: Scam score = 30
   Preview:  Wed Nov  5 08:17:16 2003 Return-Path: <jamesmorgan@fsmail.net> Message-ID: <30401587.1068038230059....

7. Email 945: Scam score = 29
   Preview:  Mon Jun 14 17:00:08 2004 	by webmail.postino.it (IMP) with HTTP  	for <cont

6. Identify emails that appear to be duplicates or near-duplicates (same sender, similar subject, sent within 24 hours). How many duplicate emails exist?

In [22]:
# YOUR CODE HERE
duplicate_pairs = []
duplicate_indices = set() 

for i in range(len(emails)):
    for j in range(i + 1, len(emails)):
        e1 = emails[i]
        e2 = emails[j]
    
        sender_match1 = re.search(r'^From:\s*(.+)$', e1, re.M)
        sender_match2 = re.search(r'^From:\s*(.+)$', e2, re.M)
        sender1 = sender_match1.group(1).strip() if sender_match1 else "unknown1"
        sender2 = sender_match2.group(1).strip() if sender_match2 else "unknown2"
        
        if sender1 != sender2:
            continue

        subject_match1 = re.search(r'^Subject:\s*(.+)$', e1, re.M)
        subject_match2 = re.search(r'^Subject:\s*(.+)$', e2, re.M)
        subject1 = subject_match1.group(1).strip().lower() if subject_match1 else ""
        subject2 = subject_match2.group(1).strip().lower() if subject_match2 else ""
        
        if subject1 != subject2:
            continue
        
        date_match1 = re.search(r'^Date:\s*(.+)$', e1, re.M)
        date_match2 = re.search(r'^Date:\s*(.+)$', e2, re.M)

        if not date_match1 or not date_match2:
            continue

        date_str1 = date_match1.group(1).strip()
        date_str2 = date_match2.group(1).strip()
        
        try:
            date_str1_clean = re.sub(r'\s*\([A-Za-z]+\)\s*$', '', date_str1)
            date_str2_clean = re.sub(r'\s*\([A-Za-z]+\)\s*$', '', date_str2)

            date_format = "%a, %d %b %Y %H:%M:%S %z"
            
            dt1 = datetime.strptime(date_str1_clean, date_format)
            dt2 = datetime.strptime(date_str2_clean, date_format)
            time_difference = abs(dt1 - dt2)

            if time_difference > timedelta(hours=24):
                continue
                
        except ValueError:
            try:
                date_format_no_day = "%d %b %Y %H:%M:%S %z"
                dt1 = datetime.strptime(date_str1_clean, date_format_no_day)
                dt2 = datetime.strptime(date_str2_clean, date_format_no_day)
                
                time_difference = abs(dt1 - dt2)
                if time_difference > timedelta(hours=24):
                    continue
            except ValueError:
                continue
        except Exception:
            continue
    
        duplicate_indices.add(i)
        duplicate_indices.add(j)

print(f"Tổng số email duplicate hoặc gần-duplicate: {len(duplicate_indices)}")

Tổng số email duplicate hoặc gần-duplicate: 356
