---
---
Recitation 4: Regular Expressions

Applied Data Science using Python

New York University, Abu Dhabi

Dated: 20th Sept 2023

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the basics of text processing
- Learn the basics of regular expressions

### Specific Goals
- Learn basic regex functions and operators
- Learn patterns and character classes
- Learn how to use quantifiers
- Learn how to use groups
- Learn about look-ahead and look-behind matching

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **R4_YOUR NETID.ipynb**.

---




# General Instructions
This recitation is worth 50 points. It has 2 parts. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout.



# Part I: Building an Email spam detector (25 points)

The faculty in the NYUAD social science division has been receiving a lot of spam emails lately. The default spam detector within gmail is not able to classify emails as spam, and so the faculty has reached out to the data scientists in the community to help create a customized spam detector.

From visual inspection, it is noticed that all the spam emails usually have one thing in common: In the body of the email, they all tend to ask the addressee to contact them at an email address with a blacklisted email domain. What is an email domain? An email domain is the web address that comes after the **@** symbol in an email address.

Conside the following spam email:

> URGENT! This email is being sent to all the employees with NYUAD email address. For security reasons, we are resetting the passwords of all the users. To verify your identity, please send us your NYUAD email address, and password to NYUAD-ADMIN@mailinator.net. Note that, except a few email addresses, you should not share your password with anyone else. More specifically, only trusted email addresses are NYUAD-ADMIN@mailinator.net, cybersecurity-awareness-training@bobmail.info, and nyuad-it@sogetthis.com.

In the above email body, there are three unique email addresses: NYUAD-ADMIN@mailinator.net,  cybersecurity-awareness-training@bobmail.info, and nyuad-it@sogetthis.com. Their domain names are *mailinator.net*, *bobmail.info*, and *sogetthis.com*. All three of these domains are black-listed.

In this task, you will use your knowledge of regular expressions to create an email spam detector using the logic above.

More precisely, write a function `is_spam(email_str, blacklisted_domains)` that takes in the body of the email as a string (as shown above), and a list of blacklisted domains, and returns `True` if the email body has a blacklisted domain name anywhere in the body of the email, and `False` if it does not.

In [59]:
import re

def is_spam(email_str, blacklisted_domains=[]):
  # Write your implementation of the function below this line

  ######### SOLUTION #########
  # Use regex to find any email domains in the email string
  # Also, remove any dots that may be at the end of the domain
  domains = re.findall(r'@(\w+\.?\w+)', email_str)
  return any(domain in blacklisted_domains for domain in domains)

  ######### SOLUTION END #########

In [60]:
# Reading blacklisted domain names from file
# In Google colab, you need to mount your drive to be access your files. If you are running jupyter notebook locally no need to do this step.
# from google.colab import drive
# drive.mount('/content/drive')

# NOTE: I run this notebook locally, so I commented out the above code
# Kindly uncomment the below code and set the path to the file containing the blacklisted domains

In [61]:
with open("blacklisted.txt","r") as file:
    # we'll read that into a variable called tweets
    blacklisted_domains = list(map(lambda d: d.strip(), file.readlines()))

print(blacklisted_domains)

['mailinator.com', 'mailinator2.com', 'mailinator.net', 'chammy.info', 'binkmail.com', 'bobmail.info', 'devnullmail.com', 'gawab.com', 'letthemeatspam.com', 'notmailinator.com', 'putthisinyourspamdatabase.com', 'tempimbox.com', 'thisisnotmyrealemail.com', 'tradermail.info', 'safetymail.info', 'sendspamhere.com', 'sogetthis.com', 'spam.la', 'spamherelots.com', 'spamthisplease.com', 'supergreatmail.com', 'suremail.info', 'veryrealemail.com', 'zippymail.info', 'guerrillamail.com', 'guerrillamail.org', 'guerrillamail.net', 'guerrillamail.biz', 'guerrillamail.de', 'guerrillamailblock.com', 'sharklasers.com', 'spam4.me', 'nwldx.com', 'rmqkr.net', 'ms9.mailslite.com', 'zoaxe.com', 'fakemailgenerator.com', 'teleworm.com', 'dayrep.com', 'onewaymail.com', 'mobi.web.id', 'ag.us.to', 'gelitik.in', 'fixmail.tk', 'shitmail.org', 'crapmail.org', '1ce.us', 'big1.us', 'garliclife.com', 'irish2me.com', 'lifebyfood.com', 'lr7.us', 'lr78.com', 'luv2.us', 'soodomail.com', 'soodonims.com', 'winemaven.info',

In [62]:
# How we will test your implementation

# Example of an email that should be classified as spam by your code
example_spam = "URGENT! This email is being sent to all the employees with NYUAD email address. For security reasons, we are resetting the passwords of all the users. To verify your identity, please send us your NYUAD email address, and password to NYUAD-ADMIN@mailinator.net. Note that, except a few email addresses, you should not share your password with anyone else. More specifically, only trusted email addresses are NYUAD-ADMIN@mailinator.net, cybersecurity-awareness-training@bobmail.info, and nyuad-it@sogetthis.com."

# Example of an email that should be classified as not-spam by your code
example_non_spam = "Dear NYUAD community. This is to inform that we will have a scheduled workday maintainance on 28/3/2021. As a result, workday will be inaccessible on 28/3/2021. For any questions contact nyuad.it@nyu.edu. Best, NYUAD IT Team"

# Example of an email that is spam but should be classified as non-spam by your code
example_non_spam2 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $100000 into your bank account. Please go to the following link: www.winbig.com"

# If your implementation is correct, the following lines should not give error
assert(True == is_spam(example_spam, blacklisted_domains))
assert(False == is_spam(example_non_spam, blacklisted_domains))
assert(False == is_spam(example_non_spam2, blacklisted_domains))

## Rubric

- +20 points for correctness (regex pattern returns the desired output)
- +5 points for conciseness (code and the regex pattern is concise)

# Part II: You have just won $10000000 (25 points)

If you give it some thought, you would understand that creating a perfect spam detector is a tough task. Modern spam detectors use machine learning-based algorithms. They are not perfect though, and arguably they never will be, simply because as spam detectors get better, so do the spammers.

For a long time though, many of the email spam-detectors were actually an ensemble of *rule-based classifiers* as you just created above using regex. These rules can be arbitrarily complex.

In this part, we want you to improve your implementation of the the spam detector above. Many of the spam emails like the `example_non_spam2` above actually specify a big amount in their emails that they are allegedly going to transfer to your bank account -- the classic case of *foreign lottery scam*.  In this task, we want you to write a function `is_lottery_spam(email_str)` that takes in an email string same as above, and returns `True` if the email body contains an amount greater than \$100. If not, it returns `False`.

Notes:

1. For simplicity, you can assume that the amount is always preceded by a `$` sign and does not contain any decimals or commas.

2. Your solution should be regex based, should not use iterations (loops, map, etc.), and should be no more than a couple of lines.

3. Note that your final regex solution should not match \$100, and only amount *greater* than \$100 should match, so you can start forming your regex by relaxing that condition.


In [63]:
# Write your code below this line

######### SOLUTION #########
def is_lottery_spam(email_str):
    # Use regex to to find any amounts greater than $100 in the email string
    
    # Finding any strings with $ followed by 3 or more digits
    # i.e., $100, $101, $999 etc.
    amounts = re.findall(r'\$(\d{3,})', email_str)
    # Check if any of the amounts is greater than $100
    return any(int(amount) > 100 for amount in amounts)

# Note that even something like $010, etc should work with our implementation
# Since we cast all string amounts to integers before checking if they are greater than 100

######### SOLUTION END #########


In [64]:
'''
- If your implementation is correct, the following lines should not give error.
- If you may notice, we have created sort of an ensemble here checking if the
email is spam using both of the detectors you have created.
'''

assert(True == is_lottery_spam(example_spam) or is_spam(example_spam, blacklisted_domains))
assert(False == is_lottery_spam(example_non_spam) or is_spam(example_non_spam, blacklisted_domains))
assert(True == is_lottery_spam(example_non_spam2) or is_spam(example_non_spam2, blacklisted_domains))

# Testing edge cases
example_non_spam3 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of 100 into your bank account. Please go to the following link: www.winbig.com"
example_non_spam4 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $100 into your bank account. Please go to the following link: www.winbig.com"
example_non_spam5 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $ 1010 into your bank account. Please go to the following link: www.winbig.com"
example_non_spam6 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $000 into your bank account. Please go to the following link: www.winbig.com"
example_non_spam7 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $99 into your bank account. Please go to the following link: www.winbig.com"
example_non_spam8 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $11 into your bank account. Please go to the following link: www.winbig.com"

assert(False == is_lottery_spam(example_non_spam3))
assert(False == is_lottery_spam(example_non_spam4))
assert(False == is_lottery_spam(example_non_spam5))
assert(False == is_lottery_spam(example_non_spam6))
assert(False == is_lottery_spam(example_non_spam7))
assert(False == is_lottery_spam(example_non_spam8))

example_spam2 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $1000 into your bank account. Please go to the following link: www.winbig.com"
example_spam3 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $101 into your bank account. Please go to the following link: www.winbig.com"
example_spam4 = "URGENT! I am a student at NYUAD, and I am seeking your help in transferring a sum of $2010 into your bank account. Please go to the following link: www.winbig.com"

assert(True == is_lottery_spam(example_spam2))
assert(True == is_lottery_spam(example_spam3))
assert(True == is_lottery_spam(example_spam4))
print("All Passed")

All Passed


## Rubric

- +20 points for correctness (regex pattern returns the desired output)
- +5 points for conciseness (code and the regex pattern is concise)