# Basic NLP Course

## Regular Expression

This is the first class of the Basic NLP Course, where we will focus on understanding and applying Regular Expressions (Regex). Regex is a powerful tool for text processing and pattern matching, which forms the foundation for many Natural Language Processing (NLP) tasks.

Good reference: [Regex101](https://regex101.com/)

### Applications in Chemical and Process Engineering

Regular Expressions can be applied in various chemical and process engineering tasks, such as:

- **Data Cleaning and Preprocessing**: Extracting and formatting experimental data from unstructured text files or reports.
- **Parsing Chemical Formulas**: Identifying and validating chemical formulas or reaction equations from text.
- **Log File Analysis**: Analyzing process logs to detect anomalies or patterns in operational data.
- **Simulation Input Validation**: Ensuring that input files for simulation software adhere to specific formats.
- **Text Mining**: Extracting relevant information from research papers, patents, or technical documents.

Regex provides a systematic way to handle these tasks efficiently, saving time and reducing errors.

In [1]:
import re

In [2]:
# Example of more noisy and unstructured texts with phone numbers
noisy_texts = [
    "C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!",
    "H3y, my n3w numb3r is 9876543210. S@ve !t, plz.",
    "Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.",
    "R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.",
    "Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.",
    "Call me at 1231231234 when you get this message.",
    "Hey, my new number is 3213214321. Save it, please.",
    "The quick brown fox jumps over the lazy dog. No numbers here!",
    "Contact us at support@example.com for more information.",
    "Meeting rescheduled to 3 PM tomorrow. No phone number included."
]

# Display the noisy texts
for text in noisy_texts:
    print(text)

C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Call me at 1231231234 when you get this message.
Hey, my new number is 3213214321. Save it, please.
The quick brown fox jumps over the lazy dog. No numbers here!
Contact us at support@example.com for more information.
Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [None]:
# lets identifiy the phone numbers
pattern = '\d\d\d\d\d\d\d\d\d\d'

for text in noisy_texts:
    matches = re.findall(pattern, text)

    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat The quick brown fox jumps over the lazy dog. No numbers here!
Chat Contact us at support@example.com for more information.
Chat Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [11]:
# try other valid pattern
pattern = '\d{10}'

for text in noisy_texts:
    matches = re.findall(pattern, text)

    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat: H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat: Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat: R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat: Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat: Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat: Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat: The quick brown fox jumps over the lazy dog. No numbers here!
Chat: Contact us at support@example.com for more information.
Chat: Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [16]:
# Pattern for phone numbers in the format (XXX)-XXXXXXX
pattern = r'\(\d{3}\)'

# Example texts with (XXX)-XXXXXXX format
formatted_texts = [
    "Call me at (123)-4567890 when you get this message.",
    "Hey, my new number is (987)-6543210. Save it, please.",
    "The event starts at 6 PM. Contact (555)-6667777 for details.",
    "Random text with a number (111)-2223333 in between.",
    "Reach out to support at (444)-5556666 for assistance.",
    "No phone number here, just some random text.",
    "Another example: (321)-3214321."
]

# Display matches for the (XXX)-XXXXXXX pattern
for text in formatted_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)
Chat: Random text with a number (111)-2223333 in between.
Phones:  (111)
Chat: Reach out to support at (444)-5556666 for assistance.
Phones:  (444)
Chat: No phone number here, just some random text.
Chat: Another example: (321)-3214321.
Phones:  (321)


In [17]:
# Pattern for phone numbers in the format (XXX)-XXXXXXX
pattern = r'\(\d{3}\)-\d{7}'

# Example texts with (XXX)-XXXXXXX format
formatted_texts = [
    "Call me at (123)-4567890 when you get this message.",
    "Hey, my new number is (987)-6543210. Save it, please.",
    "The event starts at 6 PM. Contact (555)-6667777 for details.",
    "Random text with a number (111)-2223333 in between.",
    "Reach out to support at (444)-5556666 for assistance.",
    "No phone number here, just some random text.",
    "Another example: (321)-3214321."
]

# Display matches for the (XXX)-XXXXXXX pattern
for text in formatted_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)-4567890
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)-6543210
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)-6667777
Chat: Random text with a number (111)-2223333 in between.
Phones:  (111)-2223333
Chat: Reach out to support at (444)-5556666 for assistance.
Phones:  (444)-5556666
Chat: No phone number here, just some random text.
Chat: Another example: (321)-3214321.
Phones:  (321)-3214321


In [19]:
# Combine both lists
combined_texts = noisy_texts + formatted_texts

# Pattern to match both plain 10-digit numbers and (XXX)-XXXXXXX format
pattern1 = r'\d{10}'
pattern2 = r'\(\d{3}\)-\d{7}'
pattern = f"({pattern1}|{pattern2})"

# | is the OR operator
# Display matches for the (XXX)-XXXXXXX pattern
for text in combined_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat: H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat: Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat: R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat: Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat: Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat: Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat: The quick brown fox jumps over the lazy dog. No numbers here!
Chat: Contact us at support@example.com for more information.
Chat: Meeting rescheduled to 3 PM tomorrow. No phone number included.
Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)-4567890
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)-6543210
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)-6667777
Chat: Rand

In [21]:
# Example texts with email addresses
email_texts = [
    "Contact us at support@example.com for more information.",
    "Send your feedback to feedback@company.org.",
    "Reach out to john.doe123@university.edu for academic queries.",
    "No email here, just some random text.",
    "Another example: jane-doe@sub.domain.co.uk."
]

# pattern
email_pattern1 = r'[a-z]*@'

def extract_matches(texts, pattern):
    """
    Extracts and prints matches from a list of texts based on the provided regex pattern.

    Args:
        texts (list): List of texts to search for matches.
        pattern (str): Regular expression pattern to match.
    """
    for text in texts:
        print('Chat:', text)
        matches = re.findall(pattern, text)
        for match in matches:
            print('Match:', match)

# Call the function
extract_matches(email_texts, email)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: @
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [23]:
# let's improve a little bit the pattern
email_pattern2 = r'[a-zA-Z]*@'

extract_matches(email_texts, email_pattern2)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: @
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [24]:
# imprpve a little bit more
email_pattern3 = r'[a-zA-Z0-9]*@'

extract_matches(email_texts, email_pattern3)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [25]:
# get dots and _
email_pattern4 = r'[a-zA-Z0-9._]*@'

extract_matches(email_texts, email_pattern4)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [30]:
# catch - also
email_pattern5 = r'[a-zA-Z0-9._\%\+\-]*@'
extract_matches(email_texts, email_pattern5)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: jane-doe@


In [33]:
# lets catch the domain
email_pattern6 = r'[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]'
extract_matches(email_texts, email_pattern6)

Chat: Contact us at support@example.com for more information.
Match: support@example.c
Chat: Send your feedback to feedback@company.org.
Match: feedback@company.o
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@university.e
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: jane-doe@sub.domain.co.u
