# Regular Expressions in NLP Assignment: Solutions and Analysis

Welcome to my NLP assignment notebook focused on regular expressions! 
In this project, I explore the power of regex patterns for solving various language-related tasks. Whether you're a fellow student, an aspiring NLP practitioner, or just curious, I hope you find this notebook informative.

# Background

Regular expressions (regex) are essential tools for text processing. They allow us to search, match, and manipulate strings based on specific patterns. In NLP, regex plays a crucial role in tasks like text cleaning, entity extraction, and validation.


# Problem Statement

Our assignment revolves around solving NLP challenges using regular expressions. From detecting repeated words to validating email addresses, we'll dive into practical applications of regex.

##### Q#1: Write a regular expression function for the following. By “word”, we mean an alphabetic string separated from other words by whitespace, any relevant punctuation, line breaks, and so forth.

## 1 - Two Consecutive Repeated Words

In [1]:
import re

In [2]:
def find_repeated_words(text):
       pattern = r'\b(\w+)\s+\1\b'
       return re.findall(pattern, text)

In [3]:
text = "Humbert Humbert and the the quick brown fox"
repeated_words=find_repeated_words(text)
print("Repeated words:", repeated_words)

Repeated words: ['Humbert', 'the']


## 2 - Starts with an Integer and Ends with a Word

In [4]:
def validate_start_end(text):
       pattern = r'^\d.*\b(\w+)$'
       return bool(re.match(pattern, text))

In [5]:
text = "42 is the answer"
is_valid = validate_start_end(text)
print("Valid:", is_valid)

Valid: True


## 3 - Contains Both "grotto" and "raven"

In [6]:
def find_grotto_raven_strings(text):
    pattern = r'\b(?:grotto|raven)\b.*\b(?:grotto|raven)\b'
    return re.findall(pattern, text)

In [7]:
text = "The raven flew into the grotto. Grottos are fascinating."
matching_strings = find_grotto_raven_strings(text)
print("Matching strings:", matching_strings)

Matching strings: ['raven flew into the grotto']


## 4 - Valid Email Address

In [8]:
def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

In [9]:
email_address = "w4s.tila@gmail.com"
is_valid_email = validate_email(email_address)
print("Valid email:", is_valid_email)

Valid email: True


## 5 - Pakistani Mobile Network Phone Numbers

In [10]:
def validate_pakistani_number(number):
    pattern = r'^\+92\d{10}$'
    return bool(re.match(pattern, number))

In [11]:
phone_number = "+923001234567"
is_valid = validate_pakistani_number(phone_number)
print("Valid phone number:", is_valid)

Valid phone number: True


## 6 - Remove Symbols and Non-Alphanumeric Characters

In [12]:
def remove_symbols(text):
    return re.sub(r'[^\w\s]+', '', text)

In [13]:
text = "Hello, world! This is my first NLP assignment."
cleaned_text = remove_symbols(text)
print("Cleaned text:", cleaned_text)

Cleaned text: Hello world This is my first NLP assignment


## 7 - Remove URLs and HTML Tags

In [35]:
def remove_urls_and_html(text):
    # Remove URLs
    cleaned_text = re.sub(r'http[s]?://\S+', '', text)
    # Remove HTML tags
    cleaned_text = re.sub(r'<[^>]*>', '', cleaned_text)
    return cleaned_text

In [37]:
text = """
<p>This is an example <a href="https://example.com">HTML</a> text with <b>tags</b>.</p>
Visit our website at https://example.com for more information.
"""
cleaned_text = remove_urls_and_html(text)
print("Cleaned text:")
print(cleaned_text)

Cleaned text:

This is an example tags.
Visit our website at  for more information.



## 8 - Find Acronyms (Uppercase Letters)

In [16]:
def find_acronyms(text):
       pattern = r'\b[A-Z]{2,}\b'
       return re.findall(pattern, text)

In [17]:
text = "NASA and FBI are well-known acronyms."
acronyms = find_acronyms(text)
print("Acronyms:", acronyms)

Acronyms: ['NASA', 'FBI']


## 9 - Mask Sensitive Info (Phone Numbers)

In [18]:
def mask_phone_numbers(text):
       return re.sub(r'\b\d{3}-\d{3}-\d{4}\b', 'XXX-XXX-XXXX', text)

In [19]:
text = "Call me at 123-456-7890."   
masked_text = mask_phone_numbers(text)
print("Masked text:", masked_text)

Masked text: Call me at XXX-XXX-XXXX.


## 10 - Extract Dates (Various Formats)

In [25]:
def extract_dates(text):
    # Date formats: DD-MM-YYYY, MM/DD/YYYY, YYYY-MM-DD
    pattern = r'\b(?:\d{2}-\d{2}-\d{4}|\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2})\b'
    return re.findall(pattern, text)

In [26]:
text = "Meeting scheduled for 15-07-2024 and 2024-07-15. Also, 07/15/2024."
dates = extract_dates(text)
print("Extracted dates:", dates)

Extracted dates: ['15-07-2024', '2024-07-15', '07/15/2024']


## 11 - Extract Currency Amounts

In [33]:
def extract_currency_amounts(text):
    # Match currency symbols followed by digits (with optional decimal part)
    pattern = r'[£$€]\s*\d+(?:\.\d{2})?|\d+(?:\.\d{2})?\s*[£$€]'
    return re.findall(pattern, text)

In [34]:
text = "Total cost: $100.00 and €50."
amounts = extract_currency_amounts(text)
print("Currency amounts:", amounts)

Currency amounts: ['$100.00', '€50']


## 12 - Find Capitalized Words

In [29]:
def find_capitalized_words(text):
    pattern = r'\b[A-Z][a-z]*\b'
    return re.findall(pattern, text)

In [30]:
text = "The Quick brown Fox Jumps over the Lazy Dog."
capitalized_words = find_capitalized_words(text)
print("Capitalized words:", capitalized_words)

Capitalized words: ['The', 'Quick', 'Fox', 'Jumps', 'Lazy', 'Dog']


## 13 - Find Repeated Consecutive Words

In [31]:
def find_repeated_consecutive_words(text):
    pattern = r'\b(\w+)\s+\1\b'
    return re.findall(pattern, text)

In [32]:
text = "The quick brown brown fox jumps over the lazy lazy dog."
repeated_words = find_repeated_consecutive_words(text)
print("Repeated consecutive words:", repeated_words)

Repeated consecutive words: ['brown', 'lazy']


## Conclusion

In this NLP assignment, we explored the power of regular expressions (regex) for solving various language-related tasks. Here are our main takeaways:

- Regex patterns can be incredibly versatile for text processing.
- Task-specific patterns are essential for accurate results (e.g., detecting repeated words, validating phone numbers).

Overall, regex provides a solid foundation for NLP, but it's essential to understand its limitations and explore more advanced techniques in future projects.

## Acknowledgments

Special thanks to my trainer, Miss Mahnoor Salman, for her valuable guidance and lessons during the classes. Her insights and feedback significantly contributed to the success of this assignment. I am an AI trainee at AtomCamp, and I appreciate the opportunity to learn and grow under her mentorship. Thank you, Ma'am!