# Economic Python: Enhanced CS50 Introduction to Programming
## **Lecture 7: Regular Expressions**

Welcome to our exploration of Regular Expressions (regex)! This notebook is based on the seventh lecture of CS50's Introduction to Programming with Python, taught by David J. Malan. We'll dive into the powerful world of pattern matching in Python.

### **Why Regular Expressions Matter for Economists**
In economics, you often need to:
- **Extract Economic Data:** Parse economic reports to extract specific figures, dates, and indicators
- **Process Survey Data:** Clean and standardize text responses from economic surveys
- **Analyze Financial Documents:** Extract information from financial statements and reports
- **Filter Economic News:** Identify relevant economic news from large text corpora
- **Validate Economic Inputs:** Ensure user input follows expected economic data formats

Regular expressions provide a powerful way to search, extract, and manipulate text data, making them invaluable for economic text analysis.

### **Table of Contents**
1.  [Introduction to Regular Expressions](#section-1)
2.  [Basic Regex Patterns](#section-2)
3.  [Email Validation with Regex](#section-3)
4.  [Regex Special Characters](#section-4)
5.  [Anchors and Boundaries](#section-5)
6.  [Character Classes](#section-6)
7.  [Regex Flags](#section-7)
8.  [Extracting Data with Regex](#section-8)
9.  [Problem Set: IPv4 Validation](#section-9)
10. [Problem Set: YouTube URL Parsing](#section-10)
11. [Problem Set: Time Conversion](#section-11)
12. [Problem Set: Word Counting](#section-12)
13. [Problem Set: Email Validation](#section-13)

<a id='section-1'></a>
## 1. Introduction to Regular Expressions

A **regular expression** (regex) is a pattern that specifies a set of strings. In Python, we use the `re` module to work with regular expressions. Regular expressions are incredibly useful for:
- Validating input (like email addresses)
- Extracting specific data from text
- Replacing parts of strings
- Splitting strings based on patterns

#### Economic Context
In economics, regular expressions are like specialized search tools for text data. Just as economists use statistical methods to identify patterns in numerical data, regular expressions help identify patterns in text data. For example, you might use regex to:

- Extract all monetary values from an economic report
- Find all dates in a specific format within a document
- Identify and standardize country names in a dataset
- Extract economic indicators from news articles

Let's start by importing the `re` module:

In [None]:
import re

<a id='section-2'></a>
## 2. Basic Regex Patterns

Let's start with some basic patterns. The simplest regex is just a literal string match:

In [None]:
# Simple literal match
text = "Hello, my name is Siddiqur Rahman. I am an Economics graduate."
pattern = "Siddiqur"

match = re.search(pattern, text)
if match:
    print(f"Found '{match.group()}' in the text")
else:
    print("Pattern not found")

#### Economic Application
In economic analysis, you might need to find specific economic terms or indicators in large documents. Let's see how we can search for economic terms in a text.

In [None]:
# Simple literal match in economic context
economic_report = """
Bangladesh Economic Review 2023

The GDP of Bangladesh has shown remarkable growth over the past decade.
Inflation rates have remained relatively stable, with minor fluctuations.
The unemployment rate has decreased due to various government initiatives.
Foreign direct investment has increased significantly in the technology sector.
"""

# Search for economic terms
pattern = "GDP"
match = re.search(pattern, economic_report)
if match:
    print(f"Found '{match.group()}' in the economic report")
else:
    print("Pattern not found")

# Find all occurrences of a term
pattern = "inflation"
matches = re.findall(pattern, economic_report, re.IGNORECASE)  # Case-insensitive search
print(f"Found 'inflation' {len(matches)} times: {matches}")

Let's try a more complex example with monetary values:

In [None]:
# Find monetary values in economic text
financial_text = """
The national budget allocated $2.5 trillion for infrastructure development.
Education received an additional $450 billion in funding.
The healthcare sector was granted $320.5 million for new initiatives.
Foreign aid amounted to $75 million from various international partners.
"""

# Pattern to match dollar amounts
pattern = r"\$[\d,]+\.?\d*"  # Matches $ followed by digits, optional commas, optional decimal
matches = re.findall(pattern, financial_text)
print(f"Monetary values found: {matches}")

# Extract and convert to numbers
values = []
for match in matches:
    # Remove $ and commas, then convert to float
    value = float(match.replace('$', '').replace(',', ''))
    values.append(value)

print(f"Extracted values: {values}")
print(f"Total funding: ${sum(values):.2f}")

<a id='section-3'></a>
## 3. Email Validation with Regex

Let's explore how we can validate email addresses using regular expressions. We'll start with a simple approach and gradually improve it.

In [None]:
# Simple email validation - just checking for @ symbol
def validate_email_simple(email):
    if "@" in email:
        return True
    return False

# Test the function
print(validate_email_simple("siddiqur@example.com"))  # True
print(validate_email_simple("siddiqur@example"))     # False
print(validate_email_simple("@example.com"))         # True (but not a valid email!)

This simple approach has limitations - it accepts invalid emails like just '@'. Let's improve it:

In [None]:
# Better email validation - checking for both @ and .
def validate_email_better(email):
    if "@" in email and "." in email:
        return True
    return False

# Test the function
print(validate_email_better("siddiqur@example.com"))  # True
print(validate_email_better("siddiqur@example"))       # False
print(validate_email_better("@example.com"))          # True (still not perfect!)

Let's use the `re` module to create a more robust email validator:

In [None]:
# Using regex for email validation
def validate_email_regex(email):
    # Pattern: one or more characters, then @, then one or more characters, then ., then more characters
    pattern = r".+@.+\..+"
    if re.search(pattern, email):
        return True
    return False

# Test the function
print(validate_email_regex("siddiqur@example.com"))  # True
print(validate_email_regex("siddiqur@example"))       # False
print(validate_email_regex("@example.com"))          # True (still not perfect!)

#### Economic Context
In economic research, you might collect contact information for economists, institutions, or survey respondents. Validating email addresses ensures you can communicate with your research participants or collaborators.

In [None]:
# Simple email validation - just checking for @ symbol
def validate_email_simple(email):
    if "@" in email:
        return True
    return False

# Test the function with economic institution emails
emails = [
    "siddiqur@econ.ju.edu",
    "worldbank@economics.org",
    "imf@monetary.fund",
    "invalid-email",
    "@missingdomain.com"
]

print("Simple email validation:")
for email in emails:
    print(f"{email}: {validate_email_simple(email)}")

This simple approach has limitations - it accepts invalid emails like just '@'. Let's improve it:

In [None]:
# Better email validation - checking for both @ and .
def validate_email_better(email):
    if "@" in email and "." in email:
        return True
    return False

print("\nBetter email validation:")
for email in emails:
    print(f"{email}: {validate_email_better(email)}")

Let's use the `re` module to create a more robust email validator:

In [None]:
# Using regex for email validation
def validate_email_regex(email):
    # Pattern: one or more characters, then @, then one or more characters, then ., then more characters
    pattern = r".+@.+\..+"
    if re.search(pattern, email):
        return True
    return False

print("\nRegex email validation:")
for email in emails:
    print(f"{email}: {validate_email_regex(email)}")

Even this improved version has limitations. Let's create a more sophisticated pattern specifically for academic and institutional emails:

In [None]:
# More sophisticated email validation for academic/institutional emails
def validate_academic_email(email):
    # Pattern for academic emails: name@department.institution.domain
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(edu|org|gov|ac\.[a-zA-Z]{2})$"
    if re.search(pattern, email):
        return True
    return False

academic_emails = [
    "siddiqur@econ.ju.edu",  # Valid
    "research@worldbank.org",  # Valid
    "analyst@imf.gov",  # Valid
    "student@econ.ac.bd",  # Valid
    "professor@university.edu",  # Valid
    "spam@gmail.com",  # Invalid (not academic)
    "invalid@domain",  # Invalid
]

print("\nAcademic email validation:")
for email in academic_emails:
    print(f"{email}: {validate_academic_email(email)}")

<a id='section-4'></a>
## 4. Regex Special Characters

Regular expressions use special characters to create more flexible patterns:

- `.` - Matches any character except newline
- `*` - Matches 0 or more repetitions of the preceding pattern
- `+` - Matches 1 or more repetitions of the preceding pattern
- `?` - Matches 0 or 1 repetition of the preceding pattern
- `{m}` - Matches exactly m repetitions
- `{m,n}` - Matches between m and n repetitions
- `\` - Escape character, used to match special characters literally

Let's see these in action:

In [None]:
# Demonstrating special characters
text = "The rain in Spain falls mainly in the plain."

# . matches any character
print(re.findall(r"r..n", text))  # ['rain', 'rain']

# * matches 0 or more repetitions
print(re.findall(r"ab*c", "ac abc abbbc abbbbbc"))  # ['ac', 'abc', 'abbbc', 'abbbbbc']

# + matches 1 or more repetitions
print(re.findall(r"ab+c", "ac abc abbbc abbbbbc"))  # ['abc', 'abbbc', 'abbbbbc']

# ? matches 0 or 1 repetition
print(re.findall(r"ab?c", "ac abc abbbc abbbbbc"))  # ['ac', 'abc']

#### Economic Application
These special characters are particularly useful when working with economic data that might have variations in formatting. For example, monetary values can be written in different ways, and regex special characters help you match all these variations.

In [None]:
# Demonstrating special characters with economic data
economic_text = """
The GDP growth rate was 6.5% in 2022.
Inflation reached 7.2% in the same year.
Unemployment stood at 4.8%.
The interest rate was adjusted to 6.75%.
Trade surplus was recorded at 2.1% of GDP.
"""

# . matches any character
print(re.findall(r"\d.\%", economic_text))  # Matches digit, any character, %

# * matches 0 or more repetitions
print(re.findall(r"\d*%", economic_text))  # Matches 0 or more digits followed by %

# + matches 1 or more repetitions
print(re.findall(r"\d+%", economic_text))  # Matches 1 or more digits followed by %

# ? matches 0 or 1 repetition
print(re.findall(r"\d\.?\d%", economic_text))  # Matches digit, optional decimal, digit, %

Let's look at a more complex example with monetary values:

In [None]:
# More complex monetary value patterns
financial_report = """
The national budget: $2,500,000,000.00
Education allocation: $450,000,000.50
Healthcare funding: $320,500,000
Foreign aid: $75,000,000
Emergency fund: $5,000,000.99
"""

# Pattern to match monetary values with optional cents
pattern = r"\$[\d,]+(?:\.\d{2})?"  # (?:...) creates a non-capturing group
matches = re.findall(pattern, financial_report)
print(f"Monetary values found: {matches}")

# Using quantifiers {m,n}
# Match years between 2000 and 2023
year_pattern = r"20[0-2][0-3]"  # Matches 2000-2023
text_with_years = "Data from 1998, 2005, 2010, 2015, 2020, 2023, 2025"
years = re.findall(year_pattern, text_with_years)
print(f"Years found: {years}")

<a id='section-5'></a>
## 5. Anchors and Boundaries

Anchors are used to match positions in the text rather than characters:

- `^` - Matches the start of the string
- `$` - Matches the end of the string
- `\b` - Matches a word boundary
- `\B` - Matches a non-word boundary

Let's see how these work:

In [None]:
# Demonstrating anchors
text = "Python is powerful. Python is easy to learn."

# ^ matches start of string
print(re.search(r"^Python", text))  # <re.Match object...>
print(re.search(r"^powerful", text))  # None

# $ matches end of string
print(re.search(r"learn.$", text))  # <re.Match object...>
print(re.search(r"Python.$", text))  # None

# \b matches word boundary
print(re.findall(r"\bPython\b", text))  # ['Python', 'Python']
print(re.findall(r"\bPy\w+\b", text))  # ['Python', 'Python']

#### Economic Application
Anchors are particularly useful when you need to validate economic data that must appear in specific positions or formats. For example, validating that an economic report starts with a specific header or that a monetary value appears at the end of a line.

In [None]:
# Demonstrating anchors with economic text
report = "GDP Growth Report: 6.5%\nInflation Rate: 7.2%\nUnemployment: 4.8%"

# ^ matches start of string
print(re.search(r"^GDP", report))  # <re.Match object...>
print(re.search(r"^Inflation", report))  # None

# $ matches end of string
print(re.search(r"4.8%$", report))  # <re.Match object...>
print(re.search(r"GDP.*$", report))  # Matches entire line starting with GDP

# \b matches word boundary
print(re.findall(r"\b\d+\.\d+%\b", report))  # Matches whole percentages

# Multiline example
multiline_report = """GDP Growth Report
Bangladesh Economic Review 2023
GDP Growth: 6.5%
Inflation Rate: 7.2%
Unemployment: 4.8%
GDP Growth: 6.5%
End of Report"""

# With re.MULTILINE flag, ^ and $ match start/end of each line
print(re.findall(r"^GDP.*$", multiline_report, re.MULTILINE))  # Matches lines starting with GDP

Let's look at a practical example of validating economic report headers:

In [None]:
# Validate economic report headers
def validate_report_header(header):
    # Pattern: starts with country name, followed by "Economic Report", followed by year
    pattern = r"^[A-Za-z\s]+Economic Report\s20[0-2][0-9]$"
    return bool(re.search(pattern, header))

headers = [
    "Bangladesh Economic Report 2023",  # Valid
    "India Economic Report 2022",  # Valid
    "Sri Lanka Economic Report 2021",  # Valid
    "Economic Report 2023",  # Invalid (missing country)
    "Bangladesh Economic Report 1999",  # Invalid (year too old)
    "Bangladesh Economic Report 2025",  # Invalid (year too new)
    "Bangladesh Financial Report 2023",  # Invalid (wrong report type)
]

for header in headers:
    print(f"'{header}': {validate_report_header(header)}")

<a id='section-6'></a>
## 6. Character Classes

Character classes allow you to specify a set of characters to match:

- `[abc]` - Matches any of a, b, or c
- `[^abc]` - Matches any character except a, b, or c
- `[a-z]` - Matches any lowercase letter
- `[A-Z]` - Matches any uppercase letter
- `[0-9]` - Matches any digit
- `\d` - Matches any digit (equivalent to [0-9])
- `\D` - Matches any non-digit
- `\w` - Matches any word character (alphanumeric + underscore)
- `\W` - Matches any non-word character
- `\s` - Matches any whitespace character
- `\S` - Matches any non-whitespace character

Let's explore character classes:

In [None]:
# Demonstrating character classes
text = "My phone number is 123-456-7890. Call me at 987-654-3210."

# [0-9] matches any digit
print(re.findall(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", text))  # ['123-456-7890', '987-654-3210']

# \d is shorthand for [0-9]
print(re.findall(r"\d{3}-\d{3}-\d{4}", text))  # ['123-456-7890', '987-654-3210']

# \w matches word characters
print(re.findall(r"\w+", text))  # ['My', 'phone', 'number', 'is', '123', '456', '7890', 'Call', 'me', 'at', '987', '654', '3210']

# [A-Za-z] matches any letter
print(re.findall(r"[A-Za-z]+", text))  # ['My', 'phone', 'number', 'is', 'Call', 'me', 'at']

#### Economic Application
Character classes are extremely useful for extracting specific types of economic data from text. For example, extracting all monetary values, percentages, or dates from economic reports.

In [None]:
# Demonstrating character classes with economic data
economic_text = """
The GDP growth was 6.5% in 2022.
Inflation reached 7.2% in the same year.
The unemployment rate was 4.8%.
Foreign exchange reserves stood at $48.5 billion.
The budget deficit was 5.1% of GDP.
"""

# [0-9] matches any digit
print(re.findall(r"[0-9]", economic_text))  # All individual digits

# \d is shorthand for [0-9]
print(re.findall(r"\d+", economic_text))  # All numbers (one or more digits)

# Extract percentages
print(re.findall(r"\d+\.\d%", economic_text))  # Numbers with decimal followed by %

# Extract monetary values
print(re.findall(r"\$\d+\.\d", economic_text))  # Dollar amounts

# Extract years
print(re.findall(r"20[0-9][0-9]", economic_text))  # Years in 2000s

# Extract economic indicators (words followed by numbers)
print(re.findall(r"[A-Za-z]+ was \d+\.\d%", economic_text))

Let's look at a more complex example with economic data validation:

In [None]:
# Validate economic data formats
def validate_economic_data(data):
    """
    Validate different types of economic data.
    Returns a dictionary with validation results.
    """
    results = {}
    
    # Validate percentage (0-100 with optional decimal)
    percentage_pattern = r"^(100|\d{1,2}(\.\d+)?)%$"
    results['percentage'] = bool(re.search(percentage_pattern, data))
    
    # Validate year (2000-2023)
    year_pattern = r"^20(0[0-9]|1[0-9]|2[0-3])$"
    results['year'] = bool(re.search(year_pattern, data))
    
    # Validate monetary value (with optional commas and decimals)
    money_pattern = r"^\$[\d,]+(\.\d{1,2})?$"
    results['money'] = bool(re.search(money_pattern, data))
    
    # Validate economic indicator code (3 letters followed by 2-4 digits)
    indicator_pattern = r"^[A-Z]{3}\d{2,4}$"
    results['indicator'] = bool(re.search(indicator_pattern, data))
    
    return results

# Test the validation function
test_data = [
    "6.5%",      # Valid percentage
    "100%",      # Valid percentage
    "125%",      # Invalid percentage
    "2022",      # Valid year
    "1999",      # Invalid year
    "2025",      # Invalid year
    "$2,500.50", # Valid money
    "$100",      # Valid money
    "$100.123",  # Invalid money (too many decimals)
    "GDP2022",   # Valid indicator
    "INF5",      # Invalid indicator
]

for data in test_data:
    result = validate_economic_data(data)
    valid_types = [k for k, v in result.items() if v]
    print(f"'{data}': Valid as {', '.join(valid_types) if valid_types else 'nothing'}")

<a id='section-7'></a>
## 7. Regex Flags

Regex flags modify the behavior of the pattern matching:

- `re.IGNORECASE` or `re.I` - Makes the pattern case-insensitive
- `re.MULTILINE` or `re.M` - Makes `^` and `$` match the start and end of each line
- `re.DOTALL` or `re.S` - Makes `.` match newline characters as well

Let's see these flags in action:

In [None]:
# Demonstrating regex flags
text = "Python is fun. python is powerful. PYTHON is versatile."

# Case-sensitive search
print(re.findall(r"python", text))  # ['python']

# Case-insensitive search using re.IGNORECASE
print(re.findall(r"python", text, re.IGNORECASE))  # ['Python', 'python', 'PYTHON']

# Multiline example
multiline_text = "First line\nSecond line\nThird line"
print(re.findall(r"^\w+", multiline_text))  # ['First'] (only matches start of string)
print(re.findall(r"^\w+", multiline_text, re.MULTILINE))  # ['First', 'Second', 'Third'] (matches start of each line)

#### Economic Application
Flags are particularly useful when processing economic text that might have inconsistent formatting. For example, economic reports might use different capitalization for terms, or data might be spread across multiple lines.

In [None]:
# Demonstrating regex flags with economic text
economic_text = """
The GDP growth rate shows positive trends.
gdp growth has been consistent over the past decade.
Gdp growth is expected to continue in the next quarter.
GDP GROWTH is a key indicator of economic health.
"""

# Case-sensitive search
print(re.findall(r"gdp growth", economic_text))  # ['gdp growth']

# Case-insensitive search using re.IGNORECASE
print(re.findall(r"gdp growth", economic_text, re.IGNORECASE))  # All variations

# Multiline example
multiline_text = """Bangladesh Economic Report
GDP Growth: 6.5%
Inflation Rate: 7.2%
Unemployment: 4.8%
Trade Balance: +2.1%
Foreign Reserves: $48.5B"""

# Find all economic indicators (one per line)
print(re.findall(r"^[A-Z][a-zA-Z\s]+:.*$", multiline_text, re.MULTILINE))

# DOTALL example - match across multiple lines
report_with_sections = """Executive Summary:
The economy has shown remarkable growth.
Key indicators are positive.

Detailed Analysis:
GDP growth has been consistent.
Inflation remains under control.
"""

# Without DOTALL - doesn't match across newlines
print(re.search(r"Executive Summary:.*Analysis:", report_with_sections))

# With DOTALL - matches across newlines
print(re.search(r"Executive Summary:.*Analysis:", report_with_sections, re.DOTALL))

Let's look at a practical example of extracting economic data from a report with inconsistent formatting:

In [None]:
# Extract economic indicators from a report with inconsistent formatting
report = """
BANGLADESH ECONOMIC REVIEW 2023
=====================================

Key Economic Indicators:
- gdp growth: 6.5%
- Inflation Rate: 7.2%
- unemployment: 4.8%
- Trade Balance: +2.1% of GDP
- Foreign Reserves: $48.5 billion

Detailed Analysis:
The GDP growth rate has exceeded expectations.
Inflation remains within the target range.
Unemployment has decreased compared to last year.
"""

# Extract all economic indicators with case-insensitive matching
indicators = re.findall(r"- (\w[\w\s]*): (.+)$", report, re.MULTILINE | re.IGNORECASE)
print("Economic indicators found:")
for indicator, value in indicators:
    print(f"{indicator.strip()}: {value.strip()}")

# Extract the report title (case-insensitive)
title = re.search(r"^(bangladesh .+ review \d{4})", report, re.MULTILINE | re.IGNORECASE)
if title:
    print(f"\nReport title: {title.group(1)}")

# Extract the analysis section (multiline)
analysis = re.search(r"Detailed Analysis:(.*)", report, re.DOTALL)
if analysis:
    print(f"\nAnalysis section: {analysis.group(1).strip()}")

<a id='section-8'></a>
## 8. Extracting Data with Regex

One of the most powerful uses of regex is extracting specific data from text. We can use groups to capture parts of a match:

In [None]:
# Extracting data with groups
text = "Siddiqur Rahman, siddiqur@example.com, 2023-05-15"

# Extract name, email, and date
pattern = r"(\w+ \w+), (\w+@\w+\.\w+), (\d{4}-\d{2}-\d{2})"
match = re.search(pattern, text)

if match:
    name = match.group(1)
    email = match.group(2)
    date = match.group(3)
    print(f"Name: {name}")
    print(f"Email: {email}")
    print(f"Date: {date}")
else:
    print("No match found")

We can also use named groups to make our code more readable:

In [None]:
# Using named groups
text = "Siddiqur Rahman, siddiqur@example.com, 2023-05-15"

# Extract name, email, and date using named groups
pattern = r"(?P<name>\w+ \w+), (?P<email>\w+@\w+\.\w+), (?P<date>\d{4}-\d{2}-\d{2})"
match = re.search(pattern, text)

if match:
    name = match.group("name")
    email = match.group("email")
    date = match.group("date")
    print(f"Name: {name}")
    print(f"Email: {email}")
    print(f"Date: {date}")
else:
    print("No match found")

#### Economic Application
Data extraction is crucial for economic analysis. You might need to extract specific values from economic reports, standardize data from different sources, or parse structured information from unstructured text.

In [None]:
# Extracting economic data with groups
economic_report = """
Siddiqur Rahman, siddiqur@econ.ju.edu, 2023-05-15
GDP Growth: 6.5%
Inflation Rate: 7.2%
Unemployment Rate: 4.8%
"""

# Extract name, email, and date
pattern = r"(\w+ \w+), (\w+@\w+\.\w+), (\d{4}-\d{2}-\d{2})"
match = re.search(pattern, economic_report)

if match:
    name = match.group(1)
    email = match.group(2)
    date = match.group(3)
    print(f"Name: {name}")
    print(f"Email: {email}")
    print(f"Date: {date}")
else:
    print("No match found")

We can also use named groups to make our code more readable:

In [None]:
# Using named groups
economic_report = """
Siddiqur Rahman, siddiqur@econ.ju.edu, 2023-05-15
GDP Growth: 6.5%
Inflation Rate: 7.2%
Unemployment Rate: 4.8%
"""

# Extract name, email, and date using named groups
pattern = r"(?P<name>\w+ \w+), (?P<email>\w+@\w+\.\w+), (?P<date>\d{4}-\d{2}-\d{2})"
match = re.search(pattern, economic_report)

if match:
    name = match.group("name")
    email = match.group("email")
    date = match.group("date")
    print(f"Name: {name}")
    print(f"Email: {email}")
    print(f"Date: {date}")
else:
    print("No match found")

Let's look at a more complex example of extracting economic indicators from a report:

In [None]:
# Extract economic indicators from a report
report = """
Bangladesh Economic Review 2023
=================================

Key Economic Indicators:
- GDP Growth: 6.5%
- Inflation Rate: 7.2%
- Unemployment Rate: 4.8%
- Trade Balance: +2.1% of GDP
- Foreign Reserves: $48.5 billion
- Budget Deficit: 5.1% of GDP

Sector-wise Performance:
- Agriculture: 3.2% growth
- Industry: 8.7% growth
- Services: 6.9% growth
"""

# Extract all economic indicators with named groups
indicator_pattern = r"- (?P<indicator>[\w\s]+): (?P<value>[+\$\d\.\s%]+)"
matches = re.finditer(indicator_pattern, report)

indicators = {}
for match in matches:
    indicator = match.group("indicator").strip()
    value = match.group("value").strip()
    indicators[indicator] = value

print("Extracted Economic Indicators:")
for indicator, value in indicators.items():
    print(f"{indicator}: {value}")

# Extract and convert percentage values
print("\nPercentage Values:")
percentage_pattern = r"(?P<indicator>[\w\s]+): (?P<value>\d+\.\d+)%"
matches = re.finditer(percentage_pattern, report)

for match in matches:
    indicator = match.group("indicator").strip()
    value = float(match.group("value"))
    print(f"{indicator}: {value}%")

Let's look at a practical example of parsing economic data from a news article:

In [None]:
# Parse economic data from a news article
news_article = """
Bangladesh's economy grew by 6.5% in 2022, according to the latest report from the 
Bangladesh Bureau of Statistics. The inflation rate stood at 7.2%, while unemployment 
was recorded at 4.8%. The country's foreign exchange reserves reached $48.5 billion, 
up from $45.2 billion in the previous year. The trade surplus was $2.1 billion, 
representing 2.1% of GDP. The budget deficit was 5.1% of GDP, slightly higher than 
the target of 4.8%. The central bank has maintained the policy interest rate at 6.75%, 
with plans to adjust it based on economic conditions.
"""

# Extract all economic data points
data_points = {}

# Extract GDP growth
gdp_match = re.search(r"economy grew by (\d+\.\d+)%", news_article)
if gdp_match:
    data_points["GDP Growth"] = float(gdp_match.group(1))

# Extract inflation rate
inflation_match = re.search(r"inflation rate stood at (\d+\.\d+)%", news_article)
if inflation_match:
    data_points["Inflation Rate"] = float(inflation_match.group(1))

# Extract unemployment rate
unemployment_match = re.search(r"unemployment was recorded at (\d+\.\d+)%", news_article)
if unemployment_match:
    data_points["Unemployment Rate"] = float(unemployment_match.group(1))

# Extract foreign reserves (current and previous)
reserves_matches = re.findall(r"foreign exchange reserves reached \$(\d+\.\d+) billion, up from \$(\d+\.\d+) billion", news_article)
if reserves_matches:
    current, previous = reserves_matches[0]
    data_points["Foreign Reserves (Current)"] = float(current)
    data_points["Foreign Reserves (Previous)"] = float(previous)

# Extract trade surplus
trade_match = re.search(r"trade surplus was \$(\d+\.\d+) billion", news_article)
if trade_match:
    data_points["Trade Surplus"] = float(trade_match.group(1))

# Extract budget deficit
deficit_match = re.search(r"budget deficit was (\d+\.\d+)% of GDP", news_article)
if deficit_match:
    data_points["Budget Deficit"] = float(deficit_match.group(1))

# Extract interest rate
interest_match = re.search(r"policy interest rate at (\d+\.\d+)", news_article)
if interest_match:
    data_points["Interest Rate"] = float(interest_match.group(1))

# Display extracted data
print("Extracted Economic Data:")
for indicator, value in data_points.items():
    if "Rate" in indicator or "Growth" in indicator or "Deficit" in indicator:
        print(f"{indicator}: {value}%")
    else:
        print(f"{indicator}: ${value} billion")

<a id='section-9'></a>
## 9. Problem Set: IPv4 Validation

Let's tackle our first problem: validating IPv4 addresses. An IPv4 address is formatted as #.#.#.# where each # should be a number between 0 and 255, inclusive.

**Task:** Implement a function called `validate` that expects an IPv4 address as input as a str and then returns True or False, respectively, if that input is a valid IPv4 address or not.

In [None]:
# TODO: Implement IPv4 validation function
def validate(ip):
    # Your code here
    pass

#### Unit Tests for IPv4 Validation

In [None]:
# Unit tests for IPv4 validation
def test_validate():
    # Test valid IPs
    assert validate("255.255.255.255") == True
    assert validate("192.168.1.1") == True
    assert validate("0.0.0.0") == True
    assert validate("127.0.0.1") == True
    
    # Test invalid IPs
    assert validate("256.100.100.100") == False  # 256 is out of range
    assert validate("192.168.1") == False      # Not enough octets
    assert validate("192.168.1.1.1") == False  # Too many octets
    assert validate("192.168.01.1") == False  # Leading zero
    assert validate("192.168.1.a") == False    # Non-numeric
    assert validate("192.168.1.256") == False # 256 is out of range
    assert validate("275.3.6.28") == False    # From the NUMB3RS example
    
    print("All tests passed!")

# Run the tests
test_validate()

#### Solution for IPv4 Validation

In [None]:
# Solution for IPv4 validation
def validate(ip):
    # Split the IP address by dots
    octets = ip.split(".")
    
    # Check if there are exactly 4 octets
    if len(octets) != 4:
        return False
    
    # Check each octet
    for octet in octets:
        # Check if octet is numeric
        if not octet.isdigit():
            return False
            
        # Check for leading zeros (but allow "0")
        if len(octet) > 1 and octet[0] == "0":
            return False
            
        # Convert to integer and check range
        num = int(octet)
        if num < 0 or num > 255:
            return False
    
    return True

# Run the tests again
test_validate()

<a id='section-10'></a>
## 10. Problem Set: YouTube URL Parsing

Now let's tackle extracting YouTube URLs from HTML. We need to extract the YouTube URL that's the value of a src attribute of an iframe element and convert it to the shorter youtu.be format.

**Task:** Implement a function called `parse` that expects a str of HTML as input, extracts any YouTube URL that's the value of a src attribute of an iframe element therein, and returns its shorter, shareable youtu.be equivalent as a str.

In [None]:
# TODO: Implement YouTube URL parsing function
def parse(s):
    # Your code here
    pass

#### Unit Tests for YouTube URL Parsing

In [None]:
# Unit tests for YouTube URL parsing
def test_parse():
    # Test with standard iframe
    html1 = '<iframe src="https://www.youtube.com/embed/xvFZjo5PgG0"></iframe>'
    assert parse(html1) == "https://youtu.be/xvFZjo5PgG0"
    
    # Test with http instead of https
    html2 = '<iframe src="http://youtube.com/embed/xvFZjo5PgG0"></iframe>'
    assert parse(html2) == "https://youtu.be/xvFZjo5PgG0"
    
    # Test with no www
    html3 = '<iframe src="https://youtube.com/embed/xvFZjo5PgG0"></iframe>'
    assert parse(html3) == "https://youtu.be/xvFZjo5PgG0"
    
    # Test with additional attributes
    html4 = '<iframe width="560" height="315" src="https://www.youtube.com/embed/xvFZjo5PgG0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>'
    assert parse(html4) == "https://youtu.be/xvFZjo5PgG0"
    
    # Test with no YouTube URL
    html5 = '<iframe src="https://example.com/video"></iframe>'
    assert parse(html5) is None
    
    # Test with no iframe
    html6 = '<div>Not an iframe</div>'
    assert parse(html6) is None
    
    print("All tests passed!")

# Run the tests
test_parse()

#### Solution for YouTube URL Parsing

In [None]:
# Solution for YouTube URL parsing
def parse(s):
    # Pattern to match YouTube URL in iframe src attribute
    pattern = r"<iframe.*?src=\"(https?://(?:www\.)?youtube\.com/embed/([\w-]+))\".*?></iframe>"
    
    match = re.search(pattern, s)
    if match:
        video_id = match.group(2)
        return f"https://youtu.be/{video_id}"
    
    return None

# Run the tests again
test_parse()

<a id='section-11'></a>
## 11. Problem Set: Time Conversion

Now let's tackle converting 12-hour time format to 24-hour format. This requires careful handling of AM/PM and edge cases like 12:00 AM and 12:00 PM.

**Task:** Implement a function called `convert` that expects a str in any of the 12-hour formats below and returns the corresponding str in 24-hour format.

In [None]:
# TODO: Implement time conversion function
def convert(s):
    # Your code here
    pass

#### Unit Tests for Time Conversion

In [None]:
# Unit tests for time conversion
def test_convert():
    # Test basic conversions
    assert convert("9:00 AM to 5:00 PM") == "09:00 to 17:00"
    assert convert("9 AM to 5 PM") == "09:00 to 17:00"
    assert convert("9:00 AM to 5 PM") == "09:00 to 17:00"
    assert convert("9 AM to 5:00 PM") == "09:00 to 17:00"
    
    # Test edge cases
    assert convert("12:00 AM to 12:00 PM") == "00:00 to 12:00"
    assert convert("12:00 PM to 12:00 AM") == "12:00 to 00:00"
    assert convert("1:00 AM to 1:00 PM") == "01:00 to 13:00"
    assert convert("11:59 AM to 11:59 PM") == "11:59 to 23:59"
    
    # Test overnight shifts
    assert convert("5:00 PM to 9:00 AM") == "17:00 to 09:00"
    assert convert("11:00 PM to 1:00 AM") == "23:00 to 01:00"
    
    print("All tests passed!")

# Run the tests
test_convert()

#### Solution for Time Conversion

In [None]:
# Solution for time conversion
def convert(s):
    # Pattern to match the time format
    pattern = r"(\d{1,2})(?::(\d{2}))? (AM|PM) to (\d{1,2})(?::(\d{2}))? (AM|PM)"
    
    match = re.search(pattern, s)
    if not match:
        raise ValueError("Invalid time format")
    
    # Extract components
    start_hour = int(match.group(1))
    start_min = int(match.group(2)) if match.group(2) else 0
    start_period = match.group(3)
    
    end_hour = int(match.group(4))
    end_min = int(match.group(5)) if match.group(5) else 0
    end_period = match.group(6)
    
    # Validate hours and minutes
    if start_hour < 1 or start_hour > 12 or end_hour < 1 or end_hour > 12:
        raise ValueError("Invalid hour")
    
    if start_min < 0 or start_min > 59 or end_min < 0 or end_min > 59:
        raise ValueError("Invalid minute")
    
    # Convert start time
    if start_period == "AM":
        if start_hour == 12:
            start_hour_24 = 0
        else:
            start_hour_24 = start_hour
    else:  # PM
        if start_hour == 12:
            start_hour_24 = 12
        else:
            start_hour_24 = start_hour + 12
    
    # Convert end time
    if end_period == "AM":
        if end_hour == 12:
            end_hour_24 = 0
        else:
            end_hour_24 = end_hour
    else:  # PM
        if end_hour == 12:
            end_hour_24 = 12
        else:
            end_hour_24 = end_hour + 12
    
    # Format the result
    start_time = f"{start_hour_24:02d}:{start_min:02d}"
    end_time = f"{end_hour_24:02d}:{end_min:02d}"
    
    return f"{start_time} to {end_time}"

# Run the tests again
test_convert()

<a id='section-12'></a>
## 12. Problem Set: Word Counting

Our final problem is counting occurrences of the word "um" in a text, but only when it appears as a standalone word, not as part of another word.

**Task:** Implement a function called `count` that expects a line of text as input as a str and returns, as an int, the number of times that "um" appears in that text, case-insensitively, as a word unto itself, not as a substring of some other word.

In [None]:
# TODO: Implement word counting function
def count(s):
    # Your code here
    pass

#### Unit Tests for Word Counting

In [None]:
# Unit tests for word counting
def test_count():
    # Test basic cases
    assert count("um") == 1
    assert count("um um") == 2
    assert count("Um, thanks for the album.") == 1
    assert count("Um, thanks, um...") == 2
    
    # Test case insensitivity
    assert count("UM") == 1
    assert count("Um") == 1
    assert count("uM") == 1
    
    # Test that it doesn't match substrings
    assert count("yummy") == 0
    assert count("summer") == 0
    assert count("humming") == 0
    
    # Test with punctuation
    assert count("um?") == 1
    assert count("um!") == 1
    assert count("um.") == 1
    assert count("um,") == 1
    
    # Test complex sentences
    assert count("This is, um, a test. Um, I think it's working.") == 2
    assert count("Hello, um, world. This is um, a test.") == 2
    
    print("All tests passed!")

# Run the tests
test_count()

#### Solution for Word Counting

In [None]:
# Solution for word counting
def count(s):
    # Pattern to match "um" as a whole word, case-insensitive
    # \b matches word boundaries
    pattern = r"\bum\b"
    
    # Find all matches, case-insensitive
    matches = re.findall(pattern, s, re.IGNORECASE)
    
    # Return the count of matches
    return len(matches)

# Run the tests again
test_count()

<a id='section-13'></a>
## 13. Problem Set: Email Validation

Our final problem is validating email addresses using a library rather than writing our own regex.

**Task:** Using either `validator-collection` or `validators` from PyPI, implement a program that prompts the user for an email address via input and then prints Valid or Invalid, respectively, if the input is a syntatically valid email address.

In [None]:
# TODO: Implement email validation function
def validate_email(email):
    # Your code here
    pass

#### Unit Tests for Email Validation

In [None]:
# Unit tests for email validation
def test_validate_email():
    # Test valid emails
    assert validate_email("siddiqur@example.com") == "Valid"
    assert validate_email("test.email+tag@example.co.uk") == "Valid"
    assert validate_email("user_name@domain.org") == "Valid"
    assert validate_email("firstname-lastname@example.com") == "Valid"
    
    # Test invalid emails
    assert validate_email("plainaddress") == "Invalid"
    assert validate_email("@missingdomain.com") == "Invalid"
    assert validate_email("missing@.com") == "Invalid"
    assert validate_email("missing@domain") == "Invalid"
    assert validate_email("spaces @domain.com") == "Invalid"
    assert validate_email("email@domain..com") == "Invalid"
    
    print("All tests passed!")

# Run the tests
test_validate_email()

#### Solution for Email Validation

In [None]:
# Solution for email validation using validators library
# First, you would need to install the validators library: pip install validators
import validators

def validate_email(email):
    if validators.email(email):
        return "Valid"
    else:
        return "Invalid"

# Alternative solution using validator-collection
# from validator_collection import validators, checkers
# def validate_email(email):
#     if checkers.is_email(email):
#         return "Valid"
#     else:
#         return "Invalid"

# Run the tests again
test_validate_email()

## Conclusion

In this notebook, we've explored the powerful world of regular expressions in Python. We've learned:

1. Basic regex patterns and special characters
2. How to use anchors and boundaries
3. Character classes and their shorthands
4. Regex flags to modify behavior
5. How to extract data using groups
6. Applied these concepts to solve real-world problems

### Economic Applications of Regular Expressions
Regular expressions are fundamental to modern economic text analysis:

1. **Data Extraction:** Extracting economic indicators, dates, and monetary values from reports and news articles.

2. **Text Standardization:** Cleaning and standardizing economic data from different sources with varying formats.

3. **Content Analysis:** Identifying key economic terms, themes, and patterns in large text corpora.

4. **Survey Processing:** Parsing and validating responses from economic surveys with text answers.

5. **Financial Document Analysis:** Extracting specific information from financial statements and reports.

### Best Practices for Economic Programming

- **Start Simple:** Begin with basic patterns and gradually add complexity as needed.

- **Test Thoroughly:** Create comprehensive tests for your regex patterns, especially with economic data.

- **Document Patterns:** Document what your regex patterns are designed to match and why.

- **Handle Edge Cases:** Consider unusual formats or edge cases in economic data.

- **Use Named Groups:** For complex patterns, use named groups to improve code readability.

Regular expressions are a fundamental tool in text processing and data validation. While they can seem complex at first, they become more intuitive with practice.

Keep practicing, and don't hesitate to refer to the Python `re` module documentation when you need to craft more complex patterns!