# Basic NLP Course

## Regular Expression

This is the first class of the Basic NLP Course, where we will focus on understanding and applying Regular Expressions (Regex). Regex is a powerful tool for text processing and pattern matching, which forms the foundation for many Natural Language Processing (NLP) tasks.

Good reference: [Regex101](https://regex101.com/)

### Applications in Chemical and Process Engineering

Regular Expressions can be applied in various chemical and process engineering tasks, such as:

- **Data Cleaning and Preprocessing**: Extracting and formatting experimental data from unstructured text files or reports.
- **Parsing Chemical Formulas**: Identifying and validating chemical formulas or reaction equations from text.
- **Log File Analysis**: Analyzing process logs to detect anomalies or patterns in operational data.
- **Simulation Input Validation**: Ensuring that input files for simulation software adhere to specific formats.
- **Text Mining**: Extracting relevant information from research papers, patents, or technical documents.

Regex provides a systematic way to handle these tasks efficiently, saving time and reducing errors.

In [1]:
import re

In [2]:
# Example of more noisy and unstructured texts with phone numbers
noisy_texts = [
    "C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!",
    "H3y, my n3w numb3r is 9876543210. S@ve !t, plz.",
    "Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.",
    "R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.",
    "Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.",
    "Call me at 1231231234 when you get this message.",
    "Hey, my new number is 3213214321. Save it, please.",
    "The quick brown fox jumps over the lazy dog. No numbers here!",
    "Contact us at support@example.com for more information.",
    "Meeting rescheduled to 3 PM tomorrow. No phone number included."
]

# Display the noisy texts
for text in noisy_texts:
    print(text)

C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Call me at 1231231234 when you get this message.
Hey, my new number is 3213214321. Save it, please.
The quick brown fox jumps over the lazy dog. No numbers here!
Contact us at support@example.com for more information.
Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [3]:
# lets identifiy the phone numbers
pattern = '\d\d\d\d\d\d\d\d\d\d'

for text in noisy_texts:
    matches = re.findall(pattern, text)

    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat: H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat: Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat: R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat: Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat: Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat: Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat: The quick brown fox jumps over the lazy dog. No numbers here!
Chat: Contact us at support@example.com for more information.
Chat: Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [4]:
# try other valid pattern
pattern = '\d{10}'

for text in noisy_texts:
    matches = re.findall(pattern, text)

    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat: H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat: Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat: R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat: Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat: Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat: Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat: The quick brown fox jumps over the lazy dog. No numbers here!
Chat: Contact us at support@example.com for more information.
Chat: Meeting rescheduled to 3 PM tomorrow. No phone number included.


In [5]:
# Pattern for phone numbers in the format (XXX)-XXXXXXX
pattern = r'\(\d{3}\)'

# Example texts with (XXX)-XXXXXXX format
formatted_texts = [
    "Call me at (123)-4567890 when you get this message.",
    "Hey, my new number is (987)-6543210. Save it, please.",
    "The event starts at 6 PM. Contact (555)-6667777 for details.",
    "Random text with a number (111)-2223333 in between.",
    "Reach out to support at (444)-5556666 for assistance.",
    "No phone number here, just some random text.",
    "Another example: (321)-3214321."
]

# Display matches for the (XXX)-XXXXXXX pattern
for text in formatted_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)
Chat: Random text with a number (111)-2223333 in between.
Phones:  (111)
Chat: Reach out to support at (444)-5556666 for assistance.
Phones:  (444)
Chat: No phone number here, just some random text.
Chat: Another example: (321)-3214321.
Phones:  (321)


In [6]:
# Pattern for phone numbers in the format (XXX)-XXXXXXX
pattern = r'\(\d{3}\)-\d{7}'

# Example texts with (XXX)-XXXXXXX format
formatted_texts = [
    "Call me at (123)-4567890 when you get this message.",
    "Hey, my new number is (987)-6543210. Save it, please.",
    "The event starts at 6 PM. Contact (555)-6667777 for details.",
    "Random text with a number (111)-2223333 in between.",
    "Reach out to support at (444)-5556666 for assistance.",
    "No phone number here, just some random text.",
    "Another example: (321)-3214321."
]

# Display matches for the (XXX)-XXXXXXX pattern
for text in formatted_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)-4567890
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)-6543210
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)-6667777
Chat: Random text with a number (111)-2223333 in between.
Phones:  (111)-2223333
Chat: Reach out to support at (444)-5556666 for assistance.
Phones:  (444)-5556666
Chat: No phone number here, just some random text.
Chat: Another example: (321)-3214321.
Phones:  (321)-3214321


In [7]:
# Combine both lists
combined_texts = noisy_texts + formatted_texts

# Pattern to match both plain 10-digit numbers and (XXX)-XXXXXXX format
pattern1 = r'\d{10}'
pattern2 = r'\(\d{3}\)-\d{7}'
pattern = f"({pattern1}|{pattern2})"

# | is the OR operator
# Display matches for the (XXX)-XXXXXXX pattern
for text in combined_texts:
    matches = re.findall(pattern, text)
    print('Chat:', text)
    for match in matches:
        print('Phones: ', match)

Chat: C@ll me @t 1234567890 wh3n you g3t th!s m3ss@ge!!
Phones:  1234567890
Chat: H3y, my n3w numb3r is 9876543210. S@ve !t, plz.
Phones:  9876543210
Chat: Th3 ev3nt st@rts @t 6 PM. C0nt@ct 5556667777 4 d3tails.
Phones:  5556667777
Chat: R@nd0m t3xt w!th a numb3r 1112223333 in b3tw33n.
Phones:  1112223333
Chat: Re@ch 0ut t0 supp0rt @t 4445556666 f0r @ssist@nc3.
Phones:  4445556666
Chat: Call me at 1231231234 when you get this message.
Phones:  1231231234
Chat: Hey, my new number is 3213214321. Save it, please.
Phones:  3213214321
Chat: The quick brown fox jumps over the lazy dog. No numbers here!
Chat: Contact us at support@example.com for more information.
Chat: Meeting rescheduled to 3 PM tomorrow. No phone number included.
Chat: Call me at (123)-4567890 when you get this message.
Phones:  (123)-4567890
Chat: Hey, my new number is (987)-6543210. Save it, please.
Phones:  (987)-6543210
Chat: The event starts at 6 PM. Contact (555)-6667777 for details.
Phones:  (555)-6667777
Chat: Rand

In [14]:
# Example texts with email addresses
email_texts = [
    "Contact us at support@example.com for more information.",
    "Send your feedback to feedback@company.org.",
    "Reach out to john.doe123@university.edu for academic queries.",
    "No email here, just some random text.",
    "Another example: jane-doe@sub.domain.co.uk."
]

# pattern
email_pattern1 = r'[a-z]*@'

def extract_matches(texts, pattern):
    """
    Extracts and prints matches from a list of texts based on the provided regex pattern.

    Args:
        texts (list): List of texts to search for matches.
        pattern (str): Regular expression pattern to match.
    """
    for text in texts:
        print('Chat:', text)
        matches = re.findall(pattern, text)
        for match in matches:
            print('Match:', match)

# Call the function
extract_matches(email_texts, email_pattern1)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: @
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [15]:
# let's improve a little bit the pattern
email_pattern2 = r'[a-zA-Z]*@'

extract_matches(email_texts, email_pattern2)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: @
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [16]:
# imprpve a little bit more
email_pattern3 = r'[a-zA-Z0-9]*@'

extract_matches(email_texts, email_pattern3)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [17]:
# get dots and _
email_pattern4 = r'[a-zA-Z0-9._]*@'

extract_matches(email_texts, email_pattern4)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: doe@


In [18]:
# catch - also
email_pattern5 = r'[a-zA-Z0-9._\%\+\-]*@'
extract_matches(email_texts, email_pattern5)

Chat: Contact us at support@example.com for more information.
Match: support@
Chat: Send your feedback to feedback@company.org.
Match: feedback@
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: jane-doe@


In [20]:
# lets catch the domain
email_pattern6 = r'[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*'
extract_matches(email_texts, email_pattern6)

Chat: Contact us at support@example.com for more information.
Match: support@example.com
Chat: Send your feedback to feedback@company.org.
Match: feedback@company.org.
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@university.edu
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: jane-doe@sub.domain.co.uk.


In [23]:
# improving the domain pattern - because domains can be more complex
email_pattern7 = r'[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*'
extract_matches(email_texts, email_pattern7)

Chat: Contact us at support@example.com for more information.
Match: support@example.com
Chat: Send your feedback to feedback@company.org.
Match: feedback@company.org.
Chat: Reach out to john.doe123@university.edu for academic queries.
Match: john.doe123@university.edu
Chat: No email here, just some random text.
Chat: Another example: jane-doe@sub.domain.co.uk.
Match: jane-doe@sub.domain.co.uk.


# Practical Exercises: Regex in Chemical and Process Engineering

Below are some practical exercises to help you understand how Regular Expressions (Regex) can be applied to solve real-world problems in chemical and process engineering.

---

## Exercise 1: Extracting Chemical Formulas
Given a list of unstructured text containing chemical formulas, extract all valid chemical formulas (e.g., `H2O`, `C6H12O6`, `NaCl`).

**Example Input:**
```
texts = [
    "The reaction produces H2O and CO2.",
    "Glucose (C6H12O6) is a simple sugar.",
    "NaCl is common table salt.",
    "Random text without formulas.",
    "Another example: CH3COOH (acetic acid)."
]
```

**Expected Output:**
```
['H2O', 'CO2', 'C6H12O6', 'NaCl', 'CH3COOH']
```

---

## Exercise 2: Validating Reaction Equations
Write a regex to validate simple chemical reaction equations. For example, `H2 + O2 -> H2O` or `C + O2 -> CO2`.

**Example Input:**
```
equations = [
    "H2 + O2 -> H2O",
    "C + O2 -> CO2",
    "Invalid equation ->",
    "NaCl + H2O -> NaOH + HCl"
]
```

**Expected Output:**
```
['H2 + O2 -> H2O', 'C + O2 -> CO2', 'NaCl + H2O -> NaOH + HCl']
```

---

## Exercise 3: Extracting Units from Experimental Data
Extract all units (e.g., `kg`, `m^3`, `Pa`, `mol`) from a list of experimental data.

**Example Input:**
```
data = [
    "The pressure is 101325 Pa.",
    "Flow rate: 1.5 m^3/s.",
    "Mass: 25 kg.",
    "Concentration: 0.1 mol/L."
]
```

**Expected Output:**
```
['Pa', 'm^3', 'kg', 'mol/L']
```

---

## Exercise 4: Parsing Log Files for Anomalies
Identify timestamps and error codes in process log files.

**Example Input:**
```
logs = [
    "[2023-10-01 12:00:00] ERROR: Code 1234 - Reactor temperature too high.",
    "[2023-10-01 12:05:00] INFO: System stable.",
    "[2023-10-01 12:10:00] ERROR: Code 5678 - Pressure drop detected."
]
```

**Expected Output:**
```
Timestamps: ['2023-10-01 12:00:00', '2023-10-01 12:10:00']
Error Codes: ['1234', '5678']
```

---

## Exercise 5: Extracting Simulation Parameters
Extract key-value pairs of simulation parameters from input files.

**Example Input:**
```
parameters = """
Temperature: 300 K
Pressure: 101325 Pa
FlowRate: 1.5 m^3/s
Concentration: 0.1 mol/L
"""
```

**Expected Output:**
```
{'Temperature': '300 K', 'Pressure': '101325 Pa', 'FlowRate': '1.5 m^3/s', 'Concentration': '0.1 mol/L'}
```

---

## Exercise 6: Identifying CAS Numbers
Extract valid CAS (Chemical Abstracts Service) numbers from a list of text. A CAS number is in the format `XXXXX-XX-X`.

**Example Input:**
```
texts = [
    "The CAS number for water is 7732-18-5.",
    "Methane has a CAS number of 74-82-8.",
    "Invalid CAS: 123-456.",
    "Another valid CAS: 50-00-0."
]
```

**Expected Output:**
```
['7732-18-5', '74-82-8', '50-00-0']
```

---

## Exercise 7: Extracting Numerical Data
Extract all numerical values (including decimals and scientific notation) from experimental results.

**Example Input:**
```
results = [
    "The reaction rate is 1.23e-4 mol/s.",
    "Temperature: 300 K.",
    "Pressure: 1.01325e5 Pa.",
    "Flow rate: 1.5 m^3/s."
]
```

**Expected Output:**
```
['1.23e-4', '300', '1.01325e5', '1.5']
```

In [94]:
# exercise 1 - extracting chemical formulas
texts = [
    "The reaction produces H2O and CO2.",
    "Glucose (C6H12O6) is a simple sugar.",
    "NaCl is common table salt.",
    "Random text without formulas.",
    "Another example: CH3COOH (acetic acid)."
]

pattern = r'([A-Z]*\d+[A-Z]*)+|[A-Z][a-z][A-Z][a-z]'
for text in texts:
    print(text)
    print('Matches:', [match.group() for match in re.finditer(pattern, text)])

The reaction produces H2O and CO2.
Matches: ['H2O', 'CO2']
Glucose (C6H12O6) is a simple sugar.
Matches: ['C6H12O6']
NaCl is common table salt.
Matches: ['NaCl']
Random text without formulas.
Matches: []
Another example: CH3COOH (acetic acid).
Matches: ['CH3COOH']


Pattern Breakdown

1. `([A-Z]*\d+[A-Z]*)+`:
    - Matches one or more sequences of:
      - Zero or more uppercase letters (`[A-Z]*`),
      - Followed by one or more digits (`\d+`),
      - Optionally followed by zero or more uppercase letters (`[A-Z]*`).
    - This part is useful for matching chemical formulas like `H2O`, `C6H12O6`, or `CH3COOH`.

2. `[A-Z][a-z][A-Z][a-z]`:
    - Matches a sequence of:
      - An uppercase letter (`[A-Z]`),
      - Followed by a lowercase letter (`[a-z]`),
      - Followed by another uppercase letter (`[A-Z]`),
      - Followed by another lowercase letter (`[a-z]`).
    - This part is less common in chemical formulas but could match specific patterns like `NaCl` if written unconventionally.

In [128]:
# exercise 2 - validating reaction equations
equations = [
    "H2 + O2 -> H2O",
    "C + O2 -> CO2",
    "Invalid equation ->",
    "NaCl + H2O -> NaOH + HCl"
]

pattern1 = r'([A-Z]+\d*[A-Z]*)\s\+\s([A-Z]+\d*[A-Z]*)\s\-\>\s([A-Z]+\d*[A-Z]*)'
pattern2 = r'([A-Z]\d*[a-z]?)+\s\+\s([A-Z]\d*[a-z]?)+\s\-\>\s([A-Z]\d*[a-z]?)+\s\+\s([A-Z]\d*[a-z]?)+'
pattern = pattern1 + "|" + pattern2

for eq in equations:
    print('Matchs: ', [match.group() for match in re.finditer(pattern, eq)])

Matchs:  ['H2 + O2 -> H2O']
Matchs:  ['C + O2 -> CO2']
Matchs:  []
Matchs:  ['NaCl + H2O -> NaOH + HCl']


### Explanation of the Pattern

The regex pattern used in the previous cell is designed to validate and extract chemical reaction equations. It consists of two main parts combined with the `|` (OR) operator:

1. **Pattern 1**: `([A-Z]+\d*[A-Z]*)\s\+\s([A-Z]+\d*[A-Z]*)\s\-\>\s([A-Z]+\d*[A-Z]*)`
    - Matches simple chemical reactions with one reactant and one product.
    - `([A-Z]+\d*[A-Z]*)`: Matches a chemical formula, which starts with one or more uppercase letters (`[A-Z]+`), optionally followed by digits (`\d*`), and optionally followed by more uppercase letters (`[A-Z]*`).
    - `\s\+\s`: Matches the `+` operator surrounded by spaces.
    - `\s\-\>\s`: Matches the `->` operator surrounded by spaces.

2. **Pattern 2**: `([A-Z]\d*[a-z]?)+\s\+\s([A-Z]\d*[a-z]?)+\s\-\>\s([A-Z]\d*[a-z]?)+\s\+\s([A-Z]\d*[a-z]?)+`
    - Matches more complex reactions with multiple reactants and products.
    - `([A-Z]\d*[a-z]?)`: Matches a chemical formula with an uppercase letter (`[A-Z]`), optionally followed by digits (`\d*`), and optionally followed by a lowercase letter (`[a-z]?`).
    - `\s\+\s`: Matches the `+` operator surrounded by spaces.
    - `\s\-\>\s`: Matches the `->` operator surrounded by spaces.

This combined pattern ensures that both simple and complex reaction equations are captured, making it versatile for various chemical reaction formats.

In [152]:
# exercise 3 - extracting units
data = [
    "The pressure is 101325 Pa.",
    "Flow rate: 1.5 m^3/s.",
    "Mass: 25 kg.",
    "Concentration: 0.1 mol/L.",
    'Heat Capacity: 1 cal/mol.C'
]

pattern1 = r'(\d\s?[A-z][a-z]?\^?\d?[a-z]*\/?[A-Za-z]*\.?[A-Z]*)+'
pattern2 = r''
pattern = pattern1 # + "|" + pattern2

for text in data:
    matches = [match.group() for match in re.finditer(pattern, text)]
    cleaned = [m.split()[1] for m in matches]
    print('Matchs: ', [c[:-1] if c[-1] == '.' else c for c in cleaned])

Matchs:  ['Pa']
Matchs:  ['m^3/s']
Matchs:  ['kg']
Matchs:  ['mol/L']
Matchs:  ['cal/mol.C']


### Explanation of the Pattern

The regex pattern `(\d\s?[A-z][a-z]?\^?\d?[a-z]*\/?[A-Za-z]*\.?[A-Z]*)+` is designed to extract units from text. Here's a breakdown:

1. `\d\s?`: Matches a digit (`\d`) optionally followed by a space (`\s?`), ensuring the unit is associated with a numerical value.
2. `[A-z]`: Matches any letter (uppercase or lowercase) from `A` to `z`.
3. `[a-z]?`: Optionally matches a lowercase letter, allowing for units like `kg` or `Pa`.
4. `\^?\d?`: Optionally matches a caret (`^`) followed by a digit, capturing units with exponents like `m^3`.
5. `[a-z]*`: Matches zero or more lowercase letters, accommodating extended unit names like `mol`.
6. `\/?`: Optionally matches a forward slash (`/`), supporting compound units like `m^3/s`.
7. `[A-Za-z]*`: Matches zero or more letters, allowing for additional characters in complex units.
8. `\.?[A-Z]*`: Optionally matches a period (`.`) followed by uppercase letters, capturing units with suffixes like `mol.C`.

In [163]:
# exercise 4 - parsing logs
logs = [
    "[2023-10-01 12:00:00] ERROR: Code 1234 - Reactor temperature too high.",
    "[2023-10-01 12:05:00] INFO: System stable.",
    "[2023-10-01 12:10:00] ERROR: Code 5678 - Pressure drop detected."
]

tms_pattern = r'\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}'
error_pattern = r'ERROR\:\sCode\s\d{4}'

for log in logs:
    timestamp = re.findall(tms_pattern, log)
    error_code = re.findall(error_pattern, log)

    print('Log: ', log)
    print('Timestamps: ', timestamp)
    print('Error Codes: ', [e.split()[2] for e in error_code])

Log:  [2023-10-01 12:00:00] ERROR: Code 1234 - Reactor temperature too high.
Timestamps:  ['2023-10-01 12:00:00']
Error Codes:  ['1234']
Log:  [2023-10-01 12:05:00] INFO: System stable.
Timestamps:  ['2023-10-01 12:05:00']
Error Codes:  []
Log:  [2023-10-01 12:10:00] ERROR: Code 5678 - Pressure drop detected.
Timestamps:  ['2023-10-01 12:10:00']
Error Codes:  ['5678']


### Breakdown of Patterns

#### 1. `tms_pattern = '\\d{4}\\-\\d{2}\\-\\d{2}\\s\\d{2}\\:\\d{2}\\:\\d{2}'`
- Matches timestamps in the format `YYYY-MM-DD HH:MM:SS`:
    - `\\d{4}`: Matches a 4-digit year.
    - `\\-`: Matches the hyphen (`-`) separating year, month, and day.
    - `\\d{2}`: Matches a 2-digit month and day.
    - `\\s`: Matches a space separating the date and time.
    - `\\d{2}\\:\\d{2}\\:\\d{2}`: Matches the time in `HH:MM:SS` format, where `\\:` matches the colon (`:`).

#### 2. `error_pattern = 'ERROR\\:\\sCode\\s\\d{4}'`
- Matches error codes in the format `ERROR: Code XXXX`:
    - `ERROR\\:`: Matches the word `ERROR` followed by a colon (`:`).
    - `\\s`: Matches a space.
    - `Code`: Matches the word `Code`.
    - `\\s`: Matches another space.
    - `\\d{4}`: Matches a 4-digit error code.


In [183]:
# execise 5 - extracting simulation parameters
parameters = """
Temperature: 300 K
Pressure: 101325 Pa
FlowRate: 1.5 m^3/s
Concentration: 0.1 mol/L
"""

pattern = r'[A-Za-z]*\:\s\d*\.?\d*\s[A-Za-z]\^?\d*[a-z]*/?[A-Za-z]*'

parameters_list = re.findall(pattern, parameters)
sim_params = {}
for item in parameters_list:
    key, value, unit = item.split()
    sim_params[key] = {
        'value': float(value),
        'unit': unit
    }

print(sim_params)

{'Temperature:': {'value': 300.0, 'unit': 'K'}, 'Pressure:': {'value': 101325.0, 'unit': 'Pa'}, 'FlowRate:': {'value': 1.5, 'unit': 'm^3/s'}, 'Concentration:': {'value': 0.1, 'unit': 'mol/L'}}


### Breakdown of Patterns

The regex pattern `[A-Za-z]*\:\s\d*\.?\d*\s[A-Za-z]\^?\d*[a-z]*/?[A-Za-z]*` is designed to extract simulation parameters in the format `Key: Value Unit`. Here's a detailed breakdown:

1. `[A-Za-z]*`: Matches the parameter name (key), which consists of zero or more letters (uppercase or lowercase).
2. `\:`: Matches the colon (`:`) that separates the key from the value.
3. `\s`: Matches a space after the colon.
4. `\d*`: Matches zero or more digits, representing the numerical value.
5. `\.?`: Optionally matches a decimal point (`.`) for floating-point numbers.
6. `\d*`: Matches zero or more digits after the decimal point.
7. `\s`: Matches a space between the value and the unit.
8. `[A-Za-z]`: Matches the first character of the unit, which must be a letter (uppercase or lowercase).
9. `\^?`: Optionally matches a caret (`^`), used for exponents in units like `m^3`.
10. `\d*`: Matches zero or more digits after the caret, representing the exponent.
11. `[a-z]*`: Matches zero or more lowercase letters, allowing for extended unit names like `mol`.
12. `\/?`: Optionally matches a forward slash (`/`), supporting compound units like `m^3/s`.
13. `[A-Za-z]*`: Matches zero or more letters, allowing for additional characters in complex units.

In [184]:
# exercise 6 - extracting valid CAS number
texts = [
    "The CAS number for water is 7732-18-5.",
    "Methane has a CAS number of 74-82-8.",
    "Invalid CAS: 123-456.",
    "Another valid CAS: 50-00-0."
]

pattern = r'\d*\-\d*\-\d'

for text in texts:
    print('Text: ', text)
    print('CAS number: ', re.findall(pattern, text))

Text:  The CAS number for water is 7732-18-5.
CAS number:  ['7732-18-5']
Text:  Methane has a CAS number of 74-82-8.
CAS number:  ['74-82-8']
Text:  Invalid CAS: 123-456.
CAS number:  []
Text:  Another valid CAS: 50-00-0.
CAS number:  ['50-00-0']


### Breakdown of Patterns

#### Pattern: `\\d*\\-\\d*\\-\\d`
This pattern is designed to match CAS (Chemical Abstracts Service) numbers, which follow the format `XXXXX-XX-X`. Here's the breakdown:

1. `\\d*`: Matches zero or more digits. This captures the first part of the CAS number, which can have varying lengths.
2. `\\-`: Matches the hyphen (`-`) that separates the sections of the CAS number.
3. `\\d*`: Matches zero or more digits. This captures the second part of the CAS number, which typically has two digits.
4. `\\-`: Matches the second hyphen (`-`) in the CAS number.
5. `\\d`: Matches exactly one digit. This captures the final part of the CAS number, which always has a single digit.

This pattern ensures that valid CAS numbers like `7732-18-5` or `50-00-0` are matched while excluding invalid formats.

In [193]:
# exercise 7 - extracting numerical data
results = [
    "The reaction rate is 1.23e-4 mol/s.",
    "Temperature: 300 K.",
    "Pressure: 1.01325e5 Pa.",
    "Flow rate: 1.5 m^3/s."
]

pattern = r'\s\d+\.?\d*e?\-?\d*'

for result in results:
    print('Text: ', result)
    print('Numerical Data: ', [s.strip() for s in re.findall(pattern, result)])

Text:  The reaction rate is 1.23e-4 mol/s.
Numerical Data:  ['1.23e-4']
Text:  Temperature: 300 K.
Numerical Data:  ['300']
Text:  Pressure: 1.01325e5 Pa.
Numerical Data:  ['1.01325e5']
Text:  Flow rate: 1.5 m^3/s.
Numerical Data:  ['1.5']


### Breakdown of Patterns

#### Pattern: `\\s\\d+\\.?\\d*e?\\-?\\d*`
This pattern is designed to extract numerical data, including integers, decimals, and scientific notation. Here's the breakdown:

1. `\\s`: Matches a space before the number, ensuring the number is not part of a word.
2. `\\d+`: Matches one or more digits, representing the integer or the whole number part of a decimal.
3. `\\.?`: Optionally matches a decimal point (`.`), allowing for both integers and decimals.
4. `\\d*`: Matches zero or more digits after the decimal point, capturing the fractional part of a decimal number.
5. `e?`: Optionally matches the character `e`, which is used in scientific notation.
6. `\\-?`: Optionally matches a minus sign (`-`), which may appear in the exponent of scientific notation.
7. `\\d*`: Matches zero or more digits, representing the exponent in scientific notation.

## Class Summary

This class focused on the practical application of Regular Expressions (Regex) in various contexts, including text processing, chemical and process engineering, and data extraction. Key highlights include:

- **Regex Basics**: Introduction to regex patterns and their components, such as character classes, quantifiers, and special characters.
- **Text Cleaning**: Extracting phone numbers, email addresses, and CAS numbers from noisy or formatted text.
- **Chemical Applications**:
    - Extracting chemical formulas from unstructured text.
    - Validating chemical reaction equations.
- **Data Extraction**:
    - Parsing experimental data to extract units and numerical values.
    - Extracting simulation parameters as key-value pairs.
- **Log Analysis**: Identifying timestamps and error codes in process logs.
- **Pattern Refinement**: Iterative improvement of regex patterns to handle complex cases, such as scientific notation, compound units, and domain-specific formats.

The class emphasized the versatility of regex in automating repetitive tasks, improving data quality, and enabling efficient information retrieval.