# File Parsing using Regular Expressions

```{admonition} Overview
:class: overview

Questions:

* How do I pull information from files when I need to match a certain pattern of text?

* What are regular expressions?

Objectives:

* Use regular expressions to pull chemical formulas from a text file.
```

In file parsing lesson, we learned how to parse files using open, readlines, then looping through the file to find specific phrases. This works fine for some cases, however, there are some times when this will either be slow or impossible.

Sometimes, you will want to look for text which resembles a certain pattern.

Regular expressions will allow us to define patterns of text we are looking for rather than hardcoding specific phrases.
They are particularly useful when the text you are looking for follows a specific pattern, but can't be defined using simple logic rules.

A commonly used example is looking for phone numbers.
Phone numbers all follow a specific pattern, but have different digits. 
If we thought of rules for a phone number, we might say that it should be 3 numbers (area code) followed by a dash, followed by three numbers, andother dash, then four numbers, to have the format XXX-XXX-XXXX. This would be difficult to look for using strategies we have learned so far.

For this section, we will be parsing some text pulled from Wikipedia on chemical formulas. 
Our goal will be to extract all of the chemical formulas for hydrocarbons in the text.

````{admonition} Exercise
:class: exercise

Take a moment to think about what rules you would define for 

1. A chemical formula.  
2. Chemical formulas for hydrocarbons.

````


## Practicing Regular Expressions using Pythex

We will be prototyping our regular expression on a website called [pythex](https://pythex.org/). 
Go to this website and paste in the text from the file `data/chemical_formula.txt`.

### Basic Matching
At it’s simplest, regular expressions match the exact pattern yoou specify. For example, if we wanted to find all the places the word `formula` occurred in the text, we could do that by typing the word `formula` into our regular expression field.

### More complicated matchng - using metacharacters
This is not that interesting, but we can see that the word “chemical” is found eight times in the text. 
You’ll notice here that this text matches exactly, as in it is only showing us when the word energy (lowercase c) is found. 
This is where the power of regular expressions can come in. 
Let’s imagine that you want to look for either the word "chemical" or the word "Chemical". You can modify your regular expression to use special characters to tell the regular expression that either letter is okay.

To do matches other than matching for the literal character, you use special characters in your regular expressions (referred to as metacharacters). Here are some examples of metacharacters:


```
. ^ $ * + ? { } [ ] \ | ( )
```

Each of these metacharacters has a different meaning and can be used to match different patterns.

The first we will use is `[ ]`. For regular expressions, these indicate a class of characters you want to search for. For example, to look for f or F, we could use `[Ff]`. 
Try typing `[Cc]hemical` in the box "Your regular expression" on Pythex.

Another usefule metacharacter is `.` which matches any character not including newline.

For example, if we wanted to match any character for the first letter of "chemical", we could have written `.hemical` instead of `[Cc]hemical`.

### Matching Digits and Letters
To match any lowercase letter, we can use `[a-z]`. Similarly, any uppercase letter is written `[A-Z]`. Both of thse patterns will match a single uppercase or lowercase letter.

To match any digit, we can use `[0-9]`. You can also use the pattern `\d`.

### Repeating Patterns
You can also specify that a pattern repeats by following the pattern with metacharacters. The meta character `{N}` can be used to specify an exact number of repeats. 

If you want to an unspecified number of repeats, you can use `*` for any number of repeats including 0, or `+` for a pattern repeating one or more times. The `?` matches either 0 or 1 times.

## Matching Hydrocarbons

Now that we know some basic rules of regular expressions, let's try using what we have learned to find the chemical formulas for hydrocarbons in the text.

To find hydrocarbons, we must first think about what rules define a hydrocarbon formula.
To do this, you might think of how you would tell someone who is not a chemist to look for hydrocarbon formulas. 
After deciding this, we will translate the rules to regular expressions.

1. A hydrocarbon formula will be a word (surrounded by spaces) containing only letters and numbers.
2. The letters will only be capital C's and H's.
3. The C's and H's might be followed by numbers, as is the case for C2H6. However, they might not be followed by numbers (CH3).
4. The formula could be a pattern of several CH groups together, (CH3CH2CH3)

Using these rules, we might initially build the following regular expression:

```
C\dH\d
```

This says "capital C, followed by a digit, followed by an H, followed by a digit"


Next, we will add in an expression for the phrase " However, they might not be followed by numbers (CH3)."

```
C\d*H\d*
```

Add in parenthesis around your regular expression to group the phrase.

```
(C\d*H\d*)
```

In Pythex, you should now see a list of matches on the side.
Notably, you will see that the formula "CH3CH2CH2CH2CH2CH3" is several matches instead of one formula.
We can add a plus following this pattern to indicate that it can repeate one or more times, then get the full match by adding another group of parenthesis:

```
((C\d*H\d*)+)
```

The use of parenthesis creates a `group` in regular expressions, and groups will also define what is returned to us for a match.
If we don't want the hydrocarbon subunit returned, we can make it into a "non-capturing" group by adding `?:` at the beginning of the group.

```
((?:C\d*H\d*)+)
```

Finally, we will want to make sure we are not getting chemical formulas that contain other elements. 
We could ensure this by adding spaces or punctuation around our pattern. For our example, they are followed by a space or a comma, so we could write the following at the end of our regular expression

```
[\s,]
```

Making our full regular expression:

```
((?:C\d*H\d*)+)[\s,]
```

## Using Regular Expressoins in Python

You can use regular expressions using the `re` module in Python.
The following code block demonstrates using Python and the regular expression we've built to pull matches using Python.

In [2]:
import re

with open("data/chemical_formula.txt") as f:
    text = f.read()

pattern = re.compile("((?:C\d*H\d*)+)[\s,]")

matches = pattern.findall(text)

print(matches)

['C6H14', 'CH3CH2CH2CH2CH2CH3']


You can read more about Python regular expressions in the [Python Regular Expressoin HOWTO](https://docs.python.org/3/howto/regex.html) and in the [Python documentation](https://docs.python.org/3/library/re.html) for the `re` module.


````{admonition} Exercise
:class: exercise

Use what you've learned about regular expressions to pull data from [this Spartan file](https://raw.githubusercontent.com/doylelab/rxnpredict/master/spartan_molecules/1-bromo-4-(trifluoromethyl)benzene.spardir/M0001/output).

Try to:

1. Pull out data matching the pattern "data : value"
2. Pull out tables.

````