# Regular Expressions

Regular expressions match patterns in text.  Known as wildcards on steroids, they offer powerful features for processing text, including:
- **Validate:** Ensuring text input matches matches a specific pattern, lie a 5 digit zip code (^\d{5}$).
- **Extract:** Pulling out specific text from a document, such as all spelling variations of the word "color"(colou?r) or "hematology" (ha?ematology).
- **Replace:** Finding and replacing text, such as changing all instances of 'OH' to 'Ohio' (\bOH\b).

Regular expressions are included in several programming langauges and software programs including Python, JavaScript, and Tableau. 

- [A-Z] matches any one uppercase letter.
- [a-z] matches any one lowercase letter.
- [0-9] matches any one digit.
- [0-9]{5} matches a pattern of 5 digits.


[Metacharacters](https://www.w3schools.com/python/gloss_python_regex_metacharacters.asp) have special meaning in Regular Expressions and must be escaped with a `\`. Metacharacters include:
- `[ ]` A set of characters, e.g., `[a-f]`.
- `\` Start of a special sequence, e.g., `\w`.
- `.` Any character except newline, e.g., `d.g`.
- `^` Starts with, e.g., `^Ohio`.
- `$` End with, e.g., `State$`.
- `*` Zero or more characters, e.g., `h*matology`.
- `+` One or more characters, e.g., `spe+s`.
- `?` Zero or more characters, e.g., `p?ediatric`.
- `{ }` Exactly x number of characters, e.g., `[0-9]{5}`.
- `|` or, e.g. `Ohio State|OSU`.
- `( )` Extract or group.

## Learning Resources

### [regex101:build,test,debug](https://regex101.com/)
Helps you write regular expressions for multiple programming languages. Allows you to test your regular expression against a sample string of your data, and explains your regex as you type. Also includes a searchable quick reference of regex syntax.

### [Learning Regular Expressions](https://library.ohio-state.edu/record=b9497158~S7)
This ebook by Ben Forta is available through the Libraries [Safari Books](https://library.ohio-state.edu/record=e1002334T~S7) package of technical books and videos. Each chapter is organized as a lesson teacing you how to match a single character or set of characters, utlize metacharacters and more.

# [re module](https://docs.python.org/3/library/re.html)
To use regular expressions in Python, import the regular expressions module.

`import re`



## [re.match( )](https://docs.python.org/3/library/re.html#re.match)
`re.match(pattern, string, flags=0)`

True if zero or more characters at the **beginning** of the string match the regular expression pattern.

In [None]:
import pandas as pd
import re

addresses=pd.read_csv('pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation')

for idx, row in addresses.iloc[0:200].iterrows():
    affiliation=str(row.affiliation)
    # print(affiliation)
    osu_match=re.match(r"The Ohio State University",affiliation) 
    if osu_match:
        print(f" MATCH {osu_match.group()}: {affiliation}")
    else:
        print("no match")


## [re.search( )](https://docs.python.org/3/library/re.html#re.search)
`re.search(pattern, string, flags=0)`

Searches through string and finds the first location where the string matches the regular expression pattern

In [None]:
import pandas as pd
import re

addresses=pd.read_csv('pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation')

for idx, row in addresses.iloc[0:200].iterrows():
    affiliation=str(row.affiliation)
    # print(affiliation)
    osu_search=re.search(r"The Ohio State University",affiliation) 
    if osu_search:
        print(osu_search.group())
    else:
        print(f"No match: affiliation = {affiliation}")
    


## [re.findall( )](https://docs.python.org/3/library/re.html#re.findall)
`re.findall(pattern, string, flags=0)`

Scans string from left to right and returns all matches in the string as a list of strings or tuples.


In [None]:
# FIND TORTOISES AT THE NATIONAL ZOO

import pandas as pd
import re

df=pd.read_csv('i_met_the_animals.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
pattern="A[a-z]* ?[a-z]* tortoise,"
tortoises=re.findall(pattern, animals)
for each_tortoise in tortoises:
    print(each_tortoise.replace(',',''))

## [re.sub( )](https://docs.python.org/3/library/re.html#re.sub)
`re.sub(pattern, repl, string, count=0, flags=0)`

Finds the string pattern and replaces it with the text provided.

In [None]:
# FIND TORTOISES AT THE NATIONAL ZOO AND REPLACE THE COMMON_NAME WITH "SLOW TORTOISE"
import pandas as pd
import re

df=pd.read_csv('i_met_the_animals.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
pattern="A[a-z]* ?[a-z]* tortoise,"
tortoises_slow=re.sub(pattern,"SLOW TORTOISE,",animals)
tortoises_slow

# Activity: The Lantern-Part 3

Using the latern_text.csv file created last week, remove the paragraph tags in the publication_text column.