# Python Regular Expression for String

What if we want to keep all variables starting with "l"? In python, we can use the **re** package for pattern recognition.

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

`re.search()` searches a pattern in a string and returns a Match object

In [1]:
import re
m = re.search(pattern=r"Ben", string="My name is Ben") 
m

<re.Match object; span=(11, 14), match='Ben'>

`m.span()` returns the start and end position of the pattern (end index not included) \
`m.group(0)` returns the matched substring.

In [2]:
m.span()

(11, 14)

In [3]:
m.group(0)

'Ben'

### 1 Build a pattern

To use `re`, we need to first specify the string pattern that we are looking for.

### Meta Characters

Metacharacters are characters with a special meaning. In most cases, they only work together with other characters.

![meta char](./images/meta_char.png)

In [4]:
import re

text = "I love ECO301"
pattern = r"ECO[0-9]"
bool(re.search(pattern,text))

True

In [5]:
pattern = r".+I"
bool(re.search(pattern, text))

False

### Special Sequences
A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:

![Special Sequences](./images/special_seq.png)

### Sets

![re sets](./images/re_sets.png)

### 2. re Methods

The `re` module offers a lot of useful functions that allows us to search a string for a match, some examples are
- `re.search(pattern, string)`	Returns a Match object if there is a match anywhere in the string
- `re.findall(pattern, string)`	Returns a list containing all matches
- `re.sub(pattern, repl, string)`	Replaces one or many matches with a string

### 3. Some Applications

1. Import gpa1 data, find all variables that contains `senior`

In [6]:
import wooldridge as woo
df = woo.data("gpa1")

In [7]:
pat = r"senior[0-9]"
ls = []
for c in df.columns:
    if bool(re.search(pat,c)):
        ls.append(c)

In [8]:
ls

['senior5']

In [9]:
[c for c in df.columns if "senior" in c]

['senior', 'senior5']

2. Assume we have the following column names \
["gdp2011", "gdp2012", "gdp2013", ..., "gdp2022"]. \
Obtain gdp variables on or before 2018.

In [10]:
columns = ["gdp"+str(i) for i in range(2011,2023)]

In [11]:
[c for c in columns if re.search("gdp201[1-8]",c)]

['gdp2011',
 'gdp2012',
 'gdp2013',
 'gdp2014',
 'gdp2015',
 'gdp2016',
 'gdp2017',
 'gdp2018']

3. Assume we have the following column names \
`["male", "name", "Q1_m","Q1_f","Q2_m","Q2_f"]`. \
Obtain `["Q1_m", "Q2_m"]`

In [12]:
columns = ["male", "name", "Q1_m","Q1_f", "Q2_m","Q2_f"]

In [13]:
[c for c in columns if re.search(r".?_m", c)]

['Q1_m', 'Q2_m']

4. Assume we have a column of email addresses. Create a new indicator variable `icloud` if the user uses an `@icloud` email address.

In [14]:
import pandas as pd
email = pd.Series(["aaa@gmail.com","icloud@gmail.com","bbb@icloud.com"])

In [15]:
[1 if re.search(r".+@icloud.com",e) else 0 for e in email]

[0, 0, 1]

In [16]:
s = "I love      ECO301"

In [17]:
pat = r"\s{2,}"
s=re.sub(pat," ",s)
s

'I love ECO301'

In [18]:
re.findall(r"\w+",s)

['I', 'love', 'ECO301']

In [19]:
p1 = "123-566-1111"
pat = r"[1-9]\d{2}-\d{3}-\d{4}"
bool(re.search(pat,p1))

True

In [20]:
ls=["2.5%", "2.5 APR"]
pat = r"\d+.\d*"

[re.search(pat,r).group(0) for r in ls if re.search(pat, r)]

['2.5', '2.5']