# **<span style="color:cornflowerblue">Regular expressions</span> for a more efficient work**

Regular expressions (regex) are a powerful tool for **<span style="color:cornflowerblue">pattern matching</span>** and text processing. Python provides the `re` module.



In [60]:
import re

### 1. **<span style="color:darksalmon">Match</span>** vs **<span style="color:darksalmon">Search</span>** vs **<span style="color:darksalmon">Findall</span>** patterns

In [61]:
pattern = 'hello'

string_1 = 'hello CAB, hello you'
string_2 = 'CAB, hello; you, hello'

The `re.match()` function checks for a match **only at the beginning of the string**.

In [62]:
print(re.match(pattern, string_1))
print(re.match(pattern, string_2))

<re.Match object; span=(0, 5), match='hello'>
None


The `re.search()` function searches the **entire string** for a match.


In [63]:
print(re.search(pattern, string_1))
print(re.search(pattern, string_2))

<re.Match object; span=(0, 5), match='hello'>
<re.Match object; span=(5, 10), match='hello'>


The `re.findall()` function searches the **entire** string for **MULTIPLE** matches.


In [64]:
print(re.findall(pattern, string_1))
print(re.findall(pattern, string_2))

['hello', 'hello']
['hello', 'hello']


### 2. **<span style="color:darksalmon">Actions</span>** for the matching


- **Grouping**: Parentheses `()` are used to create groups.
- **Substitution**: The `re.sub()` function replaces matches with a specified string.

In [65]:
# Example Grouping
pattern = ', (hello)'
string = 'hello CAB, hello you'

print(re.search(pattern, string))
print(re.search(pattern, string).group())
print(re.search(pattern, string).group(0))
print(re.search(pattern, string).group(1))


<re.Match object; span=(9, 16), match=', hello'>
, hello
, hello
hello


In [66]:
# Example Substitution
pattern = 'work'
string = 'I work at CAB'

new = 'do my PhD'

print(re.sub(pattern, new, string))

I do my PhD at CAB


### 3. **<span style="color:cornflowerblue">Example</span>** of use

Note: the explination of the patterns is in Sect. 4, this is just to have a global picture before jumping to that.

In [67]:
# I have a list of files (observations from different observatories) and I want to know which is the observatory from the Path
list_files = ['User/Desktop/obs/CAHA/car-20180129T23h12m35s.fits',
              'User/Desktop/obs/ElRoque/hn-20200129T23h12m35s.fits',
              'User/Desktop/obs/LaSilla/h-20250129T23h12m35s.fits',
              'User/Desktop/obs/Parannal/esp-20210129T23h12m35s.fits']


def find_observatory(path):
    pattern = r"obs/([^/]+)/"   # text between the parenthesis is what I am searching for, and will be stored in .group(1)
    observatory = re.search(pattern, path).group(1)
    return observatory


for file in list_files:
    print(find_observatory(file))

CAHA
ElRoque
LaSilla
Parannal


In [68]:
# I want to move each file to a new sub-directory being the year of observation
import os

def newpath_subdir_year(path, observatory):
    # Get the 4 numbers after "-" ensuring that there is no "/" after it
    pattern = r"-(\d{4})[^/]*$"
    year = re.search(pattern, path).group(1)

    # Substitute "observatory/" by "observatory/year/" from path
    new_path = re.sub(observatory + "/",
                      observatory + "/" + year + "/", 
                      path)
    return new_path

for file in list_files:
    observatory = find_observatory(file)
    print(newpath_subdir_year(file, observatory))

User/Desktop/obs/CAHA/2018/car-20180129T23h12m35s.fits
User/Desktop/obs/ElRoque/2020/hn-20200129T23h12m35s.fits
User/Desktop/obs/LaSilla/2025/h-20250129T23h12m35s.fits
User/Desktop/obs/Parannal/2021/esp-20210129T23h12m35s.fits


### 4. Common **<span style="color:darksalmon">Patterns</span>** and **<span style="color:darksalmon">Metacharacters</span>**

#### 4.1 Metacharacters

- `.`: Matches any character except a newline.
- `^`: Matches the beginning of the string (equivalent to using `re.match` instead of `re.search`).
- `$`: Matches the end of the string.
- `*`: Matches 0 or more repetitions of the preceding character.
- `+`: Matches 1 or more repetitions of the preceding character.
- `?`: Matches 0 or 1 repetition of the preceding character.
- `{}`: Specifies a specific number of repetitions.

##### Examples:

In [83]:
# . (Matches any character except a newline)
print(re.search(r'B.', 'ABCD').group())
print(re.search(r'B.', 'AB\nCD'))
print(re.search(r'B\..', 'AB.CD').group())

BC
None
B.C


In [86]:
# ^ (Matches the beginning of the string)
pattern = '^A'
print(re.search(pattern, 'A1'))
print(re.search(pattern, '1A'))

<re.Match object; span=(0, 1), match='A'>
None


In [87]:
# $ (Matches the end of the string)
pattern = 'A$'
print(re.search(pattern, 'A1'))
print(re.search(pattern, '1A'))

None
<re.Match object; span=(1, 2), match='A'>


In [90]:
# * (Matches 0 or more repetitions of the preceding character)
pattern = 'A*1'
print(re.search(pattern, '1'))
print(re.search(pattern, '1A'))
print(re.search(pattern, 'A1'))
print(re.search(pattern, 'AAA1'))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 2), match='A1'>
<re.Match object; span=(0, 4), match='AAA1'>


In [91]:
# + (Matches 1 or more repetitions of the preceding character)
pattern = 'A+1'
print(re.search(pattern, '1'))
print(re.search(pattern, 'A1'))
print(re.search(pattern, 'AAA1'))

None
<re.Match object; span=(0, 2), match='A1'>
<re.Match object; span=(0, 4), match='AAA1'>


In [92]:
# ? (Matches 0 or 1 repetition of the preceding character)
pattern = 'A?1'
print(re.search(pattern, '1'))
print(re.search(pattern, 'A1'))
print(re.search(pattern, 'AAA1'))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 2), match='A1'>
<re.Match object; span=(2, 4), match='A1'>


In [93]:
# {} (Specifies a specific number of repetitions)
pattern = r'A{2}1'
print(re.search(pattern, '1'))
print(re.search(pattern, 'A1'))
print(re.search(pattern, 'AAA1'))

None
None
<re.Match object; span=(1, 4), match='AA1'>


#### 4.2 Character Sets


- `[abc]`: Matches any character `a`, `b`, or `c`.
- `[a-z]`: Matches any lowercase letter.
- `[0-9]`: Matches any digit.
- `[^]`: Matches any character **except** the ones specified inside the brackets

##### Examples:

In [76]:
pattern = 'ES[AO]'

print(re.search(pattern,'La ESA, existe'))
print(re.search(pattern,'La ESO, existe'))
print(re.search(pattern,'La ESU, no existe'))

<re.Match object; span=(3, 6), match='ESA'>
<re.Match object; span=(3, 6), match='ESO'>
None


In [77]:
pattern_1 = '/[a-z]/'
pattern_2 = '/[a-z]+/'

print(re.search(pattern_1,'User/name/Desktop'))
print(re.search(pattern_2,'User/name/Desktop'))

None
<re.Match object; span=(4, 10), match='/name/'>


In [78]:
pattern = '[0-9]+'

print(re.search(pattern,'file_number_423.txt').group())
print(re.search(pattern,'file_number_001.txt').group())

423
001


In [79]:
pattern = 'obs/([^/]+)/' # matches any character instead "/" between "obs/" and "/"
string = 'User/Desktop/obs/CAHA/car-20180129T23h12m35s.fits'

print(re.search(pattern, string).group(1))

CAHA


#### 4.3 Special Sequences:

- `\d`: Matches any digit (equivalent to `[0-9]`).
- `\D`: Matches any non-digit.
- `\w`: Matches any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).
- `\W`: Matches any non-alphanumeric character.
- `\s`: Matches any whitespace character.
- `\S`: Matches any non-whitespace character.

##### Example:

In [80]:
pattern = r'\d+'
string = 'The PhD candidates at CAB are 67% men and 33% women.'
print(re.findall(pattern, string))

['67', '33']
