# Regular Expressions



*   Regular expressions (regex), introduced in the 1950s by Stephen Kleene, were originally a formal notation for describing patterns in text.
*   They are now widely used in natural language processing tasks, from simple text searches (e.g., identifying "the" in text) to more complex applications like tokenization and text normalization​

*   This practical language is used in every computer language, in text processing
tools like the Unix tools grep, and in editors like vim or Emacs.



## **Using grep in Linux**



In [1]:
# Writing text to a file
text = """The woodchuck lives in the forest.
It digs burrows and forages for food.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [3]:
# Search for the "woodchuck" pattern in example.txt
!grep "woodchuck" example.txt


The woodchuck lives in the forest.


Now, what happens is I add the same word in another line?


In [4]:
# Writing text to a file
text = """Line 1 The woodchuck lives in the forest.
Line 2 It digs burrows and forages for food.
Line 3 The woodchuck lives in the forest.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [5]:
# Search for the "woodchuck" pattern in example.txt
!grep "woodchuck" example.txt


Line 1 The woodchuck lives in the forest.
Line 3 The woodchuck lives in the forest.


## **Python**

NOTE: This section is just to demonstrate that the regular expression exist in all programming languages

In [None]:
# For searching for patterns using regular expressions in Python
import re

`re.search` returns the first instance it finds

In [9]:
# Define the pattern
pattern = 'woodchuck'

# Multi-line text
text = """Line 1 The woodchuck lives in the forest.
Line 2 It digs burrows and forages for food.
Line 3 The woodchuck lives in the forest.
"""

# Search for pattern
match = re.search(pattern, text)

# Show the result
if match:
    print(f"Match found: {match.group()} at position {match.start()}")
else:
    print("No match found.")


Match found: woodchuck at position 11


In [10]:
# The index at which our pattern starts in the string
match.start()

11

To find all occurrences of a pattern in the text, you can use `re.finditer`

In [11]:
import re

# Define the pattern
pattern = 'woodchuck'

# Multi-line text
text = """Line 1 The woodchuck lives in the forest.
Line 2 It digs burrows and forages for food.
Line 3 The woodchuck lives in the forest.
"""

# Find all matches
matches = re.finditer(pattern, text)

# Display the results
for match in matches:
    print(f"Match found: {match.group()} at position {match.start()}")


Match found: woodchuck at position 11
Match found: woodchuck at position 98


In [9]:
# Index into the string to see what it has found
text[87:113]

'Line 3 The woodchuck lives'

We will continue our discussion on regular expressions, focusing on the use of the grep tool in Linux.



## Simple regex patterns

In [12]:
# Writing text to a file
text = """interesting links to woodchucks and lemurs
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [13]:
# Search for the pattern in the txt file (return the entire line where a pattern match occurs)
!grep "lem" example.txt


interesting links to woodchucks and lemurs


In [14]:
# If you want grep to return only the matched words or substrings, you can use the -o option
!grep -o "lem" example.txt


lem


## Regular expressions are case-sensitive

In [13]:
# Writing text to a file
text = """interesting links to woodchucks and lemurs
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [15]:
# Search for the pattern in the txt file (return the entire line where a pattern match occurs)
!grep -o "Woodchuck" example.txt


It did not return anything because or "W" in our pattern

## Disjunction of characters (OR)

We can solve this problem with the use of the square braces [ and ].

The string of characters inside the braces specifies a disjunction of characters to match.

 Match: Woodchuck or woodchuck

In [16]:
# Writing text to a file
text = """interesting links to woodchucks and lemurs
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [17]:
# Adjusting our pattern using braces
# Search for the pattern in the txt file
!grep "[wW]oodchuck" example.txt


interesting links to woodchucks and lemurs


 Match: ‘a’, ‘b’, or ‘c’

In [18]:
# Writing text to a file
text = """In uomini, in soldati
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [19]:
# Search for the pattern in the txt file
!grep -o "[abc]" example.txt


a


 Match: any digit

In [20]:
# Writing multi-line text to a file
text = """plenty of 7 to 5”
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [21]:
# Search for the pattern in the txt file
!grep -o "[1234567890]" example.txt


7
5


## Range

In cases where there is a well-defined sequence associated
with a set of characters, the brackets can be used with the dash (-) to specify any one character in a range

In [22]:
# Writing text to a file
text = """we should call it Drenched Blossoms
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [23]:
# Search for the pattern in the txt file
!grep -o "[A-Z]" example.txt


D
B


## Negation

The square braces can also be used to specify what a single character cannot be,
by use of the caret ^.

*   If the caret ^ is the first symbol after the open square brace, the resulting pattern is negated


In [24]:
# Writing text to a file
text = """Oyfn pripetchik
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [25]:
# Search for the pattern in the txt file
!grep -o "[^A-Z]" example.txt


y
f
n
 
p
r
i
p
e
t
c
h
i
k


In [26]:
# Search for the pattern in the txt file
!grep -o "[A-Z]" example.txt


O


**But what if the caret is not used right after** [ **?**

In [27]:
# Writing text to a file
text = """look up ^ now
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


The below pattern just says that we need to look for either 'e' OR '^'

In [28]:
# Search for the pattern in the txt file
!grep -o "[e^]" example.txt


^


**But what if we want to look for the pattern 'a^b'**

In [29]:
# Writing text to a file
text = """look up a^b now
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [30]:
# Search for the pattern in the txt file
# It has no braces here [ ]
!grep -o "a^b" example.txt


a^b


## Optional Elements

How can we talk about optional elements, like an optional s in woodchuck and woodchucks?


We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”.

For this we use the question mark **/?/**, which means **"the preceding character or nothing"**

In [31]:
# Writing text to a file
text = """The woodchuck is an amazing creature, but woodchucks are rarely seen in the wild.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [32]:
# Search for the pattern in the txt file
!grep -o "woodchucks?" example.txt



**It returned NOTHING!!! WHY????**


`grep` command is not treating the question mark `?` as a part of the regular expression because, by default, `grep` does not use "extended regular expressions" unless you explicitly enable it.

The `?` operator is part of extended regular expressions, so you need to use the -E option (for extended regex) with `grep`.

In [33]:
!grep -E -o "woodchucks?" example.txt


woodchuck
woodchucks


Let's see another example:

In [34]:
# Writing text to a file
text = """I prefer the colour blue, but the color red is also nice.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")

File 'example.txt' created with content.


In [35]:
!grep -E -o "colou?r" example.txt


colour
color


## Kleene *

The set of operators that allows us to say things like **"some number of as"** are based on the asterisk or *, commonly called the Kleene * (generally
pronounced "cleany star")

In other words: The Kleene star means "zero or more occurrences
of the immediately previous character or regular expression".

So  **/a*/** means "**any string of zero or more a**".

This will match **a** or **aaaaaa**, but it will also match the
empty string at the start of *Off Minor* since the string *Off Minor* starts with **zero a's**.

In [36]:
# Writing text to a file
text = """b!
baa!
baaa!
baaaa!
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [37]:
# Search for 'ba*!' where 'a*' means zero or more occurrences of 'a'
!grep -E -o "ba*!" example.txt


b!
baa!
baaa!
baaaa!


## Kleene +

A shorter way to specify "at least one" of some character

Kleene + means "*one* or more occurrences of the immediately preceding
character or regular expression"

In [45]:
# Writing text to a file
text = """b!
ba!
baa!
baaa!
12345 is a sequence of digits.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [46]:
# Search for 'baa+!' where 'a+' means one or more occurrences of 'a'
!grep -E -o "ba+!" example.txt

# Search for sequences of digits using '[0-9]+'
!grep -E -o "[0-9]+" example.txt


ba!
baa!
baaa!
12345


## Wildcard Expression

One very important special character is the period (/./), a wildcard expression that matches any single character

In [47]:
# Writing text to a file
text = """begin
began
beg'n
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [48]:
# Match words with any character between beg and n
!grep -E -o "beg.n" example.txt


begin
began
beg'n


Another example:
suppose we want to find any line in which a particular word, for example, aardvark, appears twice.

In [47]:
# Writing text to a file
text = """
aardvark is a rare animal. The aardvark loves the wild.
Many researchers study the aardvark for its unique characteristics.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [49]:
# Finding any line in which aardvark appears twice
!grep -E "aardvark.*aardvark" example.txt


aardvark is a rare animal. The aardvark loves the wild.


## Anchors

Anchors are special characters that anchor regular expressions to particular places in a string.

### Start of Line

In [39]:
# Writing text to a file
text = """The dog barked loudly.
A cat ran across the street.
The dog wagged its tail.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [41]:
# Command to match lines starting with The
!grep -E "^The" example.txt


The dog barked loudly.
The dog wagged its tail.


### End of Line

In [43]:
# Writing text to a file
text = """The dog barked loudly.
The cat is on the roof.
It again barked.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [44]:
!grep -E "barked\.$" example.txt


It again barked.


### Word Boundary

\b matches a word boundary.

Thus, `\bthe\b` matches the word `the` but not the word `other`

**NOTE:** A “word” for the purposes of a regular expression is defined based on the
definition of words in programming languages as a sequence of digits, underscores,
or letters.

In [45]:
# Writing text to a file
text = """The dog barked.
They walked the dog.
Dogmatic approaches rarely help.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [46]:
# Command to match the word dog as a whole word:
!grep -E "\bdog\b" example.txt


The dog barked.
They walked the dog.


## Disjunction

We might want to search for either the string `cat` or
the string `dog`.

We can't use the square brackets to search for “cat or dog” (why can't we say [catdog]?).

We need a new operator, the disjunction operator, also
called the pipe symbol `|`.

In [50]:
# Writing text to a file
text = """Cats are independent creatures.
Dogs are loyal companions.
Some people have both cats and dogs as pets.
Birds are also common pets.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [51]:
# Search for lines containing either "cat" or "dog"
!grep -E "cat|dog" example.txt


Some people have both cats and dogs as pets.


In [52]:
# Search for lines containing either "cat" or "dog" (both lower and upper case)
!grep -E "[cC]at|[dD]og" example.txt

Cats are independent creatures.
Dogs are loyal companions.
Some people have both cats and dogs as pets.


## Precedence

Sometimes we need to use this disjunction operator in the midst of a larger sequence.

How can I specify both `guppy` and `guppies`? We cannot simply
say `guppy|ies`, because that would match only the strings `guppy` and `ies`.

This is because sequences like guppy take precedence over the disjunction operator `|`.


Enclosing a pattern in parentheses makes it act like
a single character for the purposes of neighboring operators like the pipe `|`.

In [56]:
# Writing text to a file
text = """I have a guppy in my aquarium.
guppies are colorful and lively fish.
Guppy food is available in most pet stores.
The guppylike behavior of the fish amazed us.
"""

file_name = "example.txt"
with open(file_name, "w") as file:
    file.write(text)

print(f"File '{file_name}' created with content.")


File 'example.txt' created with content.


In [57]:
# Search for "guppy" or "guppies" using parentheses to ensure proper precedence
!grep -E "gupp(y|ies)" example.txt


I have a guppy in my aquarium.
guppies are colorful and lively fish.
The guppylike behavior of the fish amazed us.
