# Python: Regular expressions (Regex)

## Introduction

**Regex** are sequences of characters that describe searched elements. They are widely used in data science to **extract and search information from data**.

In [1]:
# Example
ai = ["data science", "big data", "statistics", "computer science"]
regex = "data"
ai_after_regex = ["data science", "big data"]

## Generic characters

One of the most important generic characters in regex is the **dot(.)**. It allows to take any element in the regex.

In [2]:
# Example
strings = ["but", "bat", "robotics"]
strings_regex = "b.t" # the matching regex

## Find the beginning and the end of a string

There are two symbols that allow you to search for the beginning and the end of a character string. These are the **symbol (^) for the beginning** of a string and the **symbol ($) for the end** of a string.

In [3]:
# Example
strings = ["he throws his bat", "robot"]
bad_string = "it is good but not generous"
regex_for_strings = "b.t$"

## Dataset

In [4]:
import csv

posts = list(csv.reader(open("askreddit_2015.csv", encoding="utf-8")))

In [5]:
print(posts[0:5])

[['Title', 'Score', 'Time', 'Gold', 'NumComs'], ['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195'], ["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479'], ['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055'], ["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']]


In [6]:
posts = posts[1:]

for post in posts[:10]:
    print(post)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']
['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389']
["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520']
['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '1438822288.0',

## Counting matches with the re module

In this section, we will discuss the **re** module. A useful method of this module is the **search()** function. It allows to check if there is a **match between a string and a regex**.

In [7]:
import re

In [8]:
# Example 1

if re.search("artificial intelligence", "big data") is not None:
    print("Match found!")
else:
    print("Match not found!")

Match not found!


In [9]:
# Example 2

if re.search("b.", "big data") is not None:
    print("Match found!")
else:
    print("Match not found!")

Match found!


### Training

Let's count the number of posts that contain the matcher "of Reddit" in the titles.

In [10]:
of_reddit_count = 0

for row in posts:
    if re.search("of Reddit", row[0]) is not None:
        of_reddit_count += 1

In [11]:
print(of_reddit_count)

76


## Hooks to match several letters

To match several characters, we use the brackets **\[ \]**.

### Training

Let's count the number of posts that contain the matchers "of Reddit" and "of reddit" in the titles.

In [12]:
of_reddit_count = 0

for row in posts:
    if re.search("of [Rr]eddit", row[0]) is not None:
        of_reddit_count += 1

In [13]:
print(of_reddit_count)

102


## Ignore special characters

To ignore special characters, use the **\** character.

### Training

Let's count the number of posts whose title contains the tag [Serious].

In [14]:
serious_count = 0

for row in posts:
    if re.search("\[Serious\]", row[0]) is not None:
        serious_count += 1

In [15]:
print(serious_count)

69


## Improve our regex

### Training

Let's go count the posts with the following tags in the title: (Serious) - (serious) - [Serious] - [serious].

In [16]:
serious_count = 0

for row in posts:
    if re.search("[\[\(][Ss]erious[\)\]]", row[0]) is not None:
        serious_count += 1

In [17]:
print(serious_count)

80


## Combine multiple regex

To combine several regex, we use the **|** character.

### Training

Let's go count for all the posts, all the tags that are at the beginning, at the end and both at the same time.

In [18]:
serious_start_count = 0
serious_end_count = 0
serious_final_count = 0

for row in posts:
    if re.search("^[\[\(][Ss]erious[\)\]]", row[0]) is not None:
        serious_start_count += 1
    if re.search("[\[\(][Ss]erious[\)\]]$", row[0]) is not None:
        serious_end_count += 1
    if re.search("^[\[\(][Ss]erious[\)\]]|[\[\(][Ss]erious[\)\]]$", row[0]) is not None:
        serious_final_count += 1

In [19]:
print(serious_start_count)
print(serious_end_count)
print(serious_final_count)

69
11
80


## Editing strings with regex

The approach that allows to modify one element by another is the **sub()** method of the **re** module.

In [20]:
# Example 1
re.sub("hello", "hi", "hello world!") 

'hi world!'

In [21]:
# Example 2
re.sub("hey", "hi", "hello world!")

'hello world!'

### Training

In [22]:
posts_new = []

for row in posts:
    row[0] = re.sub("[\[\(][Ss]erious[\)\]]", "[Serious]", row[0])
    posts_new.append(row)

In [23]:
print(posts_new[0:10])

[['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195'], ["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479'], ['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055'], ["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201'], ['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325'], ['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389'], ["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520'], ['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '143882

## Matching years with regex

To match numbers and alphabetical letters, the following regexes are used respectively: **[0-9]** and **[a-z]** or **[A-Z]** for capital letters.

In [24]:
# Example
strings = ['he throws his bat in 1995', 'robot']
year_string = []

for string in strings:
    if re.search("[1-2][0-9]{3}", string) is not None: # [1-2][0-9]{3} <==> [1-2][0-9][0-9][0-9]
        year_string.append(string)

In [25]:
year_string

['he throws his bat in 1995']

## Extract all years

To extract characters, it is also possible to use the **findall()** method of the **re** module.

In [26]:
# Example 1
re.findall("[a-z]", "abcd1234")

['a', 'b', 'c', 'd']

In [27]:
# Example 2
years_string = "We are already in 2022, one year more than 2021 and less than 2023!"
years = re.findall("[2][0-9][0-9][0-9]", years_string)
years

['2022', '2021', '2023']