# String Operations Using Regular Expressions

Regular expressions (often abbreviated as **regex** or **regexp**) are sequences of characters used to define search patterns. They provide a concise and flexible way to match, search, and manipulate text based on specific patterns.

Regular expressions consist of a combination of literal characters and special characters called metacharacters. The metacharacters have special meanings and allow you to define complex patterns. Here are some commonly used metacharacters:

- `.` (dot): Matches any single character except a newline.
- `^` (caret): Matches the start of a string.
- `$` (dollar): Matches the end of a string.
- `*` (asterisk): Matches zero or more occurrences of the previous character or group.
- `+` (plus): Matches one or more occurrences of the previous character or group.
- `?` (question mark): Matches zero or one occurrence of the previous character or group.
- `[ ]` (square brackets): Matches any character within the brackets.
- `[^ ]` (caret within square brackets): Matches any character not in the brackets.
- `|` (pipe): Matches either the expression before or after the pipe.
- `()` (parentheses): Groups patterns together.
- `\` backward slash explain special sequence.
- `w` select words
- `d` select digits

Regular expressions allow you to perform various operations, such as:

- Matching: Determine if a string matches a specific pattern.
- Searching: Find the first occurrence of a pattern within a larger text.
- Extraction: Extract specific portions of a text that match a pattern.
- Substitution: Replace occurrences of a pattern with new text.

> very useful website for RegEx tips and experimentation https://regex101.com/

In [1]:
import re

### Find the first instance of a string using `search()`

In [2]:
s = 'this is a sample string'
pattern = 'is'

re.search(pattern, s)

<re.Match object; span=(2, 4), match='is'>

In [4]:
s[2:4]

'is'

### Search and capture all instances of a string using `findall()`

In [5]:
s2 = 'This is a sentence. Here is my sentence. Lastly, I am showing my 3rd sentence'

re.findall('sentence', s2)

['sentence', 'sentence', 'sentence']

In [7]:
s3 = 'It is raining in Spain. Pain'

re.findall('ain', s3)

['ain', 'ain', 'ain']

### Selecting words using `w` char

In [8]:
emails = '''
            bassel@gmail.com
            mark@yahoo.com
            betty@msn.net
            john@gmail.com
        '''

In [9]:
pattern = re.compile('\w+') #select words only (no special chars)

re.findall(pattern, emails)

['bassel',
 'gmail',
 'com',
 'mark',
 'yahoo',
 'com',
 'betty',
 'msn',
 'net',
 'john',
 'gmail',
 'com']

In [10]:
# make the list unique
set(re.findall(pattern, emails))

{'bassel', 'betty', 'com', 'gmail', 'john', 'mark', 'msn', 'net', 'yahoo'}

Capture the domain of each email

In [11]:
pattern = re.compile('\w+@\w+.(\w+)')  #here () act as a cursor to capture the targeted word

re.findall(pattern, emails)

['com', 'com', 'net', 'com']

### Splitting A String Using `split()`

In [12]:
text = 'Apples - Bananas - Oranges - Grapes'

# split the text and convert it into a list

fruit_list = re.split('-', text)
fruit_list

['Apples ', ' Bananas ', ' Oranges ', ' Grapes']

In [13]:
# strip the white spaces from the list items
fruit_list = [fruit.strip() for fruit in fruit_list]
fruit_list

['Apples', 'Bananas', 'Oranges', 'Grapes']

In [14]:
fruit_list = re.split(' - ', text)
fruit_list

['Apples', 'Bananas', 'Oranges', 'Grapes']

What if the string has different characters to split?

In [18]:
text = 'Apples-Bananas|Oranges,Grapes'

fruit_list = re.split('[-|,]', text) # using brackets, we avoid using | as a reserved special character in RegEx (literal pipe char)
fruit_list

['Apples', 'Bananas', 'Oranges', 'Grapes']

### Using RegEx with Digits (using `d`)

In [23]:
phone_nums = '''
            Bassel: 234-7890018
            Mark: 564-9873254
            Becky: 346-0981238
            '''

In [22]:
pattern = re.compile('\d\d\d-\d\d\d\d\d\d\d')
re.findall(pattern, phone_nums)

['234-7890018', '564-9873254', '346-0981238']

In [24]:
# more efficient way is to use multipliers

pattern = re.compile('\d{3}-\d{7}')
re.findall(pattern, phone_nums)

['234-7890018', '564-9873254', '346-0981238']

Select the area code form the phone numbers

In [26]:
pattern = re.compile('(\d{3})-\d{7}')
user_area_codes = re.findall(pattern, phone_nums)
user_area_codes

['234', '564', '346']

### Replacing Strings Using `sub()`

**Exercise** clean the following text using RegEx

In [34]:
text = ''' The BEST $mvie ever made about writer's block and one of the scariest tales ever made regarding cabin fever, 
        The Shining took a simple concept of a      haunted hotel and built it ~up into an unforgettable, 
        psychological ^horror mvie that will withstand the test of 
        time despite being slated by it's original creator. scary moovie ---!!!!'''


1. remove misspelling of the word movie

In [35]:
text = re.sub('mvie|moovie', 'movie', text) # or flag searches for both mvie and moovie
print(text)

 The BEST $movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever, 
        The Shining took a simple concept of a      haunted hotel and built it ~up into an unforgettable, 
        psychological ^horror movie that will withstand the test of 
        time despite being slated by it's original creator. scary movie ---!!!!


2. get rid of special characters

In [36]:
pattern = '[^a-zA-Z0-9\s.\']' # preserve lowercase, uppercase, digits, spaces
text = re.sub(pattern, '', text)
print(text)

 The BEST movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever 
        The Shining took a simple concept of a      haunted hotel and built it up into an unforgettable 
        psychological horror movie that will withstand the test of 
        time despite being slated by it's original creator. scary movie 


3. get rid of extra spaces

In [37]:
text = re.sub('\s+', ' ', text)
print(text)

 The BEST movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever The Shining took a simple concept of a haunted hotel and built it up into an unforgettable psychological horror movie that will withstand the test of time despite being slated by it's original creator. scary movie 


In [74]:
# remove the first space 
text = re.sub('^\s','',text)
text

"The BEST movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever The Shining took a simple concept of a haunted hotel and built it up into an unforgettable psychological horror movie that will withstand the test of time despite being slated by it's original creator. scary movie "

In [38]:
text.lower()

" the best movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever the shining took a simple concept of a haunted hotel and built it up into an unforgettable psychological horror movie that will withstand the test of time despite being slated by it's original creator. scary movie "

## Using RegEx with Pandas

In some cases you don't have to use the `re` libraries with `pandas` because `pandas` has builtin `re` functions

### Using `contains()`

In [39]:
import pandas as pd

In [41]:
data = {
    'names': [  'bassel'
                ,'mark'
                ,'betty'
                ,'john'],
    'emails': [ 'bassel@gmail.com'
                ,'mark@yahoo.com'
                ,'betty@msn.net'
                ,'john@gmail.com']
}

df = pd.DataFrame(data)
df

Unnamed: 0,names,emails
0,bassel,bassel@gmail.com
1,mark,mark@yahoo.com
2,betty,betty@msn.net
3,john,john@gmail.com


Build a flag for gmail accounts

In [43]:
df['has_gmail'] = df['emails'].str.contains('gmail')
df

Unnamed: 0,names,emails,has_gmail
0,bassel,bassel@gmail.com,True
1,mark,mark@yahoo.com,False
2,betty,betty@msn.net,False
3,john,john@gmail.com,True


### Using `split()`

Extract domain and email service into 2 columns

In [46]:
df['email_domain'] = df['emails'].str.split('.').str[1] # split by @ and grab the second item after it
df

Unnamed: 0,names,emails,has_gmail,email_domain
0,bassel,bassel@gmail.com,True,com
1,mark,mark@yahoo.com,False,com
2,betty,betty@msn.net,False,net
3,john,john@gmail.com,True,com


In [50]:
df['email_service'] = df['emails'].str.split('@').str[1].str.split('.').str[0]
df

Unnamed: 0,names,emails,has_gmail,email_domain,email_service
0,bassel,bassel@gmail.com,True,com,gmail
1,mark,mark@yahoo.com,False,com,yahoo
2,betty,betty@msn.net,False,net,msn
3,john,john@gmail.com,True,com,gmail


#### Using `expand=True` with `split()`

In [51]:
data = {'info':['Bassel-85', 'Mark-70', 'Becky-92', 'Mike-79']}


df = pd.DataFrame(data)
df

Unnamed: 0,info
0,Bassel-85
1,Mark-70
2,Becky-92
3,Mike-79


Split the info values into 2 separate columns

In [52]:
df[['name','score']] = df['info'].str.split('-', expand=True)
df

Unnamed: 0,info,name,score
0,Bassel-85,Bassel,85
1,Mark-70,Mark,70
2,Becky-92,Becky,92
3,Mike-79,Mike,79


### Using `re` with `pandas`

In [53]:
email_addr='catherine@gmail.com'
email_service = re.findall('@(.+)[.]', email_addr)
print(email_service)

['gmail']


In [54]:
data = {
    'names': [  'bassel'
                ,'mark'
                ,'betty'
                ,'john'],
    'emails': [ 'bassel@gmail.com'
                ,'mark@yahoo.com'
                ,'betty@msn.net'
                ,'john@gmail.com']
}

df = pd.DataFrame(data)
df

Unnamed: 0,names,emails
0,bassel,bassel@gmail.com
1,mark,mark@yahoo.com
2,betty,betty@msn.net
3,john,john@gmail.com


In [71]:
def extract_dom(text):
    text = re.findall('@(.+)[.]', text)
    return text[0]

In [72]:
df['domain'] = df['emails'].apply(extract_dom)
df

Unnamed: 0,names,emails,domain
0,bassel,bassel@gmail.com,gmail
1,mark,mark@yahoo.com,yahoo
2,betty,betty@msn.net,msn
3,john,john@gmail.com,gmail
