## Regular Expressions (Regex) in Python

In [1]:
# regex basically looks for patterns
# and manipulated these patterns in a dataset

import re

## Special Characters in Regex

#### the backslash ( \ )
used to escape special characters.

In [2]:
s = "Today.was.good."
# without using the (\)
match = re.search(r'.', s)
print(match)


<re.Match object; span=(0, 1), match='T'>


In [3]:
s = "Today.was.good."
# with using the (\)
match = re.search(r'\.', s)
print(match)


<re.Match object; span=(5, 6), match='.'>


#### the square bracket
represents a set of characters we want to match

In [4]:
# lists the occurence of the selected alphabets
# [abc] here from a-b
s = "when it's all said and done, be happy"
match = re.findall(r'[abc]', s)
print(match)

['a', 'a', 'a', 'b', 'a']


In [5]:
# it is case sensitive and ignores the a in caps
s = "when it's all said and done, be hAppy"
match = re.findall(r'[abc]', s)
print(match)

['a', 'a', 'a', 'b']


#### the caret (^)
checks whether a string begins with a character or not

In [6]:
z = "greet, grew, grey, grow, "
match= re.findall(r'^gre', z)
print(match)

['gre']


## Functions in the Module 're' 

#  1.  re.search()
syntax = re.search(pattern, string)

In [7]:
# finds the first occurence of the pattern
poem = 'Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked;'
re.search('picked', poem)

<re.Match object; span=(12, 18), match='picked'>

## 2. re.match()

syntax = re.match(pattern, string)

In [8]:
# finds the occurence of the pattern at the start of the text
poem = 'Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked;'
z=re.match('picked', poem)
print(z)

None


## 3. re.findall()

syntax = re.findall(pattern, string)

In [9]:
# returns a list of all the times the pattern occured
poem = 'Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked;'
re.findall('picked', poem)

['picked', 'picked']

##  4. re.split()

syntax re.split(pattern, string)

In [10]:
poem = 'Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked;'

# the '\s' splits on white spaces
re.split(r'\s', poem)

['Peter',
 'Piper',
 'picked',
 'a',
 'peck',
 'of',
 'pickled',
 'peppers;',
 'A',
 'peck',
 'of',
 'pickled',
 'peppers',
 'Peter',
 'Piper',
 'picked;']

## 5. re.sub()

syntax = re.sub(pattern, repl, string)
simplified syntax = re.sub(old pattern, replacement, string)

In [11]:
poem = 'Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked;'
re.sub('picked', 'chewed', poem)

'Peter Piper chewed a peck of pickled peppers; A peck of pickled peppers Peter Piper chewed;'

# Operations to Perfrom Using Regex


# 1. Finding a word



In [12]:
if re.search("self", "Emancipate yourself from mental slavery...screams self-development."):
    print("Self is present here")

Self is present here


## 2. To Generate an iterator
Here we find the index of a particular word in the sentence

In [13]:
s = """ DataInsight has a large community and support for learners"""
match = re.search(r'community', s)
print('Start Index:', match.start())
print('End Index:', match.end())

Start Index: 25
End Index: 34


In the code example above, the 'r' represents a raw string and not a regular string. The difference between the two is the regular expression would execute special characters like the '.', '\','*' and a host of others but the raw string ignores those characters. (we will dive deeper into the special characters.

## 3. Match words with a pattern

In [14]:
# [shmp] means we want a word that starts with those alphabets

str = "sat, hat, mat, pat"
allstr = re.findall("[shmp]at", str)
for i in allstr:
    print(i)

sat
hat
mat
pat


## 4. Series of a Range of Characters

In [15]:
str = "sat, hat, mat, pat"

# prints words which begin with alphabets between 'm-p'
allstr = re.findall("[m-p]at", str)
for i in allstr:
    print(i)

mat
pat


## 5. Replace a String

In [16]:
birds = 'A hen, emu, kiwi cannot fly.'
# looks for the bird emu from the string birds
match = re.compile('[e]mu')
real = match.sub('dodo', birds)
print(real)

A hen, dodo, kiwi cannot fly.


## Applications of Regex
1. Extracting emails
2. web scrapping
3. Data wrangling
4. Data cleaning

## Extracting E-mails

In [17]:
## Extracting e-mails
mail = """From: adwumapa27@gmail.com\
Sent: 16th October, 2021\
To: owusea15@yahoo.com\
Subject: Paper Towel Ventures\
Thank you for choosing us. For bulk purchases, email our Ghanaian correspondent through \
plemanbee1vent@gmail.com\
best,\
Joana :D"""

re.findall("[\w.-]+@[\w.-]+", mail)

['adwumapa27@gmail.comSent',
 'owusea15@yahoo.comSubject',
 'plemanbee1vent@gmail.combest']

## Data Cleaning

using a practice project on DataCamp, The Android App Market on Google Play
The data was scraped from here.(https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv)
After importing the dataset into a pandas dataframe, some of the columns had special characters like $, * etc


         Category  Rating  Reviews  Size     Installs  Type Price  \
0  ART_AND_DESIGN     4.1      159  19.0      10,000+  Free     0   
1  ART_AND_DESIGN     3.9      967  14.0     500,000+  Free     0   
2  ART_AND_DESIGN     4.7    87510   8.7   5,000,000+  Free     0   
3  ART_AND_DESIGN     4.5   215644  25.0  50,000,000+  Free     0   
4  ART_AND_DESIGN     4.3      967   2.8     100,000+  Free     0 


A regex saved the situation

In [None]:
# List of characters to remove
chars_to_remove = ['+', ',', '$']
# List of column names to clean
cols_to_clean = ['Installs', 'Price']

# Loop for each column in cols_to_clean
for col in cols_to_clean:
    # Loop for each char in chars_to_remove
    for char in chars_to_remove:
        # Replace the character with an empty string
        apps[col] = apps[cols_to_clean].apply(lambda x: x.str.replace(r'\D', ''))
        
# Print a summary of the apps dataframe
print(apps.head())

The output now looks like this:

         Category  Rating  Reviews  Size  Installs  Type     Price  \
0  ART_AND_DESIGN     4.1      159  19.0     10000  Free     10000   
1  ART_AND_DESIGN     3.9      967  14.0    500000  Free    500000   
2  ART_AND_DESIGN     4.7    87510   8.7   5000000  Free   5000000   
3  ART_AND_DESIGN     4.5   215644  25.0  50000000  Free  50000000   
4  ART_AND_DESIGN     4.3      967   2.8    100000  Free    100000

Regex to the rescue.

For the code, above, the line
        apps[col] = apps[cols_to_clean].apply(lambda x: x.str.replace(r'\D', ''))
replaced the non-digits in the colums with an empty space. As explained in the blog, the \D sections the non-digits in the column.

Find the full project, 'The Android App Market on Google Play' on DataCamp.com