# Regular Expression

Regular expression, often abbreviated as RegEx or regex are a powerful tool for pattern matching and text manipulaton. They provide a concise and flexible way to serach, extract and manipulate strings of text based on specific patterns.

However, regular expression can be complex and tricky to work with specially for complex patterns. It's essential to understand the syntax and behaviour of regular expressions before using them. Python's re module provides a set of functions and methods to work with regular expressions such as `search()`, `match()`,`findall()` and `sub()`, which makes it easier to leverage their power within Python code.

Also Regular expression are instrumental in extracting informationfrom text such as log files, spreadsheets, or even textual documents.

**For example:** Below are some of the cases where regular expression can help you save alot of time.
* Searching and replacing text in files.
* Validating text input, such as passwords and email address
* Rename a hundred files at a time. For example, you can change the extension of all the files using regex pattern

We will start this topic by using the RE module, a built in Python module that provided all the required functionality needed for handling patterns and regular expressions.

In [1]:
import regex as re

In [2]:
#print(help(re))

## Metacharacters

Metacharacters are special characters with a special meaning that affect how the regular expressions around them are interpreted. 

Metacharacters don't match themselves. Instead, they indicate that some rules. Characters or sign like |, +, or* are special characters. For example, ^(caret) metacharacters used to match the starting of the string.

Metacharacters are also called operators, signs, or symbols.

First, Let's see the list of regex metacharacters we can use in Python and their meaning.

**(. Dot)**       Matches any character except a newline character. `(Any character except newline character)`

**(^ Caret)**     Matches pattern only at the start of the string `(starts with)`.

**($ Dollar)**    Matches pattern at the end of the string `(ends with)`.

**(* asterisk)**  Matches 0 or more repetition ofthe regex `(Zero or more occurrences)`.

**(+ Plus)**      Matches 1 or more repetition ofthe regex `(One or more occurrences)`.

**(? Question Mark)**     Matches 0 or 1 repetition ofthe regex `(Zero or one occurrences)`

**([]Square bracet)**     used to indicate a set of characters. Matches any single charcter in brackets. `(A set of characters)`. For example, [a,b,c] will match either a, or, b, or, c character

**[| Pipe]:**           used to specify multiple patterns`(either or)`. For example P1|P2, where P1 and P2 are the different regexes.
 
**[\ backslash]:**       used to escape special characters or signals a special sequence. For example, if you are searching for one of the special characters you can use a \ to escape them.

**[^...]**       Matches any single character not in brackets.

**(...)**       Matches whatever regular expression is inside the parenthesis. For example, (abc) will match sunstring 'abc'

**{}**        Exactly the specified number of occurrences.

### Examples of Metacharacters

In [3]:
# . metacharacter

txt = "hello planet"
x = re.findall("he..o", txt)
print(x)

['hello']


In [4]:
# ^ metacharacter

txt = "Hello planet"
x = re.findall("^Hello", txt)
print(x)

['Hello']


In [5]:
# $ metacharacter

txt = "Hello planet"
x = re.findall("planet$", txt)
print(x)

['planet']


## Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

**\A** - Returns a match if the specified characters are at the beginning of the string. Example: "\AThe"

**\b** - Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string") Example: r"\bain"
r"ain\b"

**\B**	- Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string") Example: r"\Bain"
r"ain\B"

**\d** - Returns a match where the string contains digits (numbers from 0-9)

**\D** - Returns a match where the string DOES NOT contain digits

**\s** - Returns a match where the string contains a white space character

**\S** - Returns a match where the string DOES NOT contain a white space character

**\w** - Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)

**\W** - Returns a match where the string DOES NOT contain any word characters

**\Z** - Returns a match if the specified characters are at the end of the string

## RegEx Functions

The `re` module offers a set of functions that allows us to search a string for a match:

* **findall** - Returns a list containing all matches
* **search** - Returns a Match object if there is a match anywhere in the string
* **split** - Returns a list where the string has been split at each match
* **sub** - Replaces one or many matches with a string

## 1. findall() Function

The re findall() scans and target string from left to right as per the regular expression pattern and returns all matches in the order they were found.

It returns None if fails to loacte the occurances of the pattern or such pattern doesn't exist in target string.

#### Matching Specific Pattern

In [6]:
import re

pattern = "Data science"
txt = "Data science uses that data to make companies more profitable part."
matches = re.findall(pattern, txt)
print(matches)

['Data science']


#### Extracting digits from a string

In [7]:
txt ="Virat Kohli scored 90 runs in his 100th match"

pattern ="\d"

x = re.findall(pattern, txt)
print(x)

['9', '0', '1', '0', '0']


In [8]:
# write a regular expression to search digits inside a string

txt ="Virat Kohli scored 90 runs in his 100th match"

pattern ="\d+"

x = re.findall(pattern, txt)
print(x)

['90', '100']


In [9]:
txt = "12ivzxllvz222szq1u45xzep"
pattern = "[0-9]+"

x = re.findall(pattern, txt)
print(x)

['12', '222', '1', '45']


In [10]:
# Extracting digits from a list of string

price = ['Apple costs Rs 50', 'Rs 60 for each Pineapple', 'Rs 120 for each Watermelon']

for i in price:
    matches = re.findall("\d+",i)
    print (matches)

['50']
['60']
['120']


In [11]:
# "\w means white space"

string = "Data science has an added trickiness, in that it’s such a difficult field to manage, given the technical acumen that’s required."
result = re.findall(r"\w{5}", string)

print(result)

['scien', 'added', 'trick', 'iness', 'diffi', 'field', 'manag', 'given', 'techn', 'acume', 'requi']


In [12]:
# "\w means white space"

string = "Data science has an added trickiness, in that it’s such a difficult field to manage, given the technical acumen that’s required."
result = re.findall(r"\w+", string)

print(result)

['Data', 'science', 'has', 'an', 'added', 'trickiness', 'in', 'that', 'it', 's', 'such', 'a', 'difficult', 'field', 'to', 'manage', 'given', 'the', 'technical', 'acumen', 'that', 's', 'required']


## 2. search() Function

Python `re.search()` method looks for occurrencesof the regex pattern inside the entire target string and returns the corresponding match object instance where the match found.

The regular expression pattern and target string are the mandatory arguments,

* **Pattern:** The first argument is the regular expression pattern we want to search inside the target string.
* **string** The second argument is the variable pointing to the target string (In which we want to look for occurences of the pattern.

In [13]:
test = "India China USA UAE"
x = re.search("India", test)
print(x)

"""Here span= (0, 5) is giving match object which means it will the 1st and the last Index of the word. 
Here 0 is matching with the 1st index of the word which is 'I' and 5th index which is 'a' """

<re.Match object; span=(0, 5), match='India'>


"Here span= (0, 5) is giving match object which means it will the 1st and the last Index of the word. \nHere 0 is matching with the 1st index of the word which is 'I' and 5th index which is 'a' "

In [14]:
# group() can be used only with search Function

test = "India China USA UAE"
x = re.search("India", test)
print(x.group()) # to retreive the exact value you have use group()

India


In [15]:
msgs = ['I grabbed a banana from the fruit bowl and quickly peeled it before taking a bite.',
       'The bright yellow banana added a pop of color to the green smoothie.',
       'My breakfast routine always includes a bowl of cereal topped with sliced bananas.']

for msg in msgs:
    search = re.search('banana', msg)
    print(search)

<re.Match object; span=(12, 18), match='banana'>
<re.Match object; span=(18, 24), match='banana'>
<re.Match object; span=(73, 79), match='banana'>


In [16]:
msg1 ="This product is really Great"

search = re.search("^This.*Great$", msg1)
print(search)

<re.Match object; span=(0, 28), match='This product is really Great'>


## 3. match object ()

Python `re.match()` method looks for the regex pattern only at the begining of the target string and results match object if match found; otherwise it will return None.

The match object contains the location at which the match starts and ends and the actual match value. If it fails to locate the occurences of the pattern that we want to find or such a pattern doesn't exist in a target string it will return aNone type.

**Note:** If there is no match, the value None will be returned, instead of the match object

In [17]:
target = 'Virat Kohli is a world-class cricketer known for his exceptional batting skills and leadership on the field.'

# match at the end
result = re.match(r"\d{5}", target)
print(result)

None


if you use a match() method to match any 4 letter word at the end of the string you will get None because it returns a match only if the pattern is loacted at the begining of the string. And as we can see the 6th letter word is not present at the start.

So, to match the regex pattern anywhere in the string you need to use either `search()` or `findall()` method of regex module.

###   This 3 functions can be used only with search()

The Match object has properties and methods used to retrieve information about the search, and the result:
* **span()** - returns tuple containg the start and end position of the match
* **string()** - returns the string passed into a function.
* **group()** - returns the part of the string where there was a match.

In [18]:
msgs = ['I grabbed a banana from the fruit bowl and quickly peeled it before taking a bite.',
       'The bright yellow banana added a pop of color to the green smoothie.',
       'My breakfast routine always includes a bowl of cereal topped with sliced bananas.']

for msg in msgs:
    search = re.search('banana', msg)
    print(search)
    print(search.group())

<re.Match object; span=(12, 18), match='banana'>
banana
<re.Match object; span=(18, 24), match='banana'>
banana
<re.Match object; span=(73, 79), match='banana'>
banana


In [19]:
#print the position of the first and end position of the first maych occurence.

msg = "Eating two bananas a day can do wonders for your health"
x = re.search(r"\bbana\w+", msg)
print(x.span())

(11, 18)


In [20]:
#print the position of the first and end position of the first maych occurence.

msg = "Eating two bananas a day can do wonders for your health"
x = re.search(r"\bbana\w+", msg)
print(x.string)

Eating two bananas a day can do wonders for your health


In [21]:
msgs = 'I grabbed a banana from the fruit bowl and quickly peeled it before taking a bite.'
x = re.search(r"\bbana\w+", msgs)
print(x.group())

banana


`re.seach()` method returns a match object (i.e, re.Match). This match object contains the following two items.

1. The tuple object contains the start and end index of a successful match.
2. It contains an actual matching value that we can retrieve using a group() method.

In [22]:
string = "APJ Abdul Kalam was an eminent Indian scientist and the 11th President of India, who inspired millions with his vision, humility, and dedication to the advancement of science and technology."
result = re.search(r"\w{5}", string)

print(result)
print(result.group())

#search will always give you the first occurence only. To find all the the words you need to use findall

<re.Match object; span=(4, 9), match='Abdul'>
Abdul


## split

Regex pattern and target string are the mandatory arguments.The maxsplit and flagsare optional.

* **pattern:** the regex pattern used for splitting the targeted string.
* **string:**  the variable pointing to the targeted string (i.e, string we want to split).
* **maxsplit:**  the number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list.

In [23]:
# split at each white space character (\s is split by space)

text = "The monkey swung from branch to branch, skillfully peeling a banana and devouring it with delight."
x = re.split("\s", text)
print(x)

['The', 'monkey', 'swung', 'from', 'branch', 'to', 'branch,', 'skillfully', 'peeling', 'a', 'banana', 'and', 'devouring', 'it', 'with', 'delight.']


In [24]:
# start splitting by position

text = "How is this possible?"
x = re.split("\s",text, 1)
print(x)

['How', 'is this possible?']


#### Limit the number of split

In [25]:
targeted_string = "23-12-1987"
result = re.split(r"\D", targeted_string, maxsplit=1) #split only on the first occurence
print(result)

['23', '12-1987']


In [26]:
targeted_string = "23-12-1987_24"
result =re.split(r"\D", targeted_string, maxsplit=3) #split only on three occurence
print(result)

['23', '12', '1987', '24']


#### split string by two delimiters

In [27]:
string ="12,08,24-23,87-89"
result = re.split(r"-|,", string)
print(result)

# Here we are using 2 delimeters "-" hyphen or "," comma. We are using the or "|" operators to combine the pattern

['12', '08', '24', '23', '87', '89']


## 4. sub()

sub method is also known as replace method.

In [28]:
suba = "The baker added slices of fresh banana to the top of the moist chocolate cake"
x = re.sub("\s", "----", suba)
print(x)

The----baker----added----slices----of----fresh----banana----to----the----top----of----the----moist----chocolate----cake


In [29]:
y = "Rs"
#x = r"[oe]"
text = "This item cost Rs 2500"

replace_text = re.sub(y,"$",text)
print(replace_text)

This item cost $ 2500


In [30]:
target = "APJ Abdul Kalam  contributions to the field of aerospace and his commitment to education made him a beloved figure and a role model for aspiring scientists and students worldwide."
re_st = re.sub(r"\s+","",target) # remove all the spaces, including single or multiple spaces(pattern to remove "\s+")
print(re_st)

APJAbdulKalamcontributionstothefieldofaerospaceandhiscommitmenttoeducationmadehimabelovedfigureandarolemodelforaspiringscientistsandstudentsworldwide.


In [31]:
# Removing Leading spaces

cake = "       The baker added slices of fresh banana to the top of the moist chocolate cake"
replace = re.sub(r"^\s+", "", cake)
print(replace)

The baker added slices of fresh banana to the top of the moist chocolate cake


In [32]:
# Removing Leading spaces and space from the end

cake = "       The baker added slices of fresh banana to the top of the moist chocolate cake     "
replace = re.sub(r"^\s+$", "", cake)
print(replace)

       The baker added slices of fresh banana to the top of the moist chocolate cake     


In [33]:
#Try substitute: re.sub(regexStr, replacementStr, inStr) -> outStr
        
substitute = re.sub(r"[0-9]","*", "A15B22C808D10_115")
print(substitute)

A**B**C***D**_***


In [34]:
#Try substitute: re.sub(regexStr, replacementStr, inStr) -> outStr
        
substitute = re.sub(r"[0-9]+","*", "A15B22C808D10_115")
print(substitute)

A*B*C*D*_*


In [35]:
#Try substitute: re.subn(regexStr, replacementStr, inStr) -> outStr

substitute = re.subn(r"[0-9]+",r"*", "AB22C808D10_115")
print(substitute)

('AB*C*D*_*', 4)


# Group in Regex

A group is a part of a regex pattern enclosed in paranthesis () metacharacter. We create a group by placing the regex pattern inside the set of parenthesis(and.)

Capturing group are a way to treat multiple characters as a single unit. They are created by placing the character to be grouped inside a set of parenthesis(,).

We can use group() method to extract each group result separetely by specific a group index in between parenthesis. Capturing groups are numbered by counting their opening parenthesis from left to right. In our case, we used two groups.

**Be Noted** Unlike string indexing, which always start at 0, group numbering always starts at 1.
The group which the number 0 is always the target string. If you call the group() method with no arguments at all or with 0 as an argument you will get the entire target string.

In [36]:
string = "Rohit Sharma has scored 43 CENTURIES and 91 CENTURIES in his cricket career"
result = re.search(r"(\b\d+).+(\b[A-Z]+\b).+(\b\d+).+(\b[A-Z]+\b)", string)

print(result.groups()) # Extract matching Values of group

print(result.group(1)) # Extract matching Values of group 1

print(result.group(2)) # Extract matching Values of group 2

('43', 'CENTURIES', '91', 'CENTURIES')
43
CENTURIES


In [37]:
print(result.group(1,2))

('43', 'CENTURIES')


group() method of a Match object, we can extract all the group matches at once, It provides all matches in the tuple format.

### Regex Capture Group Multiple Times

The search method will return only the first match for each group. But what if a string contains the multiple occurences of a regex group and you want to extract all matches.

To capture all matches to a regex group we need to use the `finditer()` method.



The `finditer()` method finds all matches and returns an iterator yielding match objects matching the regex pattern. Next, we can iterate each Match object and extract its value.

**Note:** DOn't use the findall() method because it returns a list, the group() method cannot be applied. If you try to apply it to the findall method, You will get the **Attribute Error: 'list' object has no attribute "groups"**.

So always use finditer if you wanted to capture all the matches to the group.

In [38]:
string1 = "Rohit Sharma has scored 43 CENTURIES and 91 HALFCENTURIES in his cricket career"

pattern = re.compile(r"(\b\d+\b).(\b[A-Z]+\b)")

#find all matches to group:
for match in pattern.finditer(string1): 
    print(match.group(1))
    #Extract Numbers
    print(match.group(2))

43
CENTURIES
91
HALFCENTURIES


# Remove all consecutive duplicate words

In [39]:
import regex as re
string_1 ="Ram went went to to his home"
regex = r'\b(\w+)(?:\W+\1\b)+'
x = re.sub(regex, r'\1',string_1)
print(x)

Ram went to his home


The details of the above regular expression can be understood as:
* "\b" - A word boundary. Boundaries are needed for special cases. For example, in "My thesis is great", "is" wont be matched twice
* "\w+" - A word character:[a-bA-Z_0-9]
* "\W+" -A non word character: [^\w]
* "\1" - Matches whatever was matched in the 1st group of parenthesis, which in this case is the (\w+)
* "+" - Matches whatever it's placed after 1 or more times

In [40]:
str2 = "hello hello world world"
regex = r'\b(\w+)(?:\W+\1\b)+'

x = re.sub(regex, r'\1', str2)
print(x)

hello world


# Extract Url from Text

# Extract Email Address from a text file with Regular Expressin:

Here is a brief explaination of the regular expression:
* `[a-zA-Z0-9._%+-]` Matches one or more characters that can be lowercase or uppercase letters, digits or any  characater `._%+-`
* `@[a-zA-Z0-9.-]` matches one or more characters that can be lowercase or uppercase letters, digits or any  characater `.-`
* `\.` matches a literal period (the backslash escapes the period so that it is treated as a literal character, rather than a special character in the regular expression)
* `[a-zA-Z]{2,}` matches two or more characters that can be lowercase or upper letters.

# Regex with Pandas

In [41]:
import pandas as pd

Pandas contains several functions that support pattern-matching with regex,

Below are the three major functions:
* **Series.str.contains(pattern)** - This function checks for a pattern in a column(Series) to returnn True or False values (a mask) where the pattern matches. The mask can then be applied to the entire dataframe to only return True rows.
* **Series.Str.extract (pattern,expand,flags)** - To use this function, we must define groups using parenthesis inside the pattern. The function extract the matches and returns the groups as a column in a dataframe. When you have only one group in the pattern, use expand = False to return a series instead of a dataframe object.
* **Series.Str.replace (pattern,repl,flags)** - Similar to re.sub(), this function replaces matches with the repl string.

In [42]:
df =pd.read_csv("titanic_train.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Filtering a dataframe - s.str.contains(pattern)

In [43]:
pattern =r'C\.?A\.?'
mask = df['Ticket'].str.contains(pattern)
df[mask].head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
56,57,1,2,"Rugg, Miss. Emily",female,21.0,0,0,C.A. 31026,10.5,,S
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S
66,67,1,2,"Nye, Mrs. (Elizabeth Ramell)",female,29.0,0,0,C.A. 29395,10.5,F33,S
70,71,0,2,"Jenkin, Mr. Stephen Curnow",male,32.0,0,0,C.A. 33111,10.5,,S
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S
93,94,0,3,"Dean, Mr. Bertram Frank",male,26.0,1,2,C.A. 2315,20.575,,S
134,135,0,2,"Sobey, Mr. Samuel James Hayden",male,25.0,0,0,C.A. 29178,13.0,,S
145,146,0,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S


In [44]:
mask.sum()

42

### Extracting data - s.str.extract(pattern)

In [45]:
df['Name'].sample(5)

843    Lemberopolous, Mr. Peter L
371     Wiklund, Mr. Jakob Alfred
231      Larsson, Mr. Bengt Edvin
113       Jussila, Miss. Katriina
239        Hunt, Mr. George Henry
Name: Name, dtype: object

In [46]:
#Extract all unique titles such as Mr, Miss, and Mrs from passengers name

pattern ='\s(\w+)\.'
all_ts = df['Name'].str.extract(pattern, expand = False)
unique_ts = all_ts.value_counts()
unique_ts

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: Name, dtype: int64

In [47]:
#extract first and last name

pattern =r'(\w+), (\w+\.) (\w+).*'
df_name = df['Name'].str.extract(pattern, flags= re.I) #re.I is the index value
df_name

Unnamed: 0,0,1,2
0,Braund,Mr.,Owen
1,Cumings,Mrs.,John
2,Heikkinen,Miss.,Laina
3,Futrelle,Mrs.,Jacques
4,Allen,Mr.,William
...,...,...,...
886,Montvila,Rev.,Juozas
887,Graham,Miss.,Margaret
888,Johnston,Miss.,Catherine
889,Behr,Mr.,Karl


### Replacing values in a column - s.str.replace(pattern, repl)

In [48]:
# Replace all the titles with capital letters

pattern = r'\s(\w+)\.'
df['Name'].str.replace(pattern, lambda m:m.group().upper())

  df['Name'].str.replace(pattern, lambda m:m.group().upper())


0                                Braund, MR. Owen Harris
1      Cumings, MRS. John Bradley (Florence Briggs Th...
2                                 Heikkinen, MISS. Laina
3           Futrelle, MRS. Jacques Heath (Lily May Peel)
4                               Allen, MR. William Henry
                             ...                        
886                                Montvila, REV. Juozas
887                         Graham, MISS. Margaret Edith
888             Johnston, MISS. Catherine Helen "Carrie"
889                                Behr, MR. Karl Howell
890                                  Dooley, MR. Patrick
Name: Name, Length: 891, dtype: object

In [49]:
# Replace only Mr and Mrs from the titles with capital letters

pattern = r'\s(Mr|Mrs)\.\s'
df['Name'].str.replace(pattern, lambda m:m.group().upper())

  df['Name'].str.replace(pattern, lambda m:m.group().upper())


0                                Braund, MR. Owen Harris
1      Cumings, MRS. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, MRS. Jacques Heath (Lily May Peel)
4                               Allen, MR. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, MR. Karl Howell
890                                  Dooley, MR. Patrick
Name: Name, Length: 891, dtype: object

In [50]:
df = pd.DataFrame ({"SUMMARY":["hello world!", "xxxx test", "123four", "five:", "six...."]})
df

Unnamed: 0,SUMMARY
0,hello world!
1,xxxx test
2,123four
3,five:
4,six....


In [51]:
df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}',"")

  df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}',"")


0    hello world
1      xxxx test
2           four
3           five
4            six
Name: SUMMARY, dtype: object

## More Methods

There are several pandas methods which accept the regex in pandas to find the pattern in a string within a Series or DataFrame object.

These methods work on the same line as Pythons re module. Its really helpful if you want to find the names starting with aparticular character or search for a pattern within a dataframe column or extract the dates from the

Here are the pandas functions that accepts regular expression:

* **count()** - Count occurences of pattern in each string of the Series/Index
* **replace()** - Replace the search string or pattern with the given value
* **contains()** - Test if pattern or regex is contained with a string of a Series or index. Calls re.search() and returns a boolean
* **extract()** - Extract groups in the regex pattern as column in a DataFrame and returns the captured groups.
* **findall()** - Find all occurences of pattern or regular expression in the Series/Index. Equivalent to applying re.findall() on all elements
* **match()** - Determine if each string matches a regular expression. Calls re.match() and returns a boolean
* **split()** - Equivalent to str.split() and accept string or regular expression to split on
* **rsplit()** - Equivalent to str.rsplit() and splits in string in the Series/Index from the end Create Data

In [52]:
data = pd.read_csv('happiness_score_dataset.csv')
data

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302


## Pandas Extract

Extract the first 5 character of each country using ^(startsof the string) and {5}(for 5 characters) and create a new column first_five_letter

In [53]:
import numpy as np

data['first_five_letter'] = data['Country'].str.extract(r'(^\w{5})')
data.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,first_five_letter
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switz
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Icela
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denma
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norwa
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canad


In [54]:
data['Country'].value_counts()

Switzerland    1
Bangladesh     1
Greece         1
Lebanon        1
Hungary        1
              ..
Kazakhstan     1
Slovenia       1
Lithuania      1
Nicaragua      1
Togo           1
Name: Country, Length: 158, dtype: int64

In [55]:
data['Country'].unique()

array(['Switzerland', 'Iceland', 'Denmark', 'Norway', 'Canada', 'Finland',
       'Netherlands', 'Sweden', 'New Zealand', 'Australia', 'Israel',
       'Costa Rica', 'Austria', 'Mexico', 'United States', 'Brazil',
       'Luxembourg', 'Ireland', 'Belgium', 'United Arab Emirates',
       'United Kingdom', 'Oman', 'Venezuela', 'Singapore', 'Panama',
       'Germany', 'Chile', 'Qatar', 'France', 'Argentina',
       'Czech Republic', 'Uruguay', 'Colombia', 'Thailand',
       'Saudi Arabia', 'Spain', 'Malta', 'Taiwan', 'Kuwait', 'Suriname',
       'Trinidad and Tobago', 'El Salvador', 'Guatemala', 'Uzbekistan',
       'Slovakia', 'Japan', 'South Korea', 'Ecuador', 'Bahrain', 'Italy',
       'Bolivia', 'Moldova', 'Paraguay', 'Kazakhstan', 'Slovenia',
       'Lithuania', 'Nicaragua', 'Peru', 'Belarus', 'Poland', 'Malaysia',
       'Croatia', 'Libya', 'Russia', 'Jamaica', 'North Cyprus', 'Cyprus',
       'Algeria', 'Kosovo', 'Turkmenistan', 'Mauritius', 'Hong Kong',
       'Estonia', 'Indonesi

## Pandas Count

In [56]:
s = pd.Series(['Switzerland', 'Iceland', 'Denmark', 'Norway', 'Canada', 'Finland',
               'Netherlands', 'Sweden', 'New Zealand', 'Australia'])
s[s.str.count(r'(^S.*)')==1]

0    Switzerland
7         Sweden
dtype: object

In [57]:
s = pd.Series(['Switzerland', 'Sweden', 'Syria', 'Sudan', 'Swaziland', 'spain'])
s[s.str.count(r'(^S.*)')==1]

0    Switzerland
1         Sweden
2          Syria
3          Sudan
4      Swaziland
dtype: object

In [58]:
s = pd.Series(['Switzerland', 'Sweden', 'Syria', 'Sudan', 'Swaziland', 'spain'])
s[s.str.count(r'(^S|s.*)')==1]

0    Switzerland
1         Sweden
2          Syria
3          Sudan
4      Swaziland
5          spain
dtype: object

We can use sum() to find the total elements matching the pattern

In [59]:
s.str.count(r'(^S.*)').sum()

5

In [60]:
data[data['Country'].str.count('^[p|P].*')>0]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,first_five_letter
24,Panama,Latin America and Caribbean,25,6.786,0.0491,1.06353,1.1985,0.79661,0.5421,0.0927,0.24434,2.84848,Panam
52,Paraguay,Latin America and Caribbean,53,5.878,0.04563,0.75985,1.30477,0.66098,0.53899,0.08242,0.3424,2.18896,Parag
57,Peru,Latin America and Caribbean,58,5.824,0.04615,0.90019,0.97459,0.73017,0.41496,0.05989,0.14982,2.5945,
59,Poland,Central and Eastern Europe,60,5.791,0.04263,1.12555,1.27948,0.77903,0.53122,0.04212,0.16759,1.86565,Polan
80,Pakistan,Southern Asia,81,5.194,0.03726,0.59543,0.41411,0.51466,0.12102,0.10464,0.33671,3.10709,Pakis
87,Portugal,Western Europe,88,5.102,0.04802,1.15991,1.13935,0.87519,0.51469,0.01078,0.13719,1.26462,Portu
89,Philippines,Southeastern Asia,90,5.073,0.04934,0.70532,1.03516,0.58114,0.62545,0.12279,0.24991,1.7536,Phili
107,Palestinian Territories,Middle East and Northern Africa,108,4.715,0.04394,0.59867,0.92558,0.66015,0.24499,0.12905,0.11251,2.04384,Pales


In our original datframe we are finding all the country that starts with character "P" and "p" (both lower and upper case). Basically, we are filtering all the rows which return count > 0.

## Pandas Match

match() is equivalent to python's re.match() and returns a boolean value.

In [61]:
S= pd.Series(['Kazakhstan', 'Slovenia', 'Lithuania', 'Nicaragua', 'Peru', 'Belarus', 'Poland'])
S[S.str.match(r'(^P.*)')==True]

4      Peru
6    Poland
dtype: object

Here we are finding all the countries in pandas series starting with character "P" (Upper case).

In [62]:
data[data['Country'].str.match(r'(^P.*)')==True]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,first_five_letter
24,Panama,Latin America and Caribbean,25,6.786,0.0491,1.06353,1.1985,0.79661,0.5421,0.0927,0.24434,2.84848,Panam
52,Paraguay,Latin America and Caribbean,53,5.878,0.04563,0.75985,1.30477,0.66098,0.53899,0.08242,0.3424,2.18896,Parag
57,Peru,Latin America and Caribbean,58,5.824,0.04615,0.90019,0.97459,0.73017,0.41496,0.05989,0.14982,2.5945,
59,Poland,Central and Eastern Europe,60,5.791,0.04263,1.12555,1.27948,0.77903,0.53122,0.04212,0.16759,1.86565,Polan
80,Pakistan,Southern Asia,81,5.194,0.03726,0.59543,0.41411,0.51466,0.12102,0.10464,0.33671,3.10709,Pakis
87,Portugal,Western Europe,88,5.102,0.04802,1.15991,1.13935,0.87519,0.51469,0.01078,0.13719,1.26462,Portu
89,Philippines,Southeastern Asia,90,5.073,0.04934,0.70532,1.03516,0.58114,0.62545,0.12279,0.24991,1.7536,Phili
107,Palestinian Territories,Middle East and Northern Africa,108,4.715,0.04394,0.59867,0.92558,0.66015,0.24499,0.12905,0.11251,2.04384,Pales


Running the same match() method and filtering by boolean valu True we get all the Countries starting with "P" in the original dataframe and here we are doing the same thing.

## Pandas Contain

In [63]:
S = pd.Series(['Finland','Kuwait', 'Suriname', 'France', 'Argentina','Florida'])
S.str.contains('^F.*')

0     True
1    False
2    False
3     True
4    False
5     True
dtype: bool

In [64]:
S = pd.Series(['Finland','Kuwait', 'Suriname', 'France', 'Argentina','florida'])
S.str.contains('^F|f.*')

0     True
1    False
2    False
3     True
4    False
5     True
dtype: bool

we are creting a new list of countries which starts with character 'F' and 'f' from the series. The list comprehension checks for all the returned value > 0 and creates a list matching the patterns.

## Pandas Split

This sis equivalent tp str.split() and accept regex, if no regex passed then the default is \s (for whitespace).

In [65]:
s =pd.Series(["TheStatueofLiberty built-on 28-Oct-1886"])
s.str.split(r'\s', n=-1, expand = True)

Unnamed: 0,0,1,2
0,TheStatueofLiberty,built-on,28-Oct-1886


Here we are splitting the text on white space and expands set as True splits that into 3 different columns

## Pandas rsplit

It is equivalent to str.rsplit() and the only difference with rsplit() functionis that it splits the string from the end.