# Regex in Pandas

- count() : count occurrences of pattern in a each string of the Series/Index
- replace(): Replace the search string or pattern with the given value
- contains():  test if pattern or regex is contained within a stirng of a series or index calls re.search() and returns boolean
- extract(): Extract capture groups in the regex pat as columns in a datafme and returns the captured groups
- findall(): find all occurrence of pattern of regular expresison in the series/index. Equivalen to applying re.findall() on all elements
- match(): determine if each string matches a regular expression. Calls re.match() and return a boolean
- split(): Equivalent str.split() and accepts string or regular expression to split on
- rsplit(): equivalent str.rsplit() and splits the string in the series/index from the end

In [7]:
import pandas as pd
import numpy as np

In [2]:
df_happiness = pd.read_csv("./world-happiness-report-2019.csv")

In [4]:
df_happiness.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0


In [5]:
df_happiness.shape

(156, 11)

In [6]:
df_happiness.columns

Index(['Country (region)', 'Ladder', 'SD of Ladder', 'Positive affect',
       'Negative affect', 'Social support', 'Freedom', 'Corruption',
       'Generosity', 'Log of GDP\nper capita', 'Healthy life\nexpectancy'],
      dtype='object')

##### 1. Count


dataframe.count(axis=0,numeric_only = False)

Example with count

In [8]:
df = pd.DataFrame({
    "Person": ["John", "Myla", "Lewis", "John", "Myla"],
    "Age": [24., np.nan, 21., 33, 26],
    "Single": [False, True, True, True, False]
    })

In [10]:
df

Unnamed: 0,Person,Age,Single
0,John,24.0,False
1,Myla,,True
2,Lewis,21.0,True
3,John,33.0,True
4,Myla,26.0,False


In [11]:
df.count()

Person    5
Age       4
Single    5
dtype: int64

In [12]:
df.count(axis=1)

0    3
1    2
2    3
3    3
4    3
dtype: int64

In [13]:
s = pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
s

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

In [14]:
s[s.str.count(r'(^F.*)')==1]

0    Finland
2    Florida
dtype: object

In [15]:
s.str.count(r'(^F.*)').sum()

2

##### 2. Replace

In [27]:
s=pd.Series(['Finland-1','Colombia-2','Florida-3','Japan-4','Puerto Rico-5','Russia-6','france-7'])

In [28]:
s

0        Finland-1
1       Colombia-2
2        Florida-3
3          Japan-4
4    Puerto Rico-5
5         Russia-6
6         france-7
dtype: object

In [29]:
s.replace('-\d','',regex=True,inplace=True)

In [30]:
s

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

##### 3. Contains

In [31]:
s=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])

In [32]:
s

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

In [33]:
s.str.contains('^F.*')

0     True
1    False
2     True
3    False
4    False
5    False
6    False
dtype: bool

In [34]:
df_happiness[df_happiness['Country (region)'].str.contains('^I.*')]

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0
12,Israel,13,14,104.0,69.0,38.0,93.0,74.0,24.0,31.0,11.0
15,Ireland,16,34,33.0,32.0,6.0,33.0,10.0,9.0,6.0,20.0
35,Italy,36,31,99.0,123.0,23.0,132.0,128.0,48.0,29.0,7.0
91,Indonesia,92,108,9.0,104.0,94.0,48.0,129.0,2.0,83.0,98.0
98,Ivory Coast,99,134,88.0,130.0,137.0,100.0,62.0,114.0,118.0,147.0
116,Iran,117,109,109.0,150.0,134.0,117.0,44.0,28.0,54.0,77.0
125,Iraq,126,147,151.0,154.0,124.0,130.0,66.0,73.0,64.0,107.0
139,India,140,41,93.0,115.0,142.0,41.0,73.0,65.0,103.0,105.0


##### 4. Extract

In [36]:
df_happiness['first_five_Letter']=df_happiness['Country (region)'].str.extract(r'(^w{5})')
df_happiness.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy,first_five_Letter
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0,
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0,
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0,
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0,
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0,


##### 5. Findall

In [37]:
s=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])

In [38]:
s

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

In [40]:
[itm[0] for itm in s.str.findall('^[Ff].*') if len(itm)>0]

['Finland', 'Florida', 'france']

##### 6. Match


In [41]:
s=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])

In [42]:
s

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

In [43]:
s[s.str.match('^P.*')]

4    Puerto Rico
dtype: object

##### 7. split

In [44]:
s = pd.Series(["StatueofLiberty built-on 28-Oct-1886"])

In [45]:
s

0    StatueofLiberty built-on 28-Oct-1886
dtype: object

In [46]:
s.str.split(r's',n=-1,expand=True)

Unnamed: 0,0
0,StatueofLiberty built-on 28-Oct-1886
