## Pandas Regex
[reference](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)
[reference](https://stackoverflow.com/questions/26318287/what-does-r-mean-before-a-regex-pattern#:~:text=The%20r%20means%20that%20the,escape%20codes%20will%20be%20ignored.)
[reference](https://www.w3schools.com/python/python_regex.asp)
[reference](https://docs.python.org/3/library/re.html)

## Regular expression in python
A RegEx, or Regular Expression, is **a sequence of characters that forms a search pattern.**  RegEx can be used to check if a string contains the specified search pattern. To form a pattern, **Metacharacters**, **Special sequences**, and **Sets** are used. 

## What does 'r' mean before a Regex pattern?
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.  So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.   Usually patterns will be expressed in Python code using this raw string notation.

In [1]:
print('\n') # Prints a newline character

print(r'\n') # Escape sequence is not processed

print('\b') # Prints a backspace character

print(r'\b') # Escape sequence is not processed



\n

\b


[reference](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)

|<font size="5">Methods</font>|<font size="5">Description</font>|
| :--- | :--- |
|<font size="4">count()</font>|<font size="4">Count occurrences of pattern in each string of the Series/Index</font>|
|<font size="4">replace()</font>|<font size="4">Replace the search string or pattern with the given value</font>|
|<font size="4">contains()</font>|<font size="4">Test if pattern or regex is contained within a string of a Series or Index. Calls re.search()|and returns a boolean</font>|
|<font size="4">extract()|<font size="4">Extract capture groups in the regex pat as columns in a DataFrame and returns the captured groups</font>|
|<font size="4">findall()</font>|<font size="4">Find all occurrences of pattern or regular expression in the Series/Index. Equivalent to applying re.findall() on all elements. findall() returns list</font>|
|<font size="4">match()</font>|<font size="4">Determine if each string matches a regular expression. Calls re.match() and returns a boolean</font>|
|<font size="4">split()</font>|<font size="4">Equivalent to str.split() and Accepts String or regular expression to split on</font>|
|<font size="4">rsplit()</font>|<font size="4">Equivalent to str.rsplit() and Splits the string in the Series/Index from the end</font>|


[reference](https://www.w3schools.com/python/python_regex.asp)
### <font size="5">Metacharacters</font>
Metacharacters are characters with a special meaning:

|Character|	Description|	Example	|
|:---   |:---   | :---   |
|[]	|A set of characters	|"[a-m]"	|
|\	|Signals a special sequence (can also be used to escape special characters)	|"\d"|	
|.	|Any character (except newline character)	|"he..o"	|
|^	|Starts with	|"^hello"|
|\$	|Ends with	|"planet$"|	
|*	|Zero or more occurrences|"he.\*o"|
|+	|One or more occurrences	|"he.+o"|	
|?	|Zero or one occurrences	|"he.?o"|	
|{}	|Exactly the specified number of occurrences|"he.{2}o"|	
|\|	|Either or	|"falls\|stays"|	
|()	|Capture and group	 	 


### <font size="5">Special Sequences</font>

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

|<font size="4"> Character</font>|<font size="4">Description</font>	|<font size="4">Example </font>|
|:--- |:--- |:--- |
|\A	|Returns a match if the specified characters are at the beginning of the string	|"\AThe"	
|\b	|Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	|r"\bain" r"ain\b"	
|\B	|Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	|r"\Bain" r"ain\B"	
|\d	|Returns a match where the string contains digits (numbers from 0-9)	|"\d"	
|\D	|Returns a match where the string DOES NOT contain digits	|"\D"	
|\s	|Returns a match where the string contains a white space character	|"\s"	
|\S	|Returns a match where the string DOES NOT contain a white space character	|"\S"	
|\w	|Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	|"\w"	
|\W	|Returns a match where the string DOES NOT contain any word characters	|"\W"	
|\Z	|Returns a match if the specified characters are at the end of the string	|"Spain\Z"

### <font size="5">Sets</font>
A set is a set of characters inside a pair of square brackets [] with a special meaning:

|<font size="4"> Set</font>|<font size="4">Description</font>	|
|:--- |:--- |
|[arn]	|Returns a match where one of the specified characters (a, r, or n) are present	|
|[a-n]	|Returns a match for any lower case character, alphabetically between a and n	|
|[^arn]	|Returns a match for any character EXCEPT a, r, and n	|
|[0123]|	Returns a match where any of the specified digits (0, 1, 2, or 3) are present	|
|[0-9]	|Returns a match for any digit between 0 and 9	|
|[0-5][0-9]	|Returns a match for any two-digit numbers from 00 and 59	|
|[a-zA-Z]	|Returns a match for any character alphabetically between a and z, lower case OR upper case	|
|[+]	|In sets, +, \*, ., \|, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string|

In [2]:
import pandas as pd
df = pd.read_csv('world-happiness-report-2019.csv')

df.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0


### Take out the first 5 characters from "Country(region)"

In [3]:
df['first_5_letter']=df['Country (region)'].str.extract(r'(\w{5})')
df.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy,first_5_letter
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0,Finla
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0,Denma
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0,Norwa
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0,Icela
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0,Nethe


### Count the countries starting with character 'F'

In [4]:
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
print(S[S.str.count(r'(^F.*)')==True]) # Evaluate whether there is any country name starting with F
print()
print(S[S.str.count(r'(^F.*)')==1] )

0    Finland
2    Florida
dtype: object

0    Finland
2    Florida
dtype: object


In [5]:
S.str.count(r'(^F.*)').sum()

2

###  Find all the Country that starts with Character ‘P’ and ‘p’ (both lower and upper case)

In [6]:
df[df['Country (region)'].str.count(r'(^[Pp].*)')>0] # 0 means False (not exist) and any number other than 0 is True (exists). 

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy,first_5_letter
30,Panama,31,121,7.0,48.0,41.0,32.0,104.0,88.0,51.0,33.0,Panam
39,Poland,40,28,76.0,33.0,44.0,52.0,108.0,77.0,41.0,36.0,Polan
62,Paraguay,63,90,1.0,39.0,30.0,34.0,76.0,67.0,90.0,81.0,Parag
64,Peru,65,114,36.0,127.0,77.0,61.0,132.0,126.0,76.0,47.0,
65,Portugal,66,73,97.0,100.0,47.0,37.0,135.0,122.0,39.0,22.0,Portu
66,Pakistan,67,53,130.0,111.0,130.0,114.0,55.0,58.0,110.0,114.0,Pakis
68,Philippines,69,119,42.0,116.0,75.0,15.0,49.0,115.0,97.0,99.0,Phili
109,Palestinian Territories,110,110,128.0,140.0,82.0,134.0,90.0,147.0,112.0,,Pales


## match () function is equivalent to python’s re.match() and returns a boolean value. We are finding all the countries in pandas series starting with character ‘P’ (Upper case) .



In [7]:
# Get countries starting with letter P
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france', 
             'Portugal', 'Peru', 'Philippines', 'Pakistan'])
S[S.str.match(r'(^P.*)')==True]

4     Puerto Rico
7        Portugal
8            Peru
9     Philippines
10       Pakistan
dtype: object

In [8]:
 df[df['Country (region)'].str.match('^P.*')== True]

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy,first_5_letter
30,Panama,31,121,7.0,48.0,41.0,32.0,104.0,88.0,51.0,33.0,Panam
39,Poland,40,28,76.0,33.0,44.0,52.0,108.0,77.0,41.0,36.0,Polan
62,Paraguay,63,90,1.0,39.0,30.0,34.0,76.0,67.0,90.0,81.0,Parag
64,Peru,65,114,36.0,127.0,77.0,61.0,132.0,126.0,76.0,47.0,
65,Portugal,66,73,97.0,100.0,47.0,37.0,135.0,122.0,39.0,22.0,Portu
66,Pakistan,67,53,130.0,111.0,130.0,114.0,55.0,58.0,110.0,114.0,Pakis
68,Philippines,69,119,42.0,116.0,75.0,15.0,49.0,115.0,97.0,99.0,Phili
109,Palestinian Territories,110,110,128.0,140.0,82.0,134.0,90.0,147.0,112.0,,Pales


## Replace: remove a dash(-) followed by a numeric digit (represented by d) and replace that with an empty string

In [9]:
# Without "regex=True", replace() function is not working. Default is "regex=False"

# Remove the dash(-) followed by number from all countries in the Series
S=pd.Series(['Finland-1','Colombia-2','Florida-3','Japan-4','Puerto Rico-5','Russia-6','france-7'])
S.replace('(-\d)','', inplace = True)


In [10]:
# Without "regex=True", replce() function is not working

# Remove the dash(-) followed by number from all countries in the Series
S=pd.Series(['Finland-1','Colombia-2','Florida-3','Japan-4','Puerto Rico-5','Russia-6','france-7'])
S.replace(r'(-\d)','', inplace = True)

S

0        Finland-1
1       Colombia-2
2        Florida-3
3          Japan-4
4    Puerto Rico-5
5         Russia-6
6         france-7
dtype: object

In [11]:
# Remove the dash(-) followed by number from all countries in the Series
S=pd.Series(['Finland-1','Colombia-2','Florida-3','Japan-4','Puerto Rico-5','Russia-6','france-7'])
S.replace('(-\d)','', regex=True, inplace = True) # now regex recognizes and interpret the special characet ("\d") as number

S

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

In [12]:
# Remove the dash(-) followed by number from all countries in the Series
S=pd.Series(['Finland-1','Colombia-2','Florida-3','Japan-4','Puerto Rico-5','Russia-6','france-7'])
S.replace(r'(-\d)','', regex=True, inplace = True)

S

0        Finland
1       Colombia
2        Florida
3          Japan
4    Puerto Rico
5         Russia
6         france
dtype: object

## Pandas Findall(): It calls re.findall() and find all occurence of matching patterns. 

In [13]:
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
S.str.findall('^[Ff].*')

0    [Finland]
1           []
2    [Florida]
3           []
4           []
5           []
6     [france]
dtype: object

In [14]:
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
[itm[0] for itm in S.str.findall('^[Ff].*') if len(itm)>0]

['Finland', 'Florida', 'france']

## Pandas Contains: It uses re.search() and returns a boolean value.

In [15]:
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
S.str.contains('^F.*')

0     True
1    False
2     True
3    False
4    False
5    False
6    False
dtype: bool

In [16]:
S=pd.Series(['Finland','Colombia','Florida','Japan','Puerto Rico','Russia','france'])
S.str.contains('^F.*').sum()

2

In [17]:
df[df['Country (region)'].str.contains('^I.*')==True]

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy,first_5_letter
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0,Icela
12,Israel,13,14,104.0,69.0,38.0,93.0,74.0,24.0,31.0,11.0,Israe
15,Ireland,16,34,33.0,32.0,6.0,33.0,10.0,9.0,6.0,20.0,Irela
35,Italy,36,31,99.0,123.0,23.0,132.0,128.0,48.0,29.0,7.0,Italy
91,Indonesia,92,108,9.0,104.0,94.0,48.0,129.0,2.0,83.0,98.0,Indon
98,Ivory Coast,99,134,88.0,130.0,137.0,100.0,62.0,114.0,118.0,147.0,Ivory
116,Iran,117,109,109.0,150.0,134.0,117.0,44.0,28.0,54.0,77.0,
125,Iraq,126,147,151.0,154.0,124.0,130.0,66.0,73.0,64.0,107.0,
139,India,140,41,93.0,115.0,142.0,41.0,73.0,65.0,103.0,105.0,India


## Pandas Split: This is equivalent to str.split() and accepts regex, if no regex passed then the default is \s (for whitespace).

In [18]:
s = pd.Series(["StatueofLiberty built-on 28-Oct-1886 in New York"])
s.str.split(r"\s", n=-1,expand=True) # -1 means that split string at every space

Unnamed: 0,0,1,2,3,4,5
0,StatueofLiberty,built-on,28-Oct-1886,in,New,York


In [19]:
s = pd.Series(["StatueofLiberty built-on 28-Oct-1886 in New York"])
s.str.split(r"\s", n=1,expand=True) # split the string at the first space

Unnamed: 0,0,1
0,StatueofLiberty,built-on 28-Oct-1886 in New York


In [20]:
s = pd.Series(["StatueofLiberty built-on 28-Oct-1886 in New York"])
s.str.split(r"\s", n=2,expand=True) # split the string at the first and the second space

Unnamed: 0,0,1,2
0,StatueofLiberty,built-on,28-Oct-1886 in New York


In [21]:
s = pd.Series(["StatueofLiberty built-on 28-Oct-1886 in New York"])
s.str.split(r"\s", n=3,expand=True) # split the string at the first, second and third spaces

Unnamed: 0,0,1,2,3
0,StatueofLiberty,built-on,28-Oct-1886,in New York


In [22]:
pd.set_option('display.max_colwidth', None)

In [23]:
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
        ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
        ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
        ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]
  
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 
  
df 

Unnamed: 0,NAME,BLOOM
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season."
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season."
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season."
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season."
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds."
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season."


In [24]:
df['PETALS'] = df['BLOOM'].str.extract('(\(.*?\))', expand=False).str.strip()
df['PETALS']=df['PETALS'].str.replace('(', '', regex=True).str.replace(')', '', regex=True)
df

Unnamed: 0,NAME,BLOOM,PETALS
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",


In [25]:
# coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\(.*?\))') # findll returns list
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]

Unnamed: 0,NAME,BLOOM,PETALS,ALL_PETALS_BRACKETS
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",,[]
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)]
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",,[]
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)]
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals,"[(17-25 petals), (26-40 petals)]"
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",,[]


Such patterns we can extract with the following RegExs:

- 2 digits to 2 digits (26 to 40) -  r'(\d{2}\s+to\s+\d{2})'
- 2 digits - 2 digits (26-40) -  r'(\d{2}-\d{2})'
- or as 2 digits followed by word "petals" (35 petals) - r'(\d{2}\s+petals+.)'

In above RegExs I use:

- r to mark the RegEx (string) as a raw string, which <font  color='red'> does not escape metacharecters.</font>
- \d - Matches any decimal digit. Equivalent to any single numeral 0 to 9.
- {} - Whatever precedes braces {n} will be repeated at least n times.
- \s - Matches where a string contains any whitespace character. Equivalent to any space, tab, or newline charecter. (Think of this as matching "space" charecters.)


Postive and Negative Lookahead

- (?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, *Isaac (?=Asimov)* will match 'Isaac ' only if it’s followed by 'Asimov'.
 
- (?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, *Isaac (?!Asimov)* will match 'Isaac ' only if it’s not followed by 'Asimov'.


Postive and Negative Lookbehind

- (?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.

- (?<!...)
Matches if the current position in the string is not preceded by a match for .... 

In [26]:
# ?<! is for negative look behind
# works on sample df but not on the df with my scraped data
df['PETALS1'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)
df

Unnamed: 0,NAME,BLOOM,PETALS,ALL_PETALS_BRACKETS,PETALS1
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",,[],
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",,[],
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals,"[(17-25 petals), (26-40 petals)]",35 to 40
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",,[],20 to 25


In [27]:
# matches regex patern two digits to two digits
df['PETALS2'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()

#  matches regex patern two digits followed by word "petals"
df['PETALS3'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()

# matches regex patern two digits hyphen two digits followed by word "petals"
df['PETALS4'] = df['BLOOM'].str.extract(r'(\d{2}-\d{2}\s+petals)', expand=False).str.strip() 

df

Unnamed: 0,NAME,BLOOM,PETALS,ALL_PETALS_BRACKETS,PETALS1,PETALS2,PETALS3,PETALS4
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",,[],,,,
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],,,,26-40 petals
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",,[],,,75 petals.,
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],,,35 petals.,26-40 petals
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals,"[(17-25 petals), (26-40 petals)]",35 to 40,35 to 40,40 petals.,17-25 petals
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",,[],20 to 25,20 to 25,25 petals.,


## Create one master column for petals data. The master column will be column PETALS1

In [28]:
# if column PETALS1 is null then replace with value in column PETALS4
for i, row in df.iterrows():
    if (pd.isnull(row['PETALS1'])):
        row['PETALS1'] = row['PETALS4']
df

Unnamed: 0,NAME,BLOOM,PETALS,ALL_PETALS_BRACKETS,PETALS1,PETALS2,PETALS3,PETALS4
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",,[],,,,
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],26-40 petals,,,26-40 petals
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",,[],,,75 petals.,
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],26-40 petals,,35 petals.,26-40 petals
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals,"[(17-25 petals), (26-40 petals)]",35 to 40,35 to 40,40 petals.,17-25 petals
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",,[],20 to 25,20 to 25,25 petals.,


In [29]:
# if column PETALS1 is still null then replace with value in column PETALS3        
for i, row in df.iterrows():
    if (pd.isnull(row['PETALS1'])):
        row['PETALS1'] = row['PETALS3']
df


Unnamed: 0,NAME,BLOOM,PETALS,ALL_PETALS_BRACKETS,PETALS1,PETALS2,PETALS3,PETALS4
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",,[],,,,
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],26-40 petals,,,26-40 petals
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",,[],75 petals.,,75 petals.,
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",26-40 petals,[(26-40 petals)],26-40 petals,,35 petals.,26-40 petals
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.",17-25 petals,"[(17-25 petals), (26-40 petals)]",35 to 40,35 to 40,40 petals.,17-25 petals
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.",,[],20 to 25,20 to 25,25 petals.,


## More examples of replace()
[reference](https://www.geeksforgeeks.org/replace-values-in-pandas-dataframe-using-regex/)

In [30]:
import pandas as pd

df = pd.DataFrame({'City':['New York', 'Parague', 'New Delhi', 'Venice', 'new Orleans'],
                    'Event':['Music', 'Poetry', 'Theatre', 'Comedy', 'Tech_Summit'],
                    'Cost':[10000, 5000, 15000, 2000, 12000]})

# create the index
index_ = [pd.Period('02-2018'), pd.Period('04-2018'),
        pd.Period('06-2018'), pd.Period('10-2018'), pd.Period('12-2018')]

# Set the index
df.index = index_

df


Unnamed: 0,City,Event,Cost
2018-02,New York,Music,10000
2018-04,Parague,Poetry,5000
2018-06,New Delhi,Theatre,15000
2018-10,Venice,Comedy,2000
2018-12,new Orleans,Tech_Summit,12000


In [31]:
# replace the matching strings
df_updated = df.replace(to_replace ='[nN]ew', value = 'New_', regex = True)

df_updated


Unnamed: 0,City,Event,Cost
2018-02,New_ York,Music,10000
2018-04,Parague,Poetry,5000
2018-06,New_ Delhi,Theatre,15000
2018-10,Venice,Comedy,2000
2018-12,New_ Orleans,Tech_Summit,12000


In [32]:
# importing pandas as pd
import pandas as pd

# Let's create a Dataframe
df = pd.DataFrame({'City':['New York (City)', 'Parague', 'New Delhi (Delhi)', 'Venice', 'new Orleans'],
                    'Event':['Music', 'Poetry', 'Theatre', 'Comedy', 'Tech_Summit'],
                    'Cost':[10000, 5000, 15000, 2000, 12000]})


# Let's create the index
index_ = [pd.Period('02-2018'), pd.Period('04-2018'),
        pd.Period('06-2018'), pd.Period('10-2018'), pd.Period('12-2018')]
df


Unnamed: 0,City,Event,Cost
0,New York (City),Music,10000
1,Parague,Poetry,5000
2,New Delhi (Delhi),Theatre,15000
3,Venice,Comedy,2000
4,new Orleans,Tech_Summit,12000


In [33]:
import re

def Clean_names(City_name):
    # Search for opening bracket in the name followed by any characters repeated any number of times
    if re.search('\(.*', City_name):

        # Extract the position of beginning of pattern
        pos = re.search('\(.*', City_name).start()

        # return the cleaned name
        return City_name[:pos]

    else:
        # if clean up needed return the same name
        return City_name
        
# Updated the city columns
df['City'] = df['City'].apply(Clean_names)

df


Unnamed: 0,City,Event,Cost
0,New York,Music,10000
1,Parague,Poetry,5000
2,New Delhi,Theatre,15000
3,Venice,Comedy,2000
4,new Orleans,Tech_Summit,12000


In [34]:
import pandas as pd

df = pd.read_csv('internshala_dataset_raw.csv')
df

Unnamed: 0,internship,company_name,skills,perks,location,duration,stipend,applicants,ifSkillsorPerksMissingUseThis
0,Software Testing,Times Internet,Software Testing,"Certificate, 5 days a week",Noida,6 Months,8000 /month,119 applicants,"Software Testing, Certificate\n5 days a week"
1,Technical Operations - Networking And Monitoring,Paytm Payments Bank,"Java, SQL, Unix, Oracle, MS SQL Server, Hibernate (Java), Shell Scripting, Spring MVC, REST API","Certificate, Letter of recommendation, 5 days a week, Job offer",Noida,6 Months,10000 /month,194 applicants,"Java\nSQL\nUnix\nOracle\nMS SQL Server\nHibernate (Java)\nShell Scripting\nSpring MVC\nREST API, Certificate\nLetter of recommendation\n5 days a week\nJob offer"
2,Software Project Management,IIT Bombay,"English Proficiency (Spoken), English Proficiency (Written), Hindi Proficiency (Spoken), Hindi Proficiency (Written)","Certificate, Letter of recommendation, Flexible work hours, 5 days a week",Work From Home,6 Months,1000-2000 /month,113 applicants,"English Proficiency (Spoken)\nEnglish Proficiency (Written)\nHindi Proficiency (Spoken)\nHindi Proficiency (Written), Certificate\nLetter of recommendation\nFlexible work hours\n5 days a week"
3,Web Development,IIT Bombay,"HTML, CSS, Flask, Python, Django","Certificate, Letter of recommendation, Flexible work hours, 5 days a week",Work From Home,6 Months,1000-2000 /month,183 applicants,"HTML\nCSS\nFlask\nPython\nDjango, Certificate\nLetter of recommendation\nFlexible work hours\n5 days a week"
4,Front End Development,IIT Bombay,"HTML, CSS, JavaScript, ReactJS, Redux","Certificate, Letter of recommendation, Flexible work hours, 5 days a week",Work From Home,6 Months,1000-2000 /month,205 applicants,"HTML\nCSS\nJavaScript\nReactJS\nRedux, Certificate\nLetter of recommendation\nFlexible work hours\n5 days a week"
...,...,...,...,...,...,...,...,...,...
2562,Internet Of Things (IoT),Aditya Pratap Law Offices,"AutoCAD, Autodesk Inventor, Arduino, Circuit Design, Internet of Things (IoT)",,Lucknow,3 Months,2000-5000 /month,30 applicants,"AutoCAD\nAutodesk Inventor\nArduino\nCircuit Design\nInternet of Things (IoT), Certificate\nLetter of recommendation"
2563,Summer Research,"Indraprastha Institute Of Information Technology, Delhi",,,Delhi,2 Months,5000 /month,Be an early applicant,
2564,Academic Research (Computer Science),Max Planck Institute,,,Work From Home,3 Months,Not provided,Be an early applicant,
2565,Computer Science,Aryabhatta Research Institute Of Observational Sciences,,,Nainital,5 Months,10000 /month,Be an early applicant,


### Replace single string value of " applicants"

In [35]:
df['applicants'].str.replace(r'\sapplicants', '', regex=True)

0                         119
1                         194
2                         113
3                         183
4                         205
                ...          
2562                       30
2563    Be an early applicant
2564    Be an early applicant
2565    Be an early applicant
2566    Be an early applicant
Name: applicants, Length: 2567, dtype: object

### Replace mutliple string values


In [36]:
df['applicants'].str.replace(r'(\sapplicants|Be an early applicant)', '', regex=True).tolist()

['119',
 '194',
 '113',
 '183',
 '205',
 '457',
 '54',
 '618',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '34',
 '',
 '',
 '28',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '32',
 '',
 '',
 '',
 '',
 '',
 '41',
 '',
 '',
 '',
 '',
 '',
 '41',
 '',
 '',
 '',
 '35',
 '',
 '',
 '',
 '',
 '',
 '31',
 '',
 '',
 '46',
 '',
 '',
 '49',
 '',
 '',
 '34',
 '',
 '',
 '57',
 '',
 '32',
 '',
 '',
 '38',
 '',
 '',
 '49',
 '35',
 '',
 '',
 '',
 '43',
 '',
 '',
 '44',
 '',
 '',
 '',
 '',
 '',
 '',
 '75',
 '',
 '',
 '',
 '',
 '',
 '39',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '37',
 '',
 '34',
 '',
 '',
 '48',
 '',
 '47',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '47',
 '',
 '',
 '55',
 '',
 '',
 '',
 '',
 '',
 '66',
 '',
 '31',
 '',
 '',
 '',
 '31',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '27',
 '',
 '',
 '26',
 '',
 '',
 '32',
 '',
 '',
 '',
 '52',
 '',
 '',
 '',
 '',
 '',
 '',
 '62',
 '31',
 '',
 '',
 '68',
 '',
 '37',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


### Replace multiple string values with different replacement
- "Be an early applicant" to 0
- "49 applicants to" 49

In [37]:
df['applicants'].replace([r'(\d+) applicants', 'Be an early applicant'],[r'\1',0], regex=True)

0       119
1       194
2       113
3       183
4       205
       ... 
2562     30
2563      0
2564      0
2565      0
2566      0
Name: applicants, Length: 2567, dtype: object

### Replace replace with capture group


In [38]:
df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)

0                         119
1                         194
2                         113
3                         183
4                         205
                ...          
2562                       30
2563    Be an early applicant
2564    Be an early applicant
2565    Be an early applicant
2566    Be an early applicant
Name: applicants, Length: 2567, dtype: object

##  Regex replace only special characters to empty string
- r'[^0-9a-zA-Z:,\s]+' - keep numbers, letters, semicolon, comma and space
- r'[^0-9a-zA-Z:,]+' - keep numbers, letters, semicolon and comma <br>
Will change: Internet Of Things (IoT) to Internet Of Things IoT

In [39]:
df['internship'].str.replace(r'[^0-9a-zA-Z:,\s]+', '', regex=True)

0                                      Software Testing
1       Technical Operations  Networking And Monitoring
2                           Software Project Management
3                                       Web Development
4                                 Front End Development
                             ...                       
2562                             Internet Of Things IoT
2563                                    Summer Research
2564                 Academic Research Computer Science
2565                                   Computer Science
2566                                   Computer Science
Name: internship, Length: 2567, dtype: object

## Replace all non numeric symbols and map in case of missing
Replace everything which is not a number with a regex. In case of a value which doesn't have a number we will map the value to 0.

In [40]:
df['applicants'].str.replace(r'\D+', '', regex=True).replace({'':0}).astype('int')

0       119
1       194
2       113
3       183
4       205
       ... 
2562     30
2563      0
2564      0
2565      0
2566      0
Name: applicants, Length: 2567, dtype: int32

## Replace all numbers from a given column to empty string

In [41]:
df['applicants'].replace(to_replace=r"\d+", value=r" ", regex=True)

0                  applicants
1                  applicants
2                  applicants
3                  applicants
4                  applicants
                ...          
2562               applicants
2563    Be an early applicant
2564    Be an early applicant
2565    Be an early applicant
2566    Be an early applicant
Name: applicants, Length: 2567, dtype: object

## Replace all values in DataFrame

In [42]:
df.replace(to_replace=r"\d+", value=r" ", regex=True)


Unnamed: 0,internship,company_name,skills,perks,location,duration,stipend,applicants,ifSkillsorPerksMissingUseThis
0,Software Testing,Times Internet,Software Testing,"Certificate, days a week",Noida,Months,/month,applicants,"Software Testing, Certificate\n days a week"
1,Technical Operations - Networking And Monitoring,Paytm Payments Bank,"Java, SQL, Unix, Oracle, MS SQL Server, Hibernate (Java), Shell Scripting, Spring MVC, REST API","Certificate, Letter of recommendation, days a week, Job offer",Noida,Months,/month,applicants,"Java\nSQL\nUnix\nOracle\nMS SQL Server\nHibernate (Java)\nShell Scripting\nSpring MVC\nREST API, Certificate\nLetter of recommendation\n days a week\nJob offer"
2,Software Project Management,IIT Bombay,"English Proficiency (Spoken), English Proficiency (Written), Hindi Proficiency (Spoken), Hindi Proficiency (Written)","Certificate, Letter of recommendation, Flexible work hours, days a week",Work From Home,Months,- /month,applicants,"English Proficiency (Spoken)\nEnglish Proficiency (Written)\nHindi Proficiency (Spoken)\nHindi Proficiency (Written), Certificate\nLetter of recommendation\nFlexible work hours\n days a week"
3,Web Development,IIT Bombay,"HTML, CSS, Flask, Python, Django","Certificate, Letter of recommendation, Flexible work hours, days a week",Work From Home,Months,- /month,applicants,"HTML\nCSS\nFlask\nPython\nDjango, Certificate\nLetter of recommendation\nFlexible work hours\n days a week"
4,Front End Development,IIT Bombay,"HTML, CSS, JavaScript, ReactJS, Redux","Certificate, Letter of recommendation, Flexible work hours, days a week",Work From Home,Months,- /month,applicants,"HTML\nCSS\nJavaScript\nReactJS\nRedux, Certificate\nLetter of recommendation\nFlexible work hours\n days a week"
...,...,...,...,...,...,...,...,...,...
2562,Internet Of Things (IoT),Aditya Pratap Law Offices,"AutoCAD, Autodesk Inventor, Arduino, Circuit Design, Internet of Things (IoT)",,Lucknow,Months,- /month,applicants,"AutoCAD\nAutodesk Inventor\nArduino\nCircuit Design\nInternet of Things (IoT), Certificate\nLetter of recommendation"
2563,Summer Research,"Indraprastha Institute Of Information Technology, Delhi",,,Delhi,Months,/month,Be an early applicant,
2564,Academic Research (Computer Science),Max Planck Institute,,,Work From Home,Months,Not provided,Be an early applicant,
2565,Computer Science,Aryabhatta Research Institute Of Observational Sciences,,,Nainital,Months,/month,Be an early applicant,
