# Python for Data Science Bootcamp
## Lecture 8 - Part 1
## Textbook reference: PY4E 
### Chapter 11
### Lesson 12: Regular Expressions

Here are the topics for this lecture:

 * Introduction to Regular Expressions
 * Findall function

Let's get started...

**Regular Expressions:** A RegEx, or Regular Expression, is a **sequence of characters that forms a search pattern**. Typically, these patterns are used for four main tasks:

* Find text within a larger body of text
* Validate that a string conforms to a desired format
* Replace or insert text
* Split strings
    
As a data scientist/developer, having a solid understanding of Regex can help you perform various data manipulation and text mining tasks very easily.

RegEx can be used to check if a string contains the specified search pattern. 

Python has a built-in package called re, which can be used to work with Regular Expressions.  To use regular expressions in your code you must first import the re module: **import re**

Regular Expression library contains a set of functions called RegEx functions (see below) that will take arguments in the form of metacharacters (see below).

**RegEx Functions**

The re module offers a set of functions that allows us to search a string for a match:

* findall - Returns a list containing all matches
* search - Returns a Match object if there is a match anywhere in the string
* split - Returns a list where the string has been split at each match
* sub - Replaces one or many matches with a string

**Metacharacters**

* ^ Matches the beginning of a line 
* $ Matches the end of the line 
* . Matches any character 
* \s Matches whitespace 
* \S Matches any non-whitespace character
* "*" Repeats a character zero or more times 
* ? Repeats a character zero or more times (non-greedy)
* Repeats a character one or more times 
* +? Repeats a character one or more times (non-greedy
* [aeiou] Matches a single character in the listed set 
* [^XYZ] Matches a single character not in the listed set 
* [a-z0-9] The set of characters can include a range 
* ( Indicates where string extraction is to start 
* ) Indicates where string extraction is to end

In [1]:
import re
str = "My name is Luis Morales, what is yours?"

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Findall Function:** Returns a list containing all matches

In [2]:
re.findall?

In [3]:
#Check if the string starts with "My":
# \A - Matches if the specified characters are at the start of a string.

x = re.findall("\AMy", str)
if (x):
  print("Yes, there is a match!")
else:
  print("No match")

Yes, there is a match!


In [4]:
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", str)
print(x)

['a', 'm', 'e', 'i', 'i', 'a', 'l', 'e', 'h', 'a', 'i']


In [5]:
#Check if "Mor" is present at the beginning of a WORD:
# The 'r' at the start of the pattern string designates 
# a python "raw" string which passes through backslashes without change

x = re.findall(r"\bMor", str)

# Python raw string treats backslash (\) as a literal character. 
# This is useful when we want to have a string that contains backslash 
# and don't want it to be treated as an escape character.

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Mor']
Yes, there is at least one match!


In [6]:
#Return a match at every no-digit character:

x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['M', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'L', 'u', 'i', 's', ' ', 'M', 'o', 'r', 'a', 'l', 'e', 's', ',', ' ', 'w', 'h', 'a', 't', ' ', 'i', 's', ' ', 'y', 'o', 'u', 'r', 's', '?']
Yes, there is at least one match!


In [7]:
#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['n', 'a', 'r', 'a', 'a', 'r']
Yes, there is at least one match!


In [8]:
fhand = open('aipi.txt') # Open file
count = 0
for line in fhand: # For each line
    #Check if the line starts with "Duke":
    x = re.findall("\Duke", line)
    print(x)
    if (x):
      count = count + 1
print("I found {} matches!".format(count))

[]
['Duke']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Duke']
[]
['Duke']
[]
[]
[]
['Duke', 'Duke']
[]
[]
[]
['Duke']
[]
['Duke']
[]
['Duke']
[]
['Duke']
[]
[]
[]
['Duke', 'Duke']
[]
['duke']
I found 10 matches!


In [9]:
# Find all numeric characters
import re
line='123abc'
re.findall('^[0-9]*',line)

['123']

In [10]:
# Find matches starting at beginning of line 
# including characters between 0-9 or a-z 
# '+' one or more matches
# before @duke.edu
import re
line='lm338@duke.edu'
print(re.findall('^([0-9a-z]+)@duke.edu',line))

['lm338']


In [11]:
# Find matches starting at beginning of line 
# any non whitespace character "\w"
# before *@duke.edu
line='lm338*@duke.edu Luis Morales'
print(re.findall('^(\w+)\*@duke.edu',line))
print(re.findall('^([0-9a-z]+)\*@duke.edu',line))

['lm338']
['lm338']


In [12]:
# Find matches starting at beginning of line 
# any character
# before *@duke.edu
line='lm338@duke.edu'
print(re.findall('^(\w+)@[duke.edu|dukekunshan.edu.cn]',line))

['lm338']


In [14]:
# Find matches starting at beginning of line 
# any non whitespace character or period
# before duke.edu or yahoo.com
#line='morales.luis@dukekunshan.edu.cn'
line='wiki_morales@yahoo.com'
print(re.findall('^([\w\.]+)@[duke.edu|yahoo.com]',line))

['wiki_morales']


In [15]:
# Find matches starting with any non whitespace character 
# before @ 
line='wiki_morales@yahoo.com'
#print(re.findall('(\S+)@\S+',line))
print(re.findall('(\S+)@',line))

['wiki_morales']
