# Regular Expressions (RegEx)

In [None]:
#https://i.redd.it/nac35ntlfg831.jpg

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [1]:
import re

### Sintax
-> Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.

https://docs.python.org/3/library/re.html#re-syntax

--> Sintax:
- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`
- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`



### Methods

- **sub()**
Replaces one or many matches with a string

In [120]:
txt = "fran, felipe, Marc, Clara and Blanca are TA's??"

In [121]:
#re.sub
#Literals
re.sub('f','F',txt)

"Fran, Felipe, Marc, Clara and Blanca are TA's??"

In [122]:
#Ranges
re.sub('[A-Z]','',txt)

"fran, felipe, arc, lara and lanca are 's??"

In [123]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"fran, felipe, Marc, Clara and Blanca are TA's."

- **search()**
Scan through a string, looking for any location where this RE matches.

In [124]:
#re.search
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [125]:
txt = "The rain in Spain"
#\b whole words only
x = re.search(r"\bS\w+", txt)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

(12, 17)
12
17
The rain in Spain
Spain


In [116]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^r\w*', txt))
print(re.search(r'^T\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
None
<re.Match object; span=(0, 3), match='The'>


- **match()**
Determine if the RE matches at the beginning of the string.

In [95]:
#re.match
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [156]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [157]:
email_address = 'Please contact us at: support@datamad.com'
match = re.search(r'(\w+)@([\w\.]+)', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@datamad.com
support
datamad.com


- **findall()**
Find all substrings where the RE matches, and returns them as a list.

In [158]:
#re.findall
email_address = "Please contact us at: support.data@data-mad.com, xyz@ironhack.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.]+@[\w\.-]+', email_address)
addresses

['support.data@data-mad.com', 'xyz@ironhack.com']

In [137]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 'm', 'd', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 'r', 'n', 'h', 'c', 'k', '.', 'c', 'm']
[' contact']
['Please']


- **split()**
Returns a list where the string has been split at each match

In [193]:
#re.split
prophet=['the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was{34}',
 'a',
 'dawn',
 'unto{6}der',
 'his']

In [200]:
def reference(x):
    return re.split("{\d+}",x)

In [201]:
prophet_reference=list(map(reference,prophet))
print(prophet_reference)

[['the', ''], ['chosen'], ['and'], ['the\nbeloved,'], ['who'], ['was', ''], ['a'], ['dawn'], ['unto', 'der'], ['his']]


In [202]:
prophet_reference=list(map(reference,prophet))
print(prophet_reference)

<function re.match(pattern, string, flags=0)>

-----------------------------------------------------------------------------------------------------------


In [5]:
fh = open(r"emails.txt", "r").read()

In [7]:
for line in re.findall("From:.*", fh):
    print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>
From: Kuta David <davidkuta@postmark.net>
From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>
From: "William Drallo" <william2244drallo@maktoob.com>
From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>
From: "Tunde  Dosumu" <barrister_td@lycos.com>
From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>
From: "Dr.Sam jordan" <sjordan@diplomats.com>
From: p_brown2@lawyer.com
From: Barrister Peter Brown
From: mic_k1@post.com
From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>
From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>
From: "Victor Aloma" <victorloma@netscape.net>
From: "Victor Aloma" <victorloma@netscape.net>
From: "JAMES NGOLA" <james_

In [10]:
match = re.findall("From:.*", fh)

for line in match:
    print(re.findall(r'[\w\.]+@[\w\.-]+', line))

['james_ngola2002@maktoob.com']
['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']
['obong_715@epatra.com']
['m_abacha03@www.com']
['davidkuta@postmark.net']
['tunde_dosumu@lycos.com']
['william2244drallo@maktoob.com']
['abdul_817@rediffmail.com']
['barrister_td@lycos.com']
['temijohnson2@rediffmail.com']
['sjordan@diplomats.com']
['p_brown2@lawyer.com']
[]
['mic_k1@post.com']
['mikebunduu1@rediffmail.com']
['elixwilliam@usa.com']
['anayoawka@hotmail.com']
['anayoawka@hotmail.com']
['victorloma@netscape.net']
['victorloma@netscape.net']
['james_ngola2002@maktoob.com']
['martinchime@usa.com']
['mboro1555@post.com']
['martinchime@borad.com']
['martinchime@borad.com']
['edema_mb@phantomemail.com']
['edema_mb@phantomemail.com']
['adewilliams_ade@lawyer.com']
['smithkam2@post.com']
['sesm@omaninfo.com']
['obinaokoro@37.com']
['jamesalbert0@eircom.net']
['seminar@eecs.UM']
['adamuohiroma.adamuohiroma@caramail.com']
['fredobi3@omaninfo.com']
['mbengu@37.com']
['mmahoyi@caramail.com']
[

['koutab0078@hotmail.com']
['bullfrog812@charter.net']
['asinthe_nignan33@hotmail.fr']
['robertwills@pre.sltnet.lk']
['emekaben@aol.co.uk']
['edwardmoore99@yahoo.co.uk']
['gmboma2@hotmail.com']
['mrcolinshand20@poczta.pf.pl']
['richard_bongo112za@hotmail.com']
['richard_bongo112za@hotmail.com']
['koutab0025@hotmail.com']
['drahmed_sh14@hotmail.com']
['collins_lyons73@yahoo.co.uk']
['rahim.attah1@virgilio.it']
[]
['fred_williams05@o2.pl']
['amija_1035@hotmail.com']
['mamud_islandoil@yahoo.com.hk']
['mamud_islandoil@yahoo.com.hk']
['chris_anderson@universia.com.br']
['bencoroma12@hotmail.com']
['larisasosnkayapawou@mail2russia.com']
['kmashaba@myway.com']
['williamkabila05@gmail.com']
['joemelosi@charter.net']
['larisasosnkayanxly@mail2russia.com']
['james_roberts007@navegante.com.sv']
[]
['drpaulsule1950@o2.pl']
['johnali20@hotmail.com']
['alimoheamed1@mynet.com']
['info_unionbankofchain007@adelphia.net']
['patrickokonta10@virgilio.it']
[]
[]
['patrickokonta10@virgilio.it']
[]
[]
[]
['n

In [13]:
contents = re.split(r"From r", fh)
contents.pop(1)

'  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDENT OCC

In [None]:
date = re.search(r"\d+\s\w+\s\d+", date_field.group())