----------------------------------------------------------------------------------------------------------

# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

**references**

- https://docs.python.org/3/howto/regex.html
- https://docs.python.org/3/library/re.html

- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [1]:
import re
import pandas as pd

### Sintax
-> Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.


--> Sintax:
- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`
- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`



ex. The expression (?:a{6})* matches any multiple of six 'a' characters

### Methods

- **sub()**
Replaces one or many matches with a string

In [8]:
txt = "fran, felipe, Marc?, Clara and Blanca are TA's??"

In [3]:
#re.sub
#Literals
re.sub('f','F',txt)

"Fran, Felipe, Marc, Clara and Blanca are TA's??"

In [4]:
#Ranges
re.sub('[A-Z]','',txt)

"fran, felipe, arc, lara and lanca are 's??"

In [10]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"fran, felipe, Marc?, Clara and Blanca are TA's."

- **search()**
Scan through a string, looking for any location where this RE matches.

In [13]:
#re.search
txt = "The rain in Spain"
x = re.search("The.*Spain", txt) 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [14]:
txt = "The rain in Spain"
#\b whole words only
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [15]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^r\w*', txt))
print(re.search(r'^T\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
None
<re.Match object; span=(0, 3), match='The'>


- **match()**
Determine if the RE matches at the beginning of the string.

In [19]:
#re.match
pattern = r"I"
sequence = "I love you,  Honey"
if re.match(pattern, sequence):
    print("Match!")
    print(re.match(pattern, sequence))
else: 
    print("Not a match!")

Match!
<re.Match object; span=(0, 1), match='I'>


In [20]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [21]:
email_address = 'Please contact us at: support@datamad.com'
match = re.search(r'(\w+)@([\w\.]+)', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@datamad.com
support
datamad.com


- **findall()**
Find all substrings where the RE matches, and returns them as a list.

In [24]:
#re.findall
email_address = "Please c contact us at: support.data@data-mad.com, xyz@ironhack.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.]+@[\w\.-]+', email_address)
addresses

['support.data@data-mad.com', 'xyz@ironhack.com']

In [26]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 'm', 'd', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 'r', 'n', 'h', 'c', 'k', '.', 'c', 'm']
[' c', ' contact']
['Please']


- **split()**
Returns a list where the string has been split at each match

In [27]:
#re.split
prophet=['the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was{34}',
 'a',
 'dawn',
 'unto{6}der',
 'his']

In [28]:
def reference(x):
    return re.split("{\d+}",x)

In [29]:
prophet_reference=list(map(reference,prophet))
print(prophet_reference)

[['the', ''], ['chosen'], ['and'], ['the\nbeloved,'], ['who'], ['was', ''], ['a'], ['dawn'], ['unto', 'der'], ['his']]


-----------------------------------------------------------------------------------------------------------


## Some Practice

https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud

In [30]:
emails_info={}

In [31]:
fh = open("emails.txt", "r").read()
fh[0:100]

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nR'

In [32]:
contents = re.split(r"From r", fh)

#### What we are looking for:

- sender_email
- sender_name
- date_sent
- time_sent
- subject

In [60]:
print(contents.pop(2))

  Fri Nov  1 04:48:39 2002
Return-Path: <m_abacha03@www.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <m_abacha03@www.com>
Message-Id: <200211010948.gA19mLu22932@perfectworld.mr.itd.UM>
From: "Maryam Abacha" <m_abacha03@www.com>
Reply-To: m_abacha03@www.com
To: R@M
Date: Fri, 1 Nov 2002 01:45:04 +0100
Subject: I Need Your Assistance.
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id gA19mVW29040
Status: RO

Dear sir, 
 
It is with a heart full of hope that I write to seek your help in respect of the context below. I am Mrs. Maryam Abacha the former first lady of the former Military Head of State of Nigeria General Sani Abacha whose sudden death occurred on 8th of June 1998 as a result of cardiac arrest (heart attack) while on the seat of power. 
I have no doubt about your capability and good-will to assist me in receiv

### Info Sender

Get the Sender's Email & Name

In [48]:
info_sender=re.findall(r"^From:.*", fh, re.M)
info_sender

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>',
 'From: "Maryam Abacha" <m_abacha03@www.com>',
 'From: Kuta David <davidkuta@postmark.net>',
 'From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>',
 'From: "William Drallo" <william2244drallo@maktoob.com>',
 'From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>',
 'From: "Tunde  Dosumu" <barrister_td@lycos.com>',
 'From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>',
 'From: "Dr.Sam jordan" <sjordan@diplomats.com>',
 'From: p_brown2@lawyer.com',
 'From: mic_k1@post.com',
 'From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>',
 'From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>',
 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>',
 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>',
 'From: "Victor Aloma" <victorloma@netscape.net>',
 'From: "Victor Aloma" <vi

In [55]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r"[\d+\w+]+@\w+\.\w+", line)
    if res:
        emails_info['sender_email'].append(res)
    else:
        emails_info['sender_email'].append('')
        
emails_info['sender_email']

[['james_ngola2002@maktoob.com'],
 ['bensul2004nng@spinfinder.com'],
 ['obong_715@epatra.com'],
 ['obong_715@epatra.com'],
 ['m_abacha03@www.com'],
 ['davidkuta@postmark.net'],
 ['tunde_dosumu@lycos.com'],
 ['william2244drallo@maktoob.com'],
 ['abdul_817@rediffmail.com'],
 ['barrister_td@lycos.com'],
 ['temijohnson2@rediffmail.com'],
 ['sjordan@diplomats.com'],
 ['p_brown2@lawyer.com'],
 ['mic_k1@post.com'],
 ['mikebunduu1@rediffmail.com'],
 ['elixwilliam@usa.com'],
 ['anayoawka@hotmail.com'],
 ['anayoawka@hotmail.com'],
 ['victorloma@netscape.net'],
 ['victorloma@netscape.net'],
 ['james_ngola2002@maktoob.com'],
 ['martinchime@usa.com'],
 ['mboro1555@post.com'],
 ['martinchime@borad.com'],
 ['martinchime@borad.com'],
 ['edema_mb@phantomemail.com'],
 ['edema_mb@phantomemail.com'],
 ['adewilliams_ade@lawyer.com'],
 ['smithkam2@post.com'],
 ['sesm@omaninfo.com'],
 ['obinaokoro@37.com'],
 ['jamesalbert0@eircom.net'],
 ['adamuohiroma@caramail.com'],
 ['fredobi3@omaninfo.com'],
 ['mbengu@37

In [58]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r":.*<", line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append('')
emails_info['sender_name']

[' "MR. JAMES NGOLA." ',
 ' "Mr. Ben Suleman" ',
 ' "PRINCE OBONG ELEME" ',
 ' "PRINCE OBONG ELEME" ',
 ' "Maryam Abacha" ',
 ' Kuta David ',
 ' "Barrister tunde dosumu" ',
 ' "William Drallo" ',
 ' "MR USMAN ABDUL" ',
 ' "Tunde  Dosumu" ',
 ' MR TEMI JOHNSON ',
 ' "Dr.Sam jordan" ',
 '',
 '',
 ' "COL. MICHAEL BUNDU" ',
 ' "MRS MARIAM ABACHA" ',
 ' " DR. ANAYO AWKA " ',
 ' " DR. ANAYO AWKA " ',
 ' "Victor Aloma" ',
 ' "Victor Aloma" ',
 ' "JAMES NGOLA" ',
 ' "MARTIN CHIME" ',
 ' "Mr George Mboro" ',
 ' "MARTIN  CHIME" ',
 ' "MARTIN  CHIME" ',
 ' ',
 ' ',
 ' "ADE WILLIAMS" ',
 '',
 ' "MRS. M SESE-SEKO" ',
 ' "obina okoro" ',
 ' "DR. JAMES  ALBERT" ',
 ' adamuohiroma adamuohiroma ',
 ' "MR FRED OBI." ',
 ' "mbeki ngumeni" ',
 ' mahoyi mamudu ',
 ' "Mr. David Agu" ',
 ' "bell.idr bell.idr" ',
 ' "idris.bello idris.bello" ',
 '',
 ' "CLEMENT APUTE" ',
 ' "MR GODWIN IGBUNU" ',
 ' "bell.idr bell.idr" ',
 ' "Mr. David Agu" ',
 ' "deborah kabila" ',
 ' Khalifa Sese ',
 ' Khalifa Sese ',
 ' "MO

## Info Dates

Get the Date & the Time

In [62]:
#DATES
dates=re.findall(r"Date:.*", fh)

(dates)

['Date: Thu, 31 Oct 2002 02:38:20 +0000',
 'Date: Thu, 31 Oct 2002 05:10:00',
 'Date: Thu, 31 Oct 2002 22:17:55 +0100',
 'Date: Thu, 31 Oct 2002 22:44:20',
 'Date: Fri, 1 Nov 2002 01:45:04 +0100',
 'Date: Sat, 02 Nov 2002 06:23:11 +0000',
 'Date: Sun, 3 Nov 2002 23:56:20 +0000',
 'Date: Mon, 04 Nov 2002 23:41:26',
 'Date: Tue, 6 Nov 2001 16:52:34 -0000',
 'Date: Fri, 08 Nov 2002 04:15:33',
 'Date: Fri, 8 Nov 2002 10:12:26 +0100',
 'Date: Mon, 11 Nov 2002 17:26:54 +0100',
 'Date: Tue, 13 Nov 2001 16:10:50 -0000',
 'Date: Thu, 14 Nov 2002 16:46:11 +0100',
 'Date: Fri, 15 Nov 2002 00:40:13',
 'Date: Fri, 15 Nov 2002 01:18:50',
 'Date: Sat, 16 Nov 2002 14:06:46',
 'Date: Sat, 16 Nov 2002 14:06:54',
 'Date: Sun, 17 Nov 2002 01:09:22 +0000',
 'Date: Wed, 20 Nov 2002 06:05:30 -0800',
 'Date: Wed, 20 Nov 2002 08:01:52',
 'Date: Wed, 20 Nov 2002 22:53:16 -0800',
 'Date: Wed, 20 Nov 2002 22:53:35 -0800',
 'Date: Fri, 22 Nov 2002 20:33:44',
 'Date: Sat, 23 Nov 2002 15:01:14 +0100',
 'Date: Mon, 2

In [65]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d{4}", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append('')

emails_info['date_sent']

['31 Oct 2002',
 '31 Oct 2002',
 '31 Oct 2002',
 '31 Oct 2002',
 '1 Nov 2002',
 '02 Nov 2002',
 '3 Nov 2002',
 '04 Nov 2002',
 '6 Nov 2001',
 '08 Nov 2002',
 '8 Nov 2002',
 '11 Nov 2002',
 '13 Nov 2001',
 '14 Nov 2002',
 '15 Nov 2002',
 '15 Nov 2002',
 '16 Nov 2002',
 '16 Nov 2002',
 '17 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '22 Nov 2002',
 '23 Nov 2002',
 '25 Nov 2002',
 '25 Nov 2002',
 '25 Nov 2002',
 '26 Nov 2002',
 '26 Nov 2002',
 '27 Nov 2002',
 '28 Nov 2002',
 '30 Nov 2002',
 '03 Dec 2002',
 '04 Dec 2002',
 '05 Dec 2002',
 '6 Dec 2002',
 '5 Dec 2002',
 '6 Dec 2002',
 '05 Dec 2002',
 '09 Dec 2002',
 '9 Dec 2002',
 '11 Dec 2002',
 '11 Dec 2002',
 '13 Dec 2002',
 '12 Dec 2002',
 '17 Dec 2002',
 '24 Dec 2002',
 '28 Dec 2002',
 '1 Jan 2003',
 '4 Jan 2000',
 '1 Jan 1999',
 '2 Jan 1999',
 '14 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '15 Jan 2003',
 '16 Jan 2003',
 '17 Jan 2003',
 '17 Jan 2003',
 '17

In [85]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}.*", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append('')

emails_info['time_sent']

['02:38:20 +0000',
 '05:10:00',
 '22:17:55 +0100',
 '22:44:20',
 '01:45:04 +0100',
 '06:23:11 +0000',
 '23:56:20 +0000',
 '23:41:26',
 '16:52:34 -0000',
 '04:15:33',
 '10:12:26 +0100',
 '17:26:54 +0100',
 '16:10:50 -0000',
 '16:46:11 +0100',
 '00:40:13',
 '01:18:50',
 '14:06:46',
 '14:06:54',
 '01:09:22 +0000',
 '06:05:30 -0800',
 '08:01:52',
 '22:53:16 -0800',
 '22:53:35 -0800',
 '20:33:44',
 '15:01:14 +0100',
 '16:04:42 +0000',
 '09:00:25 -0800',
 '22:20:56 -0500',
 '04:34:05 +0100',
 '19:53:58 GMT+1',
 '13:45:46 +0100',
 '08:34:54 -0800',
 '17:32:20 GMT+1',
 '19:47:42',
 '15:09:17 GMT+1',
 '14:19:23 GMT+1',
 '01:03:19 +0000',
 '19:03:38 -0800',
 '01:28:42 -0800',
 '13:32:50 GMT+1',
 '04:02:43',
 '05:10:49 -0800',
 '05:23:51 -0800 (PST)',
 '06:23:59 -0800 (PST)',
 '03:52:57 +0000',
 '08:40:42 -0500',
 '19:58:30 -0000',
 '16:52:18',
 '07:36:51',
 '08:50:53 -0500 (EST)',
 '09:42:53 -0800',
 '22:08:55 +0100',
 '01:43:13 +0100',
 '21:06:26 -0800',
 '02:14:07 +0800',
 '02:48:26 -0500',
 '

## Subject

Get the Subject of the email

In [None]:
subjects=re.findall(r"",fh)
subjects

In [None]:
emails_info['subject']=[]
for sub in subjects:
    res=re.findall(r"", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append('')

emails_info['subject']