# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3/pip install regex** should install it.

In [1]:
import re
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- `?<=` Matches Lookbehind --> https://www.regular-expressions.info/lookaround.html
- NOTA: re.M -> modo multilinea

### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]` <- we use the hat between brackets to indicate that we want the opposite
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,$)
- **Ranges** `[a-d]`, `[1-9]`, `[A-D]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [2]:
# your solution

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

In [47]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [86]:
emails_info={}

In [87]:
fh = open("emails.txt", "r").read()

In [88]:
fh.count("From r")

3977

In [89]:
contents = re.split(r"From r", fh)

In [90]:
contents[:2]

['',
 '  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

In [91]:
contents[0]

''

In [92]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject


### Info Sender

In [93]:
# Get Senders
print(contents[2])

re.findall(r'[\w\.]+@[\w\.]+', contents[0])

  Thu Oct 31 17:27:16 2002
Return-Path: <obong_715@epatra.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <obong_715@epatra.com>
Message-Id: <200210312227.g9VMQvDj017948@bluewhale.cs.CU>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
Reply-To: obong_715@epatra.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 22:17:55 +0100
Subject: GOOD DAY TO YOU
X-Mailer: Microsoft Outlook Express 5.00.2919.6900DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9VMRBW20642
Status: RO

FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF ELEME KINGDOM 
CHIEF DANIEL ELEME, PHD, EZE 1 OF ELEME.E-MAIL 
ADDRESS:obong_715@epatra.com  

ATTENTION:PRESIDENT,CEO Sir/ Madam. 

This letter might surprise you because we have met
neither in person nor by correspondence. But I believe
it is one day that you got to know somebody either in
physical or through correspondence. 

I got your contact through 

['james_ngola2002@maktoob.com',
 'james_ngola2002@maktoob.com',
 '200210310241.g9V2fNm6028281@cs.CU',
 'james_ngola2002@maktoob.com',
 'james_ngola2002@maktoob.com',
 'webmaster@aclweb.org',
 'james_ngola2002@maktoob.com']

In [94]:
ls_emails = [re.search(r'[\w\.]+@[\w\.]+', x).group() for x in contents]
emails_info['sender_email'] = ls_emails

In [95]:
pd.DataFrame(emails_info)

Unnamed: 0,sender_email
0,james_ngola2002@maktoob.com
1,bensul2004nng@spinfinder.com
2,obong_715@epatra.com
3,obong_715@epatra.com
4,m_abacha03@www.com
...,...
3972,michealagu0255@zipmail.com.br
3973,ali_sherif252@hotmail.fr
3974,drusmanibrahimtg08@hotmail.fr
3975,motherdorisk61@hotmail.com


In [96]:
re.search(r'(From: )(".*")( <)', contents[2]).groups()

('From: ', '"PRINCE OBONG ELEME"', ' <')

In [97]:
contents[2]

'  Thu Oct 31 17:27:16 2002\nReturn-Path: <obong_715@epatra.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <obong_715@epatra.com>\nMessage-Id: <200210312227.g9VMQvDj017948@bluewhale.cs.CU>\nFrom: "PRINCE OBONG ELEME" <obong_715@epatra.com>\nReply-To: obong_715@epatra.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 22:17:55 +0100\nSubject: GOOD DAY TO YOU\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9VMRBW20642\nStatus: RO\n\nFROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF ELEME KINGDOM \nCHIEF DANIEL ELEME, PHD, EZE 1 OF ELEME.E-MAIL \nADDRESS:obong_715@epatra.com  \n\nATTENTION:PRESIDENT,CEO Sir/ Madam. \n\nThis letter might surprise you because we have met\nneither in person nor by correspondence. But I believe\nit is one day that you got to know somebody either in\nphysical or through correspondence. \n

In [98]:
emails_info['sender_name'] = []
for x in contents:
    try:
        name = re.search('(From: )(".*")( <)', x).group(2)
    except:
        name = np.nan
    
    emails_info['sender_name'].append(name)

In [99]:
df = pd.DataFrame(emails_info)
df

Unnamed: 0,sender_email,sender_name
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA."""
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman"""
2,obong_715@epatra.com,"""PRINCE OBONG ELEME"""
3,obong_715@epatra.com,"""PRINCE OBONG ELEME"""
4,m_abacha03@www.com,"""Maryam Abacha"""
...,...,...
3972,michealagu0255@zipmail.com.br,
3973,ali_sherif252@hotmail.fr,
3974,drusmanibrahimtg08@hotmail.fr,
3975,motherdorisk61@hotmail.com,


### Info Dates

In [100]:
print(contents[0])

  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE

In [101]:
print(contents[9])

  Tue Nov  5 05:25:07 2002
Return-Path: <barrister_td@mailcity.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <barrister_td@mailcity.com>
From: "Tunde  Dosumu" <barrister_td@lycos.com>
Message-ID: <JOKFPGMHGGBECAAA@mailcity.com>
Mime-Version: 1.0
X-Sent-Mail: off
Reply-To: barrister_td@lycos.com
X-Expiredinmiddle: true
X-Mailer: MailCity Service
X-Priority: 3
Subject: Urgent Assistance
X-Sender-Ip: 208.14.9.28
Organization: Lycos Mail  (http://www.mail.lycos.com:80)
Content-Type: text/plain; charset=us-ascii
Content-Language: en
Content-Transfer-Encoding: 7bit
Status: RO

Dear Sir,

I am Barrister Tunde Dosumu (SAN) solicitor at law. I am the personal attorney to Mr. Eton Simon, a national of your country, who used to work with Shell Petroleum Development Company (SPDC)here in Nigeria. Here in after shall be referred to  me as my client.  On the 21st of April 2000, my client, his wife And  their three children were involved in a car accident along Sagbama Express Road. All occupants of the  

In [102]:
emails_info['date_sent'] = []
for x in contents:
    try:
        date = re.search('\d{1,2} [a-zA-Z]{3} \d{4}', x).group()
    except:
        try:
            date = re.search(r'([a-zA-Z]{3}\s){2}\s\d{1,2}', x).group()
        except:
            date=np.nan
    emails_info['date_sent'].append(date)

df = pd.DataFrame(emails_info)


In [103]:
df

Unnamed: 0,sender_email,sender_name,date_sent
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002
...,...,...,...
3972,michealagu0255@zipmail.com.br,,
3973,ali_sherif252@hotmail.fr,,17 Sep 2007
3974,drusmanibrahimtg08@hotmail.fr,,18 Sep 2007
3975,motherdorisk61@hotmail.com,,19 Sep 2007


In [104]:
df[df['date_sent'].isnull()]

Unnamed: 0,sender_email,sender_name,date_sent
21,findictment@yahoo.com,,
53,williamsdaniel1@phantomemail.com,,
54,obileonard@yahoo.com,,
79,chris_adamu2003@yahoo.com,,
87,b_kambili@yahoo.co.uk,,
...,...,...,...
3929,fhoorizadeh@natrol.com,,
3931,jackpot_inter2000@yahoo.fr,,
3940,onuoha5a2003@yahoo.co.jp,,
3943,yusufu019@yahoo.ca,,


In [105]:
emails_info['time_sent'] = []
for x in contents:
    try:
        date = re.search('(\d{2}:){2}\d{2}', x).group()
    except:
        date=np.nan
    emails_info['time_sent'].append(date)

df = pd.DataFrame(emails_info)
df

Unnamed: 0,sender_email,sender_name,date_sent,time_sent
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,21:41:56
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,08:11:39
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:27:16
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:53:56
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,04:48:39
...,...,...,...,...
3972,michealagu0255@zipmail.com.br,,,03:36:40
3973,ali_sherif252@hotmail.fr,,17 Sep 2007,18:28:28
3974,drusmanibrahimtg08@hotmail.fr,,18 Sep 2007,06:55:05
3975,motherdorisk61@hotmail.com,,19 Sep 2007,19:52:38


In [106]:
df.isnull().sum()

sender_email       0
sender_name     1653
date_sent        316
time_sent          1
dtype: int64

### Subject

In [107]:
# Get Subjects
# Get Subjects
import re
re.findall(r'(?<=Subject: ).*(?=\n)', contents[0])
re.findall(r'(Subject: )(.*)(\n)', contents[0])
re.search(r'(Subject: )(.*)(\n)', contents[0]).group(2)

'URGENT BUSINESS ASSISTANCE AND PARTNERSHIP'

In [108]:
emails_info['subject'] = []

for x in contents:
    try:
        sub = re.search(r'(Subject: )(.*)(\n)', x).group(2)
    except:
        sub = np.nan

    emails_info['subject'].append(sub)

df = pd.DataFrame(emails_info)
df


Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,21:41:56,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,08:11:39,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:27:16,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:53:56,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,04:48:39,I Need Your Assistance.
...,...,...,...,...,...
3972,michealagu0255@zipmail.com.br,,,03:36:40,=?iso-8859-1?Q?CONTACT=20GLOBAL=20MAX=20SHIPIN...
3973,ali_sherif252@hotmail.fr,,17 Sep 2007,18:28:28,TREAT AS URGENT.
3974,drusmanibrahimtg08@hotmail.fr,,18 Sep 2007,06:55:05,From Dr Usman Ibrahim / Mr Wahid Yoffe property.
3975,motherdorisk61@hotmail.com,,19 Sep 2007,19:52:38,My Beloved In Christ.


In [109]:
print(contents[3972])

  Mon Sep 17 03:36:40 2007
Return-Path: <michealagu0255@zipmail.com.br>
X-Sieve: CMU Sieve 2.3
From: michealagu0255@zipmail.com.br
Subject: =?iso-8859-1?Q?CONTACT=20GLOBAL=20MAX=20SHIPING=20COMPANY?=
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
To: undisclosed-recipients:;
X-SIG5: 4335510e40341724d00e160f927a3b4d
X-Spam-Score: 6.994 (******) ADVANCE_FEE_2 CU_SUBJ_ALL_CAPS URG_BIZ
X-Scanned-By: MIMEDefang 2.48 on 128.59.28.169
Status: RO

Atten: My Dear ,
 
I have Paid the fee for your Cheque Draft.Because the manager of EcoBank
Benin told me that before the check will get to you that it willexpire.
So i told him to cash $850,000.00  however all the necessary arrangement
of delivering the $850,000.00 in cash was made with  Global Max  shiping
Courier Company. This is the information they need to delivery your packa=
ge
to you The only money you have to send to them is there security keeping
fee which is $95.00 Us Dollars to

### Creating DataFrame

In [111]:
# Create DataFrame
df

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,21:41:56,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,08:11:39,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:27:16,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,17:53:56,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,04:48:39,I Need Your Assistance.
...,...,...,...,...,...
3972,michealagu0255@zipmail.com.br,,,03:36:40,=?iso-8859-1?Q?CONTACT=20GLOBAL=20MAX=20SHIPIN...
3973,ali_sherif252@hotmail.fr,,17 Sep 2007,18:28:28,TREAT AS URGENT.
3974,drusmanibrahimtg08@hotmail.fr,,18 Sep 2007,06:55:05,From Dr Usman Ibrahim / Mr Wahid Yoffe property.
3975,motherdorisk61@hotmail.com,,19 Sep 2007,19:52:38,My Beloved In Christ.


### ¡Now you can start your analysis!