# Regular Expressions (RegEx)

In [60]:
#https://i.redd.it/nac35ntlfg831.jpg

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3 install regex** should install it.

In [61]:
import re
import pandas as pd

### Sintax
-> Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- NOTA: re.M -> modo multilinea

https://docs.python.org/3/library/re.html#re-syntax

--> Sintax:
- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`
- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`



### Methods

- **sub()**
Replaces one or many matches with a string

In [62]:
txt = "fran, felipe, Marc, Clara and Blanca are TA's??"

In [63]:
#re.sub
#Literals
re.sub('f','F',txt)

"Fran, Felipe, Marc, Clara and Blanca are TA's??"

In [64]:
#Ranges
re.sub('[A-Z]','',txt)

"fran, felipe, arc, lara and lanca are 's??"

In [65]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"fran, felipe, Marc, Clara and Blanca are TA's."

- **search()**
Scan through a string, looking for any location where this RE matches.

In [66]:
#re.search
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) 
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [67]:
txt = "The rain in Spain"
#\b whole words only
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [92]:
print(re.search(r'r\w*', txt))
print(re.search(r'R\w*', txt))
print(re.search(r'^T\w*', txt))
print(re.search(r'^t\w*', txt))

<re.Match object; span=(4, 8), match='rain'>
None
<re.Match object; span=(0, 3), match='The'>
None


- **match()**
Determine if the RE matches at the beginning of the string.

In [69]:
#re.match
pattern = r"Cookie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence2):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [70]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))
print(re.match(r'^T\w*', txt))
print(re.match(r'T\w*', txt))

None
None
<re.Match object; span=(0, 3), match='The'>
<re.Match object; span=(0, 3), match='The'>


In [71]:
email_address = 'Please contact us at: support@datamad.com'
match = re.search(r'(\w+)@([\w\.]+)', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@datamad.com
support
datamad.com


- **findall()**
Find all substrings where the RE matches, and returns them as a list.

In [72]:
#re.findall
email_address = "Please contact us at: support.data@data-mad.com, xyz@ironhack.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.]+@[\w\.-]+', email_address)
addresses

['support.data@data-mad.com', 'xyz@ironhack.com']

In [73]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 'm', 'd', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 'r', 'n', 'h', 'c', 'k', '.', 'c', 'm']
[' contact']
['Please']


- **split()**
Returns a list where the string has been split at each match

In [74]:
#re.split
prophet=['the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was{34}',
 'a',
 'dawn',
 'unto{6}der',
 'his']

In [75]:
def reference(x):
    return re.split("{\d+}",x)

In [76]:
prophet_reference=list(map(reference,prophet))
print(prophet_reference)

[['the', ''], ['chosen'], ['and'], ['the\nbeloved,'], ['who'], ['was', ''], ['a'], ['dawn'], ['unto', 'der'], ['his']]


-----------------------------------------------------------------------------------------------------------


## Some Practice

In [2]:
emails_info={}

In [27]:
import re
fh = open("emails.txt", "r").read()
contents = re.split(r"From r", fh)

In [28]:
len(contents)

3978

- sender_email
- sender_name
- date_sent
- time_sent
- subject

In [79]:
print(contents.pop(1))

  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE

### Info Sender

In [5]:
info_sender=re.findall("From:.*", fh)
info_sender

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>',
 'From: "Maryam Abacha" <m_abacha03@www.com>',
 'From: Kuta David <davidkuta@postmark.net>',
 'From: "Barrister tunde dosumu" <tunde_dosumu@lycos.com>',
 'From: "William Drallo" <william2244drallo@maktoob.com>',
 'From: "MR USMAN ABDUL" <abdul_817@rediffmail.com>',
 'From: "Tunde  Dosumu" <barrister_td@lycos.com>',
 'From: MR TEMI JOHNSON <temijohnson2@rediffmail.com>',
 'From: "Dr.Sam jordan" <sjordan@diplomats.com>',
 'From: p_brown2@lawyer.com',
 'From: Barrister Peter Brown',
 'From: mic_k1@post.com',
 'From: "COL. MICHAEL BUNDU" <mikebunduu1@rediffmail.com>',
 'From: "MRS MARIAM ABACHA" <elixwilliam@usa.com>',
 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>',
 'From: " DR. ANAYO AWKA " <anayoawka@hotmail.com>',
 'From: "Victor Aloma" <victorloma@netscape.n

In [6]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        print(res)
        emails_info['sender_email'].append(re.findall(r'[\w\.]+@[\w\.-]+', line)[0])
    else:
        emails_info['sender_email'].append('Nan')
        
emails_info['sender_email']

['james_ngola2002@maktoob.com']
['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']
['obong_715@epatra.com']
['m_abacha03@www.com']
['davidkuta@postmark.net']
['tunde_dosumu@lycos.com']
['william2244drallo@maktoob.com']
['abdul_817@rediffmail.com']
['barrister_td@lycos.com']
['temijohnson2@rediffmail.com']
['sjordan@diplomats.com']
['p_brown2@lawyer.com']
['mic_k1@post.com']
['mikebunduu1@rediffmail.com']
['elixwilliam@usa.com']
['anayoawka@hotmail.com']
['anayoawka@hotmail.com']
['victorloma@netscape.net']
['victorloma@netscape.net']
['james_ngola2002@maktoob.com']
['martinchime@usa.com']
['mboro1555@post.com']
['martinchime@borad.com']
['martinchime@borad.com']
['edema_mb@phantomemail.com']
['edema_mb@phantomemail.com']
['adewilliams_ade@lawyer.com']
['smithkam2@post.com']
['sesm@omaninfo.com']
['obinaokoro@37.com']
['jamesalbert0@eircom.net']
['seminar@eecs.UM']
['adamuohiroma.adamuohiroma@caramail.com']
['fredobi3@omaninfo.com']
['mbengu@37.com']
['mmahoyi@caramail.com']
['ad

['kone404@msn.com']
['kone404@msn.com']
['drkwesij@indiatimes.com']
['David.Henson@state.co.us']
['jamesmasonwebmail@msn.com']
['mothermelissa11@msn.com']
['usman_be90@she.com']
['charleslilian01@edumail.co.za']
['jocelynjones@edumail.co.za']
['elenaowen@sify.com']
['elenaowen@sify.com']
['willzungu14@hotmail.com']
['phiri_aboa6@yahoo.co.in']
['danabdul2@yahoo.co.uk']
['tony_silver1@hotmail.com']
['tony_silver1@hotmail.com']
['susan_v13@msn.com']
['susan_v13@msn.com']
['m_aj03@msn.com']
['ritad_tt@yahoo.com']
['ken211@o2.pl']
['ken211@o2.pl']
['franktetteh@afrik.com']
['smith_a96@virgilio.it']
['smith_a96@virgilio.it']
['smith_a96@virgilio.it']
['barristerfm2006@msn.com']
['abuse@capitalone.com']
['all@eecs.UM']
['lilian_chrr@aumara.zzn.com']
['villaran_nenitaonline@hotmail.com']
['engkentas@yahoo.com']
['manoni112@msn.com']
['kingsley.enterprise@tiscali.co.uk']
['john2mmadu@ozu.es']
['ibrahim_bell13@hotmail.com']
['personaltreasurer@walla.com']
['lingeng@bochk.com']
['andersonzuma2006

['euphilipson@atmail.org']
['adamu_ali23@hotmail.fr']
['ab.marinho8@zipmail.com.br']
['lmnandi02@pnetmail.co.za']
['favour6@excite.com']
['dr_johnson12@yahoo.com']
['dr_johnson12@yahoo.com']
['miriamkolo67@hotmail.com']
['drdinka015dogon@hotmail.fr']
['brownofficeaa@virgilio.it']
['drfwest12@hotmail.fr']
['ken_sorowiwajr@excite.com']
['jamescamrar57@hotmail.com']
['charles_greene126@yahoo.co.uk']
['bill_cole_arts002@myway.com']
['compensationsfund@redtreesolutions.com']
['compensationsfund@redtreesolutions.com']
['aku_ubahxxxxx@yahoo.com']
['att22us6@hotmail.com']
['helen@ab-sa.co.za']
['muhammd_amin30@hotmail.fr']
['abdulfaye@yahoo.fr']
['adriaanc36@yahoo.gr']
['g.ajithkumar@msn.com']
['mustafa_kamel11@hotmail.fr']
['mr_hassanbilly01@latinmail.com']
['idr_biko8@hotmail.com']
['sandrawilliams@pre.sltnet.lk']
['rjpalaw@uku.co.uk']
['frankkone0011@hotmail.com']
['frankkone0011@hotmail.com']
['darekking2007@yahoo.co.uk']
['pae.za@terra.es', 'pae.za@terra.es']
['larisasosnkayapofey@mail2ru

['james_ngola2002@maktoob.com',
 'bensul2004nng@spinfinder.com',
 'obong_715@epatra.com',
 'obong_715@epatra.com',
 'm_abacha03@www.com',
 'davidkuta@postmark.net',
 'tunde_dosumu@lycos.com',
 'william2244drallo@maktoob.com',
 'abdul_817@rediffmail.com',
 'barrister_td@lycos.com',
 'temijohnson2@rediffmail.com',
 'sjordan@diplomats.com',
 'p_brown2@lawyer.com',
 'Nan',
 'mic_k1@post.com',
 'mikebunduu1@rediffmail.com',
 'elixwilliam@usa.com',
 'anayoawka@hotmail.com',
 'anayoawka@hotmail.com',
 'victorloma@netscape.net',
 'victorloma@netscape.net',
 'james_ngola2002@maktoob.com',
 'martinchime@usa.com',
 'mboro1555@post.com',
 'martinchime@borad.com',
 'martinchime@borad.com',
 'edema_mb@phantomemail.com',
 'edema_mb@phantomemail.com',
 'adewilliams_ade@lawyer.com',
 'smithkam2@post.com',
 'sesm@omaninfo.com',
 'obinaokoro@37.com',
 'jamesalbert0@eircom.net',
 'seminar@eecs.UM',
 'adamuohiroma.adamuohiroma@caramail.com',
 'fredobi3@omaninfo.com',
 'mbengu@37.com',
 'mmahoyi@caramail.co

In [9]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append('Nan')
emails_info['sender_name']

[' "MR. JAMES NGOLA." ',
 ' "Mr. Ben Suleman" ',
 ' "PRINCE OBONG ELEME" ',
 ' "PRINCE OBONG ELEME" ',
 ' "Maryam Abacha" ',
 ' Kuta David ',
 ' "Barrister tunde dosumu" ',
 ' "William Drallo" ',
 ' "MR USMAN ABDUL" ',
 ' "Tunde  Dosumu" ',
 ' MR TEMI JOHNSON ',
 ' "Dr.Sam jordan" ',
 'Nan',
 'Nan',
 'Nan',
 ' "COL. MICHAEL BUNDU" ',
 ' "MRS MARIAM ABACHA" ',
 ' " DR. ANAYO AWKA " ',
 ' " DR. ANAYO AWKA " ',
 ' "Victor Aloma" ',
 ' "Victor Aloma" ',
 ' "JAMES NGOLA" ',
 ' "MARTIN CHIME" ',
 ' "Mr George Mboro" ',
 ' "MARTIN  CHIME" ',
 ' "MARTIN  CHIME" ',
 ' ',
 ' ',
 ' "ADE WILLIAMS" ',
 'Nan',
 ' "MRS. M SESE-SEKO" ',
 ' "obina okoro" ',
 ' "DR. JAMES  ALBERT" ',
 'Nan',
 ' adamuohiroma adamuohiroma ',
 ' "MR FRED OBI." ',
 ' "mbeki ngumeni" ',
 ' mahoyi mamudu ',
 ' "Mr. David Agu" ',
 ' "bell.idr bell.idr" ',
 ' "idris.bello idris.bello" ',
 'Nan',
 ' "CLEMENT APUTE" ',
 ' "MR GODWIN IGBUNU" ',
 ' "bell.idr bell.idr" ',
 ' "Mr. David Agu" ',
 ' "deborah kabila" ',
 ' Khalifa Sese 

## Info Dates

In [11]:
#DATES
dates=re.findall(r"Date:.*", fh)
dates

['Date: Thu, 31 Oct 2002 02:38:20 +0000',
 'Date: Thu, 31 Oct 2002 05:10:00',
 'Date: Thu, 31 Oct 2002 22:17:55 +0100',
 'Date: Thu, 31 Oct 2002 22:44:20',
 'Date: Fri, 1 Nov 2002 01:45:04 +0100',
 'Date: Sat, 02 Nov 2002 06:23:11 +0000',
 'Date: Sun, 3 Nov 2002 23:56:20 +0000',
 'Date: Mon, 04 Nov 2002 23:41:26',
 'Date: Tue, 6 Nov 2001 16:52:34 -0000',
 'Date: Fri, 08 Nov 2002 04:15:33',
 'Date: Fri, 8 Nov 2002 10:12:26 +0100',
 'Date: Mon, 11 Nov 2002 17:26:54 +0100',
 'Date: Tue, 13 Nov 2001 16:10:50 -0000',
 'Date: Thu, 14 Nov 2002 16:46:11 +0100',
 'Date: Fri, 15 Nov 2002 00:40:13',
 'Date: Fri, 15 Nov 2002 01:18:50',
 'Date: Sat, 16 Nov 2002 14:06:46',
 'Date: Sat, 16 Nov 2002 14:06:54',
 'Date: Sun, 17 Nov 2002 01:09:22 +0000',
 'Date: Wed, 20 Nov 2002 06:05:30 -0800',
 'Date: Wed, 20 Nov 2002 08:01:52',
 'Date: Wed, 20 Nov 2002 22:53:16 -0800',
 'Date: Wed, 20 Nov 2002 22:53:35 -0800',
 'Date: Fri, 22 Nov 2002 20:33:44',
 'Date: Sat, 23 Nov 2002 15:01:14 +0100',
 'Date: Mon, 2

In [12]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append('Nan')

emails_info['date_sent']

['31 Oct 2002',
 '31 Oct 2002',
 '31 Oct 2002',
 '31 Oct 2002',
 '1 Nov 2002',
 '02 Nov 2002',
 '3 Nov 2002',
 '04 Nov 2002',
 '6 Nov 2001',
 '08 Nov 2002',
 '8 Nov 2002',
 '11 Nov 2002',
 '13 Nov 2001',
 '14 Nov 2002',
 '15 Nov 2002',
 '15 Nov 2002',
 '16 Nov 2002',
 '16 Nov 2002',
 '17 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '20 Nov 2002',
 '22 Nov 2002',
 '23 Nov 2002',
 '25 Nov 2002',
 '25 Nov 2002',
 '25 Nov 2002',
 '26 Nov 2002',
 '26 Nov 2002',
 '27 Nov 2002',
 '28 Nov 2002',
 '30 Nov 2002',
 '03 Dec 2002',
 '04 Dec 2002',
 '05 Dec 2002',
 '6 Dec 2002',
 '5 Dec 2002',
 '6 Dec 2002',
 '05 Dec 2002',
 '09 Dec 2002',
 '9 Dec 2002',
 '11 Dec 2002',
 '11 Dec 2002',
 '13 Dec 2002',
 '12 Dec 2002',
 '17 Dec 2002',
 '24 Dec 2002',
 '28 Dec 2002',
 '1 Jan 2003',
 '4 Jan 2000',
 '1 Jan 1999',
 '2 Jan 1999',
 '14 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '16 Jan 2003',
 '15 Jan 2003',
 '16 Jan 2003',
 '17 Jan 2003',
 '17 Jan 2003',
 '17

In [16]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}.*\s", dat)
    if res:
        emails_info['time_sent'].append(res[0][:-1])
    else:
        emails_info['time_sent'].append('Nan')

emails_info['time_sent']

['02:38:20',
 'Nan',
 '22:17:55',
 'Nan',
 '01:45:04',
 '06:23:11',
 '23:56:20',
 'Nan',
 '16:52:34',
 'Nan',
 '10:12:26',
 '17:26:54',
 '16:10:50',
 '16:46:11',
 'Nan',
 'Nan',
 'Nan',
 'Nan',
 '01:09:22',
 '06:05:30',
 'Nan',
 '22:53:16',
 '22:53:35',
 'Nan',
 '15:01:14',
 '16:04:42',
 '09:00:25',
 '22:20:56',
 '04:34:05',
 '19:53:58',
 '13:45:46',
 '08:34:54',
 '17:32:20',
 'Nan',
 '15:09:17',
 '14:19:23',
 '01:03:19',
 '19:03:38',
 '01:28:42',
 '13:32:50',
 'Nan',
 '05:10:49',
 '05:23:51 -0800',
 '06:23:59 -0800',
 '03:52:57',
 '08:40:42',
 '19:58:30',
 'Nan',
 'Nan',
 '08:50:53 -0500',
 '09:42:53',
 '22:08:55',
 '01:43:13',
 '21:06:26',
 '02:14:07',
 '02:48:26',
 'Nan',
 '09:03:50',
 '17:38:58',
 '14:58:04',
 '18:47:42',
 'Nan',
 'Nan',
 'Nan',
 'Nan',
 'Nan',
 'Nan',
 '11:44:49',
 '15:49:19',
 '04:57:06',
 '04:33:16',
 '04:33:42',
 '18:23:20 -0000',
 '14:53:48',
 'Nan',
 '22:12:21',
 '18:17:36',
 '17:39:00',
 '20:19:07',
 'Nan',
 '16:45:52',
 '01:41:05',
 '18:30:21',
 '15:00:35',

## Subject

In [18]:
subjects=re.findall(r"Subject:.*",fh)
subjects

['Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP',
 'Subject: URGENT ASSISTANCE /RELATIONSHIP (P)',
 'Subject: GOOD DAY TO YOU',
 'Subject: GOOD DAY TO YOU',
 'Subject: I Need Your Assistance.',
 'Subject: Partnership',
 'Subject: Urgent Attention',
 'Subject: URGENT BUSINESS PRPOSAL',
 'Subject: THANK YOU',
 'Subject: Urgent Assistance',
 'Subject: IMPORTANT',
 'Subject: URGENT ASSISTANCE.',
 'Subject: From: Barrister Peter Brown',
 'Subject: MICHAEL',
 'Subject: *****SPAM***** IMPORTANT',
 'Subject: TRUST TRANSACTION',
 'Subject: REQUEST FOR YOUR UNRESERVED ASSISTANCE',
 'Subject: REQUEST FOR YOUR UNRESERVED ASSISTANCE',
 'Subject: Urgent Assistance',
 'Subject: Urgent Assistance',
 'Subject: URGENT BUSINESS PROPOSAL',
 'Subject: WANTED FOR FRAUD: Fawwaz Ulaby',
 'Subject: URGENT',
 'Subject: Offer',
 'Subject: URGENT',
 'Subject: URGENT',
 'Subject: BUSINESS RELATIONSHIP',
 'Subject: BUSINESS RELATIONSHIP',
 'Subject: REPLY NOW',
 'Subject: MICHAEL!',
 'Subject: Trustee',
 'Sub

In [20]:
emails_info['subject']=[]
for sub in subjects:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append('Nan')

emails_info['subject']

['URGENT BUSINESS ASSISTANCE AND PARTNERSHIP',
 'URGENT ASSISTANCE /RELATIONSHIP (P)',
 'GOOD DAY TO YOU',
 'GOOD DAY TO YOU',
 'I Need Your Assistance.',
 'Partnership',
 'Urgent Attention',
 'URGENT BUSINESS PRPOSAL',
 'THANK YOU',
 'Urgent Assistance',
 'IMPORTANT',
 'URGENT ASSISTANCE.',
 'From: Barrister Peter Brown',
 'MICHAEL',
 '*****SPAM***** IMPORTANT',
 'TRUST TRANSACTION',
 'REQUEST FOR YOUR UNRESERVED ASSISTANCE',
 'REQUEST FOR YOUR UNRESERVED ASSISTANCE',
 'Urgent Assistance',
 'Urgent Assistance',
 'URGENT BUSINESS PROPOSAL',
 'WANTED FOR FRAUD: Fawwaz Ulaby',
 'URGENT',
 'Offer',
 'URGENT',
 'URGENT',
 'BUSINESS RELATIONSHIP',
 'BUSINESS RELATIONSHIP',
 'REPLY NOW',
 'MICHAEL!',
 'Trustee',
 'urgent reply',
 'Request',
 'URGENT ASSISTANCE',
 'URGENT BUSINESS TRUSTEE',
 'STRICTLY CONFIDENTIAL & URGENT',
 'respond asap',
 'INHERITANCE CLAIM',
 'BUSINESS PROPOSAL',
 'BUSINESS PROPOSAL',
 'INVESTMENT BASED ON TRUST',
 'PARTNERSHIP',
 'BUSINESS PROPOSAL',
 'ASSISTANCE',
 'BU

# Creating DataFrame

In [22]:
emails_info.keys()

dict_keys(['sender_email', 'sender_name', 'date_sent', 'time_sent', 'subject'])

In [26]:
len(contents)

3978

In [29]:
for k,v in emails_info.items():
    emails_info[k]=emails_info[k][:30]

In [90]:
df=pd.DataFrame(emails_info)
df

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.
5,davidkuta@postmark.net,Kuta David,02 Nov 2002,06:23:11,Partnership
6,tunde_dosumu@lycos.com,"""Barrister tunde dosumu""",3 Nov 2002,23:56:20,Urgent Attention
7,william2244drallo@maktoob.com,"""William Drallo""",04 Nov 2002,,URGENT BUSINESS PRPOSAL
8,abdul_817@rediffmail.com,"""MR USMAN ABDUL""",6 Nov 2001,16:52:34,THANK YOU
9,barrister_td@lycos.com,"""Tunde Dosumu""",08 Nov 2002,,Urgent Assistance


In [91]:
#interesante!! PISTA!
df.subject=df.subject.str.replace(r"(.*)?SPAM(.*)?",'SPAM')
df.head(20)

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.
5,davidkuta@postmark.net,Kuta David,02 Nov 2002,06:23:11,Partnership
6,tunde_dosumu@lycos.com,"""Barrister tunde dosumu""",3 Nov 2002,23:56:20,Urgent Attention
7,william2244drallo@maktoob.com,"""William Drallo""",04 Nov 2002,,URGENT BUSINESS PRPOSAL
8,abdul_817@rediffmail.com,"""MR USMAN ABDUL""",6 Nov 2001,16:52:34,THANK YOU
9,barrister_td@lycos.com,"""Tunde Dosumu""",08 Nov 2002,,Urgent Assistance
