# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3/pip install regex** should install it.

In [59]:
import re
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- `?<=` Matches Lookbehind --> https://www.regular-expressions.info/lookaround.html
- NOTA: re.M -> modo multilinea

### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]` <- we use the hat between brackets to indicate that we want the opposite
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,$)
- **Ranges** `[a-d]`, `[1-9]`, `[A-D]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [60]:
txt = "gabriel, DDio & CClara are TA's??"

In [65]:
a = '2'

In [66]:
txt = "gabriel, DDDio & CClara are TA's??"
re.sub('[A-Z]', a, txt)

"gabriel, 222io & 22lara are 22's??"

In [19]:
re.sub('[A-Z]{2}', a, txt)

"gabriel, 2Dio & 2lara are 2's??"

In [20]:
#re.sub
#Literals
#{2} 2 occurencies of [A-Z] substituted by a
txt = "gabriel, DDDio & CClara are TA's??"
txt = re.sub('[A-Z]{2}', a, txt)
txt

"gabriel, 2Dio & 2lara are 2's??"

In [67]:
#Ranges
txt = "gabriel, DDDio & CClara are TA's??"
re.sub('[A-Z]','',txt)

"gabriel, io & lara are 's??"

In [68]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"gabriel, DDDio & CClara are TA's."

In [72]:
import os
path = os.getcwd()
re.sub('\!{3}', '', path)

'c:\\Users\\alber\\Documents\\GitHub\\TheBridge\\SVL_DS_Marzo_TB\\2-Data Analysis\\4-Feature Engineering\\4-Text\\RegEx'

In [75]:
re.sub('\\\\SVL_DS_Marzo_TB.+', '', path)

'c:\\Users\\alber\\Documents\\GitHub\\TheBridge'

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

`re.search()` method either returns None (if the pattern doesn’t match), or a re.MatchObject that contains information about the matching part of the string. This method stops after the first match, so this is best suited for testing a regular expression more than extracting data.

In [84]:
#re.search
txt = "The rain in Spain and the sun in England, Tower"
x = re.search("^The.*Spain", txt) 
x.span()

(0, 17)

In [89]:
txt = "The rain in Spain"
#\b everything starts with "S" and have any character next.
# w+ for getting all the string
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [92]:
txt2 = "Hola soy Yo"
if re.search(r"\bS\w+", txt2):
    print("Hay una palabra que empieza por S")
else:
    print("No hay una palabra que empiece por S")

No hay una palabra que empiece por S


In [93]:
passw = input("Introduzca nombre:")
while True:
    if re.search('^[A-Z][a-z]+$', passw):
        print("Hola {}".format(passw))
        break
    else:
        passw = input("Incorrecto, introduzca nombre válido:")

Hola Manuel


In [95]:
for i in range(x.start(), x.end()):
    print(x.string[i])

S
p
a
i
n


In [105]:
print(r"Hola \n que tal")
print("Hola \n que tal")

Hola \n que tal
Hola 
 que tal


In [96]:
print(txt)
print(re.search(r'r\w+\s\w+', txt).group())
print(re.search(r'R\w+', txt))
print(re.search(r'T\w*', txt).group())
print(re.search(r'^t\w*', txt))

The rain in Spain
rain in
None
The
None


In [108]:
txt3 = "Me llamo P. Perez"
print(re.search(r'\w+\.\s', txt3).group())

P. 


### re.match(pattern, string)
Determine if the RE matches at the beginning of the string.

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but `re.match()` searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none. 
While `re.search()` searches for the whole string even if the string contains multi-lines and tries to find a match of the substring in all the lines of string.

In [110]:
#re.match
pattern = r"Cookie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Not a match!


In [111]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))

print(re.search(r'^T\w*', txt).group()) # search with ^ is the same as match

print(re.match(r'T\w*', txt).group())

None
None
The
The


In [114]:
text = 'El caballo blanco de Santiago'

print(re.match(r'E\w+', text) == re.search(r'^E\w+', text))

False


In [118]:
re.match(r'E\w+', text).group() == re.search(r'^E\w+', text).group()

True

In [116]:
re.search(r'^E\w+', text)

<re.Match object; span=(0, 2), match='El'>

In [119]:
email_address = 'Please contact us at: support@thebridge.com'
matchi = re.search(r'(\w+)@([\w\.]+)', email_address)
if matchi:
    print(matchi.group(0)) # The whole matched text
    print(matchi.group(1)) # The username (group 1)
    print(matchi.group(2)) # The host (group 2)

support@thebridge.com
support
thebridge.com


In [130]:
matchi.groups()

('support', 'thebridge.com')

In [132]:
email_address = 'Please contact us at: support@thebridge.com'
matchi = re.match(r'(\w+)@([\w\.]+)', email_address)
if matchi:
    print(matchi.group(0)) # The whole matched text
    print(matchi.group(1)) # The username (group 1)
    print(matchi.group(2)) # The host (group 2)

In [134]:
type(matchi)

NoneType

### re.fullmatch(pattern, string)

the difference between `re.match()` and `re.fullmatch()` is that `re.match()` matches only at the beginning but `re.fullmatch()` tries to match at the end as well.

In [30]:
class_names = ["Andrea", "Ariadna", "Estela", "Anais", "Xeles", "Maria", "Mar"]
for name in class_names:
    if re.fullmatch("Maria", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Andrea is not desired name
Ariadna is not desired name
Estela is not desired name
Anais is not desired name
Xeles is not desired name
Maria is desired name
Mar is not desired name


In [141]:
txt = 'nombre_apellido@dominio.com'
txt2 = 'Email: nombre_apellido@dominio.com'
txt3 = 'nombre_apellido@dominio.com es mi email'

# Primero aplicamos el search
print("SEARCH:")

for tx in [txt, txt2, txt3]:
    matchi = re.search(r'(\w+)@([\w\.]+)', tx)
    if matchi:
        print(matchi.group())
    else:
        print("None")

print("\n\n")

print("MATCH:")

for tx in [txt, txt2, txt3]:
    matchi = re.match(r'(\w+)@([\w\.]+)', tx)
    if matchi:
        print(matchi.group())
    else:
        print("None")

print('\n'*2)

print("FULLMATCH:")

for tx in [txt, txt2, txt3]:
    matchi = re.fullmatch(r'(\w+)@([\w\.]+)', tx)
    if matchi:
        print(matchi.group())
    else:
        print("None")

SEARCH:
nombre_apellido@dominio.com
nombre_apellido@dominio.com
nombre_apellido@dominio.com



MATCH:
nombre_apellido@dominio.com
None
nombre_apellido@dominio.com



FULLMATCH:
nombre_apellido@dominio.com
None
None


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [142]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'\w+\.?\w+@[\w\.-]+', email_address)
addresses

['support.data@data-science.com', 'xyz@thebridge.com']

In [143]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 's', 'c', 'n', 'c', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 't', 'h', 'b', 'r', 'd', 'g', '.', 'c', 'm']
[' contact']
['Please']


In [144]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [145]:
emails_clients=re.findall(r"[\w\.]+@[\w\.]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [35]:
client_numbers_normal = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [146]:
print(client_numbers_normal[:5])

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682']


In [165]:
client_numbers = re.findall(r"(([0-9]{2}-)\d{3}-\d{3}(-\d{3}))", client_info)
client_suff = re.findall(r"[0-9]{2}-\d{3}-\d{3}(-\d{3})", client_info)

In [166]:
print(client_numbers[:5])

[('34-739-941-941', '34-', '-941'), ('34-278-870-242', '34-', '-242'), ('34-999-876-292', '34-', '-292'), ('34-345-887-949', '34-', '-949'), ('34-905-089-682', '34-', '-682')]


In [168]:
client_suff[:5]

['-941', '-242', '-292', '-949', '-682']

### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [169]:
#re.split
sente = "Hello,\n Please, contact me the sooner.\n Thank you,\n Me"

In [40]:
reg = re.split("\n", sente)
reg

['Hello,', ' Please, contact me the sooner.', ' Thank you,', ' Me']

In [41]:
"".join(reg)

'Hello, Please, contact me the sooner. Thank you, Me'

https://stackoverflow.com/questions/10804732/what-is-the-difference-between-and-in-regex

In [54]:
client_info[:1000]

'Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941 Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242 Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292 Scarlett Ortiz Nullam.velit@non.ca 6082 Massa Road 34-345-887-949 Ocean Bell In@gravidamolestiearcu.co.uk P.O. Box 370, 440 Suspendisse Rd. 34-905-089-682 Meghan Henson non.bibendum@ipsumdolorsit.edu Ap #393-8999 Maecenas Rd. 34-773-463-479 Cole Cantrell rutrum.Fusce.dolor@purusNullamscelerisque.ca Ap #897-8561 Vitae, Rd. 34-017-915-525 Leandra Shaw quam.quis@ac.net P.O. Box 345, 1982 Ipsum Road 34-274-204-840 Michelle Rollins Nulla.eu.neque@idmollis.com 2891 Eget St. 34-575-459-881 Pamela Webster lacus.Cras@quisaccumsan.net 1418 Non Avenue 34-249-358-256 Kieran Aguilar commodo.at@sit.org 393-8798 Phasellus Rd. 34-299-478-659 Harrison Bartlett libero.et.tristique@sodaleseliterat.com Ap #803-4228 Accumsan Rd. 34-094-

- ?:  is for non capturing group
- ?=  is for positive look ahead
- ?!  is for negative look ahead
- ?<= is for positive look behind
- ?<! is for negative look behind

In [46]:
# No incluye el valor por el que hace el split
client_list = re.split(r"(?:[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:3])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road ', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. ', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. ']


In [47]:
# Incluye el valor del split solo hacia delante
client_list = re.split(r"(?=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:3])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road ', '34-739-941-941 Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. ', '34-278-870-242 Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. ']


In [55]:
# Todos los valores por los que no hay match
client_list = re.split(r"(?![0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:100])

['', 'W', 'y', 'o', 'm', 'i', 'n', 'g', ' ', 'D', 'u', 'd', 'l', 'e', 'y', ' ', 's', 'i', 't', '.', 'a', 'm', 'e', 't', '.', 'm', 'e', 't', 'u', 's', '@', 'e', 'g', 'e', 's', 't', 'a', 's', 'n', 'u', 'n', 'c', '.', 'c', 'a', ' ', '2', '5', '1', '8', ' ', 'N', 'u', 'l', 'l', 'a', ' ', 'R', 'o', 'a', 'd', ' 3', '4', '-', '7', '3', '9', '-', '9', '4', '1', '-', '9', '4', '1', ' ', 'A', 'r', 'i', 's', 't', 'o', 't', 'l', 'e', ' ', 'G', 'r', 'a', 'n', 't', ' ', 'a', 'r', 'c', 'u', '.', 'A', 'l', 'i']


In [52]:
# Coge el valor y los anteriores
client_list = re.split(r"(?<=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:3])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292']


### re.compile(pattern
Compiles a RE into a regular expression object.

In [177]:
name_check = re.compile(r"[^a-z ]", re.IGNORECASE) #,  re.IGNORECASE

In [175]:
name_check

re.compile(r'[^A-Za-z ]', re.UNICODE)

In [178]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [191]:
# Fran
def validate_usr(username):
    import re
    str_list = re.findall("[a-z0-9_]",username)
    compare = ""
    compare = compare.join(str_list)
    if compare == username and len(compare)<=16 and len(compare)>=4:
        return True
    else:
        return False

validate_usr('user_name123')

True

In [192]:
# Selu

def validate_usr(username):
    # import re
    check = re.compile(r'^[a-z0-9_]{4,16}$')
    if check.search(username):
        return True
    else:
        return False

validate_usr('user_name123')

True

In [193]:
# Ouissam
import re
def validate_usr(username):
    pattern = re.compile(r"[a-z0-9_]{4,16}")
    if re.fullmatch(pattern, username):
        return True
    else:
        return False

validate_usr('user_name123')

True

In [194]:
# Mario
def validate_usr(username):
    return bool(re.fullmatch('[a-z0-9_]{4,16}', username))

validate_usr('user_name123')

True

In [188]:
# Nani
def validate_usr(username):
    if re.match(r"[a-z\d_]{4,16}$",username):
        return True
    else:
        return False

validate_usr('user_name123')

True

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

In [196]:
# Fran

def validate_pin(pin):
    import re
    if len(pin) in [4, 6]:
        return bool(re.fullmatch(r"[0-9]+", pin))
    else:
        return False

print(validate_pin('456789'))
print(validate_pin('45678'))

True
False


In [197]:
# Ouissam

import re
def validate_pin(pin):
    pattern1 = re.compile(r"[0-9]{4}")
    pattern2 = re.compile(r"[0-9]{6}")

    if re.fullmatch(pattern1, pin) or re.fullmatch(pattern2, pin):
        return True
    else: 
        return False

print(validate_pin('456789'))
print(validate_pin('45678'))

True
False


In [200]:
# Selu

def validate_pin(pin):
    #return true or false
    if re.fullmatch((r"\d{4}|\d{6}"), pin):
        return True
    else:
        return False

print(validate_pin('456789'))
print(validate_pin('45678'))

True
False


In [199]:
# Nani

def validate_pin(pin):
    if re.fullmatch(r"\d{4}|\d{6}", pin):
        return True
    else:
        return False

print(validate_pin('456789'))
print(validate_pin('45678'))

True
False


In [201]:
# Mario

def validate_pin(pin):
    return bool(re.fullmatch("[0-9]{4}|[0-9]{6}", pin))

print(validate_pin('456789'))
print(validate_pin('45678'))

True
False


-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

In [47]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [69]:
emails_info={}

In [70]:
fh = open("emails.txt", "r").read()

In [71]:
fh.count("From r")

3977

In [72]:
contents = re.split(r"From r", fh)

In [73]:
contents[0]

''

In [74]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject


### Info Sender

In [75]:

info_sender=[]
for i,e in enumerate(contents):
    try:
        info_sender.append(re.search("From:.*", e).group())
    except: 
        info_sender.append("not found")

In [78]:
len(info_sender)

3977

In [38]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        emails_info['sender_email'].append(res[0])
    else:
        emails_info['sender_email'].append(np.nan)
        
len(emails_info['sender_email'])

3977

In [39]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append(np.nan)
len(emails_info['sender_name'])

3977

### Info Dates

In [40]:
#DATES
dates=[]
for i,e in enumerate(contents):
    try:
        dates.append(re.search("Date:.*", e).group())
    except: 
        dates.append("not found")
len(dates)

3977

In [41]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append(np.nan)

len(emails_info['date_sent'])

3977

In [42]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}:\d{2}", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append(np.nan)

len(emails_info['time_sent'])

3977

### Subject

In [43]:
subject=[]
for i,e in enumerate(contents):
    try:
        subject.append(re.search("Subject:.*", e).group())
    except: 
        subject.append("not found")
len(subject)

3977

In [44]:
emails_info['subject']=[]
for sub in subject:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append(np.nan)

len(emails_info['subject'])

3977

### Creating DataFrame

In [45]:
df=pd.DataFrame(emails_info)
df.isnull().sum()

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

In [46]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!