# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [2]:
#讀取文本資料
#<your code>#
with open('data/sample_emails.txt') as file:
    text_mail = file.read()
print(text_mail)

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WH

In [None]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [4]:
import re

In [109]:
pattern = r"From:\s\".+\"\s<\w+?@\w+\..+>"
match = re.findall(pattern ,text_mail, flags=re.M|re.I)

In [110]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [123]:
for ma in match:
    pattern = r"\".+\""
    match_name = re.search(pattern, ma)
    print(match_name.group())

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [125]:
for ma in match:
    pattern = r"<(.+)>"
    match_email = re.search(pattern, ma)
    print(match_email.group(1))

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [134]:
# 參考
pattern2 = r"From:\s\".+\"\s<(\w+)@\w+\..+>"
match_domain2 = re.findall(pattern2, text_mail, flags=re.I)
print(*match_domain2)

james_ngola2002 bensul2004nng obong_715


In [132]:
for ma in match:
    pattern = r"@(\w+)\..+"
    match_domain = re.search(pattern, ma, flags=re.I)
    print(match_domain.group(1))

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [142]:
pattern_name_domain = r"From:\s\"(.+)\"\s<\w+@(\w+)\..+>"
match_name_domain = re.findall(pattern_name_domain, text_mail, re.I|re.M)
print(*[', '.join(item) for item in match_name_domain], sep='\n')

MR. JAMES NGOLA., maktoob
Mr. Ben Suleman, spinfinder
PRINCE OBONG ELEME, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [157]:
import re
import pandas as pd
import email

from email.policy import default

###讀取文本資料:fradulent_emails.txt###
with open('data/all_emails.txt', 'rb') as file:
    email_text = file.read().decode('utf-8', 'ignore')

In [230]:
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
pattern_split_email = r"From\sr.+\d+\:\d+\:\d+\s\d+\n"
list_split_email = re.split(pattern_split_email, email_text)

In [231]:
#查看有多少封email
len(list_split_email)

3977

In [233]:
print(list_split_email[1])

Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE WERE HOLDING MEETING WITH 

In [153]:
print(email_text[:20000])

b'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCID

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [264]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in list_split_email[:21]: #只取前20筆資料 (處理速度比較快)
    if not mail:
        continue
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    #<your code>#
    pattern_from_info = r"From:\s\"(.+)\"\s<(\w+?@\w+\..+)>"
    match_from_info = re.search(pattern_from_info, mail)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    str_sender_name = match_info.group(1)
    str_sender_mail = match_info.group(2)
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['sender_name'] = str_sender_name
    emails_dict['sender_mail'] = str_sender_mail
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    #<your code>#
    pattern_to_info = r"To:\s(.+@\w+\..+)?(.+)?"
#     pattern_to_info = r"To:((.+))"
    match_to_info = re.search(pattern_to_info, mail)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    str_recipient_name = match_to_info.group(2)
    str_recipient_mail = match_to_info.group(1)
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['recipient_name'] = str_recipient_name
    emails_dict['recipient_mail'] = str_recipient_mail
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    #Date: Thu, 31 Oct 2002 22:17:55 +0100
    #<your code>#
    pattern_date = r"Date:.+(\d+\s\w+\s\d+).+"
    match_date = re.search(pattern_date, mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #<your code>#
    try:
        str_date = match_date.group(1)
        emails_dict['date'] = str_date
    except:
        emails_dict['date'] = ""
        
    #Step3: 將取得的日期資訊存入字典中
    #<your code>#
    
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #<your code>#
    pattern_sub = r"Subject:\s(.*)\s*"
    match_sub = re.search(pattern_sub, mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    str_sub = match_sub.group(1)
    
    #Step3: 將取得的主旨存入字典中
    #<your code>#
    emails_dict['subject'] = str_sub
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)
    
    #<your code>#

In [265]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_name,sender_mail,recipient_name,recipient_mail,date,subject,email_body
0,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,james_ngola2002@maktoob.com,1 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,R@M,,1 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,obong_715@epatra.com,1 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,webmaster@aclweb.org,1 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,m_abacha03@www.com,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,davidkuta@yahoo.com,2 Nov 2002,Partnership,ATTENTION: ...
6,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,tunde_dosumu@lycos.com,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,william2244drallo@maktoob.com,3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,R@M,,4 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,barrister_td@lycos.com,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
