# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
#讀取文本資料
fn = "sample_emails.txt"

with open(fn) as file_obj:
    sample_corpus = file_obj.read()

In [2]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
import re
pattern = r"^From:.*"

match = re.findall(pattern, sample_corpus, re.M)

In [4]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [5]:
pattern = r"(?<=From:)[a-zA-Z.\s\"]+"

match = re.findall(pattern, sample_corpus, re.M)

In [6]:
for name in match:
    print(name)

 "MR. JAMES NGOLA." 
 "Mr. Ben Suleman" 
 "PRINCE OBONG ELEME" 


### 只讀取寄件者電子信箱

In [7]:
pattern = r"^From:.*(?<=\<)(.*@\w+\.com)"

match = re.findall(pattern, sample_corpus, re.M)

In [8]:
match

['james_ngola2002@maktoob.com',
 'bensul2004nng@spinfinder.com',
 'obong_715@epatra.com']

### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [9]:
pattern = r"^From:.*(?<=\<).*@(\w+)\.com"

match = re.findall(pattern, sample_corpus, re.M)

In [10]:
for info in match:
    print(info)

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [11]:
pattern = r"^From:.*(?<=\<)(.*)@(\w+)\.com"

match = re.findall(pattern, sample_corpus, re.M)

In [12]:
match

[('james_ngola2002', 'maktoob'),
 ('bensul2004nng', 'spinfinder'),
 ('obong_715', 'epatra')]

### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [13]:
import re
import pandas as pd
import email
from email.parser import Parser
from email.header import decode_header
from email.utils import parseaddr

###讀取文本資料###
with open('all_emails.txt', 'r', encoding="utf8", errors='ignore') as f:
    corpus = f.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = re.split(r"From r", corpus, flags=re.M)
emails = emails[1:] #移除第一項的空元素
len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [14]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    
    ###取的寄件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: From:)
    pattern = r"^From:.*"
    From = re.findall(pattern, mail, re.M)
    emails_dict['from_info'] = From
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    pattern = r"(?<=From:)[a-zA-Z.\s\"]+"
    name = re.findall(pattern, mail, re.M)
    emails_dict['from_name'] = name

    pattern = r"^From:.*(?<=\<)(.*@\w+\.com)"
    email = re.findall(pattern, mail, re.M)
    emails_dict['from_email'] = email
    

    ###取的收件者姓名與地址### 
    #Step1: 取的收件者資訊 (hint: to:)
    pattern = r"^To:.*"
    to = re.findall(pattern, mail, re.M)
    emails_dict['to_info'] = to
    

    pattern = r"^To:.*(?<=\<)(.*@\w+\.com)"
    email = re.findall(pattern, mail, re.M)
    emails_dict['to_email'] = email

    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    pattern = r"^Subject:.*"
    subject = re.findall(pattern, mail, re.M)
    emails_dict['subject'] = subject
    
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    emails_list.append(emails_dict)

In [15]:
emails_list

[{'from_info': ['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>'],
  'from_name': [' "MR. JAMES NGOLA." '],
  'from_email': ['james_ngola2002@maktoob.com'],
  'to_info': ['To: webmaster@aclweb.org'],
  'to_email': [],
  'subject': ['Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP'],
  'email_body': None},
 {'from_info': ['From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>'],
  'from_name': [' "Mr. Ben Suleman" '],
  'from_email': ['bensul2004nng@spinfinder.com'],
  'to_info': ['To: R@M'],
  'to_email': [],
  'subject': ['Subject: URGENT ASSISTANCE /RELATIONSHIP (P)'],
  'email_body': None},
 {'from_info': ['From: "PRINCE OBONG ELEME" <obong_715@epatra.com>'],
  'from_name': [' "PRINCE OBONG ELEME" '],
  'from_email': ['obong_715@epatra.com'],
  'to_info': ['To: webmaster@aclweb.org'],
  'to_email': [],
  'subject': ['Subject: GOOD DAY TO YOU'],
  'email_body': None},
 {'from_info': ['From: "PRINCE OBONG ELEME" <obong_715@epatra.com>'],
  'from_name': [' "PRINCE OBONG EL

In [16]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,from_info,from_name,from_email,to_info,to_email,subject,email_body
0,"[From: ""MR. JAMES NGOLA."" <james_ngola2002@mak...","[ ""MR. JAMES NGOLA."" ]",[james_ngola2002@maktoob.com],[To: webmaster@aclweb.org],[],[Subject: URGENT BUSINESS ASSISTANCE AND PARTN...,
1,"[From: ""Mr. Ben Suleman"" <bensul2004nng@spinfi...","[ ""Mr. Ben Suleman"" ]",[bensul2004nng@spinfinder.com],[To: R@M],[],[Subject: URGENT ASSISTANCE /RELATIONSHIP (P)],
2,"[From: ""PRINCE OBONG ELEME"" <obong_715@epatra....","[ ""PRINCE OBONG ELEME"" ]",[obong_715@epatra.com],[To: webmaster@aclweb.org],[],[Subject: GOOD DAY TO YOU],
3,"[From: ""PRINCE OBONG ELEME"" <obong_715@epatra....","[ ""PRINCE OBONG ELEME"" ]",[obong_715@epatra.com],[To: webmaster@aclweb.org],[],[Subject: GOOD DAY TO YOU],
4,"[From: ""Maryam Abacha"" <m_abacha03@www.com>]","[ ""Maryam Abacha"" ]",[m_abacha03@www.com],[To: R@M],[],[Subject: I Need Your Assistance.],
5,[From: Kuta David <davidkuta@postmark.net>],[ Kuta David ],[],[To: davidkuta@yahoo.com],[],[Subject: Partnership],
6,"[From: ""Barrister tunde dosumu"" <tunde_dosumu@...","[ ""Barrister tunde dosumu"" ]",[tunde_dosumu@lycos.com],[],[],[Subject: Urgent Attention],
7,"[From: ""William Drallo"" <william2244drallo@mak...","[ ""William Drallo"" ]",[william2244drallo@maktoob.com],[To: webmaster@aclweb.org],[],[Subject: URGENT BUSINESS PRPOSAL],
8,"[From: ""MR USMAN ABDUL"" <abdul_817@rediffmail....","[ ""MR USMAN ABDUL"" ]",[abdul_817@rediffmail.com],[To: R@M],[],[Subject: THANK YOU],
9,"[From: ""Tunde Dosumu"" <barrister_td@lycos.com>]","[ ""Tunde Dosumu"" ]",[barrister_td@lycos.com],[],[],[Subject: Urgent Assistance],
