# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
#讀取文本資料
#<your code>#
with open('sample_emails.txt', 'r') as file:
    sample_corpus = file.read()
    #print(sample_corpus)

In [2]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
#<your code>#
import re
pattern = "From\:\s*.+\<\w+@\w+\..+\>"
match = re.findall(pattern, sample_corpus)
print(match)

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>', 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>', 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']


In [4]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [5]:
#<your code>#
for m in match:
    #print(type(m), m)
    
    #print(m.split('"')[1])
        
    print(re.search(r'\".+\"', m).group())

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [6]:
#<your code>#
for m in match:
    #print(m.split('<')[1].strip('>'))
    print(re.search(r'\w+@\w+\.com', m).group())

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [7]:
#<your code>#
for m in match:
    print(m.split('@')[1].split('.')[0])
    #print(re.search(r'(?<=@)\w\S*(?=\.)', m).group())
    #print(re.findall(r'\w+@(\w\S*)\.com', m))

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [8]:
#<your code>#
for m in match:
    person_info = re.findall(r'(\w+)@(\w\S*)\.com', m)
    for info in person_info:
        print('{}, {}'.format(info[0], info[1]))

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [9]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
#<your code>#
#with open('all_emails.txt', 'r', encoding="utf8", errors='ignore') as f:
#    emails_content = f.read()
with open('all_emails.txt', 'r', encoding="utf8", errors='ignore') as f:
    contents = f.read()
    
print(contents)
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
#<your code>#
# From r  Sat Nov 16 08:34:18 2002
emails = contents.split(r'From r  ')
len(emails) #查看有多少封email

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [10]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[1:20]: #只取前20筆資料 (處理速度比較快)
    #print(type(mail))
    #print(str(mail))
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    #<your code>#
    # From: "Victor Aloma" <victorloma@netscape.net>
    sender = re.search(r'From:\s*(\".+\")\s+\<.+\>', mail)
    #print(type(sender), sender)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if sender != None:
        s_name= re.search(r'(?<=\").+(?=\")', sender.group())
        #print(s_name.group())
        s_address=re.search(r'\<(\w+.*@.+)\>', sender.group())
        #print(s_address.group())
    else:
        s_name = None
        s_address =None
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    if sender != None:
        emails_dict['sender_name'] = s_name.group()
        emails_dict['sender_address'] = s_address.group()
    else:
        emails_dict['sender_name'] = s_name
        emails_dict['sender_address'] = s_address
    #print(emails_dict)
        
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    #<your code>#
    #To: webmaster@aclweb.org
    receiver = re.search(r'To:\s*.+@.+', mail)
    #print(receiver.group())
    
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if receiver != None:
        receiver_name = re.search(r'(?<= ).+(?=@)', receiver.group())
        #print(receiver_name.group())
        receiver_address = re.search(r'(?<= ).+', receiver.group())
        #print(receiver_address.group())
    else:
        receiver_name = None
        receiver_address = None
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    if receiver != None:
        emails_dict['receiver_name'] = receiver_name.group()
        emails_dict['receiver_address'] = receiver_address.group()
    else:
        emails_dict['receiver_name'] = receiver_name
        emails_dict['receiver_address'] = receiver_address
    
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    #<your code>#
    #Date: Tue, 6 Nov 2001 16:52:34 -0000
    #Date: Fri, 08 Nov 2002 04:15:33
    #Date: Sat, 23 Nov 2002 15:01:14 +0100
    email_date_info = re.search(r'Date:.+\d{2}:\d{2}:\d{2}', mail)
    #print(email_date_info)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #<your code>#
    if email_date_info is not None:
        email_date = re.search(r'\d+\s+\w{3}\s+\d{4}', email_date_info.group())
        #print(email_date)
    else:
        email_date = None
        
    #Step3: 將取得的日期資訊存入字典中
    #<your code>#
    if email_date_info is not None:
        emails_dict['date'] = email_date.group()
    else:
        emails_dict['date'] = email_date
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #<your code>#
    # Subject: MICHAEL!
    subject_info = re.search(r'Subject:.+', mail)
    #print(subject_info.group())
    
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    if subject_info != None:
        email_subject = re.sub(r'Subject: ', '', subject_info.group())
    else:
        email_subject = None
    #print(email_subject)
    
    #Step3: 將取得的主旨存入字典中
    #<your code>#
    emails_dict['subject'] = email_subject
    
    print(emails_dict)
    

    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    #<your code>#
    emails_list.append(emails_dict)

{'sender_name': 'MR. JAMES NGOLA.', 'sender_address': '<james_ngola2002@maktoob.com>', 'receiver_name': 'james_ngola2002', 'receiver_address': 'james_ngola2002@maktoob.com', 'date': '31 Oct 2002', 'subject': 'URGENT BUSINESS ASSISTANCE AND PARTNERSHIP'}
{'sender_name': 'Mr. Ben Suleman', 'sender_address': '<bensul2004nng@spinfinder.com>', 'receiver_name': 'R', 'receiver_address': 'R@M', 'date': '31 Oct 2002', 'subject': 'URGENT ASSISTANCE /RELATIONSHIP (P)'}
{'sender_name': 'PRINCE OBONG ELEME', 'sender_address': '<obong_715@epatra.com>', 'receiver_name': 'obong_715', 'receiver_address': 'obong_715@epatra.com', 'date': '31 Oct 2002', 'subject': 'GOOD DAY TO YOU'}
{'sender_name': 'PRINCE OBONG ELEME', 'sender_address': '<obong_715@epatra.com>', 'receiver_name': 'webmaster', 'receiver_address': 'webmaster@aclweb.org', 'date': '31 Oct 2002', 'subject': 'GOOD DAY TO YOU'}
{'sender_name': 'Maryam Abacha', 'sender_address': '<m_abacha03@www.com>', 'receiver_name': 'm_abacha03', 'receiver_add

In [11]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,date,email_body,receiver_address,receiver_name,sender_address,sender_name,subject
0,31 Oct 2002,Wed Oct 30 21:41:56 2002\nReturn-Path: <james_...,james_ngola2002@maktoob.com,james_ngola2002,<james_ngola2002@maktoob.com>,MR. JAMES NGOLA.,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,31 Oct 2002,Thu Oct 31 08:11:39 2002\nReturn-Path: <bensul...,R@M,R,<bensul2004nng@spinfinder.com>,Mr. Ben Suleman,URGENT ASSISTANCE /RELATIONSHIP (P)
2,31 Oct 2002,Thu Oct 31 17:27:16 2002\nReturn-Path: <obong_...,obong_715@epatra.com,obong_715,<obong_715@epatra.com>,PRINCE OBONG ELEME,GOOD DAY TO YOU
3,31 Oct 2002,Thu Oct 31 17:53:56 2002\nReturn-Path: <obong_...,webmaster@aclweb.org,webmaster,<obong_715@epatra.com>,PRINCE OBONG ELEME,GOOD DAY TO YOU
4,1 Nov 2002,Fri Nov 1 04:48:39 2002\nReturn-Path: <m_abac...,m_abacha03@www.com,m_abacha03,<m_abacha03@www.com>,Maryam Abacha,I Need Your Assistance.
5,02 Nov 2002,Sat Nov 2 00:18:06 2002\nReturn-Path: <davidk...,davidkuta@yahoo.com,davidkuta,,,Partnership
6,,Sat Nov 2 05:10:24 2002\nReturn-Path: <tunde_...,tunde_dosumu@lycos.com,tunde_dosumu,<tunde_dosumu@lycos.com>,Barrister tunde dosumu,Urgent Attention
7,3 Nov 2002,Sun Nov 3 19:00:11 2002\nReturn-Path: <willia...,william2244drallo@maktoob.com,william2244drallo,<william2244drallo@maktoob.com>,William Drallo,URGENT BUSINESS PRPOSAL
8,04 Nov 2002,Mon Nov 4 17:41:46 2002\nReturn-Path: <abdul_...,R@M,R,<abdul_817@rediffmail.com>,MR USMAN ABDUL,THANK YOU
9,,Tue Nov 5 05:25:07 2002\nReturn-Path: <barris...,barrister_td@lycos.com,barrister_td,<barrister_td@lycos.com>,Tunde Dosumu,Urgent Assistance
