# 使用python正規表達式對資料進行清洗處理

我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
with open('./D4/sample_emails.txt') as f:
    sample_corpus = f.read()

In [2]:
print(sample_corpus)

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WH

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
import re
pattern = r'From: ".*?" <.*?>'
result = re.findall(pattern,sample_corpus)
print('\n'.join(result))

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


### 只讀取寄件者姓名

In [4]:
pattern = '".*?"'
result_name = re.findall(pattern,'\n'.join(result))
print('\n'.join(result_name))

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [5]:
pattern = '(?<=<).*?@.*?(?=>)'
result_email = re.findall(pattern,'\n'.join(result))
print('\n'.join(result_email))

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [6]:
pattern = '(?<=@)\w+'
result_com = re.findall(pattern,'\n'.join(result_email))
print('\n'.join(result_com))

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [7]:
pattern = '.+(?=@)|(?<=@)\w+'
result_conbine = [', '.join(re.findall(pattern,i)) for i in result_email]
print('\n'.join(result_conbine))

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [8]:
import re
import pandas as pd
import numpy as np
import email

###讀取文本資料:fradulent_emails.txt###
with open('./D4/all_emails.txt', encoding="utf-8", errors='ignore') as f:
    s = f.read()

###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = ['From r'+i for i in s.split('From r')[1:]]

#查看有多少封email
print(len(emails)) 

3977


### 從文本中擷取所有寄件者與收件者的姓名和地址

In [9]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    pattern = 'From: .*'
    sender = ' '.join(re.findall(pattern, mail))
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    pattern = '(?<=").*(?=")'
    sender_name = re.findall(pattern, sender)
    pattern = '\S*@\S*'
    sender_email = re.findall(pattern, sender)

    #Step3: 將取得的姓名與地址存入字典中
    emails_dict['sender_name'] = sender_name[0] if sender_name else np.NaN
    emails_dict['sender_emial'] = sender_email[0].lstrip('<').rstrip('>')  if sender_email else np.NaN
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    pattern = 'To: .*'
    receivers = re.findall(pattern,mail)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    receiver_pair=[]
    for receiver in receivers:
        pattern = '(?<=").*(?=")'
        receiver_name = re.findall(pattern,receiver)
        pattern = '\S*@\S*'
        receiver_email = [i.lstrip('<').rstrip('>') for i in re.findall(pattern, receiver)]
        receiver_pair.append((receiver_name[0] if receiver_name else np.NaN, 
                              receiver_email[0] if receiver_email else np.NaN))

    #Step3: 將取得的姓名與地址存入字典中
    emails_dict['receiver_name_email_pair'] = receiver_pair if receiver_pair else np.NaN
        
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    pattern = '(?<=From r  ).*'
    date_information = re.findall(pattern, mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    date = [v for i,v in enumerate(date_information[0].split()) if i in (1,2,4)] if date_information else []
    
    #Step3: 將取得的日期資訊存入字典中
    emails_dict['date'] = date[1]+' '+date[0]+' '+date[2] if date else np.NaN
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    pattern = '(?<=Subject: ).*'
    subject = re.findall(pattern, mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    
    #Step3: 將取得的主旨存入字典中
    emails_dict['subject'] = subject[0] if subject else np.NaN

    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)

In [10]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
print(emails_df)

             sender_name                   sender_emial  \
0       MR. JAMES NGOLA.    james_ngola2002@maktoob.com   
1        Mr. Ben Suleman   bensul2004nng@spinfinder.com   
2     PRINCE OBONG ELEME           obong_715@epatra.com   
3     PRINCE OBONG ELEME           obong_715@epatra.com   
4          Maryam Abacha             m_abacha03@www.com   
...                  ...                            ...   
3972                 NaN  michealagu0255@zipmail.com.br   
3973                 NaN       ali_sherif252@hotmail.fr   
3974                 NaN  drusmanibrahimtg08@hotmail.fr   
3975                 NaN     motherdorisk61@hotmail.com   
3976                 NaN                            NaN   

                               receiver_name_email_pair         date  \
0     [(nan, james_ngola2002@maktoob.com), (nan, web...  30 Oct 2002   
1                                          [(nan, R@M)]  31 Oct 2002   
2     [(nan, obong_715@epatra.com), (nan, webmaster@...  31 Oct 2002   
3  