In [1]:
import re
import email as email_lib
import pandas as pd

# Carregando o arquivo

In [2]:
textFile = open("fradulent_emails.txt", "r").read()

# Criando uma divisão inicial do arquivo
Aparentemente, a maioria dos dados dos emails iniciam-se com "From" no inicio da linha, e, na mesma linha, existe uma data

In [3]:
emails_level_1 = re.split("\nFrom .*?\d{2}:\d{2}:\d{2}.*", textFile)

# Checando quantidade
print( "E-Mails encontrados: " +str(len(emails_level_1)) )

E-Mails encontrados: 3978


# Estudando se dentro do primeira divisão existem ainda emails que não foram separados

Vamos tentar utilizar algum meta-dado do header como identificador

In [4]:
# Função para fazer a checagem

def checkSubEmail( identifier , emailList ):
    candidateSubs = list()
    
    regexPattern = "\n" + identifier

    for email in emailList:
        reResult = re.findall(regexPattern, email)

        if len(reResult) > 1:
            candidateSubs.append( email )
            
    return candidateSubs

In [5]:
# Função para facilitar o debug
def debug_checkSubEmail( identifier , emailList ):
    candidateSubs = checkSubEmail( identifier , emailList)
    print( 'Subdividindo por "' +str(identifier) +'":\n\tCandidatos a sub emails: ' + str(len( candidateSubs ) ) +"\n" )
    return

### Utilizando as funções acima, estudamos o que pode ser um sub separador de emails

In [6]:
debug_checkSubEmail( "Subject:" , emails_level_1 )
debug_checkSubEmail( "From:" , emails_level_1 )
debug_checkSubEmail( "Status:" , emails_level_1 )
debug_checkSubEmail( "Return-Path:" , emails_level_1 )
debug_checkSubEmail( "Message-Id:" , emails_level_1 )

Subdividindo por "Subject:":
	Candidatos a sub emails: 25

Subdividindo por "From:":
	Candidatos a sub emails: 149

Subdividindo por "Status:":
	Candidatos a sub emails: 1

Subdividindo por "Return-Path:":
	Candidatos a sub emails: 60

Subdividindo por "Message-Id:":
	Candidatos a sub emails: 10



# Análise

### Após uma inspeção amostral dos casos acima, percebemos que em muitos casos trata-se de uma mensagem originada em outro sistema que foi enviada por email, por isso headers parecidos ( dois exemplos a seguir )

Nesses exemplos, perecebemos uma outra estrutura em comum, o "Content-Type: text/plain;"

In [7]:
subs = checkSubEmail( "Message-Id:" , emails_level_1 )
print(subs[1])


Return-Path: <apache@nsi25.miniserver.de>
X-Sieve: CMU Sieve 2.3
Date: Tue, 9 Jan 2007 23:04:09 -0500
Message-Id: <200701100404.l0A449IR029401@brazil.mail.UM>
Subject: I WILL ASSIST YOU GET YOUR FUNDS
From: DAVID MARK <"CENTRALBANK_ FOREGINREMINTANCE.COM"@nsi25.miniserver.de>
Reply-To: david02_mark@sify.com
MIME-Version: 1.0
Status: O

Content-Type: text/plain

Message-Id: <20070110035833.DA5848D1DFD@nsi25.miniserver.de>
Date: Wed, 10 Jan 2007 04:58:33 +0100 (CET)
Content-Transfer-Encoding: quoted-printable

=0D
FROM: =0D
DR DAVID  MARK     PRIVATE AND CONFIDENTIAL=0D
      # 3 Onyeama Street, Zone 2 =0D
      Wuse Abuja.                =0D
 =0D
My Dear, =0D
=0D
I know that this letter will come to you with a great suprise, my name =0D
is DR =0D
DR DAVID  MARK chairman consultative monetary panel, and this is in =0D
regards to your outstanding payment.=0D
 =0D
I took my time to carrying out a proper verification exercise on this =0D
subject matter and all the complications, which was 

In [8]:
subs = checkSubEmail( "Subject:" , emails_level_1 )
print(subs[1])


Return-Path: <mailman-bounces@krusty.si.UM>
X-Original-To: ilist-owner@lists.si.UM
Delivered-To: ilist-owner@lists.si.UM
	Tue, 18 May 2004 17:17:00 -0400 (EDT)
Subject: Ilist post from drallo_wd4@yahoo.com requires approval
From: ilist-owner@krusty.si.UM
To: ilist-owner@krusty.si.UM
MIME-Version: 1.0
Message-ID: <mailman.23.1084915016.1197.ilist@lists.si.UM>
Date: Tue, 18 May 2004 17:16:56 -0400
Precedence: bulk
X-BeenThere: ilist@lists.si.UM
X-Mailman-Version: 2.1.3
List-Id: <ilist.lists.si.UM>
X-List-Administrivia: yes
Sender: mailman-bounces@krusty.si.UM
Errors-To: mailman-bounces@krusty.si.UM
Status: RO

Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

As list administrator, your authorization is requested for the
following mailing list posting:

    List:    Ilist@lists.si.UM
    From:    drallo_wd4@yahoo.com
    Subject: urgent response needed
    Reason:  Post by non-member to a members-only list

At your convenience, visit:

    h

# Conclusão Parcial

### Não é claro que devemos dividir ainda mais estes emails. Então vamos testar 2 resultados: apenas no nivel já calculado, e subdividindo um pouco mais

In [9]:
# funcao para extrair o DataFrame que queremos

def findEmailSubject( emailRaw ):
    result = re.search("\nSubject:(?P<subject>.*)", emailRaw)

    if result is not None:
        s_subject =  result.group("subject").strip()

    else:
        s_subject = None
        
    return s_subject


def findEmailBody( emailRaw ):
    full_email = email_lib.message_from_string( emailRaw )
    body = full_email.get_payload()

    # Mantem apenas o que aparece depois da primeira instância de "Status:"
    result = re.search("\nStatus:[^\n]*(?P<content>.*)", body , re.DOTALL )
    if result is not None:
        content =  result.group("content").strip()

    else:
        # Caso essa busca tenha falhado... Mantem o email inteiro
        content = body.strip()
        
    return content


def buildEmailsDF( emailsRaw ):
    # Constroi a lista de dicionarios
    emails = list()
    for item in emailsRaw:
        emails_dict = {}

        # Step 1: encontra o Subject
        emails_dict["subject"] = findEmailSubject(item)

        # Step 2: encontra o corpo do email
        emails_dict["content"] = findEmailBody(item)
        
        # Armazena
        emails.append(emails_dict)
        
    # Constroi o dataframe
    emails_df = pd.DataFrame(emails)

    return emails_df

# Caso 1: Sem Sub Divisões

In [10]:
caso1_df = buildEmailsDF( emails_level_1 )
caso1_df.head()

Unnamed: 0,content,subject
0,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",URGENT ASSISTANCE /RELATIONSHIP (P)
2,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,GOOD DAY TO YOU
3,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,GOOD DAY TO YOU
4,"Dear sir, \n \nIt is with a heart full of hope...",I Need Your Assistance.


### Exemplos de content

In [11]:
print( caso1_df["content"][0] )

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE WERE HOLDING MEETING WITH HIS EXCELLENCY OVER THE FINANCIAL RETURNS FROM THE DIAMOND SALES IN THE AREAS CONTROLLED BY (D.R.C.) DEMOCRATIC REPUBLIC OF CONGO FORCES AND THEIR FOREIGN ALLIES ANGOLA AND ZIMBABWE, HAVING RECEIVED THE PREVIOUS DAY (USD$100M) ONE HUNDRED MILLION UNITED STATES DOLLARS, CASH IN THREE DIPLOMATIC BOXES ROUTED THROUGH ZIMBABWE.

MY PURPOSE OF WRITING YOU THIS LETTER IS TO SOLICIT FOR YOUR ASSISTANCE AS TO BE A COVER TO THE FUND AND ALSO COLLABORATION IN MOVING THE SAID FUND INTO YOUR BANK ACCOUNT THE SUM OF (USD$25M) TWENTY FIVE MILLION UNITED STATES DOLLARS ONLY, WHICH I DEPOSITED WITH A SECURITY COMPANY IN GH

In [12]:
print( caso1_df["content"][1] )

Dear Friend,

I am Mr. Ben Suleman a custom officer and work as Assistant controller of the Customs and Excise department Of the Federal Ministry of Internal Affairs stationed at the Murtala Mohammed International Airport, Ikeja, Lagos-Nigeria.

After the sudden death of the former Head of state of Nigeria General Sanni Abacha on June 8th 1998 his aides and immediate members of his family were arrested while trying to escape from Nigeria in a Chartered jet to Saudi Arabia with 6 trunk boxes Marked "Diplomatic Baggage". Acting on a tip-off as they attempted to board the Air Craft,my officials carried out a thorough search on the air craft and discovered that the 6 trunk boxes contained foreign currencies amounting to US$197,570,000.00(One Hundred and  Ninety-Seven Million Five Hundred Seventy Thousand United States Dollars).

I declared only (5) five boxes to the government and withheld one (1) in my custody containing the sum of (US$30,000,000.00) Thirty Million United States Dollars O

# Caso 2: SubDivindo por "Content-Type: text/plain;"

In [13]:
# funcao para extrair o DataFrame que queremos

def findSubEmails( emailRaw ):
    emails_dict = {}
    
    # Step 1: encontra o Subject
    emails_dict["subject"] = findEmailSubject( emailRaw )

    # Step 2: encontra o corpo
    full_email = email_lib.message_from_string( emailRaw )
    body = full_email.get_payload()

    # Tenta manter apenas o que aparece depois da primeira instância de "Status:"
    result = re.search("\nStatus:[^\n]*(?P<content>.*)", body , re.DOTALL )
    if result is not None:
        emails_dict["content"] = result.group("content").strip()
        return emails_dict

    # Tenta entender onde termina os headers, ou seja, primeira linha sem ":"
    result = re.search("\n[^:]*\n(?P<content>.*)", body , re.DOTALL )
    if result is not None:
        emails_dict["content"] = result.group("content").strip()
        return emails_dict
    
    # Caso tudo mais falhe, retonar o corpo como é
    emails_dict["content"] = body.strip()  
    return emails_dict


def buildEmailsDF_2( emailsRaw ):
    # Constroi a lista de dicionarios
    emails = list()
    for item in emailsRaw:
        emails_dict = {}
        emails.append(emails_dict)

        # Step 1: encontra o Subject
        emails_dict["subject"] = findEmailSubject(item)

        # Step 2: encontra o corpo do email
        emailContent = findEmailBody(item)

        # Step 3: O corpo contem o "Content-Type: text/plain;" ?
        subEmails = re.split( "Content-Type: .*?text/plain.*" , emailContent )
        
        if len(subEmails) == 1:
            emails_dict["content"] = emailContent
        else:
            emails_dict["content"] = subEmails[0]
            
            # Sub Divide o email
            for i in range(1,len(subEmails)):
                sub_emails_dict = findSubEmails( subEmails[i] )
                emails.append( sub_emails_dict )
            
    # Constroi o dataframe
    emails_df = pd.DataFrame(emails)

    return emails_df

### Caso 2: Resultado

In [14]:
caso2_df = buildEmailsDF_2( emails_level_1 )
caso2_df.head()

Unnamed: 0,content,subject
0,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",URGENT ASSISTANCE /RELATIONSHIP (P)
2,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,GOOD DAY TO YOU
3,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,GOOD DAY TO YOU
4,"Dear sir, \n \nIt is with a heart full of hope...",I Need Your Assistance.


### Checando as diferenças de tamanho

In [15]:
print( "Tamanho do DataFrame SEM subdivisoes: " +str(caso1_df.shape[0]) )
print( "Tamanho do DataFrame COM subdivisoes: " +str(caso2_df.shape[0]) )

Tamanho do DataFrame SEM subdivisoes: 3978
Tamanho do DataFrame COM subdivisoes: 4846


# Inspecionando um caso onde ocorreu a SubDivisão

In [16]:
# Do exemplo encontrado anteriormente...
subs = checkSubEmail( "Subject:" , emails_level_1 )
emails_check = list()
emails_check.append( subs[1] )

# Realiza as operações
check_df_1 = buildEmailsDF( emails_check )
check_df_2 = buildEmailsDF_2( emails_check )

### SEM subdivisão
Lembrando que pegamos apenas 1 email para exemplo

In [17]:
check_df_1.shape

(1, 2)

In [18]:
# Visualizando o antes
print(check_df_1['content'][0])

Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

As list administrator, your authorization is requested for the
following mailing list posting:

    List:    Ilist@lists.si.UM
    From:    drallo_wd4@yahoo.com
    Subject: urgent response needed
    Reason:  Post by non-member to a members-only list

At your convenience, visit:

    http://lists.si.UM/mailman/admindb/ilist
        
to approve or deny the request.

Content-Type: message/rfc822
MIME-Version: 1.0

Return-Path: <drallo_wd4@yahoo.com>
X-Original-To: ilist@krusty.si.UM
Delivered-To: ilist@si.UM
Message-ID: <20040518211651.47848.qmail@web41206.mail.yahoo.com>
From: =?iso-8859-1?q?william=20drallo?= <drallo_wd4@yahoo.com>
Subject: urgent response needed
To: drallo_wd4@yahoo.com
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

                     MR WILLIAM DRALLO
                     BANQUE TOGOLAISE DE
                     DEVELOPPE

### COM subdivisão
Lembrando que pegamos apenas 1 email para exemplo, que agora foi dividido em 4

In [19]:
check_df_2.shape

(4, 2)

In [20]:
print(check_df_2['content'][0])




In [21]:
print(check_df_2['content'][1])

following mailing list posting:

    List:    Ilist@lists.si.UM
    From:    drallo_wd4@yahoo.com
    Subject: urgent response needed
    Reason:  Post by non-member to a members-only list

At your convenience, visit:

    http://lists.si.UM/mailman/admindb/ilist
        
to approve or deny the request.

Content-Type: message/rfc822
MIME-Version: 1.0

Return-Path: <drallo_wd4@yahoo.com>
X-Original-To: ilist@krusty.si.UM
Delivered-To: ilist@si.UM
Message-ID: <20040518211651.47848.qmail@web41206.mail.yahoo.com>
From: =?iso-8859-1?q?william=20drallo?= <drallo_wd4@yahoo.com>
Subject: urgent response needed
To: drallo_wd4@yahoo.com
MIME-Version: 1.0


In [22]:
print(check_df_2['content'][2])

http://promo.yahoo.com/sbc/

Content-Type: message/rfc822
MIME-Version: 1.0


In [23]:
print(check_df_2['content'][3])

If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message.  Do this if the message is
spam.  If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list.  The Approved: header can also appear in the first line
of the body of the reply.


# Conclusão

### A subdivisão por "Content-Type: text/plain;" não parece uma boa ideia. Pode ser que exista uma padrão melhor para subdividir os emails, e queriamos apenas explorar como fazer isso com REs.

# Melhor resultado: CASO 1