Challenge Author: Ervin
# Welcome Challenger!

Welcome to the Phishing Section of the div0 Artificial Intelligence Capture-The-Flag Challenge! 

This challenge will cover a simplified version of an evasion attack, hence, you are supposed to send an email in order to try to force a false positive in the Machine Learning Model.

Background Information:

This challenge involves the forcing of a false positive in the Artificial Intelligence, which in general, is the less damaging of the two forms of phishing attacks. So, why are we going through false positives? This is because false positives has caused more time loss as compared to the other variant. False positive phishing emails can cause either investigation time to be lost, or have even legitimate working emails be flagged and may ultimately cause confusion among workgroups due to communication breakdown. What does this ultimately mean? This means that productivity ultimately suffers greatly due to such circumstances, and is why we are doing this CTF.

First of all, we are going to go through some rules.

- You, the participant are required to force a false positive (a normal email that appears to be a phishing email) on the Artificial Intelligence.
- In doing so, successful forcing of a false positive will allow the flag to be sent to you.
- No prior experience in hacking is required, we will give you hints along the way. (But you are to interpret what it means!)
- Understanding of Artificial Intelligence though, will be very helpful, and you should attempt the following modules below before attempting the challenge:
- Hacking of Artificial Intelligence is vastly different from conventional hacking, but maybe your methodologies can benefit you?

Challenge:

The Machine Learning Model used in this challenge is built with a poor dataset and is found to cause false positives to a specific kind of email! Can you find out what this kind of email is?

Now that you have the rules down, we are going to go through the dataset that you are going to be given. The dataset you are given consists of 100 phishing emails and 100 non-phishing emails that are included in the training dataset of the AI. This dataset can be found in the github link, assuming that you haven't downloaded that yet:
- Link here

Additionally, import tools needed for this challenge includes:
- email 
- BeautifulSoup

Now that you have the dataset, you may want to traverse to the directory with the files and open them to do analysis on them, to see the differences between phishing and non-phishing emails.

In [1]:
import os
import email

os.chdir("Dataset")
filenames_phishing = os.listdir("phishing")
filenames_nonphish = os.listdir("non_phishing")

files_phishing = []
files_nonphish = []

for file in filenames_phishing:
    filename = "phishing/" + file
    fp = open(filename, 'rb')
    mail = fp.read()
    msg = email.message_from_bytes(mail)
    files_phishing.append(msg)
    fp.close()

for file in filenames_nonphish:
    filename = "non_phishing/" + file
    fp = open(filename, 'rb')
    mail = fp.read()
    msg = email.message_from_bytes(mail)
    files_nonphish.append(msg)
    fp.close()

Now, we have our files in the list, but what kinds of information can we extract from those files that may be indicative of a phishing email? Let's see... Oh!

- email headers
- attachments
- email body
- email links

Let's go on to extracting information on the email subject!

In [2]:
from email.parser import BytesParser, Parser
from email.policy import default

emailheaders_phishing = []
emailheaders_nonphish = []

for file in filenames_phishing:
    filename = "phishing/" + file
    fp = open(filename, 'rb')
    headers = BytesParser(policy=default).parse(fp)
    emailheaders_phishing.append(headers)

for file in filenames_nonphish:
    filename = "non_phishing/" + file
    fp = open(filename, 'rb')
    headers = BytesParser(policy=default).parse(fp)
    emailheaders_nonphish.append(headers)

Now that we have the email subjects, let's think about certain attributes of an email subject that may lead a machine learning model to believe that it is a phishing email... Only one feature used in the actual model will be given here, so you will have to figure out the other emails yourself!

In [3]:
def subj_reply(header):
    subj_reply = header['subject'].lower().startswith("re:")
    return subj_reply

#CONTINUE FROM HERE
# ADDED 3 MORE EXAMPLE FEATURES
#ADDED ANSWER CODE
def subj_forward(header):
    try:
        subj_forward = header['subject'].lower().startswith("fwd:")
    except:
        subj_forward = ""
    return subj_forward

def subj_noOfWords(header):
    try:
        subj_noOfWords = len(header['subject'].split())
    except:
        subj_noOfWords = 0
    return subj_noOfWords

def subj_noOfChar(header):
    try:
        subj_noOfChar = len(header['subject'])
    except:
        subj_noOfChar = 0
    return subj_noOfChar

In [4]:
subj_reply_list_phishing = []
subj_reply_list_nonphish = []
subj_forward_phishing = []
subj_forward_nonphish = []
subj_noWords_phishing = []
subj_noWords_nonphish = []
subj_noChars_phishing = []
subj_noChars_nonphish = []

for header in emailheaders_phishing:
    subj_reply_list_phishing.append(subj_reply(header))
for header in emailheaders_nonphish:
    subj_reply_list_nonphish.append(subj_reply(header))
    
'''Go on and be wild'''
"""-----------------"""
#ADDED ANSWER CODE
for header in emailheaders_phishing:
    subj_forward_phishing.append(subj_forward(header))
for header in emailheaders_nonphish:
    subj_forward_nonphish.append(subj_forward(header))
    
for header in emailheaders_phishing:
    subj_noWords_phishing.append(subj_noOfWords(header))
for header in emailheaders_nonphish:
    subj_noWords_nonphish.append(subj_noOfWords(header))
    
for header in emailheaders_phishing:
    subj_noChars_phishing.append(subj_noOfChar(header))
for header in emailheaders_nonphish:
    subj_noChars_nonphish.append(subj_noOfChar(header))

We can also look at if there is an attachment. This may also be an important feature that may make a machine learning model more likely to think that it is a phishing email? Let's take a look! For this part, you are on your own, so figure out the code yourself!

In [5]:
#ADDED ANSWER CODE
def CheckAttachment(msg):
    attachment = 0
    for part in msg.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
    fileName = part.get_filename()
    if bool(fileName):
        attachment = 1
    return attachment

check_attachment_phishing = []
check_attachment_nonphish = []

for file in files_phishing:
    check_attachment_phishing.append(CheckAttachment(file))
for file in files_nonphish:
    check_attachment_nonphish.append(CheckAttachment(file))

Now, let's try to extract the body of the email! Here, you are going to have to figure out the code yourself!

In [6]:
#ADDED ANSWER CODE
def ExtractBody(msg):
    content = ""
    if msg.is_multipart():
        for payload in msg.get_payload():
            content += str(payload.get_payload())
    else:
        content += str(msg.get_payload())
    return content

body_phishing = []
body_nonphish = []

for file in files_phishing:
    body_phishing.append(ExtractBody(file))
for file in files_nonphish:
    body_nonphish.append(ExtractBody(file))

Now, maybe we can look at the several attributes of the email body that may lead a machine learning model to believe that it is a phishing email. Like the one for email subjects, only one feature used in the actual model will be given here, so you will have to figure out the other features by yourself!

In [7]:
def checkSuspension(body_content):
    body_suspension = "suspension" in body_content.lower()
    return body_suspension

#ADDED ANSWER CODE
def checkNoWords(body_content):
    body_NoWords = len(body_content.split())
    return body_NoWords

def checkNoChars(body_content):
    body_NoChars = len(body_content) - body_content.count(' ') - body_content.count('\n')
    return body_NoChars

In [8]:
check_suspension_phishing = []
check_suspension_nonphish = []
#ADDED ANSWER CODE
check_noWords_phishing = []
check_noWords_nonphish = []
check_noChars_phishing = []
check_noChars_nonphish = []

for body in body_phishing:
    check_suspension_phishing.append(checkSuspension(body))
for body in body_nonphish:
    check_suspension_nonphish.append(checkSuspension(body))
'''Go on and be wild'''
"""-----------------"""
#ADDED ANSWER CODE    
for body in body_phishing:
    check_noWords_phishing.append(checkNoWords(body))
for body in body_nonphish:
    check_noWords_nonphish.append(checkNoWords(body))
    
for body in body_phishing:
    check_noChars_phishing.append(checkNoChars(body))
for body in body_nonphish:
    check_noChars_nonphish.append(checkNoChars(body))

We can also look at the email links right? So let's go on and extract the links, before we extract some features of the links, shall we? For extracting links and URLs, you are on your own, but we will give you a hint here.

HINT: Use BeautifulSoup to extract the links and URLs

In [9]:
from bs4 import BeautifulSoup
import re, os, sys

'''EXTRACT LINKS AND URLS'''
#ADDED ANSWER CODE 
def getLinks(body_content):
    links = []
    content = []
    soup = BeautifulSoup(body_content, "lxml")
    for link in soup.findAll('a'):
        links.append(link.get("href"))
    return links

def getURLs(body_content):
    urls = re.findall(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", body_content)
    return urls

links_phishing = []
links_nonphish = []
urls_phishing = []
urls_nonphish = []

for body in body_phishing:
    links_phishing.append(getLinks(body))
for body in body_nonphish:
    links_nonphish.append(getLinks(body))
for body in body_phishing:
    urls_phishing.append(getURLs(body))
for body in body_nonphish:
    urls_nonphish.append(getURLs(body))

We now have the links and URLs don't we, so now you should extract features from the links given, and figure out what features are important here! One feature will be given here, so take the opportunity and think of other features that may be considered important by the dataset!

In [10]:
def noOfLinks(links):
    noLinks = len(links)
    return noLinks

#ADDED ANSWER CODE 
def check_img(body_content):
    soup = BeautifulSoup(body_content)
    ImgLinks = soup.findAll('img')
    noOfImgLinks = len(ImgLinks)
    return noOfImgLinks

def check_doubleslash(links):
    no_doubleslash = 1
    for link in links:
        xd = str(link)[10:]
        if u'//' in xd:
            no_doubleslash = 0
    return no_doubleslash

nolinks_phishing = []
nolinks_nonphish = []
img_phishing = []
img_nonphish = []
doubleslash_phishing = []
doubleslash_nonphish = []

for body in body_phishing:
    img_phishing.append(check_img(body))
for body in body_nonphish:
    img_nonphish.append(check_img(body))
    
for link in links_phishing:
    nolinks_phishing.append(noOfLinks(link))
    doubleslash_phishing.append(check_doubleslash(link))
for link in links_nonphish:
    nolinks_nonphish.append(noOfLinks(link))
    doubleslash_nonphish.append(check_doubleslash(link))



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Now that we have all our features down, maybe we can turn it into a pandas dataset in order to turn it in order to look at the extracted data with the pandas tools?

In [11]:
import numpy
import pandas as pd

'''Add the features gathered from your feature extraction, and the phishing label'''
features = ['subj_reply', 'subj_forward', 'subj_noWords', 'subj_noChars', 'check_suspension', 'check_noWords', 'check_noChars', 'nolinks', 'img', 'doubleslash','Phishing']

df = pd.DataFrame(columns = features)


'''Conversion to Pandas Dataframe'''
for x in range(0,100):
    entry_phishing = [subj_reply_list_phishing[x], subj_forward_phishing[x], subj_noWords_phishing[x], subj_noChars_phishing[x], check_suspension_phishing[x], check_noWords_phishing[x], check_noChars_phishing[x], nolinks_phishing[x], img_phishing[x], doubleslash_phishing[x],'1']
    entry_nonphish = [subj_reply_list_nonphish[x], subj_forward_nonphish[x], subj_noWords_nonphish[x], subj_noChars_nonphish[x], check_suspension_nonphish[x], check_noWords_nonphish[x], check_noChars_nonphish[x], nolinks_nonphish[x], img_nonphish[x], doubleslash_nonphish[x],'0']
    np_phishing = numpy.array(entry_phishing)
    np_nonphish = numpy.array(entry_nonphish)
    phishing_series = pd.Series(entry_phishing, index = df.columns)
    nonphish_series = pd.Series(entry_nonphish, index = df.columns)
    df = df.append(phishing_series, ignore_index = True)
    df = df.append(nonphish_series, ignore_index = True)

Now, we can take a look at the many features of the pandas dataframe created, and look at the correlation between the features and the probability of an email being a phishing email!

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   subj_reply        200 non-null    object
 1   subj_forward      200 non-null    object
 2   subj_noWords      200 non-null    object
 3   subj_noChars      200 non-null    object
 4   check_suspension  200 non-null    object
 5   check_noWords     200 non-null    object
 6   check_noChars     200 non-null    object
 7   nolinks           200 non-null    object
 8   img               200 non-null    object
 9   doubleslash       200 non-null    object
 10  Phishing          200 non-null    object
dtypes: object(11)
memory usage: 17.3+ KB


EXAMPLE CODE for converting object file type to other filetype:

df = df.convert_dtypes()

In [13]:
df.astype('int32').dtypes

subj_reply          int32
subj_forward        int32
subj_noWords        int32
subj_noChars        int32
check_suspension    int32
check_noWords       int32
check_noChars       int32
nolinks             int32
img                 int32
doubleslash         int32
Phishing            int32
dtype: object

In [14]:
df = df.convert_dtypes()

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   subj_reply        200 non-null    boolean
 1   subj_forward      200 non-null    boolean
 2   subj_noWords      200 non-null    Int64  
 3   subj_noChars      200 non-null    Int64  
 4   check_suspension  200 non-null    boolean
 5   check_noWords     200 non-null    Int64  
 6   check_noChars     200 non-null    Int64  
 7   nolinks           200 non-null    Int64  
 8   img               200 non-null    Int64  
 9   doubleslash       200 non-null    Int64  
 10  Phishing          200 non-null    string 
dtypes: Int64(7), boolean(3), string(1)
memory usage: 15.2 KB


In [16]:
df.dtypes

subj_reply          boolean
subj_forward        boolean
subj_noWords          Int64
subj_noChars          Int64
check_suspension    boolean
check_noWords         Int64
check_noChars         Int64
nolinks               Int64
img                   Int64
doubleslash           Int64
Phishing             string
dtype: object

In [17]:
df.max()

subj_reply            True
subj_forward         False
subj_noWords            17
subj_noChars            94
check_suspension      True
check_noWords         4442
check_noChars       173586
nolinks                 18
img                     65
doubleslash              1
Phishing                 1
dtype: object

In [18]:
df.min()

subj_reply          False
subj_forward        False
subj_noWords            1
subj_noChars            4
check_suspension    False
check_noWords           0
check_noChars           0
nolinks                 0
img                     0
doubleslash             0
Phishing                0
dtype: object

In [19]:
df.loc[df['Phishing'] == '1'] .head(10)

Unnamed: 0,subj_reply,subj_forward,subj_noWords,subj_noChars,check_suspension,check_noWords,check_noChars,nolinks,img,doubleslash,Phishing
0,True,False,8,50,False,57,315,0,0,1,1
2,False,False,5,35,False,1214,40572,7,12,1,1
4,False,False,7,53,False,369,3929,2,2,1,1
6,False,False,9,63,False,46,3478,0,0,1,1
8,False,False,5,44,False,847,17759,14,11,1,1
10,False,False,4,43,False,350,6079,1,0,1,1
12,False,False,6,45,False,980,12656,10,11,1,1
14,False,False,5,28,False,573,5165,1,0,1,1
16,False,False,2,12,False,2262,20016,7,8,1,1
18,False,False,3,23,False,829,8589,10,11,1,1


In [20]:
df.loc[df['Phishing'] == '0'] .head(10)

Unnamed: 0,subj_reply,subj_forward,subj_noWords,subj_noChars,check_suspension,check_noWords,check_noChars,nolinks,img,doubleslash,Phishing
1,True,False,4,24,False,225,1329,0,0,1,0
3,True,False,4,24,False,143,796,0,0,1,0
5,True,False,5,23,False,62,277,0,0,1,0
7,False,False,5,36,False,104,758,0,0,1,0
9,False,False,6,47,False,59,555,0,0,1,0
11,True,False,4,24,False,352,1922,0,0,1,0
13,False,False,6,38,False,77,577,0,0,1,0
15,True,False,8,47,False,155,884,0,0,1,0
17,False,False,8,45,False,175,874,0,0,1,0
19,False,False,4,25,False,89,554,0,0,1,0


In [24]:
normal = df.loc[df['Phishing'] == '1']
normal["nolinks"].value_counts()

1     22
0     20
2     13
10    11
3      8
7      6
5      5
11     3
4      2
8      2
13     2
14     2
18     2
9      1
12     1
Name: nolinks, dtype: Int64

In [25]:
normal = df.loc[df['Phishing'] == '0']
normal["nolinks"].value_counts()

0    99
2     1
Name: nolinks, dtype: Int64

Now that you have done the correlation check, do you now know what are the features that generally constitutes a phishing email?

Now let's try to break the model by using these features against the machine learning model, and try to break the machine learning model by creating a false positive, where the machine learning model thinks a non-phishing email is a phishing email.

One method you can try in order to craft the email would be to use Thunderbird in order to craft the email.

With your phishing email crafted, it will be sent to a server to test for whether it is successful in breaking the machine learning model or not!

Link: ???? (TO BE ADDED)

In [22]:
from IPython.display import IFrame
IFrame("https://ctf-crisis-2.herokuapp.com/challenge1", width=800, height=300)

BONUS QUESTION:

Are you able to determine the most important feature to the Machine Learning Model?
(HINT: It is one of the features that we have shown you!)

In [23]:
from IPython.display import IFrame
IFrame("https://ctf-crisis-2.herokuapp.com/challenge1_p2", width=800, height=300)