Challenge Author: Ervin
# Welcome Challenger!

Welcome to the Phishing Section of the div0 Artificial Intelligence Capture-The-Flag Challenge! 

This challenge will cover a simplified version of an evasion attack, hence, you are supposed to send an email in order to try to force a false positive in the Machine Learning Model.

Background Information:

This challenge involves the forcing of a false positive in the Artificial Intelligence, which in general, is the less damaging of the two forms of phishing attacks. So, why are we going through false positives? This is because false positives has caused more time loss as compared to the other variant. False positive phishing emails can cause either investigation time to be lost, or have even legitimate working emails be flagged and may ultimately cause confusion among workgroups due to communication breakdown. What does this ultimately mean? This means that productivity ultimately suffers greatly due to such circumstances, and is why we are doing this CTF.

First of all, we are going to go through some rules.

- You, the participant are required to force a false positive (a normal email that appears to be a phishing email) on the Artificial Intelligence.
- In doing so, successful forcing of a false positive will allow the flag to be sent to you.
- No prior experience in hacking is required, we will give you hints along the way. (But you are to interpret what it means!)
- Understanding of Artificial Intelligence though, will be very helpful, and you should attempt the following modules below before attempting the challenge:
- Hacking of Artificial Intelligence is vastly different from conventional hacking, but maybe your methodologies can benefit you?

Challenge:

The Machine Learning Model used in this challenge is built with a poor dataset and is found to cause false positives to a specific kind of email! Can you find out what this kind of email is?

Now that you have the rules down, we are going to go through the dataset that you are going to be given. The dataset you are given consists of 100 phishing emails and 100 non-phishing emails that are included in the training dataset of the AI. This dataset can be found in the github link, assuming that you haven't downloaded that yet:
- Link here

Additionally, import tools needed for this challenge includes:
- email 
- BeautifulSoup (now bs4)

Now that you have the dataset, you may want to traverse to the directory with the files and open them to do analysis on them, to see the differences between phishing and non-phishing emails.

In [None]:
import os
import email

os.chdir("../Dataset")
filenames_phishing = os.listdir("phishing")
filenames_nonphish = os.listdir("non_phishing")

files_phishing = []
files_nonphish = []

for file in filenames_phishing:
    filename = "phishing/" + file
    fp = open(filename, 'rb')
    mail = fp.read()
    msg = email.message_from_bytes(mail)
    files_phishing.append(msg)
    fp.close()

for file in filenames_nonphish:
    filename = "non_phishing/" + file
    fp = open(filename, 'rb')
    mail = fp.read()
    msg = email.message_from_bytes(mail)
    files_nonphish.append(msg)
    fp.close()

Now, we have our files in the list, but what kinds of information can we extract from those files that may be indicative of a phishing email? Let's see... Oh!

- email headers
- attachments
- email body
- email links

Let's go on to extracting information on the email subject!

In [None]:
from email.parser import BytesParser, Parser
from email.policy import default

emailheaders_phishing = []
emailheaders_nonphish = []

for file in filenames_phishing:
    filename = "phishing/" + file
    fp = open(filename, 'rb')
    headers = BytesParser(policy=default).parse(fp)
    emailheaders_phishing.append(headers)

for file in filenames_nonphish:
    filename = "non_phishing/" + file
    fp = open(filename, 'rb')
    headers = BytesParser(policy=default).parse(fp)
    emailheaders_nonphish.append(headers)

Now that we have the email subjects, let's think about certain attributes of an email subject that may lead a machine learning model to believe that it is a phishing email... Only one feature used in the actual model will be given here, so you will have to figure out the other emails yourself!

In [None]:
def subj_reply(header):
    subj_reply = header['subject'].lower().startswith("re:")
    return subj_reply

#CONTINUE FROM HERE

In [None]:
subj_reply_list_phishing = []
subj_reply_list_nonphish = []

for header in emailheaders_phishing:
    subj_reply_list_phishing.append(subj_reply(header))
for header in emailheaders_nonphish:
    subj_reply_list_nonphish.append(subj_reply(header))
    
'''Go on and be wild'''
"""-----------------"""

We can also look at if there is an attachment. This may also be an important feature that may make a machine learning model more likely to think that it is a phishing email? Let's take a look! For this part, you are on your own, so figure out the code yourself!

Now, let's try to extract the body of the email! Here, you are going to have to figure out the code yourself!

Now, maybe we can look at the several attributes of the email body that may lead a machine learning model to believe that it is a phishing email. Like the one for email subjects, only one feature used in the actual model will be given here, so you will have to figure out the other features by yourself!

In [None]:
def checkSuspension(body_content):
    body_suspension = "suspension" in body_content.lower()
    return body_suspension

In [None]:
check_suspension_phishing = []
check_suspension_nonphish = []

for body in __________________:
    check_suspension_phishing.append(checkSuspension(body))
for body in __________________:
    check_suspension_nonphish.append(checkSuspension(body))
    '''Go on and be wild'''
    """-----------------"""

We can also look at the email links right? So let's go on and extract the links, before we extract some features of the links, shall we? For extracting links and URLs, you are on your own, but we will give you a hint here.

HINT: Use BeautifulSoup to extract the links and URLs

In [None]:
from bs4 import BeautifulSoup

'''EXTRACT LINKS AND URLS'''

We now have the links and URLs don't we, so now you should extract features from the links given, and figure out what features are important here! One feature will be given here, so take the opportunity and think of other features that may be considered important by the dataset!

In [None]:
def noOfLinks(links):
    noLinks = len(links)
    return noLinks

Now that we have all our features down, maybe we can turn it into a pandas dataset in order to turn it in order to look at the extracted data with the pandas tools?

In [None]:
import numpy
import pandas as pd

'''Add the features gathered from your feature extraction, and the phishing label'''
features = ['''FEATURES''','Phishing']

df = pd.DataFrame(columns = features)

'''Conversion to Pandas Dataframe'''
for x in range(0,100):
    entry_phishing = ['''ADD YOUR FEATURES HERE''','1']
    entry_nonphish = ['''ADD YOUR FEATURES HERE''','0']
    phishing_series = pd.Series(entry_phishing, index = df.columns)
    nonphish_series = pd.Series(entry_nonphish, index = df.columns)
    df = df.append(phishing_series, ignore_index = True)
    df = df.append(nonphish_series, ignore_index = True)

Now, we can take a look at the many features of the pandas dataframe created, and look at the correlation between the features and the probability of an email being a phishing email!

One example of using pandas groupby to do correlation checks:
- data[['subjReply', 'Phishing']].groupby(['subjReply'], as_index=False).mean()

In [None]:
df.info()

Now that you have done the correlation check, do you now know what are the features that generally constitutes a phishing email?

Now let's try to break the model by using these features against the machine learning model, and try to break the machine learning model by creating a false positive, where the machine learning model thinks a non-phishing email is a phishing email.

One method you can try in order to craft the email would be to use Thunderbird in order to craft the email.

With your phishing email crafted, it will be sent to a server to test for whether it is successful in breaking the machine learning model or not!

Link: ???? (TO BE ADDED)

In [1]:
from IPython.display import IFrame
IFrame("https://ctf-crisis-2.herokuapp.com/challenge1", width=800, height=300)

BONUS QUESTION:

Are you able to determine the most important feature to the Machine Learning Model?
(HINT: It is one of the features that we have shown you!)

In [2]:
from IPython.display import IFrame
IFrame("https://ctf-crisis-2.herokuapp.com/challenge1_p2", width=800, height=300)