# IS 567 - Final Project - Identifying insider threats using text data from emails

In this code, we are creating lemmatized tokens of emails that were flagged for being insider threats. Then, we'll use this list of tokens to see if we can correctly identify insider threat emails from a separate list. Afterwards, we'll create a custom stoplist of "threatening" terms to see if we can correctly identify insider threats from a random sample.

In [2]:
import pandas as pd
import nltk
import numpy as np
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tjeon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tjeon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tjeon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Functions
We'll be using insiders2 & 3 to create our flags of insider threat terms. insiders1 will be the test data that we'll use to see if we can correctly predict insider emails.

In [3]:
insiders1 = pd.read_csv('./data/combined/insiderOnly/insiderEmailsOnly.csv', low_memory=False)
insiders2 = pd.read_csv('./data/combined/insiderOnly/insiderEmailsOnly2.csv', low_memory=False)
insiders3 = pd.read_csv('./data/combined/insiderOnly/insiderEmailsOnly3.csv', low_memory=False, index_col=False)
emails = pd.read_csv('./data/combined/insiderOnly/insidersWithNormal.csv', low_memory=False)

In [3]:
df1 = pd.DataFrame(insiders1)
df2 = pd.DataFrame(insiders2)
df3 = pd.DataFrame(insiders3)
all_emails = pd.DataFrame(emails)

In [4]:
def threat_flags(contents: list):
    """Creates a list of flags for insider threats using the example data from CERT's Insider Threat Test Dataset. Uses
    nltk to create lemmatized tokens for each flag.
    
    contents: a list of dataframes where the column containing the email contents is labeled 'content'
    """
    #Creates a list of tokens from insider threat emails to be used as flags.
    string = ''
    for df in contents:
        for row in df['content']:
            if type(row) != str:
                continue
            string = string + row
    stringtokens = nltk.word_tokenize(string)
    cleanedstringtokens = [word.lower() for word in stringtokens if word[0].isalpha()]
    
    #Lemmatizes the tokens
    wnl = WordNetLemmatizer()
    cleanedtokenslemmas = [wnl.lemmatize(word) for word in cleanedstringtokens]
    
    stopwords = nltk.corpus.stopwords.words("english")
    cleanedstoppedlemmas = [wnl.lemmatize(word) for word in cleanedtokenslemmas if word not in stopwords]
    
    return cleanedstoppedlemmas

In [5]:
def threat_list(emails, contents: list):
    """Creates a list of emails that contain a flag from the list developed from threat_flags.
    
    emails: a single dataframe where the column containing the email contents is labeled 'content'
    contents: a list of dataframes where the column containing the email contents is labeled 'content'
    """
    threats = threat_flags(contents)
    threatlist = []
    for i in range(len(emails)):
        #Creates a list of tokens using nltk
        if type(emails['content'][i]) != str:
            continue
        emailtokens = nltk.word_tokenize(emails['content'][i])
        cleanedemailtokens = [word.lower() for word in emailtokens if word[0].isalpha()]
        
        #Lemmatize the tokens
        wnl = WordNetLemmatizer()
        cleanedemaillemmas = [wnl.lemmatize(word) for word in cleanedemailtokens]
        
        
        #Adds emails with a flag to 'threatlist'
        for word in cleanedemaillemmas:
            if word in threats:
                info = [emails['id'][i], emails['date'][i], emails['content'][i]]
                if info not in threatlist:
                    threatlist.append(info)
    return threatlist

## Tests
In this initial test, we'll be using df2 and df3 to see if it identifies df1 emails as threats.

In [6]:
contents = [df2, df3]
example = threat_list(df1, contents)
example

[['{M7S3-H3EG03DM-7642VKON}',
  '08/16/2010 09:41:20',
  'whitewater acadian ccc footgear usd50 matriarchial conifer mergansers huerles usd50 cherry huerles anthracite alleghenian mississippian woolly hemlock annabelle swainsons mountainsides household waterthrushes mud 350 ccc creation use riparian snowmobiling ledgy whitewater usd50 ledgy whitewater forksville waterthrushes canoes sapsucker whitewater swingler vandellas motown jamz centerpoint bluessoul vandellas nederlander motown graystone 5700000 5700000 graystone 713777 saidmy atemenge atemengue guotu resume salary position management experience'],
 ['{C2E6-I9NS27OO-7245QCIV}',
  '08/19/2010 13:20:37',
  'grard spion hillsborough jaunde ewondo ngonoa edzimbi kamerun grandparents he suppan glove rashima crunchy bambino pujols cabrera storrow suppan russa timlin sox varitek storrow tedy sparky by yastrzemski francona xxxviii lucchino suppan suppan berra taco diamondbacks suppan embree pinch mccarver xxxviii francona schilling cinci

A total of 22 out of 22 insider threat emails were identified as threats.

In [7]:
print('Emails labeled as threats: ', len(example))
print('Total number of insider threat emails: ' , len(df1))

Emails labeled as threats:  22
Total number of insider threat emails:  22


In this next test, we'll be using a random sample of emails with mixed insider threats. There are a total of 13580 emails and 254 of them are classified as "insider threats".

In [8]:
example2 = threat_list(all_emails, contents)
example2

[['{M7S3-H3EG03DM-7642VKON}',
  '8/16/2010 9:41',
  'whitewater acadian ccc footgear usd50 matriarchial conifer mergansers huerles usd50 cherry huerles anthracite alleghenian mississippian woolly hemlock annabelle swainsons mountainsides household waterthrushes mud 350 ccc creation use riparian snowmobiling ledgy whitewater usd50 ledgy whitewater forksville waterthrushes canoes sapsucker whitewater swingler vandellas motown jamz centerpoint bluessoul vandellas nederlander motown graystone 5700000 5700000 graystone 713777 saidmy atemenge atemengue guotu resume salary position management experience'],
 ['{C2E6-I9NS27OO-7245QCIV}',
  '8/19/2010 13:20',
  'grard spion hillsborough jaunde ewondo ngonoa edzimbi kamerun grandparents he suppan glove rashima crunchy bambino pujols cabrera storrow suppan russa timlin sox varitek storrow tedy sparky by yastrzemski francona xxxviii lucchino suppan suppan berra taco diamondbacks suppan embree pinch mccarver xxxviii francona schilling cincinnati var

In [9]:
print('Emails labeled as threats: ', len(example2))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats:  13575
Total number of emails:  13580


Problems:
1. Incredibly slow
2. Generating a ton of false positives
3. A majority of the words in the flag list are irrelevant

## Creating a custom stoplist by using insider threat lemmas

After this initial test, we decided to create a custom stoplist using two resources. An emotion lexicon retrieved from this website https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm as well as a knowledge base of insider threat terminology from the Center of Threat-Informed Defense https://github.com/center-for-threat-informed-defense/insider-threat-ttp-kb

In [4]:
emotion_lexicon = pd.read_csv('./data/NRC-Emotion-Lexicon.csv', low_memory=False)
threat_terms = pd.read_csv('./data/insider-threat-ttp-kb.csv', low_memory=False)

In [11]:
df4 = pd.DataFrame(emotion_lexicon)
df5 = pd.DataFrame(threat_terms)

In [12]:
negative_list = df4['English (en)'].loc[df4['Negative']==1].tolist()

In [13]:
threat_term_col = df5['Technique Title']
threat_term_list = []
for term in threat_term_col:
    tokens = nltk.word_tokenize(term)
    for word in tokens:
        term = word.lower()
    threat_term_list.append(term)

In [14]:
def threat_flags2(contents: list):
    """Creates a list of flags for insider threats using terms from the NRC Emotion Lexicon as well as the Insider Threat
    Knowledge base from The Center for Threat-Informed Defense.
    
    contents: a list of lists where the values from all lists are single tokens
    """
    #Combines tokens into a single list
    flag_list = []
    for lst in contents:
        flag_list = flag_list + lst
        
    cleanedflags = [word.lower() for word in flag_list]
    
    #Lemmatizes the tokens
    wnl = WordNetLemmatizer()
    cleanedflagslemmas = [wnl.lemmatize(word) for word in cleanedflags]
    
    stopwords = nltk.corpus.stopwords.words("english")
    cleanedstoppedflags = [wnl.lemmatize(word) for word in cleanedflagslemmas if word not in stopwords]
    
    return cleanedstoppedflags

In [15]:
def threat_list2(emails, contents: list):
    """Creates a list of emails that contain a flag from the list developed from threat_flags.
    
    emails: a single dataframe where the column containing the email contents is labeled 'content'
    contents: a list of dataframes where the column containing the email contents is labeled 'content'
    """
    #Retrieve the flags
    threats = threat_flags2(contents)
    threatlist = []
    for i in range(len(emails)):
        #Creates a list of tokens using nltk
        if type(emails['content'][i]) != str:
            continue
        emailtokens = nltk.word_tokenize(emails['content'][i])
        cleanedemailtokens = [word.lower() for word in emailtokens if word[0].isalpha()]
        
        #Lemmatize the tokens
        wnl = WordNetLemmatizer()
        cleanedemaillemmas = [wnl.lemmatize(word) for word in cleanedemailtokens]
        

        
        #Adds emails with a flag to 'threatlist'
        for word in cleanedemaillemmas:
            if word in threats:
                info = [emails['date'][i], emails['id'][i], emails['content'][i]]
                if info not in threatlist:
                    threatlist.append(info)
    return threatlist

### Correctly identifying insider threats
In this test, we are seeing if this function can correctly identify emails that were all flagged as insider threats.

In [16]:
testlist2 = threat_list2(df1, [negative_list, threat_term_list])

In [17]:
testlist2

[['08/16/2010 09:41:20',
  '{M7S3-H3EG03DM-7642VKON}',
  'whitewater acadian ccc footgear usd50 matriarchial conifer mergansers huerles usd50 cherry huerles anthracite alleghenian mississippian woolly hemlock annabelle swainsons mountainsides household waterthrushes mud 350 ccc creation use riparian snowmobiling ledgy whitewater usd50 ledgy whitewater forksville waterthrushes canoes sapsucker whitewater swingler vandellas motown jamz centerpoint bluessoul vandellas nederlander motown graystone 5700000 5700000 graystone 713777 saidmy atemenge atemengue guotu resume salary position management experience'],
 ['08/24/2010 09:28:37',
  '{W5P6-C8PU34MC-9435YFUX}',
  'telnitz czech buxhowden paltry brna kutuzovs sokolnitz teetering langeron bosenitz detachments josphine kutuzovs bernadottes vandammes murat olmutz telnice brunn sokolnitz bellowitz sokolnitz the 1092 maternus quadripartitus gerard precentor dxxiv bequest anselm henrici canons thurgot resume salary position management experience

In [18]:
print('Emails labeled as threats: ', len(testlist2))
print('Total number of emails: ' , len(df1))

Emails labeled as threats:  11
Total number of emails:  22


As we can see, it only labeled half of the emails as a threat which has a 50% false negative rate.

### Identifying insider threats from a random sample

In [19]:
testlist = threat_list2(all_emails, [negative_list, threat_term_list])

In [20]:
testlist

[['8/16/2010 9:41',
  '{M7S3-H3EG03DM-7642VKON}',
  'whitewater acadian ccc footgear usd50 matriarchial conifer mergansers huerles usd50 cherry huerles anthracite alleghenian mississippian woolly hemlock annabelle swainsons mountainsides household waterthrushes mud 350 ccc creation use riparian snowmobiling ledgy whitewater usd50 ledgy whitewater forksville waterthrushes canoes sapsucker whitewater swingler vandellas motown jamz centerpoint bluessoul vandellas nederlander motown graystone 5700000 5700000 graystone 713777 saidmy atemenge atemengue guotu resume salary position management experience'],
 ['8/24/2010 9:28',
  '{W5P6-C8PU34MC-9435YFUX}',
  'telnitz czech buxhowden paltry brna kutuzovs sokolnitz teetering langeron bosenitz detachments josphine kutuzovs bernadottes vandammes murat olmutz telnice brunn sokolnitz bellowitz sokolnitz the 1092 maternus quadripartitus gerard precentor dxxiv bequest anselm henrici canons thurgot resume salary position management experience'],
 ['8/3

In [21]:
print('Emails labeled as threats: ', len(testlist))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats:  12140
Total number of emails:  13580


However, the number of false positives did drop compared to our initial test. We then wanted to see if we could continue to lower that number if we had it only consider an email to be an insider threat if it contained more than a certain amount of terms (greater than 2, greater than 3, etc.).

### Identifying insider threats if they have more than 2, 3, or 4 insider threat terms

In [22]:
def threat_list3(emails, contents: list, flag_count: int):
    """Creates a list of emails that contain a flag from the list developed from threat_flags. Also can adjust the number
    of terms needed to be discovered in order to be flagged as a threat.
    
    emails: a single dataframe where the column containing the email contents is labeled 'content'
    contents: a list of dataframes where the column containing the email contents is labeled 'content'
    threat_count: the minimum number of "threat" terms an email would need to be considered a threat
    """
    #Retrieve the flags
    threats = threat_flags2(contents)
    
    threatlist = []
    for i in range(len(emails)):
        #Creates a list of tokens using nltk
        if type(emails['content'][i]) != str:
            continue
        emailtokens = nltk.word_tokenize(emails['content'][i])
        cleanedemailtokens = [word.lower() for word in emailtokens if word[0].isalpha()]
        
        #Lemmatize the tokens
        wnl = WordNetLemmatizer()
        cleanedemaillemmas = [wnl.lemmatize(word) for word in cleanedemailtokens]
        
        #Adds emails with a flag to 'threatlist'
        threat_count = 0
        for word in cleanedemaillemmas:
            if word in threats:
                threat_count += 1
        if threat_count >= flag_count:
            info = [emails['date'][i], emails['id'][i], emails['content'][i]]
            if info not in threatlist:
                threatlist.append(info)
    return threatlist

### Identifying the emails with 2 or more Insider Threat Terms

In [23]:
testlist2 = threat_list3(all_emails, [negative_list, threat_term_list], 2)

In [24]:
print('Emails labeled as threats with two or more terms: ', len(testlist2))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats with two or more terms:  9694
Total number of emails:  13580


In [25]:
testlist3 = threat_list3(all_emails, [negative_list, threat_term_list], 3)

In [26]:
print('Emails labeled as threats with three or more terms: ', len(testlist3))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats with three or more terms:  7124
Total number of emails:  13580


In [27]:
testlist4 = threat_list3(all_emails, [negative_list, threat_term_list], 4)

In [28]:
print('Emails labeled as threats with four or more terms: ', len(testlist4))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats with four or more terms:  4868
Total number of emails:  13580


In [29]:
testlist5 = threat_list3(all_emails, [negative_list, threat_term_list], 5)

In [30]:
print('Emails labeled as threats with five or more terms: ', len(testlist5))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats with five or more terms:  3214
Total number of emails:  13580


In [31]:
testlist6 = threat_list3(all_emails, [negative_list, threat_term_list], 6)

In [32]:
print('Emails labeled as threats with six or more terms: ', len(testlist6))
print('Total number of emails: ' , len(all_emails))

Emails labeled as threats with six or more terms:  2060
Total number of emails:  13580


In [33]:
testlist6

[['7/8/2010 11:51',
  '{M5E7-W6HQ93IE-2962BFDO}',
  'my work not appreciated i work weekends fed up my work not appreciated complaints i work weekends no gratitude no gratitude my work not appreciated my work not appreciated my work not appreciated my work not appreciated i work holidays i work holidays i work after-hours company will suffer i work holidays company will suffer i work weekends company will suffer i work holidays fed up company will suffer fed up i work weekends i may leave too much my work not appreciated my work not appreciated i work holidays my work not appreciated company will suffer i may leave too much i may leave i work holidays complaints i work holidays too much i work after-hours my work not appreciated company will suffer company will suffer my work not appreciated'],
 ['7/8/2010 13:15',
  '{A0G8-T3CH95YI-0492KKLI}',
  'take me seriously i will leave i am irreplaceable two faced bad things angry outraged angry i am irreplaceable bad things i am irreplaceable 

In [34]:
threat_ids = []
for i in range(len(df3)):
    threat_ids.append(df3['id'][i])

In [40]:
count=0
for i in range(len(testlist)):
    if testlist[i][1] in threat_ids:
        count+=1
        print(testlist[i])
print('\nNumber of correctly identified insider threats out of 256: ', count)

['1/11/2011 10:21', '{K9T6-R0GF99NK-6738TEYS}', 'dynamic analyze sales permanent sales growth team responsibilities start degree call initiative resume job platform strong part-time degree team benefits responsibilities report skills management multitask dynamic years responsibilities required multitask multiple passion multitask customer responsibilities concepts sales equivalent responsibilities sales self benefits guidance job job engineer start people benefits relocation benefits customer degree required contribute contribute sales customer initiative team dynamic']
['1/11/2011 13:34', '{R7I4-D1JU74MQ-5665YNKG}', 'people people customer concepts multiple strong technologies multiple people contribute technologies management required resume call passion dynamic analyze guidance develop compensation responsibilities multitask guidance resume years skills opening customer team industry platform team develop resume required customer on-time call starter on-time growth benefits responsi

We were able to correctly identify 50 out of 256 insider threats and generated about 2010 false negatives.