# Questionnaire extraction

Note to whoever might want to use this notebook: this code is not the most efficient out there; given the small size of the dataset, the processing is quick-and-dirty, and ad hoc. If you want to use this code for your own analysis, you will probably have to adjust some things

## Sources
- Usernames from: https://github.com/danielmiessler/SecLists/tree/master/Usernames
- Passwords from a part of the password file that is in none of the test sets
- Names from https://forebears.io/earth/forenames

In [1]:
import pandas as pd
import numpy as np
import re
import random
from datasets import DatasetDict, Dataset
random.seed(42)

In [2]:
df = pd.read_excel('Password Research (Responses).xlsx', converters={'How old are you?': int,})
df

Unnamed: 0,Timestamp,I have read the above information and I consent to participating in this investigation. I am aware that the data I fill in will be used for research purposes and might be published or shared for research purposes.,"You are sharing an account for a website with a friend and you are sending them a message (e.g. a text or through a popular messaging app) with the login information (username and password). How would you format that message? Please use [PASSWORD] where you would put the password and [USERNAME] where you would put the username. If you want to, you can use other placeholders such as [WEBSITE] or a placeholder you wish to introduce yourself. If you want to indicate that you would send it in different messages, insert [NEW MESSAGE] between the different messages. Do not forget to answer in your native language.","You are sending a colleague the login information (username and password) for a shared account over an email. How would you format that email? Please use [PASSWORD] where you would put the password and [USERNAME] where you would put the username. If you want to, you can use other placeholders such as [WEBSITE] or a placeholder you wish to introduce yourself. If you want to indicate that you would send it in different mails, insert [NEW EMAIL] between the different mails. Do not forget to answer in your native language.","You want to write down your password in a note on your phone. What would this note look like? Please use [PASSWORD] where you would put the password. If you want to, you can use other placeholders such as [WEBSITE] or a placeholder you wish to introduce yourself. Do not forget to answer in your native language.",What language did you answer the previous questions in?,How old are you?
0,2023-11-07 19:25:27.425,I consent,"Hey, here is the login: [USERNAME] [PASSWORD]","Hi,\nHereby I send you the login information: ...",[PASSWORD] [WEBSITE],English,22
1,2023-11-07 19:28:34.321,I consent,Hey here is the login info for [WEBSITE]: \nNa...,"Hello [NAME],\n\nHere is the login information...",Pw [WEBSITE]:\n[PASSWORD],English,21
2,2023-11-07 19:44:14.797,I consent,[USERNAME]\n[NEWMESSAGE]\n[PASSWORD],"Hoi [NAME],\n\nVoor [WEBSITE] kan je inloggen ...",[WEBSITE]\nusername: [USERNAME]\nww: [PASSWORD],Dutch,21
3,2023-11-07 20:43:22.404,I consent,That's the login details you asked for: \n[NEW...,"Hi,\n\nThese are the login details for [WEBSIT...",[WEBSITE]:\n[PASSWORD],English,23
4,2023-11-07 21:08:17.706,I consent,[USERNAME]\n[PASSWORD],"Hallo [NAME],\nhier zijn is het wachtwoord:\nu...",[USERNAME]\n[PASSWORD],Dutch,22
5,2023-11-07 22:05:29.284,I consent,here is the info [USERNAME]:[PASSWORD],"Hey [NAME],\n\nHere is the info for [WEBSITE]:...",[PASSWORD],English,23
6,2023-11-08 10:27:04.602,I consent,[USERNAME]\n[NEW TEXT]\n[PASSWORD]\n[NEW TEXT]...,"Hi [COLLEGA],\n\nHierbij de inloggegevens die ...",[TITLE NOTE]\n[WEBSITE]\n[CONTENT NOTE]\n[USER...,Dutch,21
7,2023-11-08 17:13:08.772,I consent,Tady máš údaje na [WEBSITE]\n[NEW MESSAGE]\n[U...,"Ahoj,\n\nposílám přihlašovací údaje na [WEBSIT...",[WEBSITE]\n[USERNAME]\n[PASSWORD],Czech,24
8,2023-11-10 12:26:13.670,I consent,Hi! Voor [WEBSITE]:\nGn: [USERNAME]\nWw: [PASS...,"Hoi [FIRST NAME],\nHierbij de inloggegevens va...",[WEBSITE]\n[PASSWORD],Dutch,23
9,2023-11-10 12:26:34.243,I consent,[USERNAME]\n[PASSWORD],"Hallo,\nHier zijn de gegevens: [USERNAME]\n[PA...",[WEBSITE]\n[USERNAME]\n[PASSWORD],Dutch,18


## Demographic data


In [3]:
print(f"The average age was {df['How old are you?'].mean()}")
print(f"The oldest person was {df['How old are you?'].max()}")
print(f"The youngest person was {df['How old are you?'].min()}")

df['What language did you answer the previous questions in?'].value_counts()

The average age was 22.314285714285713
The oldest person was 53
The youngest person was 18


Dutch                                                         14
English                                                       11
Slovak                                                         2
French                                                         1
Turkish                                                        1
Spanish                                                        1
Two in Dutch, 1 in English (because my work is in English)     1
Azerbaijani                                                    1
Czech                                                          1
Portuguese                                                     1
Polish                                                         1
Name: What language did you answer the previous questions in?, dtype: int64

## Extracting the dataset

In [4]:
Q1 = "You are sharing an account for a website with a friend and you are sending them a message (e.g. a text or through a popular messaging app) with the login information (username and password). How would you format that message? Please use [PASSWORD] where you would put the password and [USERNAME] where you would put the username. If you want to, you can use other placeholders such as [WEBSITE] or a placeholder you wish to introduce yourself. If you want to indicate that you would send it in different messages, insert [NEW MESSAGE] between the different messages. Do not forget to answer in your native language."
Q2 = "You are sending a colleague the login information (username and password) for a shared account over an email. How would you format that email? Please use [PASSWORD] where you would put the password and [USERNAME] where you would put the username. If you want to, you can use other placeholders such as [WEBSITE] or a placeholder you wish to introduce yourself. If you want to indicate that you would send it in different mails, insert [NEW EMAIL] between the different mails. Do not forget to answer in your native language."

In [5]:
# Select the right columns for our purposes

texts_df = pd.DataFrame() 
texts_df['text'] = df[Q1]
texts_df['emails'] = df[Q2]
texts_df

Unnamed: 0,text,emails
0,"Hey, here is the login: [USERNAME] [PASSWORD]","Hi,\nHereby I send you the login information: ..."
1,Hey here is the login info for [WEBSITE]: \nNa...,"Hello [NAME],\n\nHere is the login information..."
2,[USERNAME]\n[NEWMESSAGE]\n[PASSWORD],"Hoi [NAME],\n\nVoor [WEBSITE] kan je inloggen ..."
3,That's the login details you asked for: \n[NEW...,"Hi,\n\nThese are the login details for [WEBSIT..."
4,[USERNAME]\n[PASSWORD],"Hallo [NAME],\nhier zijn is het wachtwoord:\nu..."
5,here is the info [USERNAME]:[PASSWORD],"Hey [NAME],\n\nHere is the info for [WEBSITE]:..."
6,[USERNAME]\n[NEW TEXT]\n[PASSWORD]\n[NEW TEXT]...,"Hi [COLLEGA],\n\nHierbij de inloggegevens die ..."
7,Tady máš údaje na [WEBSITE]\n[NEW MESSAGE]\n[U...,"Ahoj,\n\nposílám přihlašovací údaje na [WEBSIT..."
8,Hi! Voor [WEBSITE]:\nGn: [USERNAME]\nWw: [PASS...,"Hoi [FIRST NAME],\nHierbij de inloggegevens va..."
9,[USERNAME]\n[PASSWORD],"Hallo,\nHier zijn de gegevens: [USERNAME]\n[PA..."


# Replace placeholders
People could fill in any placeholder they wanted to. In order to deal with this properly, we need to see what placeholders they actually used. We first look at that, and then use different functions to replace the placeholders

In [6]:
def find_placeholders(df):
    columns = df.columns
    placeholders = []
    for column in columns:
        for index, row in df.iterrows():
            placeholders += re.findall(r"\[[\w\s]*\]", row[column])
    return placeholders



find_placeholders(texts_df)

['[USERNAME]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[NEWMESSAGE]',
 '[PASSWORD]',
 '[NEW MESSAGE]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[NEW TEXT]',
 '[PASSWORD]',
 '[NEW TEXT]',
 '[WEBSITE]',
 '[NEW MESSAGE]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[NEW TEXT]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[NEW MESSAGE]',
 '[NAME]',
 '[WEBSITE]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[USERNAME]',
 '[PASSWORD]',
 '[NEW MESSAGE]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[WEBSITE]',
 '[NEW MESSAGE]',
 '[USERNAME]',
 '[PASSWORD]',
 '[USERNAME]',
 '[NEW MESSAGE]',
 '[PASSWORD]',
 '[NEW MESSAGE]',
 '[WEBSITE]

In [7]:
def count_nr_of_placeholders_per_row(row, column, placeholder):
    return len(re.findall(placeholder, row[column]))

def count_nr_of_placeholders(df, placeholder):
    columns = df.columns
    placeholders = 0
    counts = pd.DataFrame()
    for column in columns:
        counts[column] = df.apply(count_nr_of_placeholders_per_row, args=(column, placeholder, ), axis=1)
    return counts.sum().sum()

def read_long_list_from_file(n, filename):
    '''
    Idea here is to read more from the file than needed, so that we can later on select n 
    items at random from the long list. This should help against alphabetical order effects
    '''
    total_n = 1000*n
    instances = []
    i = 0
    with open(filename, 'r') as f:
        while i < total_n:
            try: 
                line = f.readline()
                if not line:
                    return instances
                instances.append(line[:-1])
                i += 1
            except:
                return instances
    return instances

def randomly_truncate(url, threshold=0.8):
    '''
    The urls are randomly truncated, as it seems likely that people do not write
    down the complete url, but rather write down the first part, i.e. "google" instead
    of "google.com"
    '''
    r = random.uniform(0,1)
    if r <= threshold:
        spl = url.split(".")
        return spl[0]
    else:
        return url

def read_tranco_file(filename):
    '''
    The Tranco file has a different format than the other files, and is therefore read in
    differently
    '''
    df = pd.read_csv(filename)
    websites = list(df['domain'])
    split_websites = list(map(randomly_truncate, websites))
    return split_websites

def read_name_file(filename):
    '''
    The name file is too short to be read in as a long name file and can easily be read in as
    a whole, which is what this function does
    '''
    with open(filename, 'r') as f:
        names = f.readlines()
        # with .strip() we remove whitespace characters like `\n` at the end of each line
        names = [x.strip() for x in names] 
    return names
    

def replace_placeholder(df, placeholder, replacement_list):
    '''
    This function replaces the placeholders by a random item from the replacement list
    '''
    placeholder_pattern = re.compile(".*" + placeholder + ".*", flags=re.DOTALL)
    placeholder_pattern_short = re.compile(placeholder)
    n = nr_placeholders = count_nr_of_placeholders(df, placeholder)
    columns = df.columns
    
    # select the necessary items randomly from the replacement list
    replacements = random.sample(replacement_list, n) 
    
    i = 0
    for column in columns:
        for index, row in df.iterrows():
            if placeholder_pattern.match(row[column]):
                row[column] = re.sub(placeholder, replacements[i], row[column], flags=re.DOTALL)
                i += 1
    return df

def replace_placeholder_by_char(df, placeholder, replacement_char):
    '''
    This function helps replace placeholders by a single character that is the same for all 
    occurrences of the placeholder
    '''
    placeholder_pattern = re.compile(".*" + placeholder + ".*", flags=re.DOTALL)
    columns = df.columns
    for column in columns:
        for index, row in df.iterrows():
            if placeholder_pattern.match(row[column]):
                row[column] = re.sub(placeholder, replacement_char, row[column], flags=re.DOTALL)
    return df


def replace_using_long_list(df, placeholder, filename):
    '''
    Function that ties together the above functions of replacing a placeholder by items from
    a long list of possible replacements
    '''
    nr_placeholders = count_nr_of_placeholders(df, placeholder)
    repl_list = read_long_list_from_file(nr_placeholders, filename)
    texts_df = replace_placeholder(df, placeholder, repl_list)

print(f"We started with {len(find_placeholders(texts_df))} placeholders")

# We now replace each placeholder by its appropriate replacements. This is a lot of 
# calling the same functions over and over, because people used a lot of similar but
# slightly different placeholders

replace_using_long_list(texts_df, "\[EMAIL\]", "xato-net-10-million-usernames.txt")
replace_using_long_list(texts_df, "\[USERNAME\]", "xato-net-10-million-usernames.txt")

texts_df = replace_placeholder_by_char(texts_df, "\[On Signal messaging app\]", "")
texts_df = replace_placeholder_by_char(texts_df, "\[PHONE CALL\]", "")
texts_df = replace_placeholder_by_char(texts_df, "\[RANDOMCHARACTERS\]", "")
texts_df = replace_placeholder_by_char(texts_df, "\[xyz\]", "")

websites = read_tranco_file("tranco.csv")
texts_df = replace_placeholder(texts_df, "\[Name website\]", websites)
texts_df = replace_placeholder(texts_df, "\[WEBSITE\]", websites)
names = read_name_file("names.txt")
texts_df = replace_placeholder(texts_df, "\[NAME\]", names)
texts_df = replace_placeholder(texts_df, "\[COLLEGA\]", names)
texts_df = replace_placeholder(texts_df, "\[Collega\]", names)
texts_df = replace_placeholder(texts_df, "\[FIRST NAME\]", names)
texts_df = replace_placeholder(texts_df, "\[COLLEAGUE\]", names)
texts_df = replace_placeholder(texts_df, "\[NAME OF COLLEAGUE\]", names)
texts_df = replace_placeholder(texts_df, "\[MY NAME\]", names)
texts_df = replace_placeholder(texts_df, "\[MYNAME\]", names)
texts_df = replace_placeholder(texts_df, "\[COLLEAGUE_NAME\]", names)
print(f"We ended with {len(find_placeholders(texts_df))} placeholders")



We started with 253 placeholders
We ended with 102 placeholders


In [8]:
print("These placeholders are left (check to see if you got them all, except for the new message and password ones)")
print(find_placeholders(texts_df))
print("This is what the dataframe looks like")
texts_df

These placeholders are left (check to see if you got them all, except for the new message and password ones)
['[PASSWORD]', '[PASSWORD]', '[NEWMESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[PASSWORD]', '[NEW TEXT]', '[PASSWORD]', '[NEW TEXT]', '[NEW MESSAGE]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW TEXT]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[NEW MESSAGE]', '[PASSWORD]', '[PASSWORD]', '[PASSWORD

Unnamed: 0,text,emails
0,"Hey, here is the login: ert [PASSWORD]","Hi,\nHereby I send you the login information: ..."
1,Hey here is the login info for iqiyi: \nName: ...,"Hello Esther,\n\nHere is the login information..."
2,Plato\n[NEWMESSAGE]\n[PASSWORD],"Hoi Mei,\n\nVoor myspace kan je inloggen met s..."
3,That's the login details you asked for: \n[NEW...,"Hi,\n\nThese are the login details for cbc tha..."
4,dsfs\n[PASSWORD],"Hallo George,\nhier zijn is het wachtwoord:\nu..."
5,here is the info Holmes:[PASSWORD],"Hey Andrea,\n\nHere is the info for notion:\nu..."
6,diabolo\n[NEW TEXT]\n[PASSWORD]\n[NEW TEXT]\nH...,"Hi Santosh,\n\nHierbij de inloggegevens die je..."
7,Tady máš údaje na amazon.es\n[NEW MESSAGE]\nfa...,"Ahoj,\n\nposílám přihlašovací údaje na telegra..."
8,Hi! Voor deviantart:\nGn: Robo\nWw: [PASSWORD]\n,"Hoi Fernando,\nHierbij de inloggegevens van az..."
9,curly\n[PASSWORD],"Hallo,\nHier zijn de gegevens: extacy\n[PASSWO..."


We now transform from a dataframe to a list. We probably should have done that earlier for working ease, but oh well. Quick-and-dirty, I said. 

Having transformed the data into one big list, we can easily split up each message that consists of multiple messages into multiple messages. 

In [9]:
merged_list = list(texts_df.text) + list(texts_df.emails)



def split_at_new(old_list, placeholder):
    '''
    Splits up at the placehold (which should be a NEW MESSAGE or similar placeholder)
    and appends the parts to a new list. The idea is that these new messages will also be found
    separately on someone's device and should therefore be considered separately. 
    '''
    new_list = []
    placeholder_pattern = re.compile(".*" + placeholder + ".*", flags=re.DOTALL)
    for message in old_list:
        if placeholder_pattern.match(message):
            parts = re.split(placeholder, message, flags=re.DOTALL)
            for part in parts:
                new_list.append(part)
        else: 
            new_list.append(message)
    return new_list

merged_list = split_at_new(merged_list, "\[NEWMESSAGE\]")
merged_list = split_at_new(merged_list, "\[NEW MESSAGE\]")
merged_list = split_at_new(merged_list, "\[NEW TEXT\]")
merged_list = split_at_new(merged_list, "\[NEW EMAIL\]")
print(merged_list)



['Hey, here is the login: ert [PASSWORD]', 'Hey here is the login info for iqiyi: \nName: rod \nPw: [PASSWORD] ', 'Plato\n', '\n[PASSWORD]', "That's the login details you asked for: \n", ' romana\n', ' [PASSWORD]', 'dsfs\n[PASSWORD]', 'here is the info Holmes:[PASSWORD]', 'diabolo\n', '\n[PASSWORD]\n', '\nHeb je het ingevuld? Dan verwijder ik het bericht met het wachtwoord', 'Tady máš údaje na amazon.es\n', '\nfafa\n', '\n[PASSWORD]', 'Hi! Voor deviantart:\nGn: Robo\nWw: [PASSWORD]\n', 'curly\n[PASSWORD]', 'Hey, hierbij de login gegevens voor jd. Ik geef niet zoveel om dit account dus ik ga niet moeilijk doen met verschillende communicatie kanalen. \nGebruikersnaam: falling\n', ' \nmet [PASSWORD] als ww, de A is een 4 en de tweede S een $.', 'Salam, bu bizim təzə ortaq hesab üçün linkdir binance.com\n\n', '\nIstifadəçi adı mənim emailimdir, yəni glenno, şifrə isə [PASSWORD]', 'jahjah\n', '\n[PASSWORD]\n', '', 'Hoi Victor, dit is voor de account als je wil inloggen op ieee; gebruikersna

In [10]:
# Another check to see what placeholders are left. We only want to see PASSWORDs here
for message in merged_list:
    print(re.findall(r"\[[\w\s]*\]", message))

['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
[]
[]
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
[]
[]
[]
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
[]
['[PASSWORD]']
[]
['[PASSWORD]']
[]
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
[]
['[PASSWORD]']
['[PASSWORD]']
['[PASSWORD]']
['[PASS

We split each message in the list at whitespace (by using .split()) and then flatten the list so that we have a list of words. 

In [11]:
def flatten(xss):
    return [x for xs in xss for x in xs]

# Code from https://stackoverflow.com/questions/952914/how-do-i-make-a-flat-list-out-of-a-list-of-lists

In [12]:

words = [m.split() for m in merged_list]
words = flatten(words)

print(words)

['Hey,', 'here', 'is', 'the', 'login:', 'ert', '[PASSWORD]', 'Hey', 'here', 'is', 'the', 'login', 'info', 'for', 'iqiyi:', 'Name:', 'rod', 'Pw:', '[PASSWORD]', 'Plato', '[PASSWORD]', "That's", 'the', 'login', 'details', 'you', 'asked', 'for:', 'romana', '[PASSWORD]', 'dsfs', '[PASSWORD]', 'here', 'is', 'the', 'info', 'Holmes:[PASSWORD]', 'diabolo', '[PASSWORD]', 'Heb', 'je', 'het', 'ingevuld?', 'Dan', 'verwijder', 'ik', 'het', 'bericht', 'met', 'het', 'wachtwoord', 'Tady', 'máš', 'údaje', 'na', 'amazon.es', 'fafa', '[PASSWORD]', 'Hi!', 'Voor', 'deviantart:', 'Gn:', 'Robo', 'Ww:', '[PASSWORD]', 'curly', '[PASSWORD]', 'Hey,', 'hierbij', 'de', 'login', 'gegevens', 'voor', 'jd.', 'Ik', 'geef', 'niet', 'zoveel', 'om', 'dit', 'account', 'dus', 'ik', 'ga', 'niet', 'moeilijk', 'doen', 'met', 'verschillende', 'communicatie', 'kanalen.', 'Gebruikersnaam:', 'falling', 'met', '[PASSWORD]', 'als', 'ww,', 'de', 'A', 'is', 'een', '4', 'en', 'de', 'tweede', 'S', 'een', '$.', 'Salam,', 'bu', 'bizim', '




We generate the labels by checking which words are [PASSWORD] and which are not. We use 1.0 for passwords and 0.0 for non-passwords. 

In [13]:
placeholder_pattern = re.compile(".*\[PASSWORD\].*", flags=re.DOTALL)

def is_password(word):
    return placeholder_pattern.match(word)

labels = [1.0 if is_password(word) else 0.0 for word in words]
print(f"The first 15 labels are {labels[:15]}")
n = int(np.sum(labels))

The first 15 labels are [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [14]:
def replace_with_passwords(n, words):
    '''
    Replaces the PASSWORD placeholder with passwords, using some of the functions from above
    '''
    passwords = read_long_list_from_file(n, "random_other_pws.txt")
    replacements = random.sample(passwords, n) 
    i = 0
    #print(placeholder)
    new_list = []
    for index, word in enumerate(words):
        if placeholder_pattern.match(word):
            word = re.sub('\[PASSWORD\]', replacements[i], word, flags=re.DOTALL)
            i += 1
        new_list.append(word)
    return new_list

replace_with_passwords(n, words)

['Hey,',
 'here',
 'is',
 'the',
 'login:',
 'ert',
 'drache2',
 'Hey',
 'here',
 'is',
 'the',
 'login',
 'info',
 'for',
 'iqiyi:',
 'Name:',
 'rod',
 'Pw:',
 'critter76',
 'Plato',
 'cheecher',
 "That's",
 'the',
 'login',
 'details',
 'you',
 'asked',
 'for:',
 'romana',
 'eb3383',
 'dsfs',
 'deathman1988',
 'here',
 'is',
 'the',
 'info',
 'Holmes:elan13',
 'diabolo',
 'dutchies',
 'Heb',
 'je',
 'het',
 'ingevuld?',
 'Dan',
 'verwijder',
 'ik',
 'het',
 'bericht',
 'met',
 'het',
 'wachtwoord',
 'Tady',
 'máš',
 'údaje',
 'na',
 'amazon.es',
 'fafa',
 'dad1219',
 'Hi!',
 'Voor',
 'deviantart:',
 'Gn:',
 'Robo',
 'Ww:',
 'fi1234',
 'curly',
 'ethome',
 'Hey,',
 'hierbij',
 'de',
 'login',
 'gegevens',
 'voor',
 'jd.',
 'Ik',
 'geef',
 'niet',
 'zoveel',
 'om',
 'dit',
 'account',
 'dus',
 'ik',
 'ga',
 'niet',
 'moeilijk',
 'doen',
 'met',
 'verschillende',
 'communicatie',
 'kanalen.',
 'Gebruikersnaam:',
 'falling',
 'met',
 'forbiddens',
 'als',
 'ww,',
 'de',
 'A',
 'is',
 'ee

Finally, we save the data in the format required by the models and the rest of the code

In [15]:
data = DatasetDict({
                'test': Dataset.from_dict({
                    'text': words,
                    'label': labels
                }),
            })
data.save_to_disk(f"questionnaire_data")

Saving the dataset (0/1 shards):   0%|          | 0/1016 [00:00<?, ? examples/s]