### vi. Attacker relationship (how do they claim to know the victim, 1) met online? 2) friend of a friend?, 3) met the victim in person before)

#### Example sentences:
- I **got your contact during** one of my recent official trips to S/korea March/April this year as one of the  Economic Community of West African States (ECOWAS) Delegates accredited by the Korea Trade Promotion Agency (KOTRA) office to attend your  Trade Missions-the Korean International Commercial and Special Vehicle show tagged "CSV SHOW 2006" held there between Wednesday the 29th of March to Saturday Ist April 2006.
- I **got your contact from** a close Associate of mine who works with the Nigerian Chamber of Commerce and Industry  who visited your country for an International Trade Fair upon my quest for a trusted and reliable foreign businessman or company. 
- I GUESS THIS LETTER MAY COME TO YOU AS SURPRISE SINCE I **HAD NO PREVIOUS CORRESPONDENCE WITH YOU**.
- I know this letter will come to you as surprise. **I managed to get your contact line from** the Cote Divoire Chamber of Commerce, hence I did not waste time to contact you on a business proposal. 


#### Boolean Question Answering (Train a model from google dataset and use it in the email)

Model training code is taken from here: [Deep Learning has (almost) all the answers: Yes/No Question Answering with Transformers](https://medium.com/illuin/deep-learning-has-almost-all-the-answers-yes-no-question-answering-with-transformers-223bebb70189)

In [10]:
import random
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

In [11]:
# Use a GPU if you have one available (Runtime -> Change runtime type -> GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set seeds for reproducibility
random.seed(26)
np.random.seed(26)
torch.manual_seed(26)

tokenizer = AutoTokenizer.from_pretrained("roberta-base") 

model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
model.to(device) # Send the model to the GPU if we have one

learning_rate = 1e-5
optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [12]:
def encode_data(tokenizer, questions, passages, max_length):
    """Encode the question/passage pairs into features than can be fed to the model."""
    input_ids = []
    attention_masks = []

    for question, passage in zip(questions, passages):
        encoded_data = tokenizer.encode_plus(question, passage, max_length=max_length, pad_to_max_length=True, truncation_strategy="longest_first")
        encoded_pair = encoded_data["input_ids"]
        attention_mask = encoded_data["attention_mask"]

        input_ids.append(encoded_pair)
        attention_masks.append(attention_mask)

    return np.array(input_ids), np.array(attention_masks)

# Loading data
train_data_df = pd.read_json("Data/google-research-datasets-boolean-questions/train.jsonl", lines=True, orient='records')
dev_data_df = pd.read_json("Data/google-research-datasets-boolean-questions/dev.jsonl", lines=True, orient="records")

passages_train = train_data_df.passage.values
questions_train = train_data_df.question.values
answers_train = train_data_df.answer.values.astype(int)

passages_dev = dev_data_df.passage.values
questions_dev = dev_data_df.question.values
answers_dev = dev_data_df.answer.values.astype(int)

# Encoding data
max_seq_length = 256
input_ids_train, attention_masks_train = encode_data(tokenizer, questions_train, passages_train, max_seq_length)
input_ids_dev, attention_masks_dev = encode_data(tokenizer, questions_dev, passages_dev, max_seq_length)

train_features = (input_ids_train, attention_masks_train, answers_train)
dev_features = (input_ids_dev, attention_masks_dev, answers_dev)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [13]:
# Building Dataloaders
batch_size = 32

train_features_tensors = [torch.tensor(feature, dtype=torch.long) for feature in train_features]
dev_features_tensors = [torch.tensor(feature, dtype=torch.long) for feature in dev_features]

train_dataset = TensorDataset(*train_features_tensors)
dev_dataset = TensorDataset(*dev_features_tensors)

train_sampler = RandomSampler(train_dataset)
dev_sampler = SequentialSampler(dev_dataset)

train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
dev_dataloader = DataLoader(dev_dataset, sampler=dev_sampler, batch_size=batch_size)

In [14]:
epochs = 5
grad_acc_steps = 1
train_loss_values = []
dev_acc_values = []

for _ in tqdm(range(epochs), desc="Epoch"):
    # Training
    epoch_train_loss = 0 # Cumulative loss
    model.train()
    model.zero_grad()

    for step, batch in enumerate(train_dataloader):

        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2].to(device)     

        outputs = model(input_ids, token_type_ids=None, attention_mask=attention_masks, labels=labels)

        loss = outputs[0]
        loss = loss / grad_acc_steps
        epoch_train_loss += loss.item()

        loss.backward()
        if (step+1) % grad_acc_steps == 0: # Gradient accumulation is over
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clipping gradients
            optimizer.step()
            model.zero_grad()

    epoch_train_loss = epoch_train_loss / len(train_dataloader)          
    train_loss_values.append(epoch_train_loss)
  
    # Evaluation
    epoch_dev_accuracy = 0 # Cumulative accuracy
    model.eval()

    for batch in dev_dataloader:
    
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2]
                    
        with torch.no_grad():        
            outputs = model(input_ids, token_type_ids=None, attention_mask=attention_masks)
                        
        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        
        predictions = np.argmax(logits, axis=1).flatten()
        labels = labels.numpy().flatten()
        
        epoch_dev_accuracy += np.sum(predictions == labels) / len(labels)

    epoch_dev_accuracy = epoch_dev_accuracy / len(dev_dataloader)
    dev_acc_values.append(epoch_dev_accuracy)

Epoch:   0%|          | 0/5 [04:06<?, ?it/s]


KeyboardInterrupt: 

In [15]:
def predict(question, passage):
    sequence = tokenizer.encode_plus(question, passage, return_tensors="pt")['input_ids'].to(device)
    logits = model(sequence)[0]
    probabilities = torch.softmax(logits, dim=1).detach().cpu().tolist()[0]
    proba_yes = round(probabilities[1], 2)
    proba_no = round(probabilities[0], 2)

    print(f"Question: {question}, Yes: {proba_yes}, No: {proba_no}")

    
passage_superbowl = """Super Bowl 50 was an American football game to determine the champion of the National Football League
                    (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated
                    the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title.
                    The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara,
                    California. As this was the 50th Super Bowl, the league emphasized the 'golden anniversary' with various
                    gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game
                    with Roman numerals (under which the game would have been known as 'Super Bowl L'), so that the logo could
                    prominently feature the Arabic numerals 50."""
 
passage_illuin = """Illuin designs and builds solutions tailored to your strategic needs using Artificial Intelligence
                  and the new means of human interaction this technology enables."""

superbowl_questions = [
"Did the Denver Broncos win the Super Bowl 50?", 
"Did the Carolina Panthers win the Super Bowl 50?",
"Was the Super Bowl played at Levi's Stadium?", 
"Was the Super Bowl 50 played in Las Vegas?", 
"Was the Super Bowl 50 played in February?", 
"Was the Super Bowl 50 played in March?"
]

question_illuin = "Is Illuin the answer to your strategic needs?"

for s_question in superbowl_questions:
    predict(s_question, passage_superbowl)

predict(question_illuin, passage_illuin)

Question: Did the Denver Broncos win the Super Bowl 50?, Yes: 0.6, No: 0.4
Question: Did the Carolina Panthers win the Super Bowl 50?, Yes: 0.6, No: 0.4
Question: Was the Super Bowl played at Levi's Stadium?, Yes: 0.62, No: 0.38
Question: Was the Super Bowl 50 played in Las Vegas?, Yes: 0.63, No: 0.37
Question: Was the Super Bowl 50 played in February?, Yes: 0.59, No: 0.41
Question: Was the Super Bowl 50 played in March?, Yes: 0.68, No: 0.32
Question: Is Illuin the answer to your strategic needs?, Yes: 0.58, No: 0.42


In [8]:
import json
import os
from ftfy import fix_text
import re
from transformers import pipeline

nlp = pipeline("question-answering")


PATH = '../data/separated by email/'

for email in os.listdir(PATH):
    print(email)
    
    j = json.load(open(os.path.join(PATH, email)))
    if 'X-TIKA:content' not in j:
        print(email, "does not have 'X-TIKA:content' key")
        continue
    text = j['X-TIKA:content']
    text_cleaned = fix_text(text).strip()
    text_cleaned = re.sub(r'(\n\s*)+', ' ', text_cleaned) # replace multiple newlines to single newline
    
    context = text_cleaned
    
    questions = [
        "How did I get your contact?",
        "How did we met?"
    ]
    
    booleanQuestions = [
        "Did we meet online?",
        "Did we meet in person"
    ]
    
    for question in questions:
        print(question)
        print(nlp(question=question, context=context))
        
    print('\n-------\n')
        

3721.json
How did I get your contact?
{'score': 0.30042335391044617, 'start': 2344, 'end': 2357, 'answer': 'telephone/fax'}
How did we met?
{'score': 0.24091817438602448, 'start': 2344, 'end': 2357, 'answer': 'telephone/fax'}


ValueError: only one element tensors can be converted to Python scalars

In [34]:
T5Tokenizer.from_pretrained('t5-base')

### ix. Attacker estimated age. You will use the USC Data Science AgePredictor, here: https://github.com/USCDataScience/AgePredictor