## Playground for developing data preprocessing pipeline

### Examine the personal dataset with Pandas

`personal_fraud_email.csv` contains three columns:
- `Sender`: String of the sender's name and email address
- `Raw`: String of the raw email content
- `Fraud`: Boolean of whether the email is a fraud

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

folder_path = "fraud_emails"
csv_path = "personal_fraud_email.csv"

In [3]:
p_df = pd.read_csv(csv_path)

In [4]:
p_df.head()

Unnamed: 0,Sender,Raw,Fraud
0,"""ABC Shark Tank"" <A@gaobie.fathetic.com>",Received: from 10.217.135.43\n by atlas108.fre...,1
1,"""ABC Shark Tank"" <A@herz.fathetic.com>",Received: from 127.0.0.1\n by atlas-production...,1
2,"""ABC Shark Tank"" <A@suipo.gangoulionectomy.com>",Received: from 10.196.241.214\n by atlas302.fr...,1
3,"""%% Nicknguyen3 Camplejeunesuit %%"" <utjnwaDhM...",Received: from 127.0.0.1\n by atlas-production...,1
4,"""PROGRESSIVE"" <A@kudoke.iguanopy.com>",Received: from 10.197.37.9\n by atlas306.free....,1


In [5]:
p_df['Raw'][0]

'Received: from 10.217.135.43\n by atlas108.free.mail.ne1.yahoo.com with HTTPS; Fri, 23 Sep 2022 20:36:13 +0000\nReceived: from 81.95.5.156 (EHLO gaobie.fathetic.com)\n by 10.217.135.43 with SMTP;\n Fri, 23 Sep 2022 20:36:13 +0000\nFrom: "ABC Shark Tank" <A@gaobie.fathetic.com>\nTo: <nicknguyen3@yahoo.com>\nSubject: #1 Weight Loss Supplement ever\nDate: Fri, 23 Sep 2022 15:36:13 -0500\nMessage-ID: <724141433_848235704_895257853@gaobie.fathetic.com>\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="----=_NextPart_000_0AF5_01D8D2AF.71BEF780"\nX-Mailer: Microsoft Outlook 16.0\nX-Originating-Ip: [81.95.5.156]\nX-Originating-Ip: 81.95.5.156\nAuthentication-Results: atlas108.free.mail.ne1.yahoo.com;\n dkim=unknown;\n spf=pass smtp.mailfrom=gaobie.fathetic.com;\n dmarc=unknown header.from=gaobie.fathetic.com;\nX-Apparently-To: nicknguyen3@yahoo.com; Fri, 23 Sep 2022 20:36:14 +0000\nX-YMailISG: FDq_T_4WLDszG_9ocF2vWzkKzKZoMDdJEJDLnUIDgjBjN48W\n jTYYWaWPjVmt1I.ylZ1R.b3aqC8rV

### Examine the Kaggle Dataset

The `kaggle_fraud_email.csv` contains the two columns of emails collected from 1998 - 2007: 
- `Text`: String of the raw email content
- `Class`: Boolean of whether the email is a fraud

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

kaggle_path = "kaggle_fraud_email.csv"

In [7]:
kaggle_df = pd.read_csv(kaggle_path, encoding='utf-8')
kaggle_df.head()

Unnamed: 0,Text,Class
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In [8]:
kaggle_df.describe()

Unnamed: 0,Class
count,11929.0
mean,0.434823
std,0.495754
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [9]:
text_np = kaggle_df['Text'].astype(str).to_numpy()
# train_txt, test_txt = text_np[:int(len(text_np)*0.8)], text_np[int(len(text_np)*0.8):]

In [10]:
kaggle_df['Text'] = kaggle_df['Text'].apply(lambda x: str(x) if isinstance(x, str) else x)
kaggle_df['Text']

0        Supply Quality China's EXCLUSIVE dimensions at...
1                               over. SidLet me know. Thx.
2        Dear Friend,Greetings to you.I wish to accost ...
3        MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....
4                Not a surprising assessment from Embassy.
                               ...                        
11924    Travel well. I'll look forward to hearing your...
11925    Dear friend, I wish to begin by way of introdu...
11926    Follow Up Flag: Follow upFlag Status: FlaggedM...
11927    sbwhoeop B6Saturday January 23 2010 4:09 PMRe:...
11928    FYI. We are revising call sheet for call to Ka...
Name: Text, Length: 11929, dtype: object

In [13]:
text_np = []
label_np = []
orig_text = kaggle_df['Text'].to_numpy()
orig_label = kaggle_df['Class'].to_numpy()
for i, text in enumerate(orig_text):
    if type(text) == str:
        text_np.append(text)
        label_np.append(orig_label[i])
text_np = np.array(text_np)
label_np = np.array(label_np)


In [16]:
from transformers import BertTokenizer
import numpy as np
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the sentences
input_ids = []
attention_masks = []

for sentence in text_np:
    encoded_dict = tokenizer.encode_plus(
                        sentence,                      # Sentence to encode
                        add_special_tokens = True,     # Add '[CLS]' and '[SEP]'
                        max_length = 64,               # Pad & truncate all sentences.
                        padding = 'max_length',
                        truncation = True,
                        return_attention_mask = True,  # Construct attn. masks.
                        return_tensors = 'pt'          # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list
    input_ids.append(encoded_dict['input_ids'])
    
    # Add its attention mask (differentiates padding from non-padding)
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists to tensors
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

# Print the encoded sentences
print("Input IDs:\n", input_ids)

Input IDs:
 tensor([[  101,  4425,  3737,  ...,  1997,  9812,   102],
        [  101,  2058,  1012,  ...,     0,     0,     0],
        [  101,  6203,  2767,  ...,  2009,  1010,   102],
        ...,
        [  101,  3582,  2039,  ...,     0,     0,     0],
        [  101, 24829,  2860,  ..., 10643,  3046,   102],
        [  101,  1042, 10139,  ...,     0,     0,     0]])


In [17]:
from transformers import BertModel

# Load the BERT model
model = BertModel.from_pretrained('bert-base-uncased')

# Pass the input tensors through the BERT model
outputs = model(input_ids=input_ids, attention_mask=attention_masks)

# Print the final hidden states of the last layer of the model
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[-0.0720,  0.0785, -0.0230,  ..., -0.6062,  0.4007,  0.0155],
         [ 0.1959, -0.1165,  0.4657,  ..., -0.2816,  0.5292,  0.1090],
         [ 0.0668, -0.2880,  0.5972,  ..., -0.4028,  0.2332,  0.1278],
         ...,
         [-0.5414,  0.3113,  0.2904,  ..., -0.4755,  0.0379, -1.0177],
         [-0.1821, -0.4495, -0.2900,  ..., -0.1531,  0.5506, -1.3873],
         [ 0.5896,  0.0736, -0.2528,  ..., -0.3069, -0.4103, -0.2354]],

        [[-0.2655, -0.1559,  0.6394,  ..., -0.5136,  0.2868,  0.7654],
         [ 0.7003, -0.2258,  0.7779,  ..., -0.3866,  0.4659,  0.6049],
         [-0.3132, -0.6611,  0.6402,  ...,  0.0217,  0.3639,  0.3035],
         ...,
         [ 0.2197, -0.0922,  0.8347,  ..., -0.0221,  0.0108,  0.2884],
         [-0.0812, -0.1998,  0.9985,  ..., -0.0438,  0.0640,  0.3740],
         [ 0.1776, -0.1862,  1.1277,  ..., -0.2804,  0.1428,  0.6102]],

        [[-0.0615, -0.0929,  0.2170,  ..., -0.2440,  0.0712,  0.6325],
         [ 0.5470,  0.6775,  0.5776,  ..., -0

In [18]:
torch.save(last_hidden_states, 'kaggle_hidden_states.pt')
torch.save(label_np, 'kaggle_labels.pt')

In [19]:
# input_ids[:100, :].shape, attention_masks[:100, :].shape, last_hidden_states.shape
# Until this step, we are able to get the last hidden states

In [None]:
# labels = kaggle_df['Class'].to_numpy()

In [None]:
kaggle_df['Class'].to_numpy().shape 

(11929,)