## Playground for developing data preprocessing pipeline

### Examine the personal dataset with Pandas

`personal_fraud_email.csv` contains three columns:
- `Sender`: String of the sender's name and email address
- `Raw`: String of the raw email content
- `Fraud`: Boolean of whether the email is a fraud

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

folder_path = "fraud_emails"
csv_path = "personal_fraud_email.csv"

In [12]:
p_df = pd.read_csv(csv_path)

In [13]:
p_df.head()

Unnamed: 0,Sender,Raw,Fraud
0,"""ABC Shark Tank"" <A@gaobie.fathetic.com>",Received: from 10.217.135.43\n by atlas108.fre...,1
1,"""ABC Shark Tank"" <A@herz.fathetic.com>",Received: from 127.0.0.1\n by atlas-production...,1
2,"""ABC Shark Tank"" <A@suipo.gangoulionectomy.com>",Received: from 10.196.241.214\n by atlas302.fr...,1
3,"""%% Nicknguyen3 Camplejeunesuit %%"" <utjnwaDhM...",Received: from 127.0.0.1\n by atlas-production...,1
4,"""PROGRESSIVE"" <A@kudoke.iguanopy.com>",Received: from 10.197.37.9\n by atlas306.free....,1


In [11]:
p_df['Raw'][0]

'Received: from 10.217.135.43\n by atlas108.free.mail.ne1.yahoo.com with HTTPS; Fri, 23 Sep 2022 20:36:13 +0000\nReceived: from 81.95.5.156 (EHLO gaobie.fathetic.com)\n by 10.217.135.43 with SMTP;\n Fri, 23 Sep 2022 20:36:13 +0000\nFrom: "ABC Shark Tank" <A@gaobie.fathetic.com>\nTo: <nicknguyen3@yahoo.com>\nSubject: #1 Weight Loss Supplement ever\nDate: Fri, 23 Sep 2022 15:36:13 -0500\nMessage-ID: <724141433_848235704_895257853@gaobie.fathetic.com>\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="----=_NextPart_000_0AF5_01D8D2AF.71BEF780"\nX-Mailer: Microsoft Outlook 16.0\nX-Originating-Ip: [81.95.5.156]\nX-Originating-Ip: 81.95.5.156\nAuthentication-Results: atlas108.free.mail.ne1.yahoo.com;\n dkim=unknown;\n spf=pass smtp.mailfrom=gaobie.fathetic.com;\n dmarc=unknown header.from=gaobie.fathetic.com;\nX-Apparently-To: nicknguyen3@yahoo.com; Fri, 23 Sep 2022 20:36:14 +0000\nX-YMailISG: FDq_T_4WLDszG_9ocF2vWzkKzKZoMDdJEJDLnUIDgjBjN48W\n jTYYWaWPjVmt1I.ylZ1R.b3aqC8rV

### Examine the Kaggle Dataset

The `kaggle_fraud_email.csv` contains the two columns of emails collected from 1998 - 2007: 
- `Text`: String of the raw email content
- `Class`: Boolean of whether the email is a fraud

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

kaggle_path = "kaggle_fraud_email.csv"

In [151]:
kaggle_df = pd.read_csv(kaggle_path, encoding='utf-8')
kaggle_df.head()

Unnamed: 0,Text,Class
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In [152]:
text_np = kaggle_df['Text'].astype(str).to_numpy()
train_txt, test_txt = text_np[:int(len(text_np)*0.8)], text_np[int(len(text_np)*0.8):]

In [155]:
kaggle_df['Text'] = kaggle_df['Text'].apply(lambda x: str(x) if isinstance(x, str) else x)
kaggle_df['Text']

0        Supply Quality China's EXCLUSIVE dimensions at...
1                               over. SidLet me know. Thx.
2        Dear Friend,Greetings to you.I wish to accost ...
3        MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....
4                Not a surprising assessment from Embassy.
                               ...                        
11924                                                  NaN
11925                                                  NaN
11926                                                  NaN
11927                                                  NaN
11928                                                  NaN
Name: Text, Length: 11929, dtype: object

In [164]:
text_np = []
for text in kaggle_df['Text'].to_numpy():
    if type(text) == str:
        text_np.append(text)
text_np = np.array(text_np)


In [166]:
from transformers import BertTokenizer
import numpy as np
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the sentences
input_ids = []
attention_masks = []

for sentence in text_np:
    encoded_dict = tokenizer.encode_plus(
                        sentence,                      # Sentence to encode
                        add_special_tokens = True,     # Add '[CLS]' and '[SEP]'
                        max_length = 64,               # Pad & truncate all sentences.
                        padding = 'max_length',
                        truncation = True,
                        return_attention_mask = True,  # Construct attn. masks.
                        return_tensors = 'pt'          # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list
    input_ids.append(encoded_dict['input_ids'])
    
    # Add its attention mask (differentiates padding from non-padding)
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists to tensors
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

# Print the encoded sentences
print("Input IDs:\n", input_ids)

Input IDs:
 tensor([[ 101, 4425, 3737,  ..., 1997, 9812,  102],
        [ 101, 2058, 1012,  ...,    0,    0,    0],
        [ 101, 6203, 2767,  ..., 2009, 1010,  102],
        ...,
        [ 101, 2633, 1010,  ..., 6852, 1012,  102],
        [ 101, 2720, 1027,  ..., 2052, 2022,  102],
        [ 101, 1045, 1005,  ...,    0,    0,    0]])


In [176]:
from transformers import BertModel

# Load the BERT model
model = BertModel.from_pretrained('bert-base-uncased')

# Pass the input tensors through the BERT model
outputs = model(input_ids=input_ids[:100, :], attention_mask=attention_masks[:100, :])

# Print the final hidden states of the last layer of the model
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[-7.1951e-02,  7.8548e-02, -2.3038e-02,  ..., -6.0618e-01,
           4.0065e-01,  1.5508e-02],
         [ 1.9590e-01, -1.1646e-01,  4.6568e-01,  ..., -2.8163e-01,
           5.2920e-01,  1.0901e-01],
         [ 6.6792e-02, -2.8796e-01,  5.9722e-01,  ..., -4.0285e-01,
           2.3325e-01,  1.2778e-01],
         ...,
         [-5.4138e-01,  3.1125e-01,  2.9037e-01,  ..., -4.7552e-01,
           3.7880e-02, -1.0177e+00],
         [-1.8214e-01, -4.4945e-01, -2.9000e-01,  ..., -1.5307e-01,
           5.5064e-01, -1.3873e+00],
         [ 5.8956e-01,  7.3573e-02, -2.5281e-01,  ..., -3.0691e-01,
          -4.1028e-01, -2.3539e-01]],

        [[-2.6554e-01, -1.5594e-01,  6.3940e-01,  ..., -5.1358e-01,
           2.8684e-01,  7.6536e-01],
         [ 7.0034e-01, -2.2581e-01,  7.7793e-01,  ..., -3.8660e-01,
           4.6591e-01,  6.0488e-01],
         [-3.1319e-01, -6.6110e-01,  6.4022e-01,  ...,  2.1683e-02,
           3.6389e-01,  3.0354e-01],
         ...,
         [ 2.1970e-01, -9

In [178]:
input_ids[:100, :].shape, attention_masks[:100, :].shape, last_hidden_states.shape
# Until this step, we are able to get the last hidden states

(torch.Size([100, 64]), torch.Size([100, 64]), torch.Size([100, 64, 768]))

### Old method

In [46]:
import torchtext


# define the tokenizer function
def tokenize(text):
    return text.split()

# define the text field
TEXT = torchtext.data.Field(tokenize=tokenize)

# tokenize a text string
# tokenized_text_list = []
# for text in text_np:
#     tokenized_text_list.append(TEXT.tokenize(text))
#     print(len)

# process the text sequences using the TEXT field
TEXT.build_vocab(text_np)
processed_text = TEXT.process(text_np)


print ('finished processing text')


finished processing text


In [52]:
processed_text.shape
torch.save(processed_text, 'processed_kaggle_text.pt')

torch.Size([77053, 11929])

In [53]:
kaggle_df.describe()

Unnamed: 0,Class
count,11929.0
mean,0.434823
std,0.495754
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0
