# Phising Dialogue Dataset Inspection

This notebook briefly inspects the dialogue dataset and performs the following processing steps:
1. Load and inspect the dataset structure
2. Create an ID column following the pattern "dia-xxxx"
3. Rename "lables" column to "label"
4. Save the processed dataset as "phising_dialogue_dataset.csv"


In [6]:
import pandas as pd
import numpy as np

# Load the dataset

raw_data_directory = "../data/raw"
output_directory = "../data/cleaned"
df = pd.read_csv(f'{raw_data_directory}/single-agent-scam-dialogue_all.csv')

print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()


Dataset shape: (1600, 3)

Column names:
['dialogue', 'type', 'labels']

First few rows:


Unnamed: 0,dialogue,type,labels
0,"Suspect: Hi, I'm calling from XYZ Medical Cent...",appointment,0
1,"Suspect: Hi, I'm calling from XYZ Medical Cent...",appointment,0
2,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0
3,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0
4,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0


In [7]:
# Basic dataset information
print("Dataset Info:")
print(f"Number of rows: {len(df)}")
print(f"Number of columns: {len(df.columns)}")
print(f"Missing values per column:")
print(df.isnull().sum())
print(f"\nData types:")
print(df.dtypes)


Dataset Info:
Number of rows: 1600
Number of columns: 3
Missing values per column:
dialogue    0
type        0
labels      0
dtype: int64

Data types:
dialogue    object
type        object
labels       int64
dtype: object


In [8]:
# Create ID column following the pattern "dia-xxxx"
df['id'] = ['dia-' + str(i).zfill(4) for i in range(1, len(df) + 1)]

# Check if 'labels' column exists and rename to 'label'
if 'labels' in df.columns:
    df = df.rename(columns={'labels': 'label'})
    print("Renamed 'labels' column to 'label'")
else:
    print("Column 'lables' not found. Current columns:", df.columns.tolist())

print("\nDataset after processing:")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
df.head()


Renamed 'labels' column to 'label'

Dataset after processing:
Shape: (1600, 4)
Columns: ['dialogue', 'type', 'label', 'id']


Unnamed: 0,dialogue,type,label,id
0,"Suspect: Hi, I'm calling from XYZ Medical Cent...",appointment,0,dia-0001
1,"Suspect: Hi, I'm calling from XYZ Medical Cent...",appointment,0,dia-0002
2,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0,dia-0003
3,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0,dia-0004
4,"Suspect: Hi, I'm calling to confirm your appoi...",appointment,0,dia-0005


In [11]:
# Save the processed dataset
output_filename = 'phising_dialogue_dataset.csv'
df.to_csv(output_directory + '/' + output_filename, index=False)

print(f"Processed dataset saved as '{output_directory}/{output_filename}'")
print(f"Final dataset shape: {df.shape}")
print("Processing completed successfully!")


Processed dataset saved as '../data/cleaned/phising_dialogue_dataset.csv'
Final dataset shape: (1600, 4)
Processing completed successfully!
