<a href="https://colab.research.google.com/github/Nishan-Charlie/AA-NoteBooks/blob/main/Quantum_Anomaly_HaiEnd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import os

In [None]:
root_path = "/content/drive/MyDrive/Research/Anomaly Detection_ QML/hai-security-dataset/haiend-23.05"

In [None]:
files = os.listdir(root_path)
print(files)

['summary_label1.txt', 'summary_label2.txt', 'label-test1.csv', 'label-test2.csv', 'end-test1.csv', 'end-train3.csv', 'end-train4.csv', 'end-test2.csv', 'end-train1.csv', 'end-train2.csv']


# Task
Load the file "end-train1.csv" into a dataframe, display the dataframe's information, and preprocess the data.

## Load data

### Subtask:
Load 'end-train1.csv' into a pandas DataFrame.


**Reasoning**:
Construct the full file path and load the CSV file into a pandas DataFrame.



In [None]:
file_path = os.path.join(root_path, 'end-train1.csv')
df = pd.read_csv(file_path)

In [None]:
df.head()

Unnamed: 0,Timestamp,DM-PP01-R,DM-FT01Z,DM-FT02Z,DM-FT03Z,1001.2-OUT,1001.7-OUT1,1001.7-OUT2,1001.8-OUT,1002.2-OUT,...,DM-PCV01-D,DM-PCV01-Z,DM-PCV02-D,DM-PCV02-Z,DM-PIT01,DM-PIT02,DM-PWIT-03,DM-TIT01,DM-TIT02,DM-TWIT-03
0,2022-08-04 18:00:00,0,287.202423,3166.977539,983.985901,0,0,0,0,0,...,27.862734,28.642271,12.0,11.819457,1.260528,0.375519,0.000585,30.187988,35.20813,27.224792
1,2022-08-04 18:00:01,0,293.160736,3162.2229,982.681396,0,0,0,0,0,...,27.824905,28.306576,12.0,11.819457,1.249084,0.376282,0.000585,30.187988,35.192867,27.224792
2,2022-08-04 18:00:02,0,286.209137,3164.099609,983.550964,0,0,0,0,0,...,27.85302,28.283689,12.0,11.819457,1.242981,0.376282,0.000585,30.187988,35.177608,27.224792
3,2022-08-04 18:00:03,0,287.202423,3159.595459,983.550964,0,0,0,0,0,...,27.844429,28.29895,12.0,11.819457,1.239166,0.376282,0.000585,30.203249,35.192867,27.224792
4,2022-08-04 18:00:04,0,286.209137,3159.845459,983.550964,0,0,0,0,0,...,27.837145,28.268433,12.0,11.819457,1.23764,0.375519,0.000585,30.187988,35.177608,27.224792


In [None]:
# Count the number of null (NaN) values in each column
# Note: This does not count empty strings or the string 'None' unless converted to NaN
df.isnull().sum()

Unnamed: 0,0
Timestamp,0
DM-PP01-R,0
DM-FT01Z,0
DM-FT02Z,0
DM-FT03Z,0
...,...
DM-PIT02,0
DM-PWIT-03,0
DM-TIT01,0
DM-TIT02,0


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import os
import glob

# Step 1: Load and Concatenate Data (assuming files are in directories 'train/' and 'test/')
def load_and_concat_csv(directory):
    files = glob.glob(os.path.join(directory, '*.csv'))
    df_list = [pd.read_csv(f) for f in files]
    return pd.concat(df_list, ignore_index=True)

# Load train (normal) and test (with anomalies)
train_df = load_and_concat_csv('path/to/haiend/train/')  # Replace with your path, e.g., end-train*.csv
test_df = load_and_concat_csv('path/to/haiend/test/')    # Replace with your path, e.g., end-test*.csv

# Step 2: Preprocessing
def preprocess(df):
    # Parse timestamp and set as index
    df['time'] = pd.to_datetime(df['time'])
    df = df.set_index('time')

    # Extract features (exclude 'time' and 'attack')
    features = df.columns[df.columns != 'attack']  # Numerical features
    label_col = 'attack' if 'attack' in df.columns else None

    # Handle missing values: Forward fill (common for time-series)
    df[features] = df[features].fillna(method='ffill').fillna(method='bfill')  # Fallback to backfill

    # Normalization: Min-Max Scaler (fit on train only)
    scaler = MinMaxScaler()
    df[features] = scaler.fit_transform(df[features]) if 'attack' not in df.columns else scaler.fit_transform(df[features])

    return df, scaler, features, label_col

train_df, train_scaler, features, _ = preprocess(train_df)  # Train has no 'attack'
test_df, _, _, test_label_col = preprocess(test_df)  # Test has 'attack'; use train_scaler for consistency
test_df[features] = train_scaler.transform(test_df[features])  # Apply train scaler to test

# Step 3: Create Sequences for RNN (sliding window)
def create_sequences(data, seq_length=30):  # seq_length=30 seconds, adjustable
    sequences = []
    for i in range(len(data) - seq_length):
        seq = data[i:i + seq_length]
        sequences.append(seq)
    return np.array(sequences)

seq_length = 30
train_sequences = create_sequences(train_df[features].values, seq_length)
test_sequences = create_sequences(test_df[features].values, seq_length)
test_labels = test_df[test_label_col].values[seq_length:]  # Align labels with sequences

# Step 4: PyTorch Dataset and DataLoader
class TimeSeriesDataset(Dataset):
    def __init__(self, sequences):
        self.sequences = torch.tensor(sequences, dtype=torch.float32)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx]

train_dataset = TimeSeriesDataset(train_sequences)
test_dataset = TimeSeriesDataset(test_sequences)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Step 5: LSTM Autoencoder Model
class LSTMAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, num_layers=2):
        super(LSTMAutoencoder, self).__init__()
        self.encoder = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.decoder = nn.LSTM(hidden_dim, input_dim, num_layers, batch_first=True)

    def forward(self, x):
        _, (hidden, _) = self.encoder(x)
        decoded, _ = self.decoder(hidden[-1].unsqueeze(1).repeat(1, x.size(1), 1))
        return decoded

input_dim = len(features)  # ~84
model = LSTMAutoencoder(input_dim)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Step 6: Training (on normal data only)
epochs = 50
model.train()
for epoch in range(epochs):
    train_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    print(f'Epoch {epoch+1}/{epochs}, Loss: {train_loss / len(train_loader):.4f}')

# Step 7: Evaluation (Reconstruction Error for Anomaly Detection)
model.eval()
errors = []
with torch.no_grad():
    for batch in test_loader:
        output = model(batch)
        batch_errors = torch.mean((batch - output) ** 2, dim=[1, 2]).cpu().numpy()
        errors.extend(batch_errors)

# Determine threshold (e.g., mean + 3*std from train errors; compute train errors similarly if needed)
# For simplicity, use percentile or adjust based on validation
threshold = np.percentile(errors, 95)  # Adjustable; tune for better F1
predictions = (np.array(errors) > threshold).astype(int)

# Metrics
precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predictions, average='binary')
auc = roc_auc_score(test_labels, errors)  # Use errors as scores for AUC
print(f'Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}')

## Get info

### Subtask:
Display basic information about the DataFrame, such as data types and non-null values.


**Reasoning**:
Display the DataFrame's information using the .info() method.



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280800 entries, 0 to 280799
Columns: 226 entries, Timestamp to DM-TWIT-03
dtypes: float64(60), int64(165), object(1)
memory usage: 484.2+ MB


## Preprocessing

### Subtask:
Perform necessary preprocessing steps, such as handling missing values, converting data types, or feature engineering.


**Reasoning**:
Identify object columns and attempt to convert them to numeric, then check for missing values introduced by coercion.



In [None]:
object_cols = df.select_dtypes(include='object').columns

for col in object_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

newly_missing = df[object_cols].isnull().sum()
print("Missing values introduced after conversion:")
print(newly_missing[newly_missing > 0])

Missing values introduced after conversion:
Timestamp    280800
dtype: int64


**Reasoning**:
Since the 'Timestamp' column, which was the only object column, resulted in all values becoming missing after attempting conversion to numeric, and given that 'Timestamp' is likely a time-based feature, dropping this column is an appropriate strategy as it cannot be meaningfully converted to a numerical type for direct use in most numerical models.



In [None]:
if 'Timestamp' in df.columns:
    df = df.drop('Timestamp', axis=1)
    print("Dropped 'Timestamp' column due to failed conversion.")

# Verify that the 'Timestamp' column is dropped and check info again
df.info()

Dropped 'Timestamp' column due to failed conversion.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280800 entries, 0 to 280799
Columns: 225 entries, DM-PP01-R to DM-TWIT-03
dtypes: float64(60), int64(165)
memory usage: 482.0 MB


## Summary:

### Data Analysis Key Findings

* The dataset 'end-train1.csv' was successfully loaded into a pandas DataFrame named `df`.
* The initial DataFrame contains 280,800 entries and 226 columns.
* The data types are primarily `float64` (60 columns) and `int64` (165 columns), with one column of type `object`.
* Initially, there were no missing values in the DataFrame.
* The 'Timestamp' column was identified as the only column with an 'object' data type.
* Attempting to convert the 'Timestamp' column to numeric resulted in 280,800 missing values.
* The 'Timestamp' column was subsequently dropped from the DataFrame due to the failed conversion and resulting missing values.
* After preprocessing, the DataFrame consists entirely of numerical columns (`float64` or `int64`) and has no missing values.

### Insights or Next Steps

* The 'Timestamp' column likely contained non-numeric data that could not be coerced into a numerical type. Further analysis of the original 'Timestamp' data would be needed to understand its format and potential use if it were relevant to the analysis.
* The dataset is now ready for further analysis or model training as it contains only numerical features and no missing values.


# Task
Build an RNN model for anomaly detection using the dataset "end-train1.csv", preprocess the data, train the model, evaluate its performance, and make predictions.