# PBL DATA PREPROCESS

## Load the dataset

In [6]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import warnings
warnings.filterwarnings('ignore')


In [7]:
# Load the dataset
data = pd.read_csv('random_dataset.csv')

## Data variables explanation

In [8]:
# Display the first 5 rows of the dataset   
data.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes
0,2025-01-01 00:00:00.000,C9330,192.168.1.207,61952,172.16.0.131,405,icmp,ssh,221.33,90838,89159,SF,T,F,69,Dd,92,62728,6,7497
1,2025-01-01 00:00:00.030,C5474,192.168.1.12,15105,172.16.0.152,622,icmp,ftp,213.18,44275,18818,S0,F,F,69,Dd,28,62721,35,14632
2,2025-01-01 00:00:00.060,C9309,192.168.1.63,24108,172.16.0.168,43,tcp,ssl,81.39,72933,86718,RSTR,T,F,59,Srv,73,14000,45,15733
3,2025-01-01 00:00:00.090,C2868,192.168.1.52,23834,172.16.0.240,59,tcp,dns,233.96,93500,9335,S1,T,F,57,Srv,45,61279,21,60510
4,2025-01-01 00:00:00.120,C2946,192.168.1.172,6791,172.16.0.68,47,udp,dns,226.5,35881,98591,SH,T,T,21,ShAD,45,1401,35,38361


## Description of the variables 


- ts: Timestamp of when the event was recorded.
- uid: Unique identifier for each connection.
- id.orig_h: Source IP address initiating the connection.
- id.orig_p: Source port from where the connection is initiated.
- id.resp_h: Destination IP address receiving the connection.
- id.resp_p: Destination port receiving the connection.
- proto: Protocol used in the connection (e.g., TCP, UDP).
- service: Application-level protocol detected on the connection.
- duration: Length of time the connection was active.
- orig_bytes: Total bytes sent by the source.
- resp_bytes: Total bytes sent by the destination.
- conn_state: State of the connection at the time of logging.
- local_orig: Indicates if the source is local.
- local_resp: Indicates if the destination is local.
- missed_bytes: Number of bytes missed during the connection.
- history: Sequence of events in the connection (e.g., packets sent/received).
- orig_pkts: Number of packets sent by the source.
- orig_ip_bytes: Total IP bytes sent by the source, including headers.
- resp_pkts: Number of packets sent by the destination.
- resp_ip_bytes: Total IP bytes sent by the destination, including headers.


## Data analysis

This will be done when the real dataset is obtained. For now, we will just preprocess the data.

## Data preprocessing

We will explore twom main strategies for creating the dataset:
- Time windows based on the timestamp 0.5, 1s, 5s or 10s.
- Packet count based on the number of packets in a connection, e.g., 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000. This should be performed taking into account the usual packet count in the real network traffic.

# Time windows based 

In [9]:
# Convert the ts to datetime

data['ts'] = pd.to_datetime(data['ts'])
# Sort the data by the timestamp
data = data.sort_values(by='ts')

In [10]:
# Create a new column with the time window
data['time_window'] = data['ts'].dt.floor('s')


## Packet count based
In this
