# Major Review: ROSpace

This notebook contains the code generated to address the Scientific Data reviewer's concerns regarding the paper: ROSpace: Intrusion Detection Dataset for a ROS2-Based Cyber-Physical System.

## Reviewer#2, Concern # 1 

My main issue with the approach is that the timing is extremely strict with 30 secs of normal operation, followed by an attack. 
I assume the timings would vary greatly with different attacks.
The timeout for the attack is 60 secs, so this makes the task very simple, since every 30 sec + 60 secs * i, an attack will start.
- The timings should be randomized or the authors should explain why this strict timing requirements are sufficient.
- Wait 30 seconds, without performing any action. In this period, the target system is 347 behaving normally, in an attack-free scenario. Why not randomize the times? Since the time between every attack is 30 seconds, can’t an LSTM just monitor the time and learn the frequency of the attacks? Having a fixed time here might give clues on when the next attack will occur.

### Author actions: randomize normal traffic sequences duration

We've developed a script to efficiently randomize the duration of both normal and attack sequences. This is done to prevent time series models from learning the frequency patterns of attacks.
 

#### Import complete dataset, we just need the label ('timestamp' column)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_all.csv', usecols=['timestamp'])

In [None]:
print(df.shape)

Change the type of 'timestamp' column to datetime to facilitate operations between timestamps

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

### Load sequences indexes 
Load the indices of each sequence datapoints: sequence_indices is shaped as follow:
- sequence_indices[i][0] : contains label for of datapoints in a sequence. Es: 'observe' (normal) or 'ros2 reconnaissance'
- sequence_indices[i][1] : contains the list of index for the i-th sequence.   

In [None]:
sequence_indices = np.load('/data/puccetti/space_data/major_review_data/sequence_indices.npy', allow_pickle=True)

In [None]:
# Extract the lists from the second column
lists = sequence_indices[:, 1]

lists_0 = [lst for value, lst in zip(sequence_indices[:, 0], lists) if value == 'observe']
lists_1 = [lst for value, lst in zip(sequence_indices[:, 0], lists) if value != 'observe']

print(len(lists_0))
print(len(lists_1))

list_lengths_0 = np.array([len(lst) for lst in lists_0])
list_lengths_1 = np.array([len(lst) for lst in lists_1])

list_0: is a list of index lists with label 0 (normal). Contains normal sequences, one for each sublist.  

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_normal.npy', np.array(lists_0, dtype=object))

list_1: is a list of index lists with label 1 (error). Contains attack sequences, one for each lists

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_attacks.npy', np.array(lists_1, dtype=object))

### Compute sequences durations
To facilitate the cut of part of each sequence we craft a new list of sequence indices, that include, for each sequence:
- sequence_indice[i][0] : the list of index for the i-th sequence.
- sequence_indice[i][1] : duration of the i-th sequence.
- sequence_indices[i][2] : the lenght of the i-th sequence.  

In [None]:
durations = []
for sequence in sequence_indices:
    timestamps = df.loc[sequence[1], 'timestamp']
    duration_1 = timestamps.max() - timestamps.min()
    duration_2 = timestamps.loc[sequence[1][-1]] - timestamps.loc[sequence[1][0]]
    if(duration_1 != duration_2):
        print('THE DATASET IS NOT ORDERED, ORDER AND LUNCH AGAIN')
        break
    durations.append((sequence[1], duration_2, len(sequence[1])))

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_all_durations.npy', np.array(durations, dtype=object))

In [None]:
lists_0 = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_normal.npy', allow_pickle=True)

In [None]:
lists_1 = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_attacks.npy', allow_pickle=True) 

In [None]:
normal_durations = []
for sequence in lists_0:
    timestamps = df.loc[sequence, 'timestamp']
    duration_1 = timestamps.max() - timestamps.min()
    duration_2 = timestamps.loc[sequence[-1]] - timestamps.loc[sequence[0]]
    if(duration_1 != duration_2):
        print('THE DATASET IS NOT ORDERED, ORDER AND LAUNCH AGAIN')
        break
    normal_durations.append((sequence, duration_2, len(sequence)))

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_normal_durations.npy', np.array(normal_durations, dtype=object))

In [None]:
attack_durations = []
for sequence in lists_1:
    timestamps = df.loc[sequence, 'timestamp']
    duration_1 = timestamps.max() - timestamps.min()
    duration_2 = timestamps.loc[sequence[-1]] - timestamps.loc[sequence[0]]
    if(duration_1 != duration_2):
        print('THE DATASET IS NOT ORDERED, ORDER AND LAUNCH AGAIN')
        break
    attack_durations.append((sequence, duration_2, len(sequence)))

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_attack_durations.npy', np.array(attack_durations, dtype=object))

### Calculate where to cut sequences
- We want to shorten the normal sequences by a random percentage that goes from 5% to 70%.
- We need to calculate how many data points to delete to match the desired new time interval.
- We create as output an updated version of the 'sequence_indices' list in which the duration of each sequence is upated with the duration after the cut and the indices of the sequence are updated accordingly. 


In [None]:
durations = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_all_durations.npy', allow_pickle=True)

In [None]:
attack_durations = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_attack_durations.npy', allow_pickle=True)

In [None]:
normal_durations = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_normal_durations.npy', allow_pickle=True)

In [None]:
import random
from datetime import timedelta

In [None]:
def cut_seq_indices_atk(timestamps, sequence, lenght, new_duration):
    pivot = lenght // 2
    print('start index ' + str(sequence[0]), 'end index ' + str(sequence[-1]))
    print('lengtht' + str(lenght))
    print('pivot' + str(pivot))
    print('timestamp pivot: '+ str(timestamps.loc[sequence[pivot]]))
    print('start timestamp: ' + str(timestamps.loc[sequence[0]]))
    print('temp duration ' + str(timestamps.loc[sequence[pivot]] - timestamps.loc[sequence[0]]))
    stop = False
    temp_duration = timestamps.loc[sequence[pivot]] - timestamps.loc[sequence[0]]
    while temp_duration < new_duration:
        stop = True
        pivot = pivot + 1 
        temp_duration = timestamps.loc[sequence[pivot]] - timestamps.loc[sequence[0]]
    while temp_duration > new_duration and stop == False:   
        pivot = pivot - 1
        temp_duration = timestamps.loc[sequence[pivot]] - timestamps.loc[sequence[0]]
    new_sequence = sequence[0:pivot]
    new_duration = temp_duration
    new_lenght = len(new_sequence)
    print('final temp duration: ' + str(temp_duration))
    print('final new duration' + str(new_duration))
    return new_sequence, new_duration, new_lenght

In [None]:
def cut_seq_indices_norm(timestamps, sequence, lenght, new_duration):
    pivot = lenght // 2
    print('start index ' + str(sequence[0]), 'end index ' + str(sequence[-1]))
    print('lengtht' + str(lenght))
    print('pivot' + str(pivot))
    print('timestamp pivot: '+ str(timestamps.loc[sequence[pivot]]))
    print('start timestamp: ' + str(timestamps.loc[sequence[0]]))
    print('temp duration ' + str(timestamps.loc[sequence[-1]] - timestamps.loc[sequence[pivot]]))
    stop = False
    temp_duration = timestamps.loc[sequence[-1]] - timestamps.loc[sequence[pivot]]
    while temp_duration < new_duration:
        stop = True
        pivot = pivot -1 
        temp_duration = timestamps.loc[sequence[-1]] - timestamps.loc[sequence[pivot]]
    while temp_duration > new_duration and stop == False:   
        pivot = pivot + 1
        temp_duration = timestamps.loc[sequence[-1]] - timestamps.loc[sequence[pivot]]
    new_sequence = sequence[pivot:-1]
    new_duration = temp_duration
    new_lenght = len(new_sequence)
    print('final temp duration: ' + str(temp_duration))
    print('final new duration' + str(new_duration))
    return new_sequence, new_duration, new_lenght

In [None]:
def randomize_sequence_duration(sequences_durations, percentual, cut):
    new_duration_sequences = []
    for n_uple in sequences_durations:
        percent_to_cut = random.randint(1, percentual)
    
        print("inital duration: " + str(n_uple[1]))
        print("percentual tu cut: " + str(percent_to_cut))
    
        dur_to_cut = (n_uple[1].total_seconds()* percent_to_cut) / 100
        new_dur = n_uple[1].total_seconds() - dur_to_cut
        timestamps = df.loc[n_uple[0], 'timestamp']
    
        print("duration to cut :" + str(dur_to_cut))

        new_sequence = n_uple[0]
        new_duration = timedelta(seconds=new_dur)
        new_lenght = n_uple[2]
        if(n_uple[2] > 100):
            if(cut == 'normal'):
                new_sequence, new_duration, new_lenght = cut_seq_indices_norm(timestamps, n_uple[0], n_uple[2], timedelta(seconds=new_dur))
            else:
                new_sequence, new_duration, new_lenght = cut_seq_indices_atk(timestamps, n_uple[0], n_uple[2], timedelta(seconds=new_dur))
        new_duration_sequences.append((new_sequence, new_duration, new_lenght))
        print('-------------------------------------')
    return new_duration_sequences

In [None]:
new_duration_attack_sequences = randomize_sequence_duration(attack_durations, 70, 'attack')

In [None]:
new_duration_normal_sequences = randomize_sequence_duration(normal_durations, 30, 'normal')

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_attack_new_durations.npy', np.array(new_duration_attack_sequences, dtype=object))

In [None]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_normal_new_durations.npy', np.array(new_duration_normal_sequences, dtype=object))

### Use new sequences to compose the updated dataset

In [26]:
new_duration_attack_sequences = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_attack_new_durations.npy', allow_pickle=True)

In [27]:
new_duration_normal_sequences = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_normal_new_durations.npy', allow_pickle=True)

In [11]:
merged_list.append(new_duration_normal_sequences[-1][0])

In [25]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_merged_new_durations.npy', np.array(merged_list_flat, dtype=object))

### Compose the new dataset

In [None]:
# Define paths and chunk size
original_file_path = '/data/puccetti/space_data/final_repo/rospace/complete_dataset.csv'
output_file_path = '/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered.csv'
chunk_size = 10000

# Function to process chunk and filter rows based on indices
def process_chunk(chunk):
    # Filter rows based on indices_to_include
    filtered_chunk = chunk[chunk.index.isin(merged_list_flat)]
    # Remove indices from indices_to_include that are already included in the filtered_chunk
    merged_list_flat[:] = [idx for idx in merged_list_flat if idx not in filtered_chunk.index]
    return filtered_chunk

# Iterate over chunks of the original DataFrame and process each chunk
for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
    filtered_chunk = process_chunk(chunk)
    # Append filtered_chunk to the output CSV file
    filtered_chunk.to_csv(output_file_path, mode='a', index=False, header=not os.path.exists(output_file_path))

  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for chunk in pd.re

In [52]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered.csv',usecols=['timestamp', 'attack'], index_col=False)

In [53]:
df

Unnamed: 0,timestamp,attack
0,2023-03-16 14:22:34.379963904,observe
1,2023-03-16 14:22:34.380381696,observe
2,2023-03-16 14:22:34.381272832,observe
3,2023-03-16 14:22:34.381770496,observe
4,2023-03-16 14:22:34.382237696,observe
...,...,...
20611046,2023-06-16 19:58:12.266677760,observe
20611047,2023-06-16 19:58:12.267390208,observe
20611048,2023-06-16 19:58:12.267472640,observe
20611049,2023-06-16 19:58:12.267490048,observe


### Change timestamps 

In [3]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered.csv',usecols=['timestamp', 'attack'], index_col=False)

In [4]:
df

Unnamed: 0,timestamp,attack
0,2023-03-16 14:22:34.379963904,observe
1,2023-03-16 14:22:34.380381696,observe
2,2023-03-16 14:22:34.381272832,observe
3,2023-03-16 14:22:34.381770496,observe
4,2023-03-16 14:22:34.382237696,observe
...,...,...
20611046,2023-06-16 19:58:12.266677760,observe
20611047,2023-06-16 19:58:12.267390208,observe
20611048,2023-06-16 19:58:12.267472640,observe
20611049,2023-06-16 19:58:12.267490048,observe


In [5]:
#Find consecutive sequences of the same lavel value
sequences = []
sequence_value = None
sequence_start_index = None

for index, row in df.iterrows():
    if row['attack'] == sequence_value:
        continue
    else:
        if sequence_value is not None:
            sequences.append((sequence_value, sequence_start_index, index - 1))
        sequence_value = row['attack']
        sequence_start_index = index

# Append the last sequence
if sequence_value is not None:
    sequences.append((sequence_value, sequence_start_index, len(df) - 1))

# Retrieve indices for each sequence
sequence_indices = []
for seq_value, start_idx, end_idx in sequences:
    indices = df.index[start_idx:end_idx + 1].tolist()
    sequence_indices.append((seq_value, indices))

In [6]:
sequence_indices[1163]

('ros2 reflection', [2104247])

In [7]:
np.save('/data/puccetti/space_data/major_review_data/sequence_indices_cut.npy', np.array(sequence_indices, dtype=object))

In [9]:
indices_list = np.load('/data/puccetti/space_data/major_review_data/sequence_indices_cut.npy', allow_pickle=True)

In [None]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered.csv',usecols=['timestamp'], index_col=False)

In [11]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [12]:
print(indices_list.shape)

(2341, 2)


In [17]:
count = 0
for seq_label, seq_indices in indices_list:
    print(count)
    
    # Calculate time deltas between consecutive rows in the original sequence
    if len(seq_indices) > 1:
        time_deltas = df['timestamp'].diff().iloc[seq_indices[1:]].dt.total_seconds().values
        delta = pd.Timedelta(seconds=time_deltas[0])
    else:
        delta = pd.Timedelta(seconds=0)

    # Get the starting timestamp for the sequence
    start_timestamp = pd.Timestamp('2023-01-01 00:00:00')

    # Reset timestamps for the sequence
    if delta.total_seconds() == 0:
        df.loc[seq_indices, 'timestamp'] = start_timestamp
    else:
        df.loc[seq_indices, 'timestamp'] = pd.date_range(start=start_timestamp, periods=len(seq_indices), freq=delta)
    
    count += 1

print(df)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [19]:
df['formatted_timestamp'] = df['timestamp'].dt.strftime('%b %d, %Y %H:%M:%S.%f %Z')

In [20]:
df['formatted_timestamp']

0           Jan 01, 2023 00:00:00.000000 
1           Jan 01, 2023 00:00:00.000417 
2           Jan 01, 2023 00:00:00.000835 
3           Jan 01, 2023 00:00:00.001253 
4           Jan 01, 2023 00:00:00.001671 
                        ...              
20611046    Jan 01, 2023 00:00:07.858296 
20611047    Jan 01, 2023 00:00:07.860724 
20611048    Jan 01, 2023 00:00:07.863151 
20611049    Jan 01, 2023 00:00:07.865579 
20611050    Jan 01, 2023 00:00:07.868007 
Name: formatted_timestamp, Length: 20611051, dtype: object

In [21]:
df.to_csv('/data/puccetti/space_data/major_review_data/new_timestamp_columns.csv')

In [24]:
import pandas as pd
import os

# Define file paths
original_file_path = '/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered.csv'
new_file_path = '/data/puccetti/space_data/major_review_data/new_timestamp_columns.csv'
output_file_path = '/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered_new_time.csv'
chunk_size = 10000

# Function to load chunks from two DataFrames and substitute timestamps
def process_chunks(original_chunk, new_chunk):
    # Substitute the 'timestamp' column in the original chunk with the 'timestamp' column from the new chunk
    original_chunk['timestamp'] = new_chunk['timestamp']
    return original_chunk

# Iterate over chunks of the original DataFrame and process each chunk
for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
                                     pd.read_csv(new_file_path, chunksize=chunk_size)):
    processed_chunk = process_chunks(original_chunk, new_chunk)
    processed_chunk.to_csv(output_file_path, mode='a', index=False, header=not os.path.exists(output_file_path))

  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_

In [10]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered_new_time.csv', usecols=['timestamp'])

In [11]:
df

Unnamed: 0,timestamp
0,2023-01-01 00:00:00.000000000
1,2023-01-01 00:00:00.000417792
2,2023-01-01 00:00:00.000835584
3,2023-01-01 00:00:00.001253376
4,2023-01-01 00:00:00.001671168
...,...
20611046,2023-01-01 00:00:07.858296576
20611047,2023-01-01 00:00:07.860724224
20611048,2023-01-01 00:00:07.863151872
20611049,2023-01-01 00:00:07.865579520


In [12]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/new_timestamp_columns.csv')
df

Unnamed: 0.1,Unnamed: 0,timestamp,attack,formatted_timestamp
0,0,2023-01-01 00:00:00.000000000,observe,"Jan 01, 2023 00:00:00.000000"
1,1,2023-01-01 00:00:00.000417792,observe,"Jan 01, 2023 00:00:00.000417"
2,2,2023-01-01 00:00:00.000835584,observe,"Jan 01, 2023 00:00:00.000835"
3,3,2023-01-01 00:00:00.001253376,observe,"Jan 01, 2023 00:00:00.001253"
4,4,2023-01-01 00:00:00.001671168,observe,"Jan 01, 2023 00:00:00.001671"
...,...,...,...,...
20611046,20611046,2023-01-01 00:00:07.858296576,observe,"Jan 01, 2023 00:00:07.858296"
20611047,20611047,2023-01-01 00:00:07.860724224,observe,"Jan 01, 2023 00:00:07.860724"
20611048,20611048,2023-01-01 00:00:07.863151872,observe,"Jan 01, 2023 00:00:07.863151"
20611049,20611049,2023-01-01 00:00:07.865579520,observe,"Jan 01, 2023 00:00:07.865579"


In [17]:
df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered_new_time.csv', nrows=3000)

  df = pd.read_csv('/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered_new_time.csv', nrows=3000)


In [19]:
df.columns

Index(['Unnamed: 0', 'timestamp', 'layers.frame.frame.time',
       'layers.frame.frame.time_delta',
       'layers.frame.frame.time_delta_displayed',
       'layers.frame.frame.time_relative', 'layers.frame.frame.number',
       'layers.frame.frame.len', 'layers.frame.frame.cap_len',
       'layers.frame.frame.protocols',
       ...
       'Active', 'pgalloc_dma', 'pgmajfault', 'SwapFree', 'src_topic',
       'subscribers_count', 'publishers_count', 'msg_type', 'msg_data',
       'attack'],
      dtype='object', length=483)

In [23]:
import pandas as pd
import os

# Define file paths
original_file_path = '/data/puccetti/space_data/major_review_data/complete_dataset_new_filtered_new_time.csv'
new_file_path = '/data/puccetti/space_data/major_review_data/new_timestamp_columns.csv'
output_file_path = '/data/puccetti/space_data/major_review_data/consegna_time.csv'
chunk_size = 10000

# Iterate over chunks of the original DataFrame and process each chunk
for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
                                     pd.read_csv(new_file_path, chunksize=chunk_size)):
    original_chunk['layers.frame.frame.time'] = new_chunk['formatted_timestamp']
    original_chunk.drop(columns=['Unnamed: 0'], inplace=True)
    original_chunk.to_csv(output_file_path, mode='a', index=False, header=not os.path.exists(output_file_path))

  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_file_path, chunksize=chunk_size),
  for original_chunk, new_chunk in zip(pd.read_csv(original_

In [37]:
reduced_path = '/data/puccetti/space_data/final_repo/rospace/reduced_final.csv'

In [38]:
df = pd.read_csv(reduced_path, index_col=False)

  df = pd.read_csv(reduced_path, index_col=False)


In [39]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [40]:
df.to_csv('/data/puccetti/space_data/major_review_data/consegna_reduced_old.csv', mode='a', index=False)

In [42]:
cols = df.columns

In [45]:
import pandas as pd
import os

# Define file paths
original_file_path = '/data/puccetti/space_data/major_review_data/consegna_time.csv'
output_file_path = '/data/puccetti/space_data/major_review_data/consegna_time_reduced.csv'
chunk_size = 10000

# Iterate over chunks of the original DataFrame and process each chunk
for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
    original_chunk = original_chunk[cols] 
    original_chunk.to_csv(output_file_path, mode='a', index=False, header=not os.path.exists(output_file_path))

  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_pa

In [4]:
cols = ['timestamp','layers.sll.sll.pkttype', 'layers.sll.sll.hatype',
       'layers.sll.sll.unused', 'layers.ip.ip.version',
       'layers.ip.ip.dsfield_tree.ip.dsfield.ecn', 'layers.ip.ip.flags',
       'layers.ip.ip.flags_tree.ip.flags.rb',
       'layers.ip.ip.flags_tree.ip.flags.df', 'layers.ip.ip.checksum',
       'layers.ip.ip.checksum.status', 'layers.tcp.tcp.stream',
       'layers.tcp.tcp.ack', 'layers.tcp.tcp.flags_tree.tcp.flags.ack',
       'layers.tcp.tcp.flags_tree.tcp.flags.syn',
       'layers.tcp.tcp.flags_tree.tcp.flags.syn_tree._ws.expert._ws.expert.message',
       'layers.tcp.tcp.flags_tree.tcp.flags.syn_tree._ws.expert._ws.expert.severity',
       'layers.tcp.tcp.flags_tree.tcp.flags.syn_tree._ws.expert._ws.expert.group',
       'layers.tcp.tcp.window_size_value', 'layers.tcp.tcp.window_size',
       'layers.tcp.tcp.options', 'layers.tcp.tcp.options_tree.tcp.options.nop',
       'layers.tcp.tcp.options_tree.tcp.options.nop_tree.tcp.option_kind',
       'layers.tcp.tcp.analysis.tcp.analysis.acks_frame',
       'layers.tcp.tcp.analysis.tcp.analysis.initial_rtt',
       'layers.tcp.tcp.payload',
       'layers.ssl.ssl.record.ssl.record.content_type', 'layers._ws.short',
       'layers.ipv6.ip.version',
       'layers.ipv6.ipv6.tclass_tree.ipv6.tclass.ecn',
       'layers.icmpv6.icmpv6.type', 'Net_Sent', 'pgpgin', 'pgactivate',
       'Disk_Read', 'pgfault', 'Net_Received', 'MemFree', 'Inactive',
       'pgdeactivate', 'Tcp_Close', 'pgfree', 'nr_active_file', 'Cached',
       'nr_inactive_file', 'Disk_Write', 'pgpgout', 'Tcp_Syn', 'Buffers',
       'Tcp_TimeWait', 'Tcp_Listen', 'Tcp_Established', 'Active',
       'pgalloc_dma', 'pgmajfault', 'SwapFree', 'src_topic',
       'subscribers_count', 'publishers_count', 'msg_type', 'msg_data',
       'attack']

In [5]:
import pandas as pd
import os

# Define file paths
original_file_path = '/data/puccetti/space_data/major_review_data/final_files/rospace_complete.csv'
output_file_path = '/data/puccetti/space_data/major_review_data/consegna_time_reduced.csv'
chunk_size = 10000

# Iterate over chunks of the original DataFrame and process each chunk
for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
    original_chunk = original_chunk[cols] 
    original_chunk.to_csv(output_file_path, mode='a', index=False, header=not os.path.exists(output_file_path))

  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_path, chunksize=chunk_size):
  for original_chunk in pd.read_csv(original_file_pa

KeyboardInterrupt: 