# Este es el notebook para el preprocesamiento de los data sets
### Aqui vamos a limpiar los datasets, y a dejarlos en un formato que sea fácil de usar para el análisis exploratorio y el modelado

# TODO re formatear el notebook, dado que ya solo se busca limpiar 3 datasets por separado 

In [62]:
#Importamos las librerias 
import pandas as pd

## Primero, vamos a cargar el primer dataset, el cual es [Internet Firewall Data](https://archive.ics.uci.edu/dataset/542/internet+firewall+data)


In [63]:
#Cargamos el data set
baseDf = pd.read_csv('DataSets/log2.csv')
#El basedf es el formato al que queremos llevar todos los datos

## Log2.csv tiene 12 columnas, las cuales son: 
1. Source Port
2. Destination Port
3. NAT Source Port
4. NAT Destination Port
5. Action
6. Bytes (suma de bytes sent y bytes received)
7. Bytes Sent
8. Bytes Received
9. Packets
10. Elapsed Time (sec)
11. pkts_sent
12. pkts_received

Donde action es nuestra columna target

Primero, verifiquemos que el dataset sea usable

In [64]:
baseDf['Action'].value_counts()

Action
allow         37640
deny          14987
drop          12851
reset-both       54
Name: count, dtype: int64

In [65]:
#Verificar que no tenga valores nulos
assert(baseDf.isnull().sum().sum() == 0)
#Verificar los types de las columnas
baseDf.dtypes

Source Port              int64
Destination Port         int64
NAT Source Port          int64
NAT Destination Port     int64
Action                  object
Bytes                    int64
Bytes Sent               int64
Bytes Received           int64
Packets                  int64
Elapsed Time (sec)       int64
pkts_sent                int64
pkts_received            int64
dtype: object

In [66]:
baseDf.__len__()

65532

Aquí vemos algo interesante, la columna NAT Source Port tiene muchos valores 0, no hay descripcion de que significa esto, es algo que tendremos que tener en cuenta para el modelo

In [67]:
baseDf['NAT Source Port'].value_counts()

NAT Source Port
0        28432
48817       83
58638       51
50116       15
7986         5
         ...  
2063         1
33661        1
36797        1
14122        1
13485        1
Name: count, Length: 29152, dtype: int64

In [68]:
#Vamos a guardar el dataset para poder usarlo en el analisis exploratorio
baseDf.to_csv('DataSets/Internet_firewall_data.csv',index=False)

## Ahora, vamos a cargar el segundo dataset, el cual es el [UNSW-NB15](https://research.unsw.edu.au/projects/unsw-nb15-dataset)
Este está dividido en 5 csv. Pero, en este caso usaremos los UNSW_NB15_training-set.csv y UNSW_NB15_testing-set.csv

Las columnas y sus descripciones se puede encontrar en el NUSW-NB15_features.csv, dado que son demasiadas, no seran escritas aquí.

In [69]:
unswDf = pd.read_csv('DataSets/UNSW_NB15_training-set.csv')
print("El tamaño del dataset es: ",unswDf.__len__())
#El skiprows es para saltar la primera fila, la cual es la descripcion de las columnas, dado que lo vamos a pegar al primer df
unswTestDf = pd.read_csv('DataSets/UNSW_NB15_testing-set.csv') 
unswDf = pd.concat([unswDf,unswTestDf],ignore_index=True)
#Verificamos que se pegaran bien
print("El tamaño del dataset es: ",unswDf.__len__())

El tamaño del dataset es:  175341
El tamaño del dataset es:  257673


In [70]:
unswDf.dtypes

id                     int64
dur                  float64
proto                 object
service               object
state                 object
spkts                  int64
dpkts                  int64
sbytes                 int64
dbytes                 int64
rate                 float64
sttl                   int64
dttl                   int64
sload                float64
dload                float64
sloss                  int64
dloss                  int64
sinpkt               float64
dinpkt               float64
sjit                 float64
djit                 float64
swin                   int64
stcpb                  int64
dtcpb                  int64
dwin                   int64
tcprtt               float64
synack               float64
ackdat               float64
smean                  int64
dmean                  int64
trans_depth            int64
response_body_len      int64
ct_srv_src             int64
ct_state_ttl           int64
ct_dst_ltm             int64
ct_src_dport_l

Se observa que no hay nada raro en los tipos, pero podemos ver que hay columnas categorical, algo que tendremos que tener en cuenta para el modelado

In [71]:
unswDf.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0


In [72]:
unswDf.isnull().sum()

id                   0
dur                  0
proto                0
service              0
state                0
spkts                0
dpkts                0
sbytes               0
dbytes               0
rate                 0
sttl                 0
dttl                 0
sload                0
dload                0
sloss                0
dloss                0
sinpkt               0
dinpkt               0
sjit                 0
djit                 0
swin                 0
stcpb                0
dtcpb                0
dwin                 0
tcprtt               0
synack               0
ackdat               0
smean                0
dmean                0
trans_depth          0
response_body_len    0
ct_srv_src           0
ct_state_ttl         0
ct_dst_ltm           0
ct_src_dport_ltm     0
ct_dst_sport_ltm     0
ct_dst_src_ltm       0
is_ftp_login         0
ct_ftp_cmd           0
ct_flw_http_mthd     0
ct_src_ltm           0
ct_srv_dst           0
is_sm_ips_ports      0
attack_cat 

In [73]:
#Ahora vamos a guardar este dataset
unswDf.to_csv('DataSets/UNSW.csv',index=False)

## Ahora, vamos a limpiar el dataset de (KDDCup99)[https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html]
Aunque la descripción del dataset se encuentra en el link de arriba tuvimos algunos problemas descargándolo desde ese link, si lo quieres descargar, puedes hacerlo desde [aquí](https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data)

In [74]:
#Primero, vamos a ver las columnas que tiene
cols="""duration, protocol_type,
service,
flag,
src_bytes,
dst_bytes,
land,
wrong_fragment,
urgent,
hot,
num_failed_logins,
logged_in,
num_compromised,
root_shell,
su_attempted,
num_root,
num_file_creations,
num_shells,
num_access_files,
num_outbound_cmds,
is_host_login,
is_guest_login,
count,
srv_count,
serror_rate,
srv_serror_rate,
rerror_rate,
srv_rerror_rate,
same_srv_rate,
diff_srv_rate,
srv_diff_host_rate,
dst_host_count,
dst_host_srv_count,
dst_host_same_srv_rate,
dst_host_diff_srv_rate,
dst_host_same_src_port_rate,
dst_host_srv_diff_host_rate,
dst_host_serror_rate,
dst_host_srv_serror_rate,
dst_host_rerror_rate,
dst_host_srv_rerror_rate,
attack"""

#Ahora, pasemos esto a un arreglo para darle a nuestro dataframe
cols = cols.replace("\n", "")
cols = cols.split(",")
print(cols)


['duration', ' protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack']


In [75]:
#Primero, vamos a cargar el dataset, dado que es un flat file, no necesitamos especificar el separador
kddDf = pd.read_csv('DataSets/kddcup.data_10_percent',names=cols)


In [76]:
kddDf.isnull().sum()

duration                       0
 protocol_type                 0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

In [77]:
kddDf.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


In [78]:
#tenemos algo interesante, y es que attack tiene varios labels, con el archivo training_attack_types.txt podemos ver que hay 22 labels, pero solo 4 ataques, por lo que los vamos a agrupar por tipo de ataques

tipos_de_ataques = {
    'normal': 'normal',
'back': 'dos',
'buffer_overflow': 'u2r',
'ftp_write': 'r2l',
'guess_passwd': 'r2l',
'imap': 'r2l',
'ipsweep': 'probe',
'land': 'dos',
'loadmodule': 'u2r',
'multihop': 'r2l',
'neptune': 'dos',
'nmap': 'probe',
'perl': 'u2r',
'phf': 'r2l',
'pod': 'dos',
'portsweep': 'probe',
'rootkit': 'u2r',
'satan': 'probe',
'smurf': 'dos',
'spy': 'r2l',
'teardrop': 'dos',
'warezclient': 'r2l',
'warezmaster': 'r2l',
}

kddDf['Attack Type'] = kddDf.attack.apply(lambda r:tipos_de_ataques[r[:-1]])

kddDf.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack,Attack Type
0,0,tcp,http,SF,181,5450,0,0,0,0,...,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.,normal
1,0,tcp,http,SF,239,486,0,0,0,0,...,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.,normal
2,0,tcp,http,SF,235,1337,0,0,0,0,...,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.,normal
3,0,tcp,http,SF,219,1337,0,0,0,0,...,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.,normal
4,0,tcp,http,SF,217,2032,0,0,0,0,...,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.,normal


In [79]:
#Ahora, vamos a añadir una columna binaria para saber si es un ataque o no para el modelado
kddDf['Is attack'] = kddDf['Attack Type'].map({'normal': 0, 'dos': 1, 'u2r': 1, 'r2l': 1, 'probe': 1})
kddDf.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack,Attack Type,Is attack
0,0,tcp,http,SF,181,5450,0,0,0,0,...,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.,normal,0
1,0,tcp,http,SF,239,486,0,0,0,0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.,normal,0
2,0,tcp,http,SF,235,1337,0,0,0,0,...,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.,normal,0
3,0,tcp,http,SF,219,1337,0,0,0,0,...,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.,normal,0
4,0,tcp,http,SF,217,2032,0,0,0,0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.,normal,0


Por ahora, no vamos a modificar las variables categorical, dado que antes queremos hacer un análisis exploratorio

In [80]:
#Ahora, vamos a guardar este dataset
kddDf.to_csv('DataSets/KDDCup99.csv',index=False)