<div style="
    display: flex;
    justify-content: center;
    align-items: center;
    background: linear-gradient(to bottom, #F5F5DC, #C0C0C0);
    padding: 10px;
    border-radius: 10px;
">
    <img src="../samsung.png" alt="Samsung Innovation Campus" style="border-radius: 5px;">
</div>
<div style="
    text-align: center;
    font-style: italic;
">
    Este proyecto fue desarrollado dentro del marco del programa Samsun Innovation Campus 2024
</div>

<b>PRESENTAN</b>: José Armando Ramírez Islas & Jorge Octavio Nicolás Díaz

# __Preprocesamiento__

Las características con las que cuenta el dataset son:

| Característica     | Descripción                                                                 |
|--------------|-----------------------------------------------------------------------------|
| dur      | Duración de la conexión en segundos.                                         |
| proto    | Protocolo de comunicación utilizado (ej. TCP, UDP, ICMP).                   |
| dir      | Dirección del flujo de tráfico (ej. → si es de origen a destino, o ← si es de destino a origen). |
| state    | Estado de la conexión (ej. CON para conexiones establecidas, INT para interrumpidas). |
| stos / dtos | Tipo de servicio (ToS) del tráfico enviado y recibido. Son valores que indican la prioridad del paquete en la red. |
| tot_pkts | Número total de paquetes enviados en la conexión.                           |
| tot_bytes| Número total de bytes transferidos.                                         |
| src_bytes| Cantidad de bytes enviados desde la IP de origen.                            |
| label    | Etiqueta que indica si el tráfico es normal o pertenece a una botnet (tráfico malicioso). |
| Family   | Especie de botnet detectada (ej. Neris, Rbot, Virut, Murlo, etc.).           |

### 1️⃣ __Importacion de modulos__

In [88]:
import dask.dataframe as dd
import os

### 2️⃣ __Leer el CSV forzando las columnas conflictivas a 'object' (string)__

In [None]:
ctu13_df = dd.read_csv('../data/total.csv', dtype={"Dport": "object", "Sport": "object"}, assume_missing=True)

### 3️⃣ __Verificar los tipos de datos antes de procesar__

In [None]:
print(ctu13_df.dtypes)

StartTime      string[pyarrow]
Dur                    float64
Proto          string[pyarrow]
SrcAddr        string[pyarrow]
Sport          string[pyarrow]
Dir            string[pyarrow]
DstAddr        string[pyarrow]
Dport          string[pyarrow]
State          string[pyarrow]
sTos                   float64
dTos                   float64
TotPkts                float64
TotBytes               float64
SrcBytes               float64
Label          string[pyarrow]
BOTNET_NAME    string[pyarrow]
dtype: object


### 4️⃣ __Función para convertir hex a int y evitar errores__

In [None]:
def safe_hex_to_int(value):
    try:
        if isinstance(value, str) and value.startswith("0x"):  
            return int(value, 16)
        return int(value)  
    except ValueError:
        return None 

### 5️⃣ __Aplicar conversión a las columnas__

In [None]:
ctu13_df["Dport"] = ctu13_df["Dport"].map(safe_hex_to_int, meta=("Dport", "float64"))
ctu13_df["Sport"] = ctu13_df["Sport"].map(safe_hex_to_int, meta=("Sport", "float64"))

In [None]:
ctu13_df.head()

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label,BOTNET_NAME
0,2011/08/16 13:51:24.049047,1277.465088,udp,147.32.84.59,7525,<->,213.239.192.34,50012,CON,0.0,0.0,1606.0,1508937.0,1245441.0,flow=Background-Established-cmpgw-CVUT,Sogou
1,2011/08/16 13:51:24.049051,1200.943726,udp,147.32.84.59,7525,<->,188.40.100.105,50012,CON,0.0,0.0,1240.0,992275.0,425194.0,flow=Background-Established-cmpgw-CVUT,Sogou
2,2011/08/16 13:51:24.049832,1276.75061,tcp,80.98.130.52,51686,<?>,147.32.84.229,13363,PA_PA,0.0,0.0,112.0,12484.0,8482.0,flow=Background,Sogou
3,2011/08/16 13:51:24.049954,1190.095215,udp,147.32.84.59,7525,<->,78.46.38.219,55012,CON,0.0,0.0,1841.0,1692037.0,1429143.0,flow=Background-Established-cmpgw-CVUT,Sogou
4,2011/08/16 13:51:24.050042,1257.621094,udp,147.32.84.59,7525,<->,213.239.199.195,51012,CON,0.0,0.0,1890.0,1782878.0,1324527.0,flow=Background-Established-cmpgw-CVUT,Sogou


### 6️⃣ __¿Que columnas tienen valores nulos?__

In [None]:
ctu13_df.isnull().sum().compute()

StartTime           0
Dur                 0
Proto               0
SrcAddr             0
Sport           74231
Dir                 0
DstAddr             0
Dport          134262
State            1235
sTos            84629
dTos           671524
TotPkts             0
TotBytes            0
SrcBytes            0
Label               0
BOTNET_NAME         0
dtype: int64

### 7️⃣ __Rellenando los registros donde hay valores nulos__

In [32]:
ctu13_df['State'].value_counts().nlargest(1).compute()

State
CON    6429439
Name: count, dtype: int64

In [33]:
ctu13_df['State'] = ctu13_df.State.fillna(value='CON')

In [39]:
ctu13_df['sTos'].describe().compute()

count    8.631597e+06
mean     6.714609e-02
std      3.544077e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.920000e+02
Name: sTos, dtype: float64

In [40]:
ctu13_df['dTos'].describe().compute()

count    8.044702e+06
mean     4.371821e-04
std      3.388975e-02
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      3.000000e+00
Name: dTos, dtype: float64

In [42]:
ctu13_df['sTos'] = ctu13_df.sTos.fillna(value=0.0)
ctu13_df['dTos'] = ctu13_df.dTos.fillna(value=0.0)

In [46]:
ctu13_df['Sport'] = ctu13_df['Sport'].ffill().bfill()
ctu13_df['Dport'] = ctu13_df['Dport'].ffill().bfill()

### 8️⃣ __Validando si pertenece a una botnet__

In [47]:
def convert_label(sample):
    if isinstance(sample, str) and "Botnet" in sample: 
        return 1
    else: 
        return 0

### 9️⃣ __Agregamos una columna que identifica si pertenece a una botnet o no__

In [59]:
ctu13_df['is_botnet'] = ctu13_df['Label'].apply(convert_label, meta=('Label', 'int64'))

In [60]:
ctu13_df[ctu13_df['is_botnet'] == 1].sample(frac=0.1, random_state=45).compute()

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label,BOTNET_NAME,is_botnet
280624,2011/08/16 14:50:17.687498,36.043262,udp,147.32.84.165,21963.0,<->,183.60.49.125,8000.0,CON,0.0,0.0,29.0,11218.0,3760.0,flow=From-Botnet-V49-UDP-Established,Murlo,1
290426,2011/08/16 14:52:38.993736,20.849554,udp,147.32.84.165,21975.0,<->,183.60.49.30,8000.0,CON,0.0,0.0,9.0,2618.0,1148.0,flow=From-Botnet-V49-UDP-Established-Custom-En...,Murlo,1
264298,2011/08/16 14:46:39.499428,76.865601,udp,147.32.84.165,21941.0,<->,112.95.240.134,8000.0,CON,0.0,0.0,107.0,44654.0,11848.0,flow=From-Botnet-V49-UDP-Established-Custom-En...,Murlo,1
245469,2011/08/16 14:43:16.831111,1296.112793,udp,147.32.84.165,7100.0,<->,218.83.161.9,9009.0,CON,0.0,0.0,10.0,1500.0,480.0,flow=From-Botnet-V49-UDP-Established,Murlo,1
285610,2011/08/16 14:51:31.544709,61.295368,tcp,147.32.84.165,4047.0,->,222.189.228.111,3389.0,FSPA_FSPA,0.0,0.0,10.0,1076.0,437.0,flow=From-Botnet-V49-TCP-CC74-HTTP-Custom-Port...,Murlo,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210021,2011/08/16 08:02:41.613561,1.307678,tcp,147.32.84.165,20310.0,->,184.173.217.40,443.0,S_RA,0.0,0.0,6.0,366.0,186.0,flow=From-Botnet-V54-TCP-Attempt,Virut,1
139986,2011/08/16 07:19:18.407383,1.307373,tcp,147.32.84.165,18892.0,->,184.173.217.40,443.0,S_RA,0.0,0.0,6.0,366.0,186.0,flow=From-Botnet-V54-TCP-Attempt,Virut,1
229286,2011/08/16 08:11:49.613636,1.307503,tcp,147.32.84.165,20582.0,->,184.173.217.40,443.0,S_RA,0.0,0.0,6.0,366.0,186.0,flow=From-Botnet-V54-TCP-Attempt,Virut,1
79599,2011/08/16 06:39:20.520277,1.307597,tcp,147.32.84.165,17706.0,->,184.173.217.40,443.0,S_RA,0.0,0.0,6.0,366.0,186.0,flow=From-Botnet-V54-TCP-Attempt,Virut,1


### 🔟 __Columnas que contienen datos categoricos__

In [57]:
# Categorical feature names
ctu13_df.select_dtypes(exclude='number').columns

Index(['StartTime', 'Proto', 'SrcAddr', 'Dir', 'DstAddr', 'State', 'Label',
       'BOTNET_NAME'],
      dtype='object')

### 1️⃣1️⃣ __Columnas que contienen datos numericos__

In [58]:
# Numeric features names
ctu13_df.select_dtypes(include='number').columns

Index(['Dur', 'Sport', 'Dport', 'sTos', 'dTos', 'TotPkts', 'TotBytes',
       'SrcBytes', 'is_botnet'],
      dtype='object')

### 1️⃣2️⃣ __Cantidad de registros que son botnets y los que no__

In [None]:
print(ctu13_df['is_botnet'].value_counts().compute())

is_botnet
1     339700
0    8376526
Name: count, dtype: int64


### 1️⃣3️⃣ __Filtrar solo los registros donde is_botnet == 1 para contar los tipos de botnets__

In [86]:
# Filtrar solo los registros donde is_botnet == 1 
print(ctu13_df[ctu13_df['is_botnet'] == 1]['BOTNET_NAME'].value_counts().compute())

BOTNET_NAME
RBot      106352
Virut      40003
Sogou         63
Neris     184987
NsisAy      2168
Murlo       6127
Name: count, dtype: int64


### 1️⃣4️⃣ __Generar un nuevo dataset preprocesado__

In [None]:
ctu13_df.compute().to_csv(os.path.join('preprocessing_data', 'dataset_procesado.csv'), index=False)