# Introduction
This dataset containing a wide range of invasions simulated in a research organization was submitted to be audited. 
By imitating a typical Ecommerce function on the internet, it developed an environment for obtaining raw TCP/IP dump data for a network. The internet was concentrated as if it were a real setting, and various attacks were launched. 

A connection is a series of TCP packets that begin and stop at a specific time interval and allow data to flow from a source IP address to a target IP address using a well-defined protocol. 
In addition, each link is classified as either normal or an attack. Each connection record is around 100 bytes long.

For each TCP/IP connection, 19 quantitative and qualitative features are obtained from normal and attack data (2 qualitative and 17 quantitative features). The class variable has categories:
   1. **buffer_overflow:** This label is often associated with attempts to exploit buffer overflow vulnerabilities in software. Buffer overflows can be used to execute arbitrary code or crash a program, and they are often seen as a security threat.
   2. **ipsweep:** This label indicates activities that involve scanning a range of IP addresses to gather information about potential targets. It is a common precursor to various types of network attacks.
   3. **normal:** This label represents network traffic that is considered normal and does not raise any security concerns. It is used as a baseline for comparison with potentially suspicious or malicious activities.
   4. **rootkit:** Rootkits are malicious software or tools that are designed to gain unauthorized access to a computer or network. This label may be used when there are signs of rootkit activity.
   5. **sqlattack:** This label is associated with SQL injection attacks, where attackers attempt to manipulate or exploit vulnerabilities in web applications or databases by injecting SQL code.
   6. **worm:** Worms are self-replicating malware that can spread across networks without human intervention. This label may be used when there is evidence of worm-like behavior in network traffic.


We can find it into the normal feature. 


In [50]:
import numpy as np 
import pandas as pd 

data = pd.read_csv("malware analysis.csv")
data 

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5,normal.
0,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
1,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
2,0,udp,domain_u,SF,29,0,2,1,0.5,1.0,0.0,10,3,0.30,0.30,0.30,0.00,0.0,normal.
3,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.00,0.00,0.0,normal.
4,0,tcp,http,SF,223,185,4,4,1.0,0.0,0.0,71,255,1.00,0.00,0.01,0.01,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60932,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60933,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60934,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60935,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.


In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60937 entries, 0 to 60936
Data columns (total 19 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   0        60937 non-null  int64  
 1   udp      60937 non-null  object 
 2   private  60937 non-null  object 
 3   SF       60937 non-null  object 
 4   105      60937 non-null  int64  
 5   146      60937 non-null  int64  
 6   1        60937 non-null  int64  
 7   1.1      60937 non-null  int64  
 8   1.2      60937 non-null  float64
 9   0.1      60937 non-null  float64
 10  0.2      60937 non-null  float64
 11  255      60937 non-null  int64  
 12  254      60937 non-null  int64  
 13  1.3      60937 non-null  float64
 14  0.01     60937 non-null  float64
 15  0.3      60937 non-null  float64
 16  0.4      60937 non-null  float64
 17  0.5      60937 non-null  float64
 18  normal.  60937 non-null  object 
dtypes: float64(8), int64(7), object(4)
memory usage: 8.8+ MB


In [52]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0.0,60937.0,10.747067,536.645424,0.0,0.0,0.0,0.0,54451.0
105.0,60937.0,748.052497,34170.266255,0.0,105.0,226.0,302.0,6291668.0
146.0,60937.0,3582.526577,36166.319605,0.0,146.0,570.0,2507.0,5203179.0
1.0,60937.0,14.194151,53.988606,0.0,1.0,4.0,11.0,511.0
1.1,60937.0,16.794493,54.906647,0.0,2.0,5.0,15.0,511.0
1.2,60937.0,0.994402,0.059121,0.0,1.0,1.0,1.0,1.0
0.1,60937.0,0.007781,0.080519,0.0,0.0,0.0,0.0,1.0
0.2,60937.0,0.111514,0.234407,0.0,0.0,0.0,0.12,1.0
255.0,60937.0,163.517961,102.75336,0.0,50.0,245.0,255.0,255.0
254.0,60937.0,231.014179,60.728053,0.0,253.0,255.0,255.0,255.0


In [53]:
num_inst, num_features = data.shape

for f in range(num_features):
    print (f, np.unique(data.iloc[:,f]))

0 [    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14    15    16    17    18    20    21    22    23    24
    25    26    27    28    29    30    31    32    33    34    35    36
    37    38    39    40    41    43    44    46    47    48    49    50
    51    53    54    56    57    58    59    60    61    62    63    64
    65    67    68    69    71    72    73    74    75    77    78    79
    81    82    84    87    88    89    90    91    92    93    94    95
    96    97    99   104   106   107   109   112   113   115   116   117
   118   120   121   126   127   128   131   133   136   137   138   139
   140   141   142   143   144   148   149   154   157   158   160   161
   162   163   164   170   171   172   181   182   183   185   186   189
   190   192   194   198   202   205   207   208   212   214   219   228
   235   238   239   245   249   253   254   261   265   292   293   297
   308   315   328   342   345   360   366   373 

# List of Features with Explanations

1. **0** :  this feature contain the numbers from 0 to 54451 
2. **tcp** : it contains three different type of protocols: 
    - **tcp** : Transmission Control Protocol
    - **udp** : User Datagram Protocol
    - **icmp** : Internet Control Message Protocol 
3. **private** : This list is related to network services and protocols, often associated with TCP/IP networking. These labels represent various network service types, often used to classify traffic or activity on a network. Each label corresponds to a specific service or protocol, and they are commonly used in network monitoring and security contexts. Here's a brief explanation of each of the labels in the list:

    1. **IRC:** Internet Relay Chat, a protocol used for real-time text messaging or chat over the internet.

    2. **X11:** The X Window System, a network protocol used for graphical user interfaces in Unix-like operating systems.

    3. **auth:** Authentication service, often related to the process of user authentication.

    4. **domain_u:** Unregistered domain service.

    5. **eco_i:** E-COMMERCE service.

    6. **ecr_i:** E-COMMERCE, results.

    7. **finger:** The Finger protocol, used for querying information about users on a network.

    8. **ftp:** File Transfer Protocol, used for transferring files between computers over a network.

    9. **ftp_data:** FTP data transfer service, used for the actual data transfer in FTP.

    10. **http:** Hypertext Transfer Protocol, the protocol used for web browsing.

    11. **icmp:** Internet Control Message Protocol, used for network diagnostics and error reporting.

    12. **link:** Network link service.

    13. **ntp_u:** Network Time Protocol (NTP), unregistered.

    14. **other:** A catch-all category for other or unclassified services.

    15. **pop_3:** Post Office Protocol version 3, used for retrieving email from a server.

    16. **private:** A label often used for private or non-standard services.

    17. **remote_job:** Remote job entry service.

    18. **smtp:** Simple Mail Transfer Protocol, used for sending email.

    19. **telnet:** Telnet, a network protocol for remote terminal access.

    20. **tftp_u:** Trivial File Transfer Protocol (TFTP), unregistered.

    21. **tim_i:** Time service, results.

    22. **time:** Time service.

    23. **urp_i:** URP (URL Rendition Protocol) results.

    These labels are commonly used in network traffic analysis, intrusion detection systems, and log analysis to classify and understand the nature of network traffic and the services being accessed or used. They help network administrators and security professionals identify and respond to potential security threats and anomalies in network activity.

4. **SF** : The terms **"REJ," "RSTO," "RSTOS0," "RSTR," "S1," "S2," "S3," and "SF"** are typically related to network scanning and port scanning techniques rather than the core TCP/IP protocol itself. These terms are often associated with the *nmap tool*, which is a popular open-source network scanning tool used for discovering open ports, identifying services running on those ports, and assessing the security of a network : 
    - **REJ** : This indicates that the scanned port is unreachable or blocked. Nmap receives a "reject" response when it attempts to connect to a port.
    - **RSTO** : This stands for "Reset TCP Port Closed." It means that the port is closed, and a TCP reset (RST) packet was received in response to the connection attempt.
    - **RSTOS0** : Similar to RSTO, but it also means that the system sent an RST packet to close the port.
    - **RSTR** : This indicates that the port is closed but is in a "reset" state, meaning that a reset (RST) packet was sent in response to the connection attempt.
    - **S1, S2, S3** : These terms represent different states of nmap's scanning process:
        - **S1** : Indicates that the scan is in progress.
        - **S2**: Signifies that the scan is 50% complete.
        - **S3**: Indicates that the scan is 75% complete.
    - **SF**: This stands for "SYN ACK TCP Port Open." It indicates that the scanned port is open, and a SYN-ACK response was received in response to the connection attempt. This is a common result when a port is open and actively accepting connections.
 
    Nmap uses various scan techniques to determine the state of ports on a target system. These codes and terms help users interpret the results of an nmap scan and understand the state of each scanned port on the target system.

    Keep in mind that these terms are specific to nmap and may not be relevant in all network scanning or TCP/IP contexts. They are primarily used by security professionals and network administrators for assessing the security and configuration of networked systems.
   
5. **105,146,1,1,1.0,0.0,0.00,255,254,1.00,0.01,0.00,0.00,0.00** : These numerical values represent various attributes or characteristics of the network event. Without a legend or context, it's challenging to provide specific meanings for these numbers. They could represent features related to the network traffic, such as port numbers, packet sizes, or other metrics. For examples :
    - **105,146,1,1,1.0,0.0,0.00** : it seems like **Source IP Address**
    - **255,254,1.00,0.01,0.00,0.00,0.00** : it seems like **Destination IP Address**

6. **normal** : This list which includes labels like **"buffer_overflow," "ipsweep," "normal," "rootkit," "sqlattack," and "worm"** is related to network security and intrusion detection. These labels are often associated with network traffic or event classification in the context of cybersecurity and intrusion detection systems. They are not directly related to TCP/IP but are used to categorize and identify different types of network activities and security incidents. These labels can help security professionals and tools like Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) identify and respond to potential threats and security breaches. Here's a brief explanation of each label:

    1. **buffer_overflow:** This label is often associated with attempts to exploit buffer overflow vulnerabilities in software. Buffer overflows can be used to execute arbitrary code or crash a program, and they are often seen as a security threat.

    2. **ipsweep:** This label indicates activities that involve scanning a range of IP addresses to gather information about potential targets. It is a common precursor to various types of network attacks.

    3. **normal:** This label represents network traffic that is considered normal and does not raise any security concerns. It is used as a baseline for comparison with potentially suspicious or malicious activities.

    4. **rootkit:** Rootkits are malicious software or tools that are designed to gain unauthorized access to a computer or network. This label may be used when there are signs of rootkit activity.

    5. **sqlattack:** This label is associated with SQL injection attacks, where attackers attempt to manipulate or exploit vulnerabilities in web applications or databases by injecting SQL code.

    6. **worm:** Worms are self-replicating malware that can spread across networks without human intervention. This label may be used when there is evidence of worm-like behavior in network traffic.

    These labels help security professionals and security tools classify and respond to different types of network events or anomalies, making it easier to identify and mitigate security threats.



https://www.kaggle.com/datasets/paytonjabir/comprehensive-malware-datasets/data

# Clean the Data 
After a first analysis, I decided to use this dataset. Because it has enough instance to make a good prediction.
Right now I have to clean the dataset to prepare it for the first part of the study.

In [54]:
from sklearn.model_selection import train_test_split

# drop label columns
X = data.drop(columns=['normal.'])

# isolate y
y = data['normal.']

# split in Train-set(80%) and Testing-set(20%)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20) 

In [55]:
X_train.shape 

(48749, 18)

In [56]:
X.shape, X_train.shape, X_test.shape

((60937, 18), (48749, 18), (12188, 18))

In [57]:
y_test

30765    normal.
25772    normal.
2519     normal.
28743    normal.
397      normal.
          ...   
17766    normal.
18878    normal.
24097    normal.
9067     normal.
1687     normal.
Name: normal., Length: 12188, dtype: object

In [58]:
data

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5,normal.
0,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
1,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
2,0,udp,domain_u,SF,29,0,2,1,0.5,1.0,0.0,10,3,0.30,0.30,0.30,0.00,0.0,normal.
3,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.00,0.00,0.0,normal.
4,0,tcp,http,SF,223,185,4,4,1.0,0.0,0.0,71,255,1.00,0.00,0.01,0.01,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60932,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60933,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60934,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60935,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.


In this way we divided the dataset into two parts, using the **feature normal.**.
The Dataset is not modified, because only the division is simulated

## Process X 

In [59]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0.0,48749.0,8.466369,428.609137,0.0,0.0,0.0,0.0,54451.0
105.0,48749.0,736.368726,33834.64666,0.0,105.0,226.0,302.0,6291668.0
146.0,48749.0,3377.49601,25915.639775,0.0,146.0,567.0,2507.0,2881112.0
1.0,48749.0,14.136618,53.670248,0.0,1.0,4.0,11.0,511.0
1.1,48749.0,16.711481,54.508614,0.0,2.0,5.0,15.0,511.0
1.2,48749.0,0.994418,0.059174,0.0,1.0,1.0,1.0,1.0
0.1,48749.0,0.007732,0.080335,0.0,0.0,0.0,0.0,1.0
0.2,48749.0,0.111439,0.234274,0.0,0.0,0.0,0.12,1.0
255.0,48749.0,163.182096,102.791933,0.0,49.0,243.0,255.0,255.0
254.0,48749.0,230.878685,60.861156,0.0,253.0,255.0,255.0,255.0


In [60]:
X_train.head()

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,0,tcp,http,SF,230,1295,4,4,1.0,0.0,0.0,178,255,1.0,0.0,0.01,0.02,0.0
33686,0,udp,private,SF,28,0,91,91,1.0,0.0,0.0,116,91,0.78,0.03,0.78,0.0,0.0
44616,0,tcp,http,SF,173,21704,1,1,1.0,0.0,0.0,255,255,1.0,0.0,0.0,0.0,0.0
58903,0,tcp,http,SF,213,1274,7,17,1.0,0.0,0.29,255,255,1.0,0.0,0.0,0.0,0.0
4651,0,tcp,http,SF,223,197,19,26,1.0,0.0,0.19,255,255,1.0,0.0,0.0,0.0,0.0


In [61]:
is_numerical  = np.array( [ len(np.unique(X_train[col]))>10 for col in X_train] )

# numerical_idx = np.flatnonzero(is_numerical) 

numerical_idx = ['105','146','1','1.1','1.2','0.1','0.2','255','254','1.3','0.01','0.3','0.4','0.5']


In [62]:
print (is_numerical)
print (numerical_idx)
print ("Number of numerical features:", sum(is_numerical))

[ True False  True False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]
['105', '146', '1', '1.1', '1.2', '0.1', '0.2', '255', '254', '1.3', '0.01', '0.3', '0.4', '0.5']
Number of numerical features: 16


In [63]:
# convert numerical to floats (keep NaN)
new_X = X_train[ numerical_idx ].apply(pd.to_numeric, errors='coerce')
#  invalid parsing will be set as NaN.

In [64]:
new_X.head()

Unnamed: 0,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,230,1295,4,4,1.0,0.0,0.0,178,255,1.0,0.0,0.01,0.02,0.0
33686,28,0,91,91,1.0,0.0,0.0,116,91,0.78,0.03,0.78,0.0,0.0
44616,173,21704,1,1,1.0,0.0,0.0,255,255,1.0,0.0,0.0,0.0,0.0
58903,213,1274,7,17,1.0,0.0,0.29,255,255,1.0,0.0,0.0,0.0,0.0
4651,223,197,19,26,1.0,0.0,0.19,255,255,1.0,0.0,0.0,0.0,0.0


In [65]:
new_X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48749 entries, 47691 to 40704
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   105     48749 non-null  int64  
 1   146     48749 non-null  int64  
 2   1       48749 non-null  int64  
 3   1.1     48749 non-null  int64  
 4   1.2     48749 non-null  float64
 5   0.1     48749 non-null  float64
 6   0.2     48749 non-null  float64
 7   255     48749 non-null  int64  
 8   254     48749 non-null  int64  
 9   1.3     48749 non-null  float64
 10  0.01    48749 non-null  float64
 11  0.3     48749 non-null  float64
 12  0.4     48749 non-null  float64
 13  0.5     48749 non-null  float64
dtypes: float64(8), int64(6)
memory usage: 5.6 MB


In [66]:
X_train.loc[411,numerical_idx]

105      307
146     1984
1         15
1.1       39
1.2      1.0
0.1      0.0
0.2      0.1
255      255
254      255
1.3      1.0
0.01     0.0
0.3     0.01
0.4      0.0
0.5      0.0
Name: 411, dtype: object

In [67]:
new_X.isna().astype(int).head(10)

Unnamed: 0,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,0,0,0,0,0,0,0,0,0,0,0,0,0,0
33686,0,0,0,0,0,0,0,0,0,0,0,0,0,0
44616,0,0,0,0,0,0,0,0,0,0,0,0,0,0
58903,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4651,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23861,0,0,0,0,0,0,0,0,0,0,0,0,0,0
60533,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12785,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1464,0,0,0,0,0,0,0,0,0,0,0,0,0,0
60271,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [68]:
new_X.head(10)

Unnamed: 0,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,230,1295,4,4,1.0,0.0,0.0,178,255,1.0,0.0,0.01,0.02,0.0
33686,28,0,91,91,1.0,0.0,0.0,116,91,0.78,0.03,0.78,0.0,0.0
44616,173,21704,1,1,1.0,0.0,0.0,255,255,1.0,0.0,0.0,0.0,0.0
58903,213,1274,7,17,1.0,0.0,0.29,255,255,1.0,0.0,0.0,0.0,0.0
4651,223,197,19,26,1.0,0.0,0.19,255,255,1.0,0.0,0.0,0.0,0.0
23861,1256,182830,1,1,1.0,0.0,0.0,29,3,0.1,0.1,0.03,0.0,0.0
60533,105,147,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.0,0.0,0.0
12785,192,1440,9,13,1.0,0.0,0.15,9,255,1.0,0.0,0.11,0.02,0.0
1464,194,371,6,26,1.0,0.0,0.12,7,255,1.0,0.0,0.14,0.02,0.0
60271,105,147,2,2,1.0,0.0,0.0,255,254,1.0,0.01,0.0,0.0,0.0


In [69]:
new_X.shape

(48749, 14)

In [70]:
# Fill NA/NaN values using the specified method whit 0.0
new_X=new_X.fillna(0.0)

In [71]:
new_X.head()

Unnamed: 0,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,230,1295,4,4,1.0,0.0,0.0,178,255,1.0,0.0,0.01,0.02,0.0
33686,28,0,91,91,1.0,0.0,0.0,116,91,0.78,0.03,0.78,0.0,0.0
44616,173,21704,1,1,1.0,0.0,0.0,255,255,1.0,0.0,0.0,0.0,0.0
58903,213,1274,7,17,1.0,0.0,0.29,255,255,1.0,0.0,0.0,0.0,0.0
4651,223,197,19,26,1.0,0.0,0.19,255,255,1.0,0.0,0.0,0.0,0.0


# Categorical Data
1. **Ordinal Encoding** : for the second feature 
2. **One-Hot Encoding** : for the third and fourth feature

In [87]:

# categorical_idx = np.flatnonzero(is_numerical==False)
categorical_idx = ['private','SF']
categorical_idx_ord = ['udp']

In [89]:
categorical_idx, categorical_idx_ord

(['private', 'SF'], ['udp'])

In [90]:
len(categorical_idx), len(categorical_idx_ord)

(2, 1)

In [102]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

oh = OneHotEncoder(sparse_output=False)
oh.fit(X_train[categorical_idx])

enc = OrdinalEncoder()
enc = enc.fit_transform(X_train[categorical_idx_ord])


In [101]:
oh.categories_

([array(['IRC', 'X11', 'auth', 'domain_u', 'eco_i', 'ecr_i', 'finger',
         'ftp', 'ftp_data', 'http', 'icmp', 'link', 'ntp_u', 'other',
         'pop_3', 'private', 'remote_job', 'smtp', 'telnet', 'tftp_u',
         'tim_i', 'time', 'urp_i'], dtype=object),
  array(['REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S1', 'S2', 'S3', 'SF'],
        dtype=object)],
 array([[1.],
        [2.],
        [1.],
        ...,
        [1.],
        [1.],
        [1.]]))

In [100]:
encoded = oh.transform(X_train[categorical_idx])
encoded.shape

(48749, 31)

In [78]:
encoded

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [79]:
oh.get_feature_names_out()

array(['private_IRC', 'private_X11', 'private_auth', 'private_domain_u',
       'private_eco_i', 'private_ecr_i', 'private_finger', 'private_ftp',
       'private_ftp_data', 'private_http', 'private_icmp', 'private_link',
       'private_ntp_u', 'private_other', 'private_pop_3',
       'private_private', 'private_remote_job', 'private_smtp',
       'private_telnet', 'private_tftp_u', 'private_tim_i',
       'private_time', 'private_urp_i', 'SF_REJ', 'SF_RSTO', 'SF_RSTOS0',
       'SF_RSTR', 'SF_S1', 'SF_S2', 'SF_S3', 'SF_SF'], dtype=object)

In [95]:
for i,col in enumerate(oh.get_feature_names_out()):
    new_X[col] = encoded[:,i]
new_X[categorical_idx_ord]=enc

In [96]:
new_X

Unnamed: 0,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,...,private_urp_i,SF_REJ,SF_RSTO,SF_RSTOS0,SF_RSTR,SF_S1,SF_S2,SF_S3,SF_SF,udp
47691,230,1295,4,4,1.0,0.0,0.00,178,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
33686,28,0,91,91,1.0,0.0,0.00,116,91,0.78,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
44616,173,21704,1,1,1.0,0.0,0.00,255,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
58903,213,1274,7,17,1.0,0.0,0.29,255,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4651,223,197,19,26,1.0,0.0,0.19,255,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16708,302,392,1,2,1.0,0.0,1.00,255,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
6010,231,1819,5,5,1.0,0.0,0.00,46,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3243,190,261,3,5,1.0,0.0,0.40,4,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
47,240,2172,59,59,1.0,0.0,0.00,255,255,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [110]:
# To show that we have changed correctly new_X
num_inst, num_features = new_X.shape

for f in range(num_features):
    print (f, np.unique(new_X.iloc[:,f]))

0 [      0       1       2 ... 2194619 3131464 6291668]
1 [      0       1       4 ... 1868080 2099247 2881112]
2 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 155 156 157 158 159 160 161 162
 163 164 165 166 168 169 170 171 172 173 175 176 177 178 179 181 182 183
 185 186 187 188 189 191 192 193 195 196 197 198 199 200 201 202 203 204
 205 206 208 209 210 211 212 213 215 216 217 218 219 221 222 223 225 226
 227 228 2

In [111]:
X_train.head(100)

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
47691,0,tcp,http,SF,230,1295,4,4,1.0,0.0,0.00,178,255,1.00,0.00,0.01,0.02,0.0
33686,0,udp,private,SF,28,0,91,91,1.0,0.0,0.00,116,91,0.78,0.03,0.78,0.00,0.0
44616,0,tcp,http,SF,173,21704,1,1,1.0,0.0,0.00,255,255,1.00,0.00,0.00,0.00,0.0
58903,0,tcp,http,SF,213,1274,7,17,1.0,0.0,0.29,255,255,1.00,0.00,0.00,0.00,0.0
4651,0,tcp,http,SF,223,197,19,26,1.0,0.0,0.19,255,255,1.00,0.00,0.00,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30219,0,tcp,http,SF,206,750,10,14,1.0,0.0,0.21,11,255,1.00,0.00,0.09,0.06,0.0
33382,0,tcp,http,SF,220,864,40,40,1.0,0.0,0.00,137,255,1.00,0.00,0.01,0.03,0.0
33980,0,tcp,http,SF,298,8968,8,15,1.0,0.0,0.13,25,255,1.00,0.00,0.04,0.03,0.0
10289,0,tcp,http,S2,261,529,1,1,1.0,0.0,0.00,1,245,1.00,0.00,1.00,0.04,0.0


## Process y 

In [112]:
y_train    

47691    normal.
33686    normal.
44616    normal.
58903    normal.
4651     normal.
          ...   
16708    normal.
6010     normal.
3243     normal.
47       normal.
40704    normal.
Name: normal., Length: 48749, dtype: object

In [113]:
y_train.value_counts()

normal.
normal.             48471
ipsweep.              246
buffer_overflow.       16
rootkit.               12
worm.                   2
sqlattack.              2
Name: count, dtype: int64

In [115]:
baseline_accuracy = y_train.value_counts().max()/y_train.value_counts().sum()
print (f"Majority class accuracy: {baseline_accuracy:.3f}")

Majority class accuracy: 0.994
