# Introduction
This dataset containing a wide range of invasions simulated in a research organization was submitted to be audited. 
By imitating a typical Ecommerce function on the internet, it developed an environment for obtaining raw TCP/IP dump data for a network. The internet was concentrated as if it were a real setting, and various attacks were launched. 

A connection is a series of TCP packets that begin and stop at a specific time interval and allow data to flow from a source IP address to a target IP address using a well-defined protocol. 
In addition, each link is classified as either normal or an attack. Each connection record is around 100 bytes long.

For each TCP/IP connection, 19 quantitative and qualitative features are obtained from normal and attack data (2 qualitative and 17 quantitative features). The class variable has categories:
   1. **buffer_overflow:** This label is often associated with attempts to exploit buffer overflow vulnerabilities in software. Buffer overflows can be used to execute arbitrary code or crash a program, and they are often seen as a security threat.
   2. **ipsweep:** This label indicates activities that involve scanning a range of IP addresses to gather information about potential targets. It is a common precursor to various types of network attacks.
   3. **normal:** This label represents network traffic that is considered normal and does not raise any security concerns. It is used as a baseline for comparison with potentially suspicious or malicious activities.
   4. **rootkit:** Rootkits are malicious software or tools that are designed to gain unauthorized access to a computer or network. This label may be used when there are signs of rootkit activity.
   5. **sqlattack:** This label is associated with SQL injection attacks, where attackers attempt to manipulate or exploit vulnerabilities in web applications or databases by injecting SQL code.
   6. **worm:** Worms are self-replicating malware that can spread across networks without human intervention. This label may be used when there is evidence of worm-like behavior in network traffic.


We can find it into the normal feature. 


In [63]:
import numpy as np 
import pandas as pd 

data = pd.read_csv("malware analysis.csv")
data 

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5,normal.
0,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
1,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
2,0,udp,domain_u,SF,29,0,2,1,0.5,1.0,0.0,10,3,0.30,0.30,0.30,0.00,0.0,normal.
3,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.00,0.00,0.0,normal.
4,0,tcp,http,SF,223,185,4,4,1.0,0.0,0.0,71,255,1.00,0.00,0.01,0.01,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60932,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60933,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60934,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60935,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.


In [64]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60937 entries, 0 to 60936
Data columns (total 19 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   0        60937 non-null  int64  
 1   udp      60937 non-null  object 
 2   private  60937 non-null  object 
 3   SF       60937 non-null  object 
 4   105      60937 non-null  int64  
 5   146      60937 non-null  int64  
 6   1        60937 non-null  int64  
 7   1.1      60937 non-null  int64  
 8   1.2      60937 non-null  float64
 9   0.1      60937 non-null  float64
 10  0.2      60937 non-null  float64
 11  255      60937 non-null  int64  
 12  254      60937 non-null  int64  
 13  1.3      60937 non-null  float64
 14  0.01     60937 non-null  float64
 15  0.3      60937 non-null  float64
 16  0.4      60937 non-null  float64
 17  0.5      60937 non-null  float64
 18  normal.  60937 non-null  object 
dtypes: float64(8), int64(7), object(4)
memory usage: 8.8+ MB


In [65]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0.0,60937.0,10.747067,536.645424,0.0,0.0,0.0,0.0,54451.0
105.0,60937.0,748.052497,34170.266255,0.0,105.0,226.0,302.0,6291668.0
146.0,60937.0,3582.526577,36166.319605,0.0,146.0,570.0,2507.0,5203179.0
1.0,60937.0,14.194151,53.988606,0.0,1.0,4.0,11.0,511.0
1.1,60937.0,16.794493,54.906647,0.0,2.0,5.0,15.0,511.0
1.2,60937.0,0.994402,0.059121,0.0,1.0,1.0,1.0,1.0
0.1,60937.0,0.007781,0.080519,0.0,0.0,0.0,0.0,1.0
0.2,60937.0,0.111514,0.234407,0.0,0.0,0.0,0.12,1.0
255.0,60937.0,163.517961,102.75336,0.0,50.0,245.0,255.0,255.0
254.0,60937.0,231.014179,60.728053,0.0,253.0,255.0,255.0,255.0


In [66]:
num_inst, num_features = data.shape

for f in range(num_features):
    print (f, np.unique(data.iloc[:,f]))

0 [    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14    15    16    17    18    20    21    22    23    24
    25    26    27    28    29    30    31    32    33    34    35    36
    37    38    39    40    41    43    44    46    47    48    49    50
    51    53    54    56    57    58    59    60    61    62    63    64
    65    67    68    69    71    72    73    74    75    77    78    79
    81    82    84    87    88    89    90    91    92    93    94    95
    96    97    99   104   106   107   109   112   113   115   116   117
   118   120   121   126   127   128   131   133   136   137   138   139
   140   141   142   143   144   148   149   154   157   158   160   161
   162   163   164   170   171   172   181   182   183   185   186   189
   190   192   194   198   202   205   207   208   212   214   219   228
   235   238   239   245   249   253   254   261   265   292   293   297
   308   315   328   342   345   360   366   373 

# List of Features with Explanations

1. **0** :  this feature contain the numbers from 0 to 54451 
2. **tcp** : it contains three different type of protocols: 
    - **tcp** : Transmission Control Protocol
    - **udp** : User Datagram Protocol
    - **icmp** : Internet Control Message Protocol 
3. **private** : This list is related to network services and protocols, often associated with TCP/IP networking. These labels represent various network service types, often used to classify traffic or activity on a network. Each label corresponds to a specific service or protocol, and they are commonly used in network monitoring and security contexts. Here's a brief explanation of each of the labels in the list:

    1. **IRC:** Internet Relay Chat, a protocol used for real-time text messaging or chat over the internet.

    2. **X11:** The X Window System, a network protocol used for graphical user interfaces in Unix-like operating systems.

    3. **auth:** Authentication service, often related to the process of user authentication.

    4. **domain_u:** Unregistered domain service.

    5. **eco_i:** E-COMMERCE service.

    6. **ecr_i:** E-COMMERCE, results.

    7. **finger:** The Finger protocol, used for querying information about users on a network.

    8. **ftp:** File Transfer Protocol, used for transferring files between computers over a network.

    9. **ftp_data:** FTP data transfer service, used for the actual data transfer in FTP.

    10. **http:** Hypertext Transfer Protocol, the protocol used for web browsing.

    11. **icmp:** Internet Control Message Protocol, used for network diagnostics and error reporting.

    12. **link:** Network link service.

    13. **ntp_u:** Network Time Protocol (NTP), unregistered.

    14. **other:** A catch-all category for other or unclassified services.

    15. **pop_3:** Post Office Protocol version 3, used for retrieving email from a server.

    16. **private:** A label often used for private or non-standard services.

    17. **remote_job:** Remote job entry service.

    18. **smtp:** Simple Mail Transfer Protocol, used for sending email.

    19. **telnet:** Telnet, a network protocol for remote terminal access.

    20. **tftp_u:** Trivial File Transfer Protocol (TFTP), unregistered.

    21. **tim_i:** Time service, results.

    22. **time:** Time service.

    23. **urp_i:** URP (URL Rendition Protocol) results.

    These labels are commonly used in network traffic analysis, intrusion detection systems, and log analysis to classify and understand the nature of network traffic and the services being accessed or used. They help network administrators and security professionals identify and respond to potential security threats and anomalies in network activity.

4. **SF** : The terms **"REJ," "RSTO," "RSTOS0," "RSTR," "S1," "S2," "S3," and "SF"** are typically related to network scanning and port scanning techniques rather than the core TCP/IP protocol itself. These terms are often associated with the *nmap tool*, which is a popular open-source network scanning tool used for discovering open ports, identifying services running on those ports, and assessing the security of a network : 
    - **REJ** : This indicates that the scanned port is unreachable or blocked. Nmap receives a "reject" response when it attempts to connect to a port.
    - **RSTO** : This stands for "Reset TCP Port Closed." It means that the port is closed, and a TCP reset (RST) packet was received in response to the connection attempt.
    - **RSTOS0** : Similar to RSTO, but it also means that the system sent an RST packet to close the port.
    - **RSTR** : This indicates that the port is closed but is in a "reset" state, meaning that a reset (RST) packet was sent in response to the connection attempt.
    - **S1, S2, S3** : These terms represent different states of nmap's scanning process:
        - **S1** : Indicates that the scan is in progress.
        - **S2**: Signifies that the scan is 50% complete.
        - **S3**: Indicates that the scan is 75% complete.
    - **SF**: This stands for "SYN ACK TCP Port Open." It indicates that the scanned port is open, and a SYN-ACK response was received in response to the connection attempt. This is a common result when a port is open and actively accepting connections.
 
    Nmap uses various scan techniques to determine the state of ports on a target system. These codes and terms help users interpret the results of an nmap scan and understand the state of each scanned port on the target system.

    Keep in mind that these terms are specific to nmap and may not be relevant in all network scanning or TCP/IP contexts. They are primarily used by security professionals and network administrators for assessing the security and configuration of networked systems.
   
5. **105,146,1,1,1.0,0.0,0.00,255,254,1.00,0.01,0.00,0.00,0.00** : These numerical values represent various attributes or characteristics of the network event. Without a legend or context, it's challenging to provide specific meanings for these numbers. They could represent features related to the network traffic, such as port numbers, packet sizes, or other metrics. For examples :
    - **105,146,1,1,1.0,0.0,0.00** : it seems like **Source IP Address**
    - **255,254,1.00,0.01,0.00,0.00,0.00** : it seems like **Destination IP Address**

6. **normal** : This list which includes labels like **"buffer_overflow," "ipsweep," "normal," "rootkit," "sqlattack," and "worm"** is related to network security and intrusion detection. These labels are often associated with network traffic or event classification in the context of cybersecurity and intrusion detection systems. They are not directly related to TCP/IP but are used to categorize and identify different types of network activities and security incidents. These labels can help security professionals and tools like Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) identify and respond to potential threats and security breaches. Here's a brief explanation of each label:

    1. **buffer_overflow:** This label is often associated with attempts to exploit buffer overflow vulnerabilities in software. Buffer overflows can be used to execute arbitrary code or crash a program, and they are often seen as a security threat.

    2. **ipsweep:** This label indicates activities that involve scanning a range of IP addresses to gather information about potential targets. It is a common precursor to various types of network attacks.

    3. **normal:** This label represents network traffic that is considered normal and does not raise any security concerns. It is used as a baseline for comparison with potentially suspicious or malicious activities.

    4. **rootkit:** Rootkits are malicious software or tools that are designed to gain unauthorized access to a computer or network. This label may be used when there are signs of rootkit activity.

    5. **sqlattack:** This label is associated with SQL injection attacks, where attackers attempt to manipulate or exploit vulnerabilities in web applications or databases by injecting SQL code.

    6. **worm:** Worms are self-replicating malware that can spread across networks without human intervention. This label may be used when there is evidence of worm-like behavior in network traffic.

    These labels help security professionals and security tools classify and respond to different types of network events or anomalies, making it easier to identify and mitigate security threats.



https://www.kaggle.com/datasets/paytonjabir/comprehensive-malware-datasets/data

# Clean the Data 
After a first analysis, I decided to use this dataset. Because it has enough instance to make a good prediction.
Right now I have to clean the dataset to prepare it for the first part of the study.

In [67]:
from sklearn.model_selection import train_test_split

# drop label columns
X = data.drop(columns=['normal.'])

# isolate y
y = data['normal.']

X_train, X_test, y_train, y_test = train_test_split( X, y, 
                                                         test_size=0.20)

In [68]:
X_train.shape

(48749, 18)

In [69]:
X.shape, X_train.shape, X_test.shape

((60937, 18), (48749, 18), (12188, 18))

In [70]:
y_test

17998    normal.
25472    normal.
52491    normal.
8293     normal.
28903    normal.
          ...   
32317    normal.
57614    normal.
51863    normal.
7376     normal.
47019    normal.
Name: normal., Length: 12188, dtype: object

In [71]:
data

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5,normal.
0,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
1,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,254,1.00,0.01,0.00,0.00,0.0,normal.
2,0,udp,domain_u,SF,29,0,2,1,0.5,1.0,0.0,10,3,0.30,0.30,0.30,0.00,0.0,normal.
3,0,udp,private,SF,105,146,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.00,0.00,0.0,normal.
4,0,tcp,http,SF,223,185,4,4,1.0,0.0,0.0,71,255,1.00,0.00,0.01,0.01,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60932,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60933,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60934,0,udp,private,SF,105,147,2,2,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.
60935,0,udp,private,SF,105,147,4,4,1.0,0.0,0.0,255,255,1.00,0.00,0.01,0.00,0.0,normal.


In this way we divided the dataset into two parts, using the **feature normal.**.
The Dataset is not modified, because only the division is simulated

## Real Dataset Cleaning

In [72]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0.0,48749.0,9.329771,510.862516,0.0,0.0,0.0,0.0,54451.0
105.0,48749.0,814.913865,38117.203874,0.0,105.0,226.0,301.0,6291668.0
146.0,48749.0,3573.197604,37220.133939,0.0,146.0,561.0,2507.0,5203179.0
1.0,48749.0,14.108782,53.61127,0.0,1.0,4.0,11.0,511.0
1.1,48749.0,16.684281,54.511128,0.0,2.0,5.0,15.0,511.0
1.2,48749.0,0.994422,0.059209,0.0,1.0,1.0,1.0,1.0
0.1,48749.0,0.007636,0.07963,0.0,0.0,0.0,0.0,1.0
0.2,48749.0,0.11152,0.234735,0.0,0.0,0.0,0.12,1.0
255.0,48749.0,163.355802,102.795617,0.0,49.0,245.0,255.0,255.0
254.0,48749.0,230.955712,60.854536,0.0,253.0,255.0,255.0,255.0


In [73]:
X_train.head()

Unnamed: 0,0,udp,private,SF,105,146,1,1.1,1.2,0.1,0.2,255,254,1.3,0.01,0.3,0.4,0.5
42273,0,tcp,http,SF,205,5089,2,17,1.0,0.0,0.18,122,255,1.0,0.0,0.01,0.04,0.0
29566,0,udp,private,SF,105,146,2,2,1.0,0.0,0.0,255,255,1.0,0.0,0.0,0.0,0.0
34900,0,udp,private,SF,105,0,1,1,1.0,0.0,0.0,255,253,0.99,0.01,0.0,0.0,0.0
39121,0,tcp,http,SF,291,1366,43,44,1.0,0.0,0.05,47,255,1.0,0.0,0.02,0.03,0.0
7789,0,tcp,ftp_data,SF,15377,0,22,22,1.0,0.0,0.0,218,89,0.3,0.03,0.3,0.02,0.0


In [74]:
is_numerical  = np.array( [ len(np.unique(X_train[col]))>10 for col in X_train] )
numerical_idx = np.flatnonzero(is_numerical) 

In [75]:
print (is_numerical)
print (numerical_idx)
print ("Number of numerical features:", sum(is_numerical))

[ True False  True False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]
[ 0  2  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
Number of numerical features: 16


In [76]:
# convert numerical to floats (keep NaN)
new_X = X_train[ numerical_idx ].apply(pd.to_numeric, errors='coerce')
#  invalid parsing will be set as NaN.

KeyError: "None of [Index([0, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], dtype='int64')] are in the [columns]"