
```
Import Libraries

```

In [None]:
# Core data manipulation and analysis libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations and arrays

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Enable inline plotting in Jupyter notebooks
# Fixed duplicate import and invalid syntax
%matplotlib inline



```
Load the full datset after Combined Raw Dataset given in 4 CSV Files.

```

In [None]:
    
# Reading datasets
# Using list comprehension to read all csv files in 4 csv files
df = pd.read_csv('C:/Users/raman/OneDrive/Important/1UnisaSTUDY/Courses/Capstone_Project_1/Github/Code Working/Data Cleaning and EDA/full_data.csv', header=0) 

df.head()


```

In this section we are trying to find the unique values in the dataset also NAN or null values columns.

The following three columns have null values

- ct_flw_http_mthd:  No. of flows that has methods such as Get and Post in http service. 

- is_ftp_login: If the ftp session is accessed by user and password then 1 else 0.

- attack_cat: The name of each attack category. In this data set, nine categories (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits,Generic, Reconnaissance, Shellcode and Worms)  

```

In [None]:
# Let's find all the unique values in this DataFrame and count how many times each appears
df.value_counts()


In [None]:
# Find rows with missing values by checking for NaN values across all columns (axis=1)
missing_rows = df[df.isna().any(axis=1)]
# Get column names where missing values were found in the subset of rows with missing values
missing_cols = missing_rows.columns[missing_rows.isna().any()]
# Print summary of number of rows found with missing values
print(f"\nFound {len(missing_rows)} rows with missing values in columns:")
# Print names of columns containing missing values
print(missing_cols)
# Print the subset of rows and columns containing missing values
print(missing_rows[missing_cols])


In [None]:
# checking the sum for null values
df.isnull().sum()

```
In this section we cleared the null values from the following section:

- ct_flw_http_mthd:  No. of flows that has methods such as Get and Post in http service. 

There were 1348145 null values converted into 0.

```

In [None]:
# Check for null values before filling
print("Number of null values before:", df['ct_flw_http_mthd'].isnull().sum())

# Check unique values before transformation
print("\nUnique values before:")
print(df['ct_flw_http_mthd'].unique())

# Show value counts before transformation
print(df['ct_flw_http_mthd'].value_counts())

# Fill null values with 0
df['ct_flw_http_mthd'] = df['ct_flw_http_mthd'].fillna(0)

# Check unique values after transformation
print("\nUnique values after:")
print(df['ct_flw_http_mthd'].unique())

# Show value counts after transformation
print("\nValue counts after:")
print(df['ct_flw_http_mthd'].value_counts())


```
In this section we cleared the null values from the following section:

- is_ftp_login:  If the ftp session is accessed by user and password then 1 else 0.  

There were 1429879 null values converted into 0.

and 4.0 has 156 values and 2.0 has 30 values in this section.

These values above 1 were converted into 1 value.

```

In [None]:
# Check initial null values and unique values
print("Initial null values:", df['is_ftp_login'].isnull().sum())
print("\nInitial unique values:")
print(df['is_ftp_login'].unique())
print(df['is_ftp_login'].value_counts())

# Fill nulls with 0 and cap values at 1 to create binary column
df['is_ftp_login'] = df['is_ftp_login'].fillna(0)
df['is_ftp_login'] = np.where(df['is_ftp_login']>1, 1, df['is_ftp_login'])

# Show final value distribution
print("\nFinal value counts:")
print(df['is_ftp_login'].value_counts())


```
In this section we cleared the null values from the following section:

- attack_cat:  If the ftp session is accessed by user and password then 1 else 0.  

There were 2218764 null values converted as Normal.


```

In [None]:
# We don't have "Normal" values for "attack_cat", so we must fill Null values with "normal"
# This code performs two operations on the 'attack_cat' column of a dataframe:
# 1. Fills any null/missing values with the string 'normal' using fillna()
# Check for null values before filling
print("Number of null values before:", df['attack_cat'].isnull().sum())

# Check unique values before transformation
print("\nUnique values before:")
print(df['attack_cat'].unique())

# Apply the transformation
df['attack_cat'] = df.attack_cat.fillna(value='Normal').apply(lambda x: x.strip())

# Check for null values after filling
print("\nNumber of null values after:", df['attack_cat'].isnull().sum())

# Check unique values after transformation
print("\nUnique values after:")
print(df['attack_cat'].unique())

# Get value counts to see distribution
print("\nValue counts:")
print(df['attack_cat'].value_counts())#    - Removes leading/trailing whitespace using strip()


```
Once again all datset unique values column wise checked and found some strange values in the following columns:

- service: http, ftp, ssh, dns ..,else (-)

This section got 1246397 values as - which is then converted into none.

- ct_ftp_cmd:No of flows that has a command in ftp session. 

This section got numeric values with unique chracters like [0, 1, 6, 2, 4, 8, 5, 3, '0', '1', ' ', '2', '4']  which is the converted as '1' as 1.



```

In [None]:
# This code iterates through each column in the DataFrame 'df'
# For each column, it:
# 1. Prints the column name
# 2. Uses value_counts() to display how many times each unique value appears in that column
# Display unique values and their counts for each column
for column in df.columns:
    print(f"\n{column}:")
    print(df[column].value_counts())


In [None]:
# Count the number of occurrences of each unique value in the 'service' column
# Returns a Series with the value counts in descending order
df['service'].value_counts()

In [None]:
# Replace all instances of "-" with "None" in the 'service' column using more readable method
df['service'] = df['service'].replace('-', 'none')
# Get the count of unique values in the 'service' column after removing "-" with none
df['service'].value_counts()


In [None]:
# Count the frequency of each unique value in the 'ct_ftp_cmd' column
# Returns a Series with unique values as index and their counts as values
df['ct_ftp_cmd'].value_counts()


In [None]:
# Get unique values in the 'ct_ftp_cmd' column of the dataframe
# Returns array of distinct FTP commands used in the dataset

df['ct_ftp_cmd'].unique()


In [None]:
# Map values to correct categories (0-8)
df['ct_ftp_cmd'] = df['ct_ftp_cmd'].map({
    0: 0,
    1: 1,
    '1': 1,  # Convert string '1' to int 1
    2: 2,
    '2': 2,
    3: 3,
    4: 4,
    '4': 4,
    5: 5,
    6: 6,
    8: 8
})

# Verify the cleanup
print(df['ct_ftp_cmd'].value_counts().sort_index())

```
As most of the dataset is cleaned and can be downloaded into a Cleaned_full_data.csv csv file

```

In [None]:
# Export the cleaned DataFrame to a CSV file
# The file will be saved in the current directory as 'Cleaned_full_data.csv'
# index=False prevents the DataFrame index from being written to the CSV
# This file will be too big to load in github so change the path to other location


# df.to_csv('./Cleaned_full_data.csv', index=False)
