# <center>  **Data Wrangling** 

### <center> **Data Collection**

Goal: Organize your data to streamline the next steps of your capstone

Time estimate: 1-2 hours

<center> Data Loading

Importing Packages

In [1]:
# Import all the Necessary Packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


Importing the data

In [35]:
# Import the data and see what it looks like
path = r'C:\Users\jdrel\OneDrive\Documents\Data_Science\Springboard\Capstone-2\data\raw\kddcup.data_10_percent'
data = pd.read_csv(path)

### <center> **Data Definition**

<center> Column Names

The column names can be found in the about section on the data's webpage: [KDD Dataset](https://www.kaggle.com/datasets/slashtea/kdd-cyberattack?resource=download)

In [6]:
# I copy and pasted the column names into this list
data.columns = ["duration", "protocol_type", "service", "flag", "src_bytes",
                "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
                "num_failed_logins", "logged_in", "num_compromised",
                "root_shell", "su_attempted", "num_root", "num_file_creations",
                "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
                "is_guest_login", "count", "srv_count", "serror_rate","srv_serror_rate",
                "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
                "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
                "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate", 
                "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "labels"]
data.head(3)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.


## <center> **Data Wrangling**

Now that the data has been imported and the column names have been set, it is time to put the data into a format that is usable for mathematical analysis. We need all the information to be represented by a float or an integer. For the categorical variables such as "protocol_type" we can do this by creating new features where each new feature is a dummy variable for a category in the categorical variables.

In [21]:
# Create the list of categorical columns (cat_col) to iterate over
cat_cols = list(data.select_dtypes(include = 'object').columns)

# Initialize num_data which is all information in numerical form
num_data = data

# Iterate over all the categorical variables
for cat_col in cat_cols:
    # Create the dummy variables for each category
    dummy_df = pd.get_dummies(data[cat_col])
    # Join dummy_df to the existing dataset
    num_data = pd.concat([num_data, dummy_df], axis = 1, join = 'outer')

# Create a numeric only dataset that can be easier to analyze
num_only_data = num_data.drop(cat_cols, axis = 1)

In [29]:
num_only_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494020 entries, 0 to 494019
Columns: 141 entries, duration to warezmaster.
dtypes: float64(15), int64(23), uint8(103)
memory usage: 191.8 MB


In [20]:
num_only_data.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,phf.,pod.,portsweep.,rootkit.,satan.,smurf.,spy.,teardrop.,warezclient.,warezmaster.
0,0,239,486,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,235,1337,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,219,1337,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,217,2032,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,217,2032,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Save Data with the column names

In [32]:
path =  r'C:\Users\jdrel\OneDrive\Documents\Data_Science\Springboard\Capstone-2\data\interim\KDD Data.csv'
data.to_csv(path)
path2 = r'C:\Users\jdrel\OneDrive\Documents\Data_Science\Springboard\Capstone-2\data\interim\Num Only Data.csv'
num_only_data.to_csv(path2)

In [34]:
data.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,59,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


Gain an overview of the dataset.

In [None]:
profile = data.profile_report()
profile

## <center> **Data Cleaning**

Goal: Clean up the data in order to prepare it for the next steps of your project.
Time estimate: 1-2 hours


NA or Missing Values

In [10]:
# Find out how complete the data set by looking at missing values
na_col = [data[column].isna().sum() for column in data.columns]
print(dict(zip(data.columns, na_col)))



{'duration': 0, 'protocol_type': 0, 'service': 0, 'flag': 0, 'src_bytes': 0, 'dst_bytes': 0, 'land': 0, 'wrong_fragment': 0, 'urgent': 0, 'hot': 0, 'num_failed_logins': 0, 'logged_in': 0, 'num_compromised': 0, 'root_shell': 0, 'su_attempted': 0, 'num_root': 0, 'num_file_creations': 0, 'num_shells': 0, 'num_access_files': 0, 'num_outbound_cmds': 0, 'is_host_login': 0, 'is_guest_login': 0, 'count': 0, 'srv_count': 0, 'serror_rate': 0, 'srv_serror_rate': 0, 'rerror_rate': 0, 'srv_rerror_rate': 0, 'same_srv_rate': 0, 'diff_srv_rate': 0, 'srv_diff_host_rate': 0, 'dst_host_count': 0, 'dst_host_srv_count': 0, 'dst_host_same_srv_rate': 0, 'dst_host_diff_srv_rate': 0, 'dst_host_same_src_port_rate': 0, 'dst_host_srv_diff_host_rate': 0, 'dst_host_serror_rate': 0, 'dst_host_srv_serror_rate': 0, 'dst_host_rerror_rate': 0, 'dst_host_srv_rerror_rate': 0, 'labels': 0}


In [11]:
# Find out how complete the data set by looking at missing values
null_col = [data[column].isnull().sum() for column in data.columns]
print(dict(zip(data.columns, null_col)))

{'duration': 0, 'protocol_type': 0, 'service': 0, 'flag': 0, 'src_bytes': 0, 'dst_bytes': 0, 'land': 0, 'wrong_fragment': 0, 'urgent': 0, 'hot': 0, 'num_failed_logins': 0, 'logged_in': 0, 'num_compromised': 0, 'root_shell': 0, 'su_attempted': 0, 'num_root': 0, 'num_file_creations': 0, 'num_shells': 0, 'num_access_files': 0, 'num_outbound_cmds': 0, 'is_host_login': 0, 'is_guest_login': 0, 'count': 0, 'srv_count': 0, 'serror_rate': 0, 'srv_serror_rate': 0, 'rerror_rate': 0, 'srv_rerror_rate': 0, 'same_srv_rate': 0, 'diff_srv_rate': 0, 'srv_diff_host_rate': 0, 'dst_host_count': 0, 'dst_host_srv_count': 0, 'dst_host_same_srv_rate': 0, 'dst_host_diff_srv_rate': 0, 'dst_host_same_src_port_rate': 0, 'dst_host_srv_diff_host_rate': 0, 'dst_host_serror_rate': 0, 'dst_host_srv_serror_rate': 0, 'dst_host_rerror_rate': 0, 'dst_host_srv_rerror_rate': 0, 'labels': 0}


There are no missing values in the entire dataset. While this may not be normal, this dataset comes from computer research so it is not surprising that every piece of information was able to be collected.

It doesn't make sense to check for duplicates in this data set since it is consistent data from computers. There will be many observations that will be duplicates.