# Cyber Security – Network Security: (Bi-nomial classification)
### Network Intrusions Detection Case Study

### Business Context:
We are in a time where businesses are more digitally advanced than ever, and as technology improves, organizations’ security postures must be enhanced as well. Failure to do so could result in a costly data breach, as we’ve seen happen with many businesses. The cybercrime landscape has evolved, and threat actors are going after any type of organization, so in order to protect your business’s data, money and reputation, it is critical that you invest in an advanced security system.

Cyber security can be described as the collective methods, technologies, and processes to help protect the confidentiality, integrity, and availability of computer systems, networks and data, against cyber-attacks or unauthorized access.

###### a. Information Security vs. Cyber Security vs. Network Security:

Information security (also known as InfoSec) ensures that both physical and digital data is protected from unauthorized access, use, disclosure, disruption, modification, inspection, recording or destruction. Information security differs from cyber security in that InfoSec aims to keep data in any form secure, whereas cyber security protects only digital data.

Cyber security, a subset of information security, is the practice of defending your organization’s networks, computers and data from unauthorized digital access, attack or damage by implementing various processes, technologies and practices. With the countless sophisticated threat actors targeting all types of organizations, it is critical that your IT infrastructure is secured at all times to prevent a full-scale attack on your network and risk exposing your company’ data and reputation.

Network security, a subset of cyber security, aims to protect any data that is being sent through devices in your network to ensure that the information is not changed or intercepted. The role of network security is to protect the organization’s IT infrastructure from all types of cyber threats including:


Viruses, worms and Trojan horses

a. Zero-day attacks

b. Hacker attacks

c. Denial of service attacks

d. Spyware and adware



Your network security team implements the hardware and software necessary to guard your security architecture. With the proper network security in place, your system can detect emerging threats before they infiltrate your network and compromise your data.

There are many components to a network security system that work together to improve your security posture. The most common network security components include:

a. Firewalls

b. Anti-virus software

c. Intrusion detection and prevention systems (IDS/IPS)

d. Virtual private networks (VPN)



### Network Intrusions vs. Computer intrusions vs. Cyber Attacks

### 1. Computer Intrusions:
Computer intrusions occur when someone tries to gain access to any part of your computer system. Computer intruders or hackers typically use automated computer programs when they try to compromise a computer’s security. There are several ways an intruder can try to gain access to your computer.

They can Access your 

a. Computer to view, change, or delete information on your computer, 

b. Crash or slow down your computer

c. Access your private data by examining the files on your system

d. Use your computer to access other computers on the Internet.

### 2. Network Intrusions:
A network intrusion refers to any unauthorized activity on a digital network. Network intrusions often involve stealing valuable network resources and almost always jeopardize the security of networks and/or their data. 

In order to proactively detect and respond to network intrusions, organizations and their cyber-security teams need to have a thorough understanding of how network intrusions work and implement network intrusion, detection, and response systems that are designed with attack techniques and cover-up methods in mind.

#### Network Intrusion Attack Techniques:
Given the amount of normal activity constantly taking place on digital networks, it can be very difficult to pinpoint anomalies that could indicate a network intrusion has occurred. Below are some of the most common network intrusion attack techniques that organizations should continually look for:

#### Living Off the Land: 
Attackers increasingly use existing tools and processes and stolen credentials when compromising networks. These tools like operating system utilities, business productivity software and scripting languages are clearly not malware and have very legitimate usage as well. In fact, in most cases, the vast majority of the usage is business justified, allowing an attacker to blend in.

#### Multi-Routing:
If a network allows for asymmetric routing, attackers will often leverage multiple routes to access the targeted device or network. This allows them to avoid being detected by having a large portion of suspicious packets bypass certain network segments and any relevant network intrusion systems.

#### Buffer Overwriting:
By overwriting certain sections of computer memory on a network device, attackers can replace normal data in those memory locations with a slew of commands that can later be used as part of a network intrusion. This attack technique is a lot harder to accomplish if boundary-checking logic is installed and executable code or malicious strings are identified before 
they can be written to the buffer.

#### Covert CGI Scripts: 
Unfortunately, the Common Gateway Interface (CGI), which allows servers to pass user requests to relevant applications and receive data back to then forward to users, serves as an easy opening for attackers to access network system files. For instance, if networks don’t require input verification or scan for backtracking, attackers can use a covert CGI script to add the directory label “..” or the pipe “|” character to any file path name, allowing them to access files that shouldn’t 
be accessible via the Web. Fortunately, CGI is much less popular today and there are far fewer devices that provide this interface.

#### Protocol-Specific Attacks: 
Protocols such as ARP, IP, TCP, UDP, ICMP, and various application protocols can inadvertently leave openings for network intrusions. Case in point: Attackers will often impersonate protocols or spoof protocol messages to perform man-in-the-middle attacks and thus access data they wouldn’t have access to otherwise, or to crash targeted devices on a network.

#### Traffic Flooding: 
By creating traffic loads that are too large for systems to adequately screen, attackers can induce chaos and congestion in network environments, which allows them to execute attacks without ever being detected.

#### Trojan Horse Malware: 
As the name suggests, Trojan Horse viruses create network backdoors that give attackers easy access to systems and any available data. Unlike other viruses and worms, Trojans don’t reproduce by infecting other files, and they don’t self-replicate. Trojans can be introduced from online archives and file repositories, and often originate from peer-to-peer file exchanges.

#### Worms: 
One of the easiest and most damaging network intrusion techniques is the common, standalone computer virus, or worm. Often spread through email attachments or instant messaging, worms take up large amounts of network resources, preventing the authorized activity from occurring. Some worms are designed to steal specific kinds of confidential information, such as financial information or any personal data relating to social security numbers, and they then relay that data to attackers waiting outside an organization’s network.


### Network Intrusion Cover-Up Methods

Once attackers have employed common network intrusion attack techniques, they’ll often incorporate additional measures to cover their tracks and avoid detection. As mentioned above, using non-malware and living off the land tools have the dual advantage of being powerful while blending into business justified usage, thus making them hard to detect. In addition, below are three practices that are frequently used to circumvent cyber security teams and network intrusion detection systems:

#### Deleting logs:
By deleting access logs, attackers can make it nearly impossible to determine where and what they’ve accessed (that is, without enlisting the help of an extensive cyber forensics team). Regularly scheduled log reviews and centralized logging can help combat this problem by preventing attackers from tampering with any type and/or location of logs.

#### Using encryption on departing data:
Encrypting the data that’s being stolen from an organization’s network environment (or simply cloaking any outbound traffic so it looks normal) is one of the most straightforward tactics attackers can leverage to hide their movements from network-based detections.

#### Installing rootkits: 
Rootkits, or software that enables unauthorized users to gain control of a network without ever being detected, are particularly effective in covering attackers’ tracks, as they allow attackers to leisurely inspect systems and exploit them over long periods of time.

### 3. Cyber Attack:
A cyber-attack is any type of offensive action that targets computer information systems, infrastructures, computer networks or personal computer devices, using various methods to steal, alter or destroy data or information systems.
Common cyber-attack types:

a. Denial-of-service (DoS) and distributed denial-of-service (DDoS) attacks

b. Man-in-the-middle (MitM) attack

c. Phishing and spear phishing attacks

d. Drive-by attack

e. Password attack

f. SQL injection attack

g. Cross-site scripting (XSS) attack

h. Eavesdropping attack

i. Birthday attack

j. Malware attack


## Business Objective:
With the enormous growth of computer networks usage and the huge increase in the number of applications running on top of it, network security is becoming increasingly more important. All the computer systems suffer from security vulnerabilities which are both technically difficult and economically costly to be solved by the manufacturers. Therefore, the role of Intrusion Detection Systems (IDSs), as special-purpose devices to detect anomalies and attacks in the network, is 
becoming more important. 

The research in the intrusion detection field has been mostly focused on anomaly-based and misusebased detection techniques for a long time. While misuse-based detection is generally favoured in commercial products due to its predictability and high accuracy, in academic research anomaly detection is typically conceived as a more powerful method due to its theoretical potential for addressing novel attacks.

As part of this project, your task is to build network intrusion detection system to detect anomalies and attacks in the network.

## Problems:

###### Binomial classification: 
Detect anomalies by predicting Activity is normal or attack.

In [None]:
# Importing library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import scipy.stats as stats

In [None]:
# Importing datasets

Data_of_Attack_Back =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back.csv")
Data_of_Attack_Back_BufferOverflow =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_BufferOverflow.csv")
Data_of_Attack_Back_FTPWrite =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_FTPWrite.csv", header=None)
Data_of_Attack_Back_GuessPassword =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_GuessPassword.csv")
Data_of_Attack_Back_Neptune =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_Neptune.csv")
Data_of_Attack_Back_NMap =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_NMap.csv")
Data_of_Attack_Back_Normal =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_Normal.csv")
Data_of_Attack_Back_PortSweep =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_PortSweep.csv")
Data_of_Attack_Back_RootKit =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_RootKit.csv")
Data_of_Attack_Back_Satan =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_Satan.csv")
Data_of_Attack_Back_Smurf =  pd.read_csv("C:/Users/navee/OneDrive/Desktop/Data_Science_360/Case_studies/Completed/Machine_learning/Cyber_Security_case_study/Data_of_Attack_Back_Smurf.csv")


In [None]:
Data_of_Attack_Back.head()

In [None]:
Data_of_Attack_Back.info()

In [None]:
Data_of_Attack_Back.nunique()

In [None]:
Data_of_Attack_Back_BufferOverflow.head()

In [None]:
Data_of_Attack_Back_BufferOverflow.info()

In [None]:
Data_of_Attack_Back_FTPWrite

In [None]:
Data_of_Attack_Back_FTPWrite.info()

In [None]:
Data_of_Attack_Back_GuessPassword.head()

In [None]:
Data_of_Attack_Back_GuessPassword.info()

In [None]:
Data_of_Attack_Back_Neptune.head()

In [None]:
Data_of_Attack_Back_Neptune.info()

In [None]:
Data_of_Attack_Back_NMap.head()

In [None]:
Data_of_Attack_Back_NMap.info()

In [None]:
Data_of_Attack_Back_Normal.head()

In [None]:
Data_of_Attack_Back_Normal.info()

In [None]:
Data_of_Attack_Back_PortSweep.head()

In [None]:
Data_of_Attack_Back_PortSweep.info()

In [None]:
Data_of_Attack_Back_RootKit.head()

In [None]:
Data_of_Attack_Back_RootKit.info()

In [None]:
Data_of_Attack_Back_Satan.head()

In [None]:
Data_of_Attack_Back_Satan.info()

In [None]:
Data_of_Attack_Back_Smurf.head()

In [None]:
Data_of_Attack_Back_Smurf.info()

In [None]:
# Adding column names to dataset "Data_of_Attack_Back_FTPWrite"


Data_of_Attack_Back_FTPWrite.columns = Data_of_Attack_Back.columns


In [None]:
# Creating a new column of datasetb name to identify which obervation belongs to which dataset


Data_of_Attack_Back['activity'] = 'attack'
Data_of_Attack_Back_BufferOverflow['activity'] = 'attack'
Data_of_Attack_Back_FTPWrite['activity'] = 'attack'
Data_of_Attack_Back_GuessPassword['activity'] = 'attack'
Data_of_Attack_Back_Neptune['activity'] = 'attack'
Data_of_Attack_Back_NMap['activity'] = 'attack'
Data_of_Attack_Back_Normal['activity'] = 'normal'
Data_of_Attack_Back_PortSweep['activity'] = 'attack'
Data_of_Attack_Back_RootKit['activity'] = 'attack'
Data_of_Attack_Back_Satan['activity'] = 'attack'
Data_of_Attack_Back_Smurf['activity'] = 'attack'


In [None]:
# We need to merge all the dataset into one dataset

new_dataset = pd.concat([Data_of_Attack_Back, Data_of_Attack_Back_BufferOverflow, Data_of_Attack_Back_FTPWrite, Data_of_Attack_Back_GuessPassword, Data_of_Attack_Back_Neptune, Data_of_Attack_Back_NMap, Data_of_Attack_Back_Normal, Data_of_Attack_Back_PortSweep, Data_of_Attack_Back_RootKit, Data_of_Attack_Back_Satan, Data_of_Attack_Back_Smurf], axis=0)

In [None]:
new_dataset.head()

In [None]:
new_dataset.info()

In [None]:
new_dataset

In [None]:
new_dataset.columns = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate',
       'srv_error_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate','activity']

In [None]:
# Checkeing duplicate records


new_dataset.duplicated().sum()

In [None]:
new_dataset = new_dataset.drop_duplicates()

In [None]:
#UDF  
def continuous_var_summary( x ):
    
    # freq and missings
    n_total = x.shape[0]
    n_miss = x.isna().sum()
    perc_miss = n_miss * 100 / n_total
    
    # outliers - iqr
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    iqr = q3 - q1
    lc_iqr = q1 - 1.5 * iqr
    uc_iqr = q3 + 1.5 * iqr
    
    return pd.Series( [ x.dtype, x.nunique(), n_total, x.count(), n_miss, perc_miss,
                       x.sum(), x.mean(), x.std(), x.var(), 
                       lc_iqr, uc_iqr, 
                       x.min(), x.quantile(0.01), x.quantile(0.05), x.quantile(0.10), 
                       x.quantile(0.25), x.quantile(0.5), x.quantile(0.75), 
                       x.quantile(0.90), x.quantile(0.95), x.quantile(0.99), x.max() ], 
                     
                    index = ['dtype', 'cardinality', 'n_tot', 'n', 'nmiss', 'perc_miss',
                             'sum', 'mean', 'std', 'var',
                        'lc_iqr', 'uc_iqr',
                        'min', 'p1', 'p5', 'p10', 'p25', 'p50', 'p75', 'p90', 'p95', 'p99', 'max']) 

In [None]:
new_dataset.select_dtypes(['int', 'float']).apply(lambda x: continuous_var_summary(x))

In [None]:
# Dropping columns with cardinality as 1

In [None]:
cardinality = pd.DataFrame(new_dataset.select_dtypes(['int', 'float']).nunique()).reset_index()

In [None]:
cardinality.columns = ['index', 'unique_values']

In [None]:
card_1 = list(cardinality.loc[cardinality.unique_values == 1]['index'])

In [None]:
print(card_1)

In [None]:
new_dataset = new_dataset.drop(columns=card_1)

In [None]:
# Getting those variables out which have low cardinality as they should be considerd as categorical variable

In [None]:
cardinality = pd.DataFrame(new_dataset.select_dtypes(['int', 'float']).nunique()).reset_index()

In [None]:
cardinality.columns = ['index', 'unique_values']

In [None]:
low_card = list(cardinality.loc[cardinality.unique_values<=12]['index'])

In [None]:
print(low_card)

In [None]:
# Service: Destination network service used
# Therefore Service should also be considerd as categorical veriable 

In [None]:
low_card = ['service'] + low_card

In [None]:
print(low_card)

In [None]:
# So remaining veriables should be considerd as continous variable

In [None]:
high_card = list(set(new_dataset.columns) - set(low_card) -set(['activity']))

In [None]:
print(high_card)

#### Now Y-variable is variable 'activity'

In [None]:
# Treating outliers in continous variable

In [None]:
cont_data = new_dataset.loc[:, high_card].apply( lambda x: x.clip( upper = x.quantile(0.99), lower = x.quantile(0.01)) )

In [None]:
cont_data

In [None]:
# Treating nulls in continous variable

In [None]:
new_dataset.loc[:, high_card].isna().sum()

# No nulls to treat

In [None]:
new_dataset.loc[:, low_card]

In [None]:
# Treating nulls in categorical variables

In [None]:
new_dataset.loc[:, low_card].isna().sum()

# There is no null value to treat

In [None]:
# Creating dummies for categorical variables

In [None]:
cat_data = pd.get_dummies(new_dataset.loc[:, low_card].astype('object'), drop_first=True)

In [None]:
print(list(cat_data.columns))

In [None]:
cat_data.columns = [column.replace(".", "_") for column in cat_data.columns]

In [None]:
print(list(cat_data.columns))

In [None]:
# Merging the two datasets and y_variable to create a new dataset


data = pd.concat([new_dataset.loc[:, 'activity'], cat_data, cont_data], axis=1)

In [None]:
print(data.info())

## Feature selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, RFE

from sklearn.linear_model import LogisticRegression

In [None]:
# Using "SelectKBest"

In [None]:
skb_feat = SelectKBest(score_func=f_classif, k=25)

In [None]:
skb_feat.fit(data.select_dtypes(['uint8', 'float', 'int']), data.loc[:, 'activity'])

In [None]:
skb_list = list((data.select_dtypes(['uint8', 'float', 'int']).columns)[skb_feat.get_support()])

In [None]:
# Using "RFE"

In [None]:
rfe_feat = RFE(estimator=LogisticRegression(), n_features_to_select= 25)

In [None]:
rfe_feat.fit(data.select_dtypes(['uint8', 'float', 'int']), data.loc[:, 'activity'])

In [None]:
rfe_list = list((data.select_dtypes(['uint8', 'float', 'int']).columns)[rfe_feat.get_support()])

In [None]:
var_selected = list(set(rfe_list + skb_list))

In [None]:
formula_like = 'activity ~ ' + ' + '.join(var_selected)

In [None]:
# Removing multi-colinerity

In [None]:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
data = pd.concat([data.loc[:, 'activity'], data.loc[:, var_selected]], axis=1)

In [None]:
y, x = dmatrices(formula_like=formula_like, data=data, return_type='dataframe')

In [None]:
vif = pd.DataFrame()

In [None]:
vif['features'] = x.columns

In [None]:
vif['vif_factor'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]

In [None]:
var_selected = list(vif.loc[vif.vif_factor <= 5, 'features'])

## Making data balanced

In [None]:
# Checking data if data is balanced or imbalanced


data.activity.value_counts() / data.activity.count()


# Data is imbalanced

In [None]:
# Making data balanced

# Using Under sampling technique for this

from imblearn.under_sampling import RandomUnderSampler

In [None]:
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)

In [None]:
x_resampled, y_resampled = undersampler.fit_resample(data.loc[:, var_selected], data.loc[:, 'activity'])

In [None]:
print(x_resampled.shape)
print(y_resampled.shape)

In [None]:
data = pd.concat([y_resampled, x_resampled], axis=1)

## Train test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainx, testx, train_y, test_y = train_test_split(data.loc[:, var_selected], data.loc[:, 'activity'], random_state=42, test_size=0.3)

## Standerdize the data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
std = StandardScaler()

In [None]:
train_x = std.fit_transform(trainx)

In [None]:
test_x = std.transform(testx)

In [None]:
train_x = pd.DataFrame(train_x, columns=trainx.columns)
test_x = pd.DataFrame(test_x, columns=testx.columns)

# Model 1: Logistics Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
param_grid = {'penalty' : ['l1', 'l2'], 'C': [0.1, 0.01, 0.5, 0.05, 0.25, 0.75]}

In [None]:
LR = GridSearchCV(LogisticRegression(),param_grid, verbose=True)

In [None]:
LR.fit(train_x, train_y)

In [None]:
df_LR_train = pd.DataFrame({'Actual': train_y, 'Predicted': LR.predict(train_x)})
df_LR_test = pd.DataFrame({'Actual': test_y, 'Predicted': LR.predict(test_x)})

In [None]:
# Label encoding for y_variable



from sklearn.preprocessing import LabelEncoder


In [None]:
LE = LabelEncoder()

In [None]:
df_LR_test['Actual'] = LE.fit_transform(df_LR_test['Actual'])
df_LR_test['Predicted'] = LE.fit_transform(df_LR_test['Predicted'])


df_LR_train['Actual'] = LE.fit_transform(df_LR_train['Actual'])
df_LR_train['Predicted'] = LE.fit_transform(df_LR_train['Predicted'])

# Errors

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

In [None]:
# Classification report of Test data

print(classification_report(df_LR_test.Actual, df_LR_test.Predicted))

In [None]:
# Classification report of Train data

print(classification_report(df_LR_train.Actual, df_LR_train.Predicted))

In [None]:
plt.figure(figsize=(6,5))
sns.heatmap(confusion_matrix(df_LR_test.Actual, df_LR_test.Predicted), annot=True, fmt='.0f', xticklabels = ["True", "False"] , yticklabels = ["True", "False"])
plt.title('Confusion matrix for pictorial representation', pad=15)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

In [None]:
print('roc auc score of train data',roc_auc_score(df_LR_train.Actual, df_LR_train.Predicted))
print('roc auc score of test data',roc_auc_score(df_LR_test.Actual, df_LR_test.Predicted))

# This model is working fine, having accuracy as 100% and recall and precision as 99%.