# Assignment 2
## Author: Emily McAfee
### Bad v. Good connections

You are involved in a project where you are tasked to build a machine learning algorithm that distinguishes between "bad'' connections (called intrusions or attacks) and "good'' (normal) connections. Note that the number of normal connections is greater than that of bad ones. The following code trains a moodel to predict 'bad' connections.

In [1]:
import pandas as pd
import numpy as np

import imblearn
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
%matplotlib inline

### 1) Read data

In [2]:
# Create file name
filename = 'https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Intrusion%20Detection.csv'

In [3]:
# Create dataframe
cdf = pd.read_csv(filename)

In [4]:
# Check data
cdf.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Class
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0


### 2) Build a classifier

In [5]:
# Check data
#cdf.dtypes
#sum(cdf.Class)
#len(cdf.Class)

# Check how many groups there are in 'object' features
#cdf.protocol_type.unique()
#cdf.flag.unique()
cdf.service.unique()

array(['http', 'smtp', 'finger', 'domain_u', 'auth', 'telnet', 'ftp',
       'eco_i', 'ntp_u', 'ecr_i', 'other', 'pop_3', 'ftp_data', 'ssh',
       'domain', 'private', 'time', 'shell', 'IRC', 'urh_i', 'X11',
       'urp_i', 'tftp_u', 'tim_i', 'red_i'], dtype=object)

In [6]:
# Further check service data to see if groups can be reduced
cdf.service.value_counts()

http        61886
smtp         9598
private      7366
domain_u     5862
other        5632
ftp_data     3806
urp_i         537
finger        468
eco_i         389
ntp_u         380
ftp           374
ecr_i         345
telnet        240
auth          220
pop_3          79
time           52
IRC            42
urh_i          14
X11             9
domain          3
tim_i           2
shell           1
ssh             1
red_i           1
tftp_u          1
Name: service, dtype: int64

In [7]:
# Reduce groups in 'service' feature to avoid overfitting to training data
cdf2 = cdf.copy()
cdf2.service = np.where(cdf2['service'] == 'urp_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'finger', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'eco_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'ntp_u', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'ftp', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'ecr_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'telnet', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'auth', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'pop_3', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'time', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'IRC', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'urh_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'X11', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'domain', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'tim_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'red_i', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'tftp_u', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'ssh', 'other', cdf2['service'])
cdf2.service = np.where(cdf2['service'] == 'shell', 'other', cdf2['service'])

cdf2.service.unique()

array(['http', 'smtp', 'other', 'domain_u', 'ftp_data', 'private'],
      dtype=object)

In [8]:
# Show which variables are categorical
cdf2.dtypes

duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

In [9]:
# Create dummy variables for categorical variables
cdf3 = cdf2.copy()
protocol_type = pd.get_dummies(cdf3['protocol_type'], prefix = 'protocol_type')
service = pd.get_dummies(cdf3['service'], prefix = 'service')
flag = pd.get_dummies(cdf3['flag'], prefix = 'flag')

# Bring together in one big dataframe
cdf3 = pd.concat([cdf3, protocol_type, service, flag], axis = 1)

# Get rid of old categorical columns
cdf4 = cdf3.drop(['protocol_type','service', 'flag'], axis = 1)

# Check new data frame
cdf4.dtypes

duration                         int64
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate                float64
rerror_rate                    float64
srv_rerror_rate                float64
same_srv_rate            

In [10]:
# Establish features
x = cdf4.loc[:, cdf4.columns != 'Class']
#x = x2.drop(['serror_rate', 'rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_count',
          #'srv_serror_rate', 'srv_rerror_rate', 'srv_diff_host_rate'], axis = 1)

# Establish target label (variabel we are trying to predict)
y = cdf4.Class
# y = y.to_frame()

In [11]:
# Implement the model
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0)

In [12]:
# Start model
logreg = LogisticRegression(max_iter = 10000)

# Fit model with data
logreg.fit(x_train,y_train)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### 3. Determine your model accuracy

In [None]:
# See how well we predicted with accuracy (with class imbalances)
y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

In [None]:
# See how well we predicted with conofusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

### 4. Modify data by handling class imbalance

In [None]:
# What did our imbalance look like before SMOTE
print('Original dataset shape {}'.format(Counter(y)))

# Apply SMOTE
sm = SMOTE(random_state=42)
x_res, y_res = sm.fit_sample(x, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

### 5. Use the same model on updated (balanced) data

In [None]:
# Train/test for SMOTE data
x_train,x_test,y_train,y_test=train_test_split(x_res,y_res,test_size=0.25,random_state=0)

logreg = LogisticRegression()

# Fit the model with now balanced data
logreg.fit(x_train,y_train)

### 6. What is the accuracy

In [None]:
# New accuracy
y_pred=logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

# New confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

In [None]:
# Show information in ROC curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

### 7. Describe your findings
The data we are working with here contains information on connections between computers. We wanted to know if we could predict which connections are 'bad' so that we can avoid getting hacked or viruses. Since, luckily, there are far more successful and 'good' connections than bad ones, this means that we have a major class imbalance with our data. However, in order to see the impact a class imbalance can have on a model, we applied a logistic regression model to the imbalanced data.

When doing this, we found a 99% (wow!) accuracy rate in predicting the connection type. However, the large discrepency between the samples means that the model will tend to favor the larger sample (in this case the'good' connections. Meaning, even if the model only pays attention to those cases, the accuracy will still be very high. We can see this situation demonstrated in the confusion matrix, where we can see that we got zero correct predictions in regards to 'bad' connections - which is not good for us, since that is the entire point of the model. To improve the model we can implement a SMOTE algorithm. By creating synthetic samples oof our minority data, we can properly balance the two samples. After this balancing, we can see that our model performs very well (93%) on our test data and has a more reasonable confusion matrix. This information is echoed within the ROC visualization, where we can see that our curve (blue) is far away from what would be a purely random classifier (red dotted).