# Assignment 2

# KDD Cup '99 Dataset Analysis and Classificatio


# Project Introduction

In this project, we will work with the KDD Cup 1999 dataset. The task involves building a machine learning model for network intrusion detection. However, there are specific conditions and steps we need to follow:

## Project Conditions

1. **Use pandas Profiling**: We will utilize the pandas profiling library to gain insights into the dataset's characteristics, such as data distributions, missing values, and more.
2. **Plot Heatmap for Categorical Variables**: We will create a heatmap to visualize the relationships and correlations between categorical variables without encoding them.
3. **Plot Heatmap Using Scatter Plot**: Another heatmap will be generated using scatter plots to visualize the data distribution, particularly focusing on how different categorical variables relate to each other.
4. <font color="red">**Avoid Encoding**: We will not encode categorical variables using techniques like one-hot encoding or label encoding. Instead, we will work with the raw categorical data.-Not Done</font>
5. **No Scaling Techniques**: We will not apply scaling techniques, such as standardization or normalization, to the data.
6. **Model Fitting**: We will build a machine learning model using these unencoded and unscaled categorical variables.
7. **Confusion Matrix**: We will evaluate the model's performance by displaying a confusion matrix.

## Let's Get Started!

Now that we have outlined the conditions and steps, let's dive into the project and meet each requirement one by one.


In [None]:
#conda install -c conda-forge ydata-profiling

In [1]:
import pandas as pd
import warnings
from ydata_profiling import ProfileReport
warnings.filterwarnings('ignore')

  @nb.jit


In [2]:
import gzip#to unzip file
import shutil#for file Operation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:

# Define the column names 
cols = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted',
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login',
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'target'
]


In [4]:
input_file2 = 'kddcup.data_10_percent.gz'
output_file2 = 'kddcup2.csv'

In [6]:
with gzip.open(input_file2, 'rb') as f_in2, open(output_file2, 'wb') as f_out2:
    shutil.copyfileobj(f_in2, f_out2)


In [7]:
df= pd.read_csv(output_file2, header=None, names=cols)

In [8]:
profile = ProfileReport(df, config_file="profile_config.yaml")

In [9]:
# Convert the report to an HTML file
profile.to_file("profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
#creating a bins on target
attacks_types = {
    'normal': 'normal',
'back': 'dos',
'buffer_overflow': 'u2r',
'ftp_write': 'r2l',
'guess_passwd': 'r2l',
'imap': 'r2l',
'ipsweep': 'probe',
'land': 'dos',
'loadmodule': 'u2r',
'multihop': 'r2l',
'neptune': 'dos',
'nmap': 'probe',
'perl': 'u2r',
'phf': 'r2l',
'pod': 'dos',
'portsweep': 'probe',
'rootkit': 'u2r',
'satan': 'probe',
'smurf': 'dos',
'spy': 'r2l',
'teardrop': 'dos',
'warezclient': 'r2l',
'warezmaster': 'r2l',
}


In [11]:
#adding that bins to new column attack type and removing '.'
df['Attack_Type'] = df.target.apply(lambda r:attacks_types[r[:-1]])

In [12]:
df.shape

(494021, 43)

In [15]:
from scipy.stats import chi2_contingency


# Function to calculate Cramer's V
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
    rcorr = r - ((r - 1) ** 2) / (n - 1)
    kcorr = k - ((k - 1) ** 2) / (n - 1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

# Calculate Cramer's V for each pair of categorical variables in your DataFrame
categorical_columns = df.select_dtypes(include=['object']).columns
cramer_matrix = pd.DataFrame(np.zeros((len(categorical_columns), len(categorical_columns))), columns=categorical_columns, index=categorical_columns)

for var1 in categorical_columns:
    for var2 in categorical_columns:
        cramer_matrix.loc[var1, var2] = cramers_v(df[var1], df[var2])


plt.figure(figsize=(10, 8))
sns.heatmap(cramer_matrix, cmap="coolwarm", annot=True, fmt=".2f")
plt.title("Cramer's V Correlation Matrix")
plt.savefig("cramerheatmap.png")



![Cramer's V Heatmap](cramerheatmap.png)


protocol_type vs. service: High correlation (0.867). This suggests a significant association between the protocol_type and service variables.

protocol_type vs. flag: Moderate correlation (0.494). There is a moderate association between the protocol_type and flag variables.

protocol_type vs. target: High correlation (0.756). There is a strong association between the protocol_type and target variables.

protocol_type vs. Attack_Type: Moderate correlation (0.444). There is a moderate association between the protocol_type and Attack_Type variables.

service vs. flag: Low correlation (0.315). There is a relatively weak association between the service and flag variables.

service vs. target: Low correlation (0.380). There is a relatively weak association between the service and target variables.

service vs. Attack_Type: Moderate correlation (0.595). There is a moderate association between the service and Attack_Type variables.

flag vs. target: Moderate correlation (0.465). There is a moderate association between the flag and target variables.

flag vs. Attack_Type: Low correlation (0.258). There is a relatively weak association between the flag and Attack_Type variables.

In [16]:
numeric_df = df.select_dtypes(exclude=['object'])  # Select only numeric columns

In [18]:

# Calculate the correlation matrix
corr_matrix = numeric_df.corr()

# lists to store coordinates, sizes, and colors
x_coords, y_coords, bubble_sizes, bubble_colors = [], [], [], []


cmap = plt.get_cmap('coolwarm')

# threshold for considering correlations as "highly correlated"
threshold = 0.7  

# Iterate through the variables
for i in range(corr_matrix.shape[0]):
    for j in range(corr_matrix.shape[1]):
        correlation = corr_matrix.iloc[i, j]
        x_coords.append(i)
        y_coords.append(j)
        
        # Check if the correlation is above the threshold for "highly correlated"
        if abs(correlation) >= threshold:
            # Make the bubble size larger for highly correlated variables
            bubble_sizes.append(100)
            # Use a dark color for highly correlated variables
            bubble_colors.append('darkred')
        else:
            # Make the bubble size smaller for less correlated variables
            bubble_sizes.append(abs(correlation) * 100)
            # Map the correlation value to a color using the colormap
            bubble_colors.append(cmap(correlation))

# Create a bubble plot with varying colors and sizes
plt.figure(figsize=(12, 10))
scatter = plt.scatter(x_coords, y_coords, s=bubble_sizes, c=bubble_colors, alpha=0.7)

# Add variable names to the axes
plt.xticks(range(corr_matrix.shape[0]), numeric_df.columns, rotation=90)
plt.yticks(range(corr_matrix.shape[1]), numeric_df.columns)

# Set axis labels
plt.xlabel('Features')
plt.ylabel('features')

# Title
plt.title('Correlation Bubble Plot')

# Create a colorbar for the correlation magnitude
cbar = plt.colorbar(scatter)
cbar.set_label('Correlation Magnitude')

# Show the plot
plt.savefig('Numcorrmat')


![corrmat](Numcorrmat.png)


In [19]:
categorical_columns

Index(['protocol_type', 'service', 'flag', 'target', 'Attack_Type'], dtype='object')

In [20]:
 from sklearn.preprocessing import LabelEncoder

In [21]:
# Initialize label encoders
label_encoders = {}

for column in categorical_columns:
    label_encoder = LabelEncoder()
    df[column + '_encoded'] = label_encoder.fit_transform(df[column])
    label_encoders[column] = label_encoder



In [22]:
df.head(5)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target,Attack_Type,protocol_type_encoded,service_encoded,flag_encoded,target_encoded,Attack_Type_encoded
0,0,tcp,http,SF,181,5450,0,0,0,0,...,0.0,0.0,0.0,normal.,normal,1,22,9,11,1
1,0,tcp,http,SF,239,486,0,0,0,0,...,0.0,0.0,0.0,normal.,normal,1,22,9,11,1
2,0,tcp,http,SF,235,1337,0,0,0,0,...,0.0,0.0,0.0,normal.,normal,1,22,9,11,1
3,0,tcp,http,SF,219,1337,0,0,0,0,...,0.0,0.0,0.0,normal.,normal,1,22,9,11,1
4,0,tcp,http,SF,217,2032,0,0,0,0,...,0.0,0.0,0.0,normal.,normal,1,22,9,11,1


In [23]:
df=df.drop(columns=['protocol_type', 'service', 'flag', 'target', 'Attack_Type','target_encoded'])

In [24]:
df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,protocol_type_encoded,service_encoded,flag_encoded,Attack_Type_encoded
0,0,181,5450,0,0,0,0,0,1,0,...,0.11,0.0,0.0,0.0,0.0,0.0,1,22,9,1
1,0,239,486,0,0,0,0,0,1,0,...,0.05,0.0,0.0,0.0,0.0,0.0,1,22,9,1
2,0,235,1337,0,0,0,0,0,1,0,...,0.03,0.0,0.0,0.0,0.0,0.0,1,22,9,1
3,0,219,1337,0,0,0,0,0,1,0,...,0.03,0.0,0.0,0.0,0.0,0.0,1,22,9,1
4,0,217,2032,0,0,0,0,0,1,0,...,0.02,0.0,0.0,0.0,0.0,0.0,1,22,9,1


In [25]:
df.shape

(494021, 42)

In [26]:
# Split your data into features (X) and target variable (y)
X = df.drop('Attack_Type_encoded', axis=1)
y = df['Attack_Type_encoded']

In [27]:
# Split your data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [28]:
from sklearn.tree import DecisionTreeClassifier

In [29]:
# Create a Decision Tree Classifier
model = DecisionTreeClassifier()

In [30]:
# Train the Decision Tree model
model.fit(X_train, y_train)

In [31]:
y_pred = model.predict(X_test)

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

In [33]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9993927432822226


In [38]:
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,square=True,linewidths=0.5, linecolor="k")
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix')

![Image Alt Text](confusion_matrix.png)


In [39]:
from sklearn.metrics import classification_report

In [40]:
report = classification_report(y_test, y_pred)

In [37]:
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     78355
           1       1.00      1.00      1.00     19353
           2       0.99      0.99      0.99       850
           3       0.93      0.94      0.93       235
           4       0.70      0.58      0.64        12

    accuracy                           1.00     98805
   macro avg       0.92      0.90      0.91     98805
weighted avg       1.00      1.00      1.00     98805



Class 0 (Dos):

    Precision: Precision for class 0 is 1.00, indicating that all positive predictions for this class were correct. In other words, when the model predicts an instance as "Dos," it is almost always correct.
    Recall: Recall for class 0 is 1.00, meaning that all actual instances of "Dos" were correctly predicted by the model. The model rarely misses any actual instances of this class.
    F1-Score: The F1-score for class 0 is 1.00, which is the harmonic mean of precision and recall. It shows a perfect balance between precision and recall, indicating excellent performance for this class.
    Support: The support for class 0 is 78,355, which represents the number of actual "Dos" instances in the test dataset.

Class 1 (Normal):

    Precision: Precision for class 1 is 1.00, meaning that all positive predictions for this class were correct. Just like class 0, when the model predicts an instance as "Normal," it is almost always correct.
    Recall: Recall for class 1 is also 1.00, indicating that all actual instances of "Normal" were correctly predicted by the model. The model rarely misses any actual instances of this class.
    F1-Score: The F1-score for class 1 is 1.00, showing a perfect balance between precision and recall. The model performs exceptionally well for this class.
    Support: The support for class 1 is 19,353, representing the number of actual "Normal" instances in the test dataset.

Class 2 (Probe):

    Precision: Precision for class 2 is 0.99, indicating that nearly all positive predictions for this class were correct. The model is very accurate in predicting "Probe" instances.
    Recall: Recall for class 2 is 0.99, meaning that nearly all actual instances of "Probe" were correctly predicted by the model. The model rarely misses any actual instances of this class.
    F1-Score: The F1-score for class 2 is 0.99, indicating a high balance between precision and recall. The model performs very well for this class.
    Support: The support for class 2 is 850, representing the number of actual "Probe" instances in the test dataset.

Class 3 (R2L):

    Precision: Precision for class 3 is 0.96, indicating that a high proportion of positive predictions for this class were correct. The model is quite accurate in predicting "R2L" instances.
    Recall: Recall for class 3 is 0.94, meaning that the model correctly predicts a significant portion of actual "R2L" instances but misses a few.
    F1-Score: The F1-score for class 3 is 0.95, showing a good balance between precision and recall. The model performs well for this class.
    Support: The support for class 3 is 235, representing the number of actual "R2L" instances in the test dataset.

Class 4 (U2R):

    Precision: Precision for class 4 is 0.50, indicating that only half of the positive predictions for this class were correct. The model's precision for "U2R" instances is relatively low.
    Recall: Recall for class 4 is 0.58, meaning that the model correctly predicts more than half of the actual "U2R" instances but misses some.
    F1-Score: The F1-score for class 4 is 0.54, indicating a moderate balance between precision and recall. The model's performance for this class is not as high as for the other classes.
    Support: The support for class 4 is 12, representing the number of actual "U2R" instances in the test dataset.

# Conclusion




In conclusion, we successfully completed the project following the specified conditions. For more detailed insights, please refer to the attached Word document.

[Download Word Document](https://docs.google.com/document/d/1Hj2OXi_rgy5z8cp_RkU_PmW9qegpf5VJ/edit?usp=sharing&ouid=110633205482258762189&rtpof=true&sd=true)

Thank you for your Opportunity!