Principal Component Analysis
PCA is a technique used for dimensionality reduction and feature extraction by transforming the original dataset into a new set of linearly uncorrelated variables called principal components. The first principal component has the highest variance and each subsequent component has the highest variance under the constraint of being orthogonal to the previous components.
The number of principal components to keep is usually determined by the amount of variance explained by each component. In this case, n_components=2 is specified, which means the data is transformed into a new two-dimensional space spanned by the first two principal components.
The transformed data is visualized using a scatter plot, where the x-axis corresponds to the first principal component and the y-axis corresponds to the second principal component. The plot shows how the data is distributed in the new two-dimensional space.
Preprocessing
This code loads the dataset from the df DataFrame and preprocesses the data for the principal component analysis (PCA).
It drops the 'datetime', 'srcstr', 'localeabbr', 'time', 'hour', and 'postalcode' columns from the dataset as they are not useful for the analysis.
It also encodes the categorical variables using LabelEncoder() from the sklearn.preprocessing module.
The resulting dataset containing only the features is stored in the X DataFrame.
Finally, it drops the rows containing null values using the dropna() method and returns the first few rows of the processed dataset using the head() method.


In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the dataset
d = df
d = d.drop('datetime', axis=1)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

d.host = le.fit_transform(d.host)
d.proto = le.fit_transform(d.proto)
d.cc = le.fit_transform(d.cc)
d.country = le.fit_transform(d.country)
d.locale = le.fit_transform(d.locale)
d.month = le.fit_transform(d.month)

# Separate the features from the target variable
d = d.drop('srcstr', axis=1)
d = d.drop('localeabbr', axis=1)
d = d.drop('time', axis=1)
d = d.drop('hour', axis=1)
d = d.drop('postalcode', axis=1)
X = d
X=X.dropna()
X.head()

NameError: name 'df' is not defined

In [None]:
X.isnull().sum()

In [None]:
# Standardize the feature values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize the transformed data
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
t = df
t = t.drop('datetime', axis=1)
t = t.drop('srcstr', axis=1)
t = t.drop('localeabbr', axis=1)
t = t.drop('time', axis=1)
t = t.drop('hour', axis=1)
t = t.drop('postalcode', axis=1)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

t.host = le.fit_transform(t.host)
t.proto = le.fit_transform(t.proto)
t.cc = le.fit_transform(t.cc)
t.locale = le.fit_transform(t.locale)
t.month = le.fit_transform(t.month)
t = t.dropna()
t.head()

This code first imports the required libraries for performing PCA visualization and 3D plot visualization.
Then, it selects the columns spt, dpt, latitude, and longitude for PCA visualization from the dataframe t.
The selected columns are standardized using StandardScaler(), and PCA is performed on them with 2 components using PCA(n_components=2).
Another PCA is performed with 3 components using PCA(n_components=3) for 3D plot visualization.
The unique values of the country column are identified and a color map is created using the plt.cm.get_cmap() function.
The transformed data is then plotted using scatter plot with each data point colored based on its corresponding country value.
Finally, the plot labels, title, legends, and colorbar are set using the set_xlabel(), set_ylabel(), set_title(), legend(), and colorbar() functions of the matplotlib.pyplot library.
The resulting plot is displayed using the plt.show() function.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Select the columns for PCA visualization
columns = ['host','spt', 'dpt', 'latitude', 'longitude']

# Separate the features from the target variable
A = t[columns]
A=A.dropna()
# Standardize the feature values
scaler = StandardScaler()
A_scaled = scaler.fit_transform(A)

# Perform PCA with 2 components
pca = PCA(n_components=2)
A_pca_2d = pca.fit_transform(A_scaled)

# Perform PCA with 3 components
pca = PCA(n_components=3)
A_pca_3d = pca.fit_transform(A_scaled)

# Create a color map for country column values
country_values = t['country'].unique()
color_map = plt.cm.get_cmap('hsv', len(country_values))

# Visualize the transformed data with color-coded data points in 2D
fig, ax = plt.subplots(figsize=(8, 5))
for i, country in enumerate(country_values):
    mask = t['country'] == country
    ax.scatter(A_pca_2d[mask, 0], A_pca_2d[mask, 1], c=color_map(i), marker='o', label=country)

# Set the plot labels and title
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_title('AWS Honeypot Dataset - PCA Visualization (2D)')

# Add legends and colorbar
ax.legend(bbox_to_anchor=(1.95, 1.50), fontsize="8")
sm = plt.cm.ScalarMappable(cmap=color_map, norm=plt.Normalize(vmin=0, vmax=len(country_values)-1))
sm.set_array([])
fig.colorbar(sm)

# Show the plot
plt.show()

This code reads in the dataset dx, normalizes the values in each column (except for certain columns), drops any rows with missing values, and then checks for the presence of missing values in the resulting dataset.

The normalization is done by dividing each value in a column by the maximum value in that column. This scales all the values in each column to lie between 0 and 1.

The specific columns excluded from normalization are 'datetime', 'host', 'src', 'proto', and 'type'. This suggests that these columns may not contain numerical data that is amenable to normalization.

The code then drops any rows in dx that contain missing values, and finally uses the isnull() and sum() methods to count the number of missing values in the resulting dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

# Read the dataset

dx= d

# Normalize the data
for col in dx.columns:
    if col != 'datetime' and col != 'host' and col != 'src' and col != 'proto' and col != 'type':
        dx[col] = dx[col].astype('float')
        dx[col] = dx[col] / dx[col].max()

dx = dx.dropna()
dx.isnull().sum()

In [None]:
dx


This code is performing PCA (Principal Component Analysis) on the normalized dataset dx and visualizing the PCA dimensions in 3D and 2D.

First, the code creates a 3D plot using the scatter function of matplotlib and the projection='3d' parameter. It then plots the transformed dataset pca_transformed on the x, y, and z axes, with the color of each point determined by the corresponding country value in the dx DataFrame.

Next, the code creates a 2D scatter plot using the sns.scatterplot function from the seaborn library. It plots the first two principal components (PCA1 and PCA2) on the x and y axes, respectively, with the color of each point determined by the corresponding country value in the dx DataFrame.

Finally, both the 3D and 2D plots are displayed using the plt.show() function.

In [None]:
# Perform PCA
pca = PCA(n_components=3)
pca_transformed = pca.fit_transform(dx.drop(['host', 'src', 'proto'], axis=1))

# Visualize PCA dimensions
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pca_transformed[:,0], pca_transformed[:,1], pca_transformed[:,2], c=dx['country'], cmap='viridis')
ax.set_xlabel('PCA1')
ax.set_ylabel('PCA2')
ax.set_zlabel('PCA3')
plt.show()

# 2D Visualization
plt.figure(figsize=(10,8))
sns.scatterplot(x=pca_transformed[:,0], y=pca_transformed[:,1], hue=dx['country'], palette='hls')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()