Author: Nik Alleyne <br>
Author Blog: https://www.securitynik.com <br>
Author GitHub:github.com/securitynik <br>
Author Books: [  <br>
                "https://www.amazon.ca/Learning-Practicing-Leveraging-Practical-Detection/dp/1731254458/",  <br>
                "https://www.amazon.ca/Learning-Practicing-Mastering-Network-Forensics/dp/1775383024/" <br>
            ] <br>


## 01 - Beginning Numpy

This post is part of my beginning machine learning series.  <br>
The series includes the following: <br>

01 - Beginning Numpy <br>
02 - Beginning Tensorflow  <br>
03 - Beginning PyTorch <br>
04 - Beginning Pandas <br>
05 - Beginning Matplotlib <br>
06 - Beginning Data Scaling <br>
07 - Beginning Principal Component Analysis (PCA) <br>
08 - Beginning Machine Learning Anomaly Detection - Isolation Forest and Local Outlier Factor <br>
09 - Beginning Unsupervised Machine Learning - Clustering - KMeans and DBSCAN <br>
10 - Beginning Supervise Learning - Machine Learning - Logistic Regression, Decision Trees and Metrics <br>
11 - Beginning Linear Regression - Machine Learning <br>
12 - Beginning Deep Learning - Anomaly Detection with AutoEncoders, Tensorflow <br>
13 - Beginning Deep Learning - Anomaly Detection with AutoEncoders, PyTroch <br>
14 - Beginning Deep Learning, - Linear Regression, Tensorflow <br>
15 - Beginning Deep Learning, - Linear Regression, PyTorch <br>
16 - Beginning Deep Learning, - Classification, Tensorflow <br>
17 - Beginning Deep Learning, - Classification, Pytorch <br>
18 - Beginning Deep Learning, - Classification - regression - MIMO - Functional API Tensorflow <br> 
19 - Beginning Deep Learning, - Convolution Networks - Tensorflow <br>
20 - Beginning Deep Learning, - Convolution Networks, PyTorch <br>
21 - Beginning Regularization - Early Stopping, Dropout, L2 (Ridge), L1 (Lasso) <br>
23 - Beginning Model TFServing <br>

But conn.log is not the only file within Zeek. Let's build some models for DNS and HTTP logs. <br>
I choose unsupervised, because there are no labels coming with these data. <br>

24 - Continuing Anomaly Learning - Zeek DNS Log - Machine Learning <br>
25 - Continuing Unsupervised Learning - Zeek HTTP Log - Machine Learning <br> <br>

This was a specific ask by someone in one of my class. <br>
26 - Beginning - Reading Executables and Building a Neural Network to make predictions on suspicious vs suspicious  <br><br>

With 26 notebooks in this series, it is quite possible there are things I could have or should have done differently.  <br>
If you find any thing, you think fits those criteria, drop me a line. <br>

In [3]:
# First up, import the numpy library
import numpy as np

In [4]:
# Get the current numpy version
np.__version__

'1.23.5'

In [10]:
# Setup an integer numpy array with 1 item
# A numpy array is a collection of one or items of the same type
x = np.array([10])
x

array([10])

In [11]:
# Confirm the data type of x
x.dtype

dtype('int32')

In [12]:
# Above shows x if of data type int32
# If we want to find the type of object x is, we use type
# This returns a Numpy N-dimensional array
# https://numpy.org/doc/stable/reference/arrays.ndarray.html
type(x)

numpy.ndarray

In [14]:
# Setup the array with multiple integer items and confirm the data type of the 
x = np.array([10, 20, 30])
x, x.dtype

(array([10, 20, 30]), dtype('int32'))

In [15]:
# Setup the array with multiple float items without explicitly specifying the data type
x = np.array([10., 20., 30.])
x

array([10., 20., 30.])

In [16]:
# Confirming the data type is float.
x.dtype

dtype('float64')

In [17]:
# Alternatively, we can cast (convert) the integers to float16 values
# Setup the array with multiple integer items
x = np.array([10, 20, 30], dtype=np.float16)
x, x.dtype

(array([10., 20., 30.], dtype=float16), dtype('float16'))

In [18]:
# Add a new dimension to the vector of integers
# Make it 2 dimensional
# Note the two brackets to open and close the data
x = np.array([10, 20, 30], dtype=float, ndmin=2)
x

array([[10., 20., 30.]])

In [20]:
# Make it a 3 dimension vector a different way
x = np.array([10, 20, 30], dtype=float, ndmin=3)
x

array([[[10., 20., 30.]]])

In [21]:
# Alternatively, we could have added those dimensions manually
x = np.array([[[10, 20, 30]]])
x

array([[[10, 20, 30]]])

In [22]:
# Reshape the x array
# In this case, (-1, 1) means any amount of rows but only one column
# For this scenario, since x is 1 row and 3 columns, this transitions it to 3 rows and 1 column
x = np.array([10, 20, 30], dtype=float, ndmin=2)
x = x.reshape(-1, 1)
x

array([[10.],
       [20.],
       [30.]])

In [23]:
# Alternatively Reshape the x array to have 1 row and any amount of columns
# In this case, (1, -1) means 1 row and any amount of columns
# For this scenario, since x is 1 row and 3 columns, this transitions it to 3 columns and 1 row
# Notice the 2 dimensions
x = np.array([10, 20, 30], dtype=float, ndmin=2)
x = x.reshape(1, -1)
x

array([[10., 20., 30.]])

In [24]:
# Alternatively, reshaping using the newaxis command to reshape from a 1 dimension vector to 2D
# This results in 1 row and multiple columns
# # Notice the transition from 1 to 2 dimensions
x = np.array([10, 20, 30], dtype=float, ndmin=1)
x[np.newaxis, :]

array([[10., 20., 30.]])

In [25]:
# Reshape with np.newaxis to any amount of rows and 1 column
x = np.array([10, 20, 30], dtype=float, ndmin=1)
x[:, np.newaxis ]

array([[10.],
       [20.],
       [30.]])

In [26]:
# If you are not too excited about using np.reshape or np.newaxis
# you can still reshape using np.expand_dims
# This results in multiple rows with 1 column
x = np.array([10, 20, 30], dtype=float, ndmin=1)

# Notice this is done across axis 1 as in across the rows
np.expand_dims(x, axis=1)

array([[10.],
       [20.],
       [30.]])

In [27]:
# If not too excited about using np.reshape or np.newaxis
# you can still reshape using np.expand_dims
# In this case, target axis 0. This results in 1 row multiple columns
x = np.array([10, 20, 30], dtype=float, ndmin=1)

# Notice the axis=0. Ths means we are going down the columns
np.expand_dims(x, axis=0)

array([[10., 20., 30.]])

In [28]:
# Create two vectors to stack
x = np.array([[1, 2, 3, 4, 5]], dtype=float)
y = np.array([[6, 7, 8, 9, 0]], dtype=float)

x, y

(array([[1., 2., 3., 4., 5.]]), array([[6., 7., 8., 9., 0.]]))

In [29]:
# Stack x and y vertically
# Note this needs to be a tuple
# This is very helpful, if you would like to stack two datasets to create 1 larger one
z = np.vstack((x, y))
z

array([[1., 2., 3., 4., 5.],
       [6., 7., 8., 9., 0.]])

In [30]:
# Stack horizontally
# This is helpful when you want to add new features to your dataset
# Remember this needs to be a tuple
z = np.hstack((x, y))
z

array([[1., 2., 3., 4., 5., 6., 7., 8., 9., 0.]])

In [31]:
# If not interested in using np.vstack
# You can instead concatenate the items along axis 0
z = np.concatenate((x, y), axis=0)
z

array([[1., 2., 3., 4., 5.],
       [6., 7., 8., 9., 0.]])

In [32]:
# If not interested in using np.hstack
# You can instead concatenate the items along axis 1
z = np.concatenate((x, y), axis=1)
z

array([[1., 2., 3., 4., 5., 6., 7., 8., 9., 0.]])

In [33]:
# Finding the difference between the smallest number and the largest
x = np.array([[1, 2, 3, 4, 5]], dtype=float)
np.ptp(x)

4.0

In [34]:
# Find the index within x where the value equals 4
x = np.array([10, 9, 8, 7, 6, 5, 4], dtype=float)
z = np.where((x == 5))
z

(array([5], dtype=int64),)

In [35]:
# Confirming the return positioned 
x[5]

5.0

In [36]:
# Generate a 4*4 matrix of ones
np.ones((4,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [37]:
# Create a 6x6 matrix with all zeros
x = np.zeros((6,6))
x

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [38]:
# Update the items at position row 0, column 0 with 2
# Counting for both the rows and columns start 0
# As a result, even though this matrix is 6x6, you 
# will be going from 0 to 5 for the indexes
x[0,0] = 2
x

array([[2., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [39]:
# Update the items at position row 3, column 5 with 100
# Remember, index starts from 0
x[3,5] = 100
x


array([[  2.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0., 100.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.]])

In [40]:
# Maybe instead, manipulate columns 1, 2 and 3 of the last row. 
# Remember, the last row is row 5
# Could do x[5, 1:4] = 23
x[-1, 1:4] = 23
x

array([[  2.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0., 100.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,  23.,  23.,  23.,   0.,   0.]])

In [41]:
# One more. Change the values from the last to the second column in row 2
# Giving them a value of -10
x[1, -5:] = -10
x

array([[  2.,   0.,   0.,   0.,   0.,   0.],
       [  0., -10., -10., -10., -10., -10.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0., 100.],
       [  0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,  23.,  23.,  23.,   0.,   0.]])

In [42]:
# Get the max value of all the items in the matrix
np.max(x)

100.0

In [43]:
# Get the max value in the matrix going down the columns 
# Going across axis=0
np.max(x, axis=0)

array([  2.,  23.,  23.,  23.,   0., 100.])

In [44]:
# Get the max value in the matrix across each row
np.max(x, axis=1)

array([  2.,   0.,   0., 100.,   0.,  23.])

In [45]:
# Create a 2D array of integers to be transposed
x = np.array([10, 9, 8, 7, 5], ndmin=2, dtype=int)
x

array([[10,  9,  8,  7,  5]])

In [46]:
# Use the full transpose function to change from a row vector to a column vector
x.transpose()

array([[10],
       [ 9],
       [ 8],
       [ 7],
       [ 5]])

In [47]:
# Maybe you like the shorter way to transpose
# simply use .T on the array
x.T

array([[10],
       [ 9],
       [ 8],
       [ 7],
       [ 5]])

In [48]:
# Create a 4 * 4 eye matrix
# Notice all the ones on the diagonal
# This is helpful also when you think about one-hot encoding
# In one hot encoding, only one item is "hot", i.e. turned on
x = np.eye(N=4, M=4)
x

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [49]:
# Preparing to do some math
# Define a matrix with axes 0 and 1
x = np.array([[6,4,3], [9, 8, 0]])
x

array([[6, 4, 3],
       [9, 8, 0]])

In [51]:
# Get the sum of x
x_sum = np.sum(x)
x_sum

30

In [52]:
# Rather than get the sum of the entire matrix, 
# Get the sum of the rows, i.e axis=1
x_sum = np.sum(x, axis=1)
x_sum

array([13, 17])

In [53]:
# Similarly get the sum of the columns, i.e axis=0
x_sum = np.sum(x, axis=0)
x_sum

array([15, 12,  3])

In [54]:
# Get the average, going across the rows, axis=1
x_avg = np.average(x, axis=1)
x_avg

array([4.33333333, 5.66666667])

In [55]:
# Get the average going down the columns, axis=0
x_avg = np.average(x, axis=0)
x_avg

array([7.5, 6. , 1.5])

In [57]:
# Get the max of each column in x
np.max(a=x, axis=0)

array([9, 8, 3])

In [58]:
# Get the max of each row in x
np.max(a=x, axis=1)

array([6, 9])

In [59]:
# Generate a random number between 10 and 20
np.random.randint(low=10, high=20)

16

In [60]:
# You may have instances you wish to generate the same random number
# Maybe for demonstration purposes. Like in this notebook :-) 
# In this case, first set the random seed
# Generate the number. Because the random seed is set, we get the same number every time

for idx, num in enumerate(range(5)): 
    np.random.seed(11)
    print(f'Run: {idx} number: {np.random.randint(low=10, high=20)}')

Run: 0 number: 19
Run: 1 number: 19
Run: 2 number: 19
Run: 3 number: 19
Run: 4 number: 19


In [61]:
# In looking for max values, you might instead want the index of that value
# This is helpful when using activation functions such as softmax, 
# when building a neural output layers networks
# Softmax activation function is used when predicting in multiclass problems
# This output shows 1. Looking at x, we see that index position 1 has 10.
# 10 is the largest value in the list
x = np.array([2, 10,5,2,3])
np.argmax(x), x

(1, array([ 2, 10,  5,  2,  3]))

In [62]:
# Multiply two matrix
# First create the matrices 
x = np.array([[2,3,4]])
y = np.array([[5],[4],[3]])
x, y

(array([[2, 3, 4]]),
 array([[5],
        [4],
        [3]]))

In [63]:
# In the next cell, we get the dot product of two matrix
# To get the dot product of two matrices, we need to ensure the inner dimensions match
# Below shows the inner dimension has 3,3
# Looking at the outer dimension, we can see we will get a 1x1 output
x.shape, y.shape

((1, 3), (3, 1))

In [64]:
# Get the dot product of the two vectors
# As mentioned above, the output here is a single value
np.dot(x, y)

array([[34]])

In [65]:
# Alternatively, perform a pairwise or Hadamar product
np.multiply(x, y)

array([[10, 15, 20],
       [ 8, 12, 16],
       [ 6,  9, 12]])

In [66]:
# Before getting the cumulative sum of x, revisit x
x

array([[2, 3, 4]])

In [67]:
# Get the cumulative sum
# Notice how the first value in x remains the same, then the first and second are added
# Then the second and third are added
np.cumsum(x)

array([2, 5, 9])

In [68]:
# Create x with 4 dimensions
# When working with convolution networks, we will need to get this data in 4 dimensions
#   19 - Beginning Deep Learning, - Convolution Networks - Tensorflow
#   20 - Beginning Deep Learning, - Convolution Networks, PyTorch

x = np.array([2,3,4,5,6], ndmin=4)
x

array([[[[2, 3, 4, 5, 6]]]])

In [69]:
# Flatten x to a vector
# This is needed especially when building architectures that have, for example
# convolution layer that then needs to transition to a dense layer
# You will have to flatten the convolution layer before passing it to the dense layer
#   19 - Beginning Deep Learning, - Convolution Networks - Tensorflow
#   20 - Beginning Deep Learning, - Convolution Networks, PyTorch

x.flatten()

array([2, 3, 4, 5, 6])

In [70]:
# Alternatively, we could have used np.ravel() to get a 1D array
x, np.ravel(x)

(array([[[[2, 3, 4, 5, 6]]]]), array([2, 3, 4, 5, 6]))

In [78]:
# Create matrix of shape 3x2
x = np.array([[3,4,5], [6,7,8]])
x

array([[3, 4, 5],
       [6, 7, 8]])

In [80]:
# Delete the the second row from the matrix above
# Remember, indexing starts at 0
# Note the 0 at the end, I left out axis=0 and only used 0
np.delete(x, 1, 0)

array([[3, 4, 5]])

In [81]:
# Note with that operation above, the original x was not modified
x

array([[3, 4, 5],
       [6, 7, 8]])

In [82]:
# Delete the middle column
# Notice this time I use axis, specifying axis=1
np.delete(arr=x, obj=1, axis=1)

array([[3, 5],
       [6, 8]])

In [83]:
# One more before moving on, broadcasting
# Taking a 2x3 matrix and multiple by a 1*3 vector
# https://numpy.org/doc/stable/user/basics.broadcasting.html
np.multiply(np.array([[10,2,3], [2,1,3]]), np.array([4,5,6]))

array([[40, 10, 18],
       [ 8,  5, 18]])

In [84]:
# The above is the same as multiplying 
# [[10,2,3],   * [[4,5,6]
# [2,1,3]]        [4,5,6]]
np.multiply(np.array([[10,2,3], [2,1,3]]), np.array([[4,5,6], [4,5,6]]))

array([[40, 10, 18],
       [ 8,  5, 18]])

In [85]:
# While I typically will use Pandas to read in data,
# You also have the opportunity to do so with numpy
# Maybe you want to read in content from a .csv file. 
# Here is a simple way of doing that.
# Note, because the data in the Zeek conn log is mixed of integers and strings, I specified U here for unicode.
# We will fix this shortly.

conn_data = np.loadtxt(fname=r'conn.log', dtype='U', encoding=None, delimiter=None, skiprows=0)
conn_data

array([['id.orig_h', 'id.orig_p', 'id.resp_h', ..., 'orig_ip_bytes',
        'resp_pkts', 'resp_ip_bytes'],
       ['127.0.0.1', '27762', '127.0.0.1', ..., '0', '0', '0'],
       ['192.168.0.4', '27761', '192.168.0.4', ..., '0', '0', '0'],
       ...,
       ['192.168.0.4', '37244', '192.168.0.4', ..., '0', '1', '40'],
       ['192.168.0.4', '37246', '192.168.0.4', ..., '0', '1', '40'],
       ['192.168.0.4', '37254', '192.168.0.4', ..., '0', '1', '40']],
      dtype='<U38')

Read in tabular data from the Zeek conn.log file
Zeek is a framework used for Network Security Monitoring. 
This entire series is based on using Zeek's data. 
The majority of the notebooks use the conn.log
You can learn more about Zeek here:
    https://zeek.org/

Alternatively, come hang out with us in the:
SANS SEC595: Applied Data Science and Machine Learning for Cybersecurity Professionals
https://www.sans.org/cyber-security-courses/applied-data-science-machine-learning/
OR
SEC503 SEC503: Network Monitoring and Threat Detection In-Depth
https://www.sans.org/cyber-security-courses/network-monitoring-threat-detection/

if you wish to learn more about using Zeek for your security needs

In [86]:
# Notice above, the first row has what looks like column headers.
# Let's remove those headers by skipping the first row. 
# Counting starts at 1
# Notice also everything is in quotes, hence we know it is all strings.
conn_data = np.loadtxt(fname=r'conn.log', dtype='U', encoding=None, delimiter=None, skiprows=1)
conn_data

array([['127.0.0.1', '27762', '127.0.0.1', ..., '0', '0', '0'],
       ['192.168.0.4', '27761', '192.168.0.4', ..., '0', '0', '0'],
       ['192.168.0.4', '27761', '192.168.0.4', ..., '0', '0', '0'],
       ...,
       ['192.168.0.4', '37244', '192.168.0.4', ..., '0', '1', '40'],
       ['192.168.0.4', '37246', '192.168.0.4', ..., '0', '1', '40'],
       ['192.168.0.4', '37254', '192.168.0.4', ..., '0', '1', '40']],
      dtype='<U38')

In [87]:
# Get the shape of the dataset
# We always want to know the shape of our dataset
# As we prepare to feed it into our machine or deep learning models
# Below states we have 4430188 samples and 12 columns/features
conn_data.shape

(4430188, 12)

In [88]:
# The reason everything above is string, is because the data contains "-" where an integer value should be
# maybe there is an easier way to read in these "-" as ints or floats
# At this point, you might be thinking I am better off using Pandas
# Which would not be a bad idea. However, this notebook is about using Numpy
#   04 - Beginning Pandas
# Let's stick with Numpy

# Giving it another try
np.genfromtxt(fname=r'conn.log', encoding=None, delimiter=None, skip_header=0)

array([[       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan, 2.7762e+04,        nan, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [       nan, 2.7761e+04,        nan, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       ...,
       [       nan, 3.7244e+04,        nan, ..., 0.0000e+00, 1.0000e+00,
        4.0000e+01],
       [       nan, 3.7246e+04,        nan, ..., 0.0000e+00, 1.0000e+00,
        4.0000e+01],
       [       nan, 3.7254e+04,        nan, ..., 0.0000e+00, 1.0000e+00,
        4.0000e+01]])

In [89]:
# Above we see "nan", not a number
# Let's replace those with 0s, by specifying the missing_values '-' and filling them in with 0
# Let's also add the column names. 
# The column names will be added from first row
# Note the skip_header=0
conn_data = np.genfromtxt(fname=r'conn.log', encoding=None, delimiter=None, \
                          skip_header=0, missing_values='-', filling_values=0., names=True)
conn_data

array([(0., 27762., 0., 58552., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.),
       (0., 27761., 0., 48798., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.),
       (0., 27761., 0., 48804., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.), ...,
       (0., 37244., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.),
       (0., 37246., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.),
       (0., 37254., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.)],
      dtype=[('idorig_h', '<f8'), ('idorig_p', '<f8'), ('idresp_h', '<f8'), ('idresp_p', '<f8'), ('service', '<f8'), ('duration', '<f8'), ('orig_bytes', '<f8'), ('resp_bytes', '<f8'), ('orig_pkts', '<f8'), ('orig_ip_bytes', '<f8'), ('resp_pkts', '<f8'), ('resp_ip_bytes', '<f8')])

In [90]:
# Alternatively, add the names of the columns dirctly
conn_data = np.genfromtxt(fname=r'conn.log', encoding=None, delimiter=None, \
                          skip_header=0, missing_values='-', filling_values=0., \
                            names=['id.orig_h',  'id.orig_p',  'id.resp_h',  'id.resp_p'\
                                   ,  'service', 'duration', 'orig_bytes', 'resp_bytes'\
                                    , 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes'])
conn_data

array([(0.,     0., 0.,     0., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.),
       (0., 27762., 0., 58552., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.),
       (0., 27761., 0., 48798., 0., 0.e+00, 0., 0., 0., 0., 0.,  0.), ...,
       (0., 37244., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.),
       (0., 37246., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.),
       (0., 37254., 0.,  9200., 0., 5.e-06, 0., 0., 0., 0., 1., 40.)],
      dtype=[('idorig_h', '<f8'), ('idorig_p', '<f8'), ('idresp_h', '<f8'), ('idresp_p', '<f8'), ('service', '<f8'), ('duration', '<f8'), ('orig_bytes', '<f8'), ('resp_bytes', '<f8'), ('orig_pkts', '<f8'), ('orig_ip_bytes', '<f8'), ('resp_pkts', '<f8'), ('resp_ip_bytes', '<f8')])

In [91]:
# Above is much better
# Also notice that Numpy inferred the data types
# Good stuff! If we pay close attention above, we see values such as "0.e+00".
# Let's remove this scientific notation and instead have more human readable values
np.set_printoptions(suppress=True, precision=5)

In [92]:
# Show the connection data again
# This time, take a snapshot of 5 records
conn_data[:5]

array([(0.,     0., 0.,     0., 0., 0., 0., 0., 0., 0., 0., 0.),
       (0., 27762., 0., 58552., 0., 0., 0., 0., 0., 0., 0., 0.),
       (0., 27761., 0., 48798., 0., 0., 0., 0., 0., 0., 0., 0.),
       (0., 27761., 0., 48804., 0., 0., 0., 0., 0., 0., 0., 0.),
       (0., 27762., 0., 58568., 0., 0., 0., 0., 0., 0., 0., 0.)],
      dtype=[('idorig_h', '<f8'), ('idorig_p', '<f8'), ('idresp_h', '<f8'), ('idresp_p', '<f8'), ('service', '<f8'), ('duration', '<f8'), ('orig_bytes', '<f8'), ('resp_bytes', '<f8'), ('orig_pkts', '<f8'), ('orig_ip_bytes', '<f8'), ('resp_pkts', '<f8'), ('resp_ip_bytes', '<f8')])

In [94]:
# Next up, you may wish to save your matrix
np.save(file='connection_data', arr=conn_data, allow_pickle=True)

# View the saved file
!dir connection_* /b

connection_data.npy


In [97]:
# Since you savedit, you may need to reload it
loaded_arr = np.load(file=r'connection_data.npy',mmap_mode='r+')
loaded_arr

memmap([(0.,     0., 0.,     0., 0., 0.     , 0., 0., 0., 0., 0.,  0.),
        (0., 27762., 0., 58552., 0., 0.     , 0., 0., 0., 0., 0.,  0.),
        (0., 27761., 0., 48798., 0., 0.     , 0., 0., 0., 0., 0.,  0.),
        ...,
        (0., 37244., 0.,  9200., 0., 0.00001, 0., 0., 0., 0., 1., 40.),
        (0., 37246., 0.,  9200., 0., 0.00001, 0., 0., 0., 0., 1., 40.),
        (0., 37254., 0.,  9200., 0., 0.00001, 0., 0., 0., 0., 1., 40.)],
       dtype=[('idorig_h', '<f8'), ('idorig_p', '<f8'), ('idresp_h', '<f8'), ('idresp_p', '<f8'), ('service', '<f8'), ('duration', '<f8'), ('orig_bytes', '<f8'), ('resp_bytes', '<f8'), ('orig_pkts', '<f8'), ('orig_ip_bytes', '<f8'), ('resp_pkts', '<f8'), ('resp_ip_bytes', '<f8')])

In [None]:
# Thats it for beginning numpy!

References: <br>
https://numpy.org/doc/stable/reference/