# DS/CMPSC 410 MiniProject Deliverable #3

# Spring 2025
### Instructor: Prof. John Yen
### TA: Peng Jin and Jingxi Zhu

### Learning Objectives
- Be able to apply PCA to reduce the high dimensional feature space to facilitate ML for high dimensional data.
- Like Minproject Deliverable #2, focus on the clustering of non-extreme multi-port scanners based on the 120 top ports they scanned.
- Be able to reduce the high dimensionality of One-Hot-Encoded features using Principal Component Analysis (PCA)
- Be able to perform k-means with and without dimension reduction using PCA, and compare the results using silhouette score and mirai external labels.
- Be able to obtain cluster centers of the original feature space for clustering results using PCA and k-means.
- After successful clustering of the small Darknet dataset (with and without dimension reduction using PCA), conduct clustering on the large Darknet dataset in the cluster mode.
- Compare the clustering results of the large dataset with and without dimension reduction using PCA.

## Submit the following items:
- Successfully completed Jupyter Notebook (run in local mode), in html format.
- Items for cluster mode (Exercise 8):
- - Submit the .py file  (5 points)
- - Submit the the log file that contains the run time information for a successful execution in the cluster mode. (5 points)
- - Submit the output file that records the cluster summary in the cluster mode (without PCA) (10 points)
- - Submit the output file that records the cluster summary in the cluster mode (with PCA) (10 points)
- - Discuss the Silihouette score and Mirai ratio of clusters generated by k-means clustering with PCA and without PCA (in a separate word document) (10 points)

### Total points: 100 
- Exercise 1: 1 point
- Exercise 2: 9 points 
- Exercise 3: 5 points 
- Exercise 4: 5 points
- Exercise 5: 5 points
- Exercise 6: 10 points
- Exercise 7: 25 points
- Exercise 8: 40 points
  
### Due: 11:59 pm, April 20th, 2025
### Early Submission bonus (before midnight April 13th): 10 points

In [None]:
import pyspark
import csv

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.feature import PCA

In [None]:
ss = SparkSession.builder.master("local").appName("MiniProject 3 PCA for k-meas Clustering using 120 OHE").getOrCreate()

In [None]:
ss.sparkContext.setLogLevel("WARN")

# We can use the header of data files to infer schema.

## Exercise 1 (1 point)
Complete the path for input file in the code below and enter your name in this Markdown cell:
- Name: 
### Note: You will need to change the name of the input file in the cluster mode to `Day_2020_profile.csv`

In [None]:
Scanners_df = ss.read.csv("/storage/home/???/work/MiniProj3/sampled_profile.csv", header= True, inferSchema=True )

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [None]:
Scanners_df.printSchema()

# In this mini project, our goal is to implement, interpret, and evaluate the results for reducing the dimensionality of the feature space using PCA.

## Like MiniProject 2, we can filter out those scanners that scan only 1 port, because they can be easily grouped using groupBy on ``ports_scanned_str``

### Because the feature `numports` record the total number of ports being scanned by each scanner, we can use it to separate 1-port-scanners from multi-port-scanners.

In [None]:
multi_port_scanners= Scanners_df.where(col("numports")>1)

In [None]:
multi_port_scanners_count = multi_port_scanners.count()

In [None]:
print(multi_port_scanners_count)

# Like Miniproject 2, we select a threashold for extreme scanners based on the largest number of ports that are scanned by at least two scanners

In [None]:
ScannersCount_byNumPorts = multi_port_scanners.groupby("numports").count()

In [None]:
ScannersCount_byNumPorts.show(3)

# Exercise 2 (9 points)
Complete the code below to find Non-extreme Multi-port scanners, using ``ScannersCount_byNumPorts`` DataFrame, by following the following steps:
- Step 1: Find the largest ``numports`` among all scanners that scann at least two ports.  We can use DataFrame aggregation method ``agg({ "numports" : "max" })``.  The result is a DataFrame with only column named as ``max(numports)``.  Obtain the value from the DataFrame and save it as a threshold for extreme scanners.
- Step 2: Filter ``multi_port_scanners`` DataFrame for those who are below the threshold for extreme scanners, save the result in ``non_extreme_multi_port_scanners`` DataFrame, which will be the data we want to cluster using k-means, with and without PCA.
- Step 3: Save the scanners whose ``numports`` are above the threshold for extreme scanners in a CSV file.

## Step 1

In [None]:
Scanners_not_unique_numports =  ScannersCount_byNumPorts.where( col("count") > 1) 

In [None]:
ExtremeScannersNumports_thresholdDF = Scanners_not_unique_numports.agg({ ??? : ??? })

In [None]:
ExtremeScannersNumports_thresholdDF.show()

In [None]:
max_non_rare_NumPorts_rdd = ExtremeScannersNumports_thresholdDF.rdd.map(lambda x: ???)
max_non_rare_NumPorts_rdd.take(1)

In [None]:
max_non_rare_NumPorts_list = max_non_rare_NumPorts_rdd.???
print(max_non_rare_NumPorts_list)

In [None]:
max_non_rare_NumPorts=max_non_rare_NumPorts_list[0]
print(max_non_rare_NumPorts)

## Step 2

In [None]:
non_extreme_multi_port_scanners = Scanners_df.where(col("numports") <= ???).where(col("numports") > ???)

In [None]:
non_extreme_multi_port_scanners.count()

## Step 3

In [None]:
extreme_scanners = Scanners_df.where(col("numports") > ???)

In [None]:
path2="/storage/home/???/work/MiniProj3/local/Extreme_Scanners.csv"
extreme_scanners.write.option("header",True).csv(path2)

# Part A: One Hot Encoding of Top 120 Ports
- Like Miniproject 2, we want to apply one hot encoding to the top 120 ports scanned by scanners.  
- Unlike Miniproject 2, however, we will apply PCA to reduce the dimensionality (from 120 port features to 30 PCA features )

In [None]:
non_extreme_multi_port_scanners.select("ports_scanned_str").show(4)

# For each port scanned, count the Total Number of Scanners that Scan the Given Port
Like MiniProject 2, to calculate this, we need to 
- (a) convert the ports_scanned_str into an array/list of ports
- (b) Convert the DataFrame into an RDD
- (c) Use flatMap to count the total number of scanners for each port.

## (a) Split the column "Ports_Array" into an Array of ports.

In [None]:
# (a)
NEMP_Scanners_df=non_extreme_multi_port_scanners.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
NEMP_Scanners_df.show(2)

## (b) We convert the column ```Ports_Array``` into an RDD so that we can apply flatMap for counting.

In [None]:
Ports_Scanned_RDD = NEMP_Scanners_df.select("Ports_Array").rdd

In [None]:
Ports_Scanned_RDD.take(2)

## (c) Like Miniproject 2, we can count the total number of scanners for each port by counting the total occurance of each port number through flatMap.
### We can then count the total number of occurance of a port using map and reduceByKey, like counting word/hashtag frequency in tweets.

In [None]:
Ports_Scanned_RDD.take(3)

In [None]:
Ports_list_RDD = Ports_Scanned_RDD.flatMap(lambda row: row.Ports_Array )

In [None]:
Ports_list_RDD.take(3)

In [None]:
Port_1_RDD = Ports_list_RDD.map(lambda x: (x, 1))
Port_1_RDD.take(2)

In [None]:
Port_count_RDD = Port_1_RDD.reduceByKey(lambda x,y: x+y, 5)
Port_count_RDD.take(3)

In [None]:
Port_count_RDD.count()

## Exercise 3 (5 points) Complete the code below to finds top 120 ports scanned by non-extreme multi-port scanners. We use top 120 ports for Mini-project 3 (both local mode and cluster mode).

In [None]:
Sorted_Count_Port_RDD = Port_count_RDD.map(lambda x: (???, ???)).sortByKey( ascending = False)

In [None]:
top_k_ports = 120

In [None]:
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: ??? )
Top_Ports_list = Sorted_Ports_RDD.take(???)

In [None]:
Top_Ports_list

#  A.2 One Hot Encoding of Top K Ports
## One-Hot-Encoded Feature/Column Name
Like Mini Project 2, we need to create a name for each one-hot-encoded feature. We adopt the convention that the column name for each top k port is "PortXXXX", where "XXXX" is a port number. This can be done by concatenating "Port" with a port number (string) in the sorted list ``Top_Ports_list`` using ``+``.

The code below is an example of OHE feature name for the last top_k_ports (i.e., the 120th top k port).

In [None]:
Top_Ports_list[top_k_ports - 1]

In [None]:
FeatureName = "Port"+Top_Ports_list[top_k_ports - 1]

In [None]:
FeatureName

## One-Hot-Encoding using withColumn and array_contains

In [None]:
from pyspark.sql.functions import array_contains

## Similar to MiniProject 2, generate Hot-One Encoded Feature for each of the top k ports in the Top_Ports_list

- Iterate through the Top_Ports_list so that each top port is one-hot encoded into the DataFrame for non-extreme multi-port scanners (i.e., `NEMP_Scanners2.df`).

## Exercise 4 (5 points) Complete the following PySpark code for encoding the n top ports using One Hot Encoding, where n is specified by the variable ```top_k_ports```

In [None]:
top_k_ports

In [None]:
Top_Ports_list[top_k_ports - 1]

In [None]:
# Initialize NEMP_Scanners2_df
NEMP_Scanners2_df = NEMP_Scanners_df
NEMP_Scanners_df.persist()

In [None]:
for i in range(0, ????):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding Top_Ports_list
    NEMP_Scanners3_df = NEMP_Scanners2_df.???(???, ????(???, Top_Ports_list[??]))
    NEMP_Scanners2_df = NEMP_Scanners3_df

In [None]:
NEMP_Scanners2_df.printSchema()

## Exercise 5 (5 points)  Complete the code below to use k-means to cluster non-extreme multi-port scanners using one-hot-encoded top 120 ports.

# Specify One-Hot Encoded Top k Ports as Input Features for k Means Clustering

In [None]:
input_features = [ ]
for i in range(0, ??? ):
    input_features.append( ??? )

In [None]:
print(input_features)

# Part B k-Means Clustering using 120 OHE features (number of clusters k=200)

In [None]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [None]:
NEMP_Scanners2_df.printSchema()

In [None]:
data= va.transform(???)

In [None]:
data.show(3)

In [None]:
data.persist()

In [None]:
total_clusters = 200
km = KMeans(featuresCol= "features", predictionCol="prediction").setK(total_clusters).setSeed(123)
km.explainParams()

In [None]:
kmModel=km.fit(data)

In [None]:
NEMP_Scanners_df.unpersist()

In [None]:
kmModel

In [None]:
predictions = kmModel.transform(data)

In [None]:
predictions.persist()

In [None]:
predictions.show(1)

In [None]:
Cluster1_df=predictions.where(col("prediction")==0)

In [None]:
Cluster1_df.count()

In [None]:
summary = kmModel.summary

In [None]:
summary.clusterSizes

In [None]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

In [None]:
print('Silhouette Score of the Clustering Result using 100 OHE Top Ports, without using PCA, is ', silhouette)

In [None]:
centers = kmModel.clusterCenters()

# We will later use this two dimensional arrays ``centers[i][j]`` to access the cluster center for each port (jth top port) for cluster (i) generated by clustering using the k-means model.

In [None]:
print(centers)

# Similar to Miniproject 2, record cluster index, cluster size, percentage of Mirai scanners, and cluster centers for each clusters formed.
## The value of cluster center for a OHE top port is the percentage of data/clusters in the cluster that scans the top port. For example, a cluster center `[0.094, 0.8, 0, ...]` indicates the following
- 9.4% of the scanners in the cluster scan Top_Ports_list[0]: port 17132
- 80% of the scanners in the cluster scan Top_Ports_list[1]: port 17130
- No scanners in the cluster scan Top_Ports_list[2]: port 17140

# Exercise 6 (10 points) Complete the code below for computing the percentage of Mirai scanners for each scanner, and record it together with cluster centers for each cluster (without PCA).

In [None]:
import pandas as pd
import numpy as np
import math

In [None]:
c1_centers = centers[0][0:top_k_ports]

In [None]:
print(c1_centers)

In [None]:
# Define columns of the Pandas dataframe
column_list = ['cluster ID', 'size', 'mirai_ratio' ]
for feature in input_features:
    column_list.append(feature)
clusters_summary_df = pd.DataFrame( columns = column_list )
for i in range(0, total_clusters):
    cluster_i = predictions.where(??? )
    ???.persist()
    cluster_i_size = cluster_i.count()
    cluster_i_mirai_count = cluster_i.where(???).???
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    cluster_row = [i, cluster_i_size, cluster_i_mirai_ratio]
    for j in range(0, len(input_features)):
        cluster_row.append(centers[i][j] )
    clusters_summary_df.loc[i]= cluster_row
    ???.unpersist()

In [None]:
path4= "/storage/home/???/work/MiniProj3/local/Clusters_Mirai_Ratio_120OHE_k200.csv"
clusters_summary_df.to_csv(path4, header=True)

In [None]:
predictions.unpersist()

# Part C Using PCA for Dimension Reduction

# We use PCA to reduce the input dimension from 120 to 30.

In [None]:
reduced_dimension = 30
pca_model = PCA(k= reduced_dimension, inputCol = "features", outputCol="pca_features")

# The `pca_model` is a template for constructing a PCA model.
## After we apply `fit` to a PCA template, we obtain an actual mapping from the original feature space to the PCA space.

In [None]:
p_model = pca_model.fit(data)

In [None]:
p_data = p_model.transform(data)

In [None]:
p_data.persist()

## Notice we change the `featuresCol` for k-means clustering to `"pca_features"` because we want to use its reduced dimension for clustering.

In [None]:
total_clusters = 200
km2 = KMeans(featuresCol= "pca_features", predictionCol="pca_prediction").setK(total_clusters).setSeed(123)
km2.explainParams()

In [None]:
data.unpersist()

In [None]:
kmModel_p=km2.fit(p_data)

In [None]:
kmModel_p

In [None]:
p_predictions = kmModel_p.transform(p_data)

In [None]:
p_predictions.persist()

In [None]:
summary_p = kmModel_p.summary

In [None]:
summary_p.clusterSizes

# Notice that we need to specify `featuresCol` explicitly because the default is "features", but we are using "pca_features" as input features for Part C.

In [None]:
evaluator = ClusteringEvaluator(predictionCol="pca_prediction", featuresCol="pca_features")
silhouette = evaluator.evaluate(p_predictions)

In [None]:
print('Silhouette Score of the Clustering Result Using PCA-reduced dimension ',reduced_dimension,' on ', top_k_ports, ' OHE port features is ', silhouette)

In [None]:
top_k_ports

# Compute the Cluster Centers from clustering on PCA-reduced dimensions in the original dimensions (One Hot Encoded top port features)

## Record the cluster centers in the ``cluster_center_array`` where the row index refers to clusters, and the column index refers to OHE features (i.e., top ports).

In [None]:
import numpy as np
# Initialize the cluster center array (for the original port dimensions) to zeros
cluster_center_array = np.zeros([total_clusters, top_k_ports])

## For each OHE top port (say top_port_i), do the following:
- Step 1: Filter the p_prediction DataFrame on the OHE top port feature, which returns all scanners that scan the specific top port.
- Step 2: Apply ``groupBy`` on``pca_prediction``, which groups the filtered scanners by the cluster they belong to (based on clustering on PCA-reduced dimensions).
- Step 3: Apply ``count()`` to the DataFrame returned by ``groupBy``, which computes the total number of scanners that scan the top_port_i in each cluster from PCA_kMeans_clustering.
- Step 4: Save the resulted DataFrame in a list by converting it to an RDD, then use collect().  The list contains a list of ``( <cluster_id> , <number of scanners that scan top_port_i in the cluster> )``.

## Below is an example of Step 1, 2, and 3 for the OHE feature for the first top port.

In [None]:
i=0
feature_name= "Port" + Top_Ports_list[i]
feature_i_count_by_clusters = p_predictions.where(col(feature_name)).groupBy("pca_prediction").count()

In [None]:
fc_bc_rdd = feature_i_count_by_clusters.rdd

In [None]:
fc_bc_rdd.take(10)

## The DataFrame ``feature_i_count_by_clusters`` contains two columns: ``pca_prediction`` and ``count``.
## The RDD converted from the DataFrame has a Row object with these two columns, where ``count`` is the number of scanners in the ``pca_prediction`` cluster that scans a given top port (i.e., the first top port in this example).
## Can you explain what the first two elements of the RDD ``fc_bc_rdd`` mean?

## Naming convention: We use ``fc_bc`` in variables as a short hand notation for "feature count by cluster``.

In [None]:
fc_bc_list =fc_bc_rdd.collect()

In [None]:
print(fc_bc_list)

# By iterating through ``fc_bc_list`` generated above, we can find the number of scanners that scan a given top port in each cluster generated by PCA_kmeans_clustering.

In [None]:
# For the first top port, the number of scanners in each PCA-generated cluster that scan the port.
for row in fc_bc_list:
    print("PCA_kmeans Cluster Index ", row[0], ": contains ", row[1], " scanners that scan Port ", Top_Ports_list[0] )

In [None]:
p_predictions.persist()
for i in range(0, top_k_ports):
    feature_i_name = "Port" + Top_Ports_list[i]
    feature_i_count_by_clusters_DF = p_predictions.where(col(feature_i_name)).groupBy(col("pca_prediction")).count()
    fic_bc_list = feature_i_count_by_clusters_DF.rdd.collect()
    # fic_bc_list is a list of (cluster_index, count of scanners that scan ith top port)
    for row in fic_bc_list:
        cluster_center_array[row[0]][i] = row[1]

In [None]:
# total_clusters = 200

In [None]:
import pandas as pd

# Exercise 7 (25 points)
## Complete the code below for computing cluster centers for each cluster using the ``cluster_center_array`` calculated above.

In [None]:
# The number of total clusters (`total_clusters`) was specified earlier when we created k-means model template. 
# Define columns of the Pandas dataframe
column_list = ['cluster ID', 'size', 'mirai_ratio' ]
for feature in input_features:
    column_list.append(feature)
clusters_summary_df = pd.DataFrame( columns = column_list )
for i in range(0, total_clusters):
    cluster_i = p_predictions.where(??? == i)
    ???.persist()
    cluster_i_size = cluster_i.???
    cluster_i_mirai_count = cluster_i.where(????).???
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    cluster_row = [i, cluster_i_size, cluster_i_mirai_ratio]
    for j in range(0, len(input_features)):
        # compute the center for the original jth feature (i.e., jth top port)
        feature_j = "Port" + Top_Ports_list[j]
        count_i_j = cluster_center_array[i][j]
        # count_j = cluster_i.where(col(feature_j)).count()
        center_j = count_i_j / cluster_i_size
        cluster_row.append(center_j)
    clusters_summary_df.loc[i]= cluster_row
    ???.unpersist()

In [None]:
path5= "/storage/home/????/work/MiniProj3/local/Mirai_Ratio_120OHE_PCA30_k200.csv"
clusters_summary_df.to_csv(path5, header=True)

In [None]:
clusters_summary_df

# Exercise 8 (40 points)
Modify the Jupyter Notebook for running in cluster mode using the big dataset (Day_2020_profile.csv). Make sure you change the output directory from `../local/..` to
`../cluster/..` so that it does not destroy the result you obtained in local mode.
Run the .py file the cluster mode to calculate cluster centers and Mirai percentage for each cluster with and without PCA.
- Submit the .py file  (5 points)
- Submit the the log file that contains the run time information for a successful execution in the cluster mode. (5 points)
- Submit the output file that records the cluster summary in the cluster mode (without PCA) (10 points)
- Submit the output file that records the cluster summary in the cluster mode (with PCA) (10 points)
- Discuss the Silihouette score and Mirai ratio of clusters generated by k-means clustering with PCA and without PCA (in a separate word document) (10 points)

In [None]:
ss.stop()