## Web References

### System

- [How to copy a file without using scp inside an ssh session?](https://superuser.com/questions/291423/how-to-copy-a-file-without-using-scp-inside-an-ssh-session)

### PySpark

- [Complete Machine Learning Project with PySpark MLlib Tutorial](https://www.youtube.com/watch?v=1a7bB1ZcZ3k)
- [The ONLY PySpark Tutorial You Will Ever Need.](https://www.youtube.com/watch?v=cZS5xYYIPzk)
- [PySpark When Otherwise | SQL Case When Usage](https://sparkbyexamples.com/pyspark/pyspark-when-otherwise/)
- [Spark rlike() Working with Regex Matching Examples](https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/)

### Anomaly Detection

- [How to Build an Anomaly Detection Engine with Spark, Akka and Cassandra](https://learning.oreilly.com/videos/how-to-build/9781491955253/9781491955253-video244545/)
- [Real Time Detection of Anomalies in the Database Infrastructure using Apache Spark](https://www.youtube.com/watch?v=1IsMMmug5q0)

### Other

- [What is CRISP DM?](https://www.datascience-pm.com/crisp-dm-2/)

### Internet Traffic

- [Data mining approach for predicting the daily Internet data traffic of a smart university](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0176-5)

## Import Libraries

In [71]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, asc, desc, col

## HDFS Preparation

In [2]:
%%bash
#!/bin/bash

# delete the 
if hadoop fs -test -d router; then
    # delete the output directory
    hadoop fs -rm -r router/output

    # create a new output directory
    hadoop fs -mkdir router/output
else
    # create the router directory and upload the input files
    hadoop fs -mkdir router
    hadoop fs -mkdir router/raw
    hadoop fs -put data/bandwidth.csv router/raw/

    # create the output directory
    hadoop fs -mkdir router/output
fi

hadoop fs -ls router/raw


Deleted router/output
Found 1 items
-rw-r--r--   3 jfoul001 users    1136415 2022-03-04 12:05 router/raw/bandwidth.csv


## Initialize the Spark Session

In [3]:
spark = SparkSession.builder.appName('cw02').getOrCreate()
spark

Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## I. Data Understanding

Identify, collect, and analyze the data sets that will help accomplish the project goals

### A. Collect Initial Data

In [4]:
df_raw = spark.read.csv('router/raw/bandwidth.csv', header=False, inferSchema=True).toDF('Direction','Interval Length','Intervals Saved','IP','Interval Start','Interval End','Bytes Used')

                                                                                

### B. Describe data

Examine the data and document its surface properties like data format, number of records, or field identities.

#### 1. Data Format

In [5]:
df_raw.printSchema()

root
 |-- Direction: string (nullable = true)
 |-- Interval Length: string (nullable = true)
 |-- Intervals Saved: string (nullable = true)
 |-- IP: string (nullable = true)
 |-- Interval Start: string (nullable = true)
 |-- Interval End: integer (nullable = true)
 |-- Bytes Used: long (nullable = true)



In [9]:
df_raw.count()

                                                                                

20278

In [6]:
df_raw.show(5)

+---------+---------------+---------------+--------+--------------+------------+----------+
|Direction|Interval Length|Intervals Saved|      IP|Interval Start|Interval End|Bytes Used|
+---------+---------------+---------------+--------+--------------+------------+----------+
| download|              2|            449|COMBINED|    1646313474|  1646313476|    937174|
| download|              2|            449|COMBINED|    1646313476|  1646313478|    479125|
| download|              2|            449|COMBINED|    1646313478|  1646313480|    779950|
| download|              2|            449|COMBINED|    1646313480|  1646313482|   1241356|
| download|              2|            449|COMBINED|    1646313482|  1646313484|    440434|
+---------+---------------+---------------+--------+--------------+------------+----------+
only showing top 5 rows



#### 2. Unique Categories

In [76]:
# if an IP address is stored in the IP column recode it as simply 'IP'
df_interval_type = df_raw.withColumn('interval_type', 
    when(df_raw['IP']
    .rlike('^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'), 'IP')
    .otherwise(df_raw['IP'])
)

df_interval_type.groupBy(['interval_type']).count().show()

+-------------+-----+
|interval_type|count|
+-------------+-----+
|           15|   86|
|           31|  323|
|          449| 2256|
|           24|  286|
|           IP|12106|
|           12|   71|
|     COMBINED| 5150|
+-------------+-----+



In [78]:
df_interval_type.groupBy(['Interval Length', 'interval_type', 'Intervals Saved']) \
    .count() \
    .orderBy(['Interval Length'], ascending=True) \
    .show(df_interval_type.count())

+---------------+-------------+---------------+-----+
|Interval Length|interval_type|Intervals Saved|count|
+---------------+-------------+---------------+-----+
|            180|     COMBINED|            479|  960|
|              2|     COMBINED|            449| 1800|
|              2|           IP|            449| 2700|
|           7200|     COMBINED|            359|  720|
|            900|           IP|             24| 1165|
|            900|     COMBINED|             24|   50|
|            day|     COMBINED|            365|  732|
|            day|     COMBINED|             31|   64|
|            day|           IP|             31| 3363|
|       dclass_1|           15|         minute|   16|
|       dclass_1|          449|              2|  450|
|       dclass_1|           24|            900|   25|
|       dclass_1|           12|          month|   13|
|       dclass_1|           24|           hour|   25|
|       dclass_1|           31|            day|   32|
|       dclass_2|           

#### 3. How many records are available for the various intervals?

In [94]:
df_interval_type.drop_duplicates(['Interval Length', 'interval_type', 'Interval Start', 'Interval End']) \
    .groupBy(['Interval Length', 'interval_type']) \
    .count() \
    .orderBy(['Interval Length'], ascending=True) \
    .show(df_interval_type.count())



+---------------+-------------+-----+
|Interval Length|interval_type|count|
+---------------+-------------+-----+
|            180|     COMBINED|  480|
|              2|     COMBINED|  452|
|              2|           IP|  450|
|           7200|     COMBINED|  360|
|            900|           IP|   25|
|            900|     COMBINED|   25|
|            day|           IP|   32|
|            day|     COMBINED|  366|
|       dclass_1|           12|   13|
|       dclass_1|           15|   16|
|       dclass_1|           24|   44|
|       dclass_1|          449|  450|
|       dclass_1|           31|   32|
|       dclass_2|           24|   44|
|       dclass_2|           12|   13|
|       dclass_2|          449|  450|
|       dclass_2|           31|   32|
|       dclass_2|           15|   16|
|       dclass_3|           31|   32|
|       dclass_3|           15|    1|
|       dclass_3|          449|    1|
|       dclass_3|           12|    3|
|       dclass_3|           24|    2|
|       dcla

                                                                                