## Web References

### System

- [How to copy a file without using scp inside an ssh session?](https://superuser.com/questions/291423/how-to-copy-a-file-without-using-scp-inside-an-ssh-session)

### PySpark

- [Complete Machine Learning Project with PySpark MLlib Tutorial](https://www.youtube.com/watch?v=1a7bB1ZcZ3k)
- [The ONLY PySpark Tutorial You Will Ever Need.](https://www.youtube.com/watch?v=cZS5xYYIPzk)

### Anomaly Detection

- [How to Build an Anomaly Detection Engine with Spark, Akka and Cassandra](https://learning.oreilly.com/videos/how-to-build/9781491955253/9781491955253-video244545/)
- [Real Time Detection of Anomalies in the Database Infrastructure using Apache Spark](https://www.youtube.com/watch?v=1IsMMmug5q0)

## Other

- [What is CRISP DM?](https://www.datascience-pm.com/crisp-dm-2/)

## Import Libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession

## HDFS Preparation

In [2]:
%%bash
#!/bin/bash

# delete the 
if hadoop fs -test -d router; then
    # delete the output directory
    hadoop fs -rm -r router/output

    # create a new output directory
    hadoop fs -mkdir router/output
else
    # create the router directory and upload the input files
    hadoop fs -mkdir router
    hadoop fs -mkdir router/raw
    hadoop fs -put data/bandwidth.csv router/raw/

    # create the output directory
    hadoop fs -mkdir router/output
fi

hadoop fs -ls router/raw


Deleted router/output
Found 1 items
-rw-r--r--   3 jfoul001 users    1136415 2022-03-04 12:05 router/raw/bandwidth.csv


## Initialize the Spark Session

In [3]:
spark = SparkSession.builder.appName('cw02').getOrCreate()
spark

Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## I. Data Understanding

Identify, collect, and analyze the data sets that will help accomplish the project goals

### A. Collect Initial Data

In [4]:
df_raw = spark.read.csv('router/raw/bandwidth.csv', header=False, inferSchema=True).toDF('Direction','Interval Length','Intervals Saved','IP','Interval Start','Interval End','Bytes Used')

                                                                                

### B. Describe data

Examine the data and document its surface properties like data format, number of records, or field identities.

#### 1. Data Format

In [5]:
df_raw.printSchema()

root
 |-- Direction: string (nullable = true)
 |-- Interval Length: string (nullable = true)
 |-- Intervals Saved: string (nullable = true)
 |-- IP: string (nullable = true)
 |-- Interval Start: string (nullable = true)
 |-- Interval End: integer (nullable = true)
 |-- Bytes Used: long (nullable = true)



In [6]:
df_raw.show(5)

+---------+---------------+---------------+--------+--------------+------------+----------+
|Direction|Interval Length|Intervals Saved|      IP|Interval Start|Interval End|Bytes Used|
+---------+---------------+---------------+--------+--------------+------------+----------+
| download|              2|            449|COMBINED|    1646313474|  1646313476|    937174|
| download|              2|            449|COMBINED|    1646313476|  1646313478|    479125|
| download|              2|            449|COMBINED|    1646313478|  1646313480|    779950|
| download|              2|            449|COMBINED|    1646313480|  1646313482|   1241356|
| download|              2|            449|COMBINED|    1646313482|  1646313484|    440434|
+---------+---------------+---------------+--------+--------------+------------+----------+
only showing top 5 rows

