## Spark Group Assignment

Group O-2-8

### Agenda
#### 1. Load Data
#### 2. Inspect Data
#### 3. Preprocess Data
#### 4. Create A Model
#### 5. Make Predictions
#### 6. Evaluate Predictions

### 1. Spark Setup

In [31]:
import os
print(os.environ['SPARK_HOME'])
dataset_path="/home/ubuntu/challenge_1/"

/usr/local/software/spark


In [60]:
import pandas as pd

In [32]:
#import findspark
#findspark.init()
import pyspark

In [33]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Adolfo-Dataset") \
    .getOrCreate()

In [34]:
spark.version

'2.2.0'

### 2. Data Loading

Data inspection shows that the data does not have a header.

In [35]:
# ---------
# Optione 1: Use SparkSession and infer schema, then add a header
# ---------

df = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"full.data")

In [36]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: integer (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: integer (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: integer (nullable = true)
 |-- _c11: integer (nullable = true)
 |-- _c12: integer (nullable = true)
 |-- _c13: integer (nullable = true)
 |-- _c14: integer (nullable = true)
 |-- _c15: integer (nullable = true)
 |-- _c16: integer (nullable = true)
 |-- _c17: integer (nullable = true)
 |-- _c18: integer (nullable = true)
 |-- _c19: integer (nullable = true)
 |-- _c20: integer (nullable = true)
 |-- _c21: integer (nullable = true)
 |-- _c22: integer (nullable = true)
 |-- _c23: integer (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: d

In [51]:
features=["duration", "protocol_type", "service", "flag", "src_bytes","dst_bytes", \
          "land","wrong_fragment","urgent","hot","num_failed_logins","logged_in", \
          "num_compromised","root_shell","su_attempted","num_root","num_file_creations", \
          "num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login", \
          "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",\
          "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", \
          "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", \
          "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",\
          "dst_host_srv_rerror_rate"]

target=["connection"]

fieldnames=features+target

rawnames=df.schema.names

# Create a small function
def updateColNames(df,oldnames,newnames):
    for i in range(len(newnames)):
        df=df.withColumnRenamed(oldnames[i], newnames[i])
    return df

df=updateColNames(df,rawnames,fieldnames)

df.printSchema()

root
 |-- duration: integer (nullable = true)
 |-- protocol_type: string (nullable = true)
 |-- service: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- src_bytes: integer (nullable = true)
 |-- dst_bytes: integer (nullable = true)
 |-- land: integer (nullable = true)
 |-- wrong_fragment: integer (nullable = true)
 |-- urgent: integer (nullable = true)
 |-- hot: integer (nullable = true)
 |-- num_failed_logins: integer (nullable = true)
 |-- logged_in: integer (nullable = true)
 |-- num_compromised: integer (nullable = true)
 |-- root_shell: integer (nullable = true)
 |-- su_attempted: integer (nullable = true)
 |-- num_root: integer (nullable = true)
 |-- num_file_creations: integer (nullable = true)
 |-- num_shells: integer (nullable = true)
 |-- num_access_files: integer (nullable = true)
 |-- num_outbound_cmds: integer (nullable = true)
 |-- is_host_login: integer (nullable = true)
 |-- is_guest_login: integer (nullable = true)
 |-- count: integer (nullable = true

### 3. Data Inspection

In [54]:
# Print the number of records in the data frame
print('Nb. of records  : %d' % df.count())

Nb. of records  : 4898431


In [61]:
df.describe().toPandas().to_csv("data_summary")

In [58]:
summary.write.csv('data_summary.csv')

AttributeError: 'NoneType' object has no attribute 'write'

In [59]:
summary.toPandas().to_csv('data_summary.csv')

AttributeError: 'NoneType' object has no attribute 'toPandas'

In [None]:
# 3a. Create a in-memory DataFrame 
df2.registerTempTable("network_data")

In [12]:
df.select("duration", "hot").show()

+--------+---+
|duration|hot|
+--------+---+
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
|       0|  0|
+--------+---+
only showing top 20 rows

