# Spark Lab 3 - Process Data Files with Spark (Scala Solution)

Assuming you're working on the experimental VM, the default path is /vagrant/IPNB

In [46]:
! pwd

/vagrant/IPNB


## Data Scrubbing - Fixing Issues in devicestatus.txt

In [56]:
!head dev1data/devicestatus.txt

2014-03-15:10:10:20,Sorrento F41L,8cc3b47e-bd01-4482-b500-28f2342679af,7,24,39,enabled,disabled,connected,55,67,12,33.6894754264,-117.543308253
2014-03-15:10:10:20|MeeToo 1.0|ef8c7564-0a1a-4650-a655-c8bbd5f8f943|0|31|63|70|39|27|enabled|enabled|enabled|37.4321088904|-121.485029632
2014-03-15:10:10:20|MeeToo 1.0|23eba027-b95a-4729-9a4b-a3cca51c5548|0|20|21|86|54|34|enabled|enabled|enabled|39.4378908349|-120.938978486
2014-03-15:10:10:20,Sorrento F41L,707daba1-5640-4d60-a6d9-1d6fa0645be0,8,22,60,enabled,enabled,disabled,68,91,17,39.3635186767,-119.400334708
2014-03-15:10:10:20,Ronin Novelty Note 1,db66fe81-aa55-43b4-9418-fc6e7a00f891,0,13,47,70,enabled,enabled,enabled,10,45,33.1913581092,-116.448242643
2014-03-15:10:10:20,Sorrento F41L,ffa18088-69a0-433e-84b8-006b2b9cc1d0,3,10,36,enabled,connected,enabled,53,58,42,33.8343543748,-117.330000857
2014-03-15:10:10:20,Sorrento F33L,66d678e6-9c87-48d2-a415-8d5035e54a23,1,34,74,enabled,disabled,enabled,57,42,15,37.3803954321,-121.840756755
2014-

### Step 1. Load the file from local path
When you load data in spark, you must use the full path. pyspark cannot recognize variables such as $DEV1DATA (which is defined in local bash shell)

In [61]:
val mydata = sc.textFile("file:/vagrant/IPNB/dev1data/devicestatus.txt")

mydata: org.apache.spark.rdd.RDD[String] = file:/vagrant/IPNB/dev1data/devicestatus.txt MapPartitionsRDD[11] at textFile at <console>:25


### Step 2.	Determine which delimiter to use 
**hint**: the character at position 19 is the first use of the delimiter

In [63]:
mydata.map(_(19)).distinct().collect()

res18: Array[Char] = Array(|, ,, /)


### Step 3.	Filter out any records which do not parse correctly 
**hint**: each record should have exactly 14 values

In [66]:
val parsed  = mydata.map(line=>line.split(line(19)))

parsed: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[20] at map at <console>:27


Count rows

In [71]:
parsed.count()

res20: Long = 459540


created an RDD that removes the rows that did not parse correctly (i.e. not containing 14 values)

In [69]:
val filtered = parsed.filter(_.length==14)

filtered: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at filter at <console>:29


compare the # of records two RDDs

In [70]:
filtered.count()

res19: Long = 459540


### Step 4	Extract the fields
Extra date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively)

In [87]:
val selectedFields = filtered.map(r=>(r(0),r(1),r(2),r(12),r(13)))

selectedFields: org.apache.spark.rdd.RDD[(String, String, String, String, String)] = MapPartitionsRDD[23] at map at <console>:31


inspect the results by printing first 10 rows

In [88]:
selectedFields.take(10).foreach(row=>println(s"${row._1}, ${row._2}, ${row._3}, ${row._4}, ${row._5}"))    

2014-03-15:10:10:20, Sorrento F41L, 8cc3b47e-bd01-4482-b500-28f2342679af, 33.6894754264, -117.543308253
2014-03-15:10:10:20, MeeToo 1.0, ef8c7564-0a1a-4650-a655-c8bbd5f8f943, 37.4321088904, -121.485029632
2014-03-15:10:10:20, MeeToo 1.0, 23eba027-b95a-4729-9a4b-a3cca51c5548, 39.4378908349, -120.938978486
2014-03-15:10:10:20, Sorrento F41L, 707daba1-5640-4d60-a6d9-1d6fa0645be0, 39.3635186767, -119.400334708
2014-03-15:10:10:20, Ronin Novelty Note 1, db66fe81-aa55-43b4-9418-fc6e7a00f891, 33.1913581092, -116.448242643
2014-03-15:10:10:20, Sorrento F41L, ffa18088-69a0-433e-84b8-006b2b9cc1d0, 33.8343543748, -117.330000857
2014-03-15:10:10:20, Sorrento F33L, 66d678e6-9c87-48d2-a415-8d5035e54a23, 37.3803954321, -121.840756755
2014-03-15:10:10:20, MeeToo 4.1, 673f7e4b-d52b-44fc-8826-aea460c3481a, 34.1841062345, -117.9435329
2014-03-15:10:10:20, Ronin Novelty Note 2, a678ccc3-b0d2-452d-bf89-85bd095e28ee, 32.2850556785, -111.819583734
2014-03-15:10:10:20, Sorrento F41L, 86bef6ae-2f1c-42ec-aa67-6

### Step 5.	 Split the 2nd field into two
The second field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufacturer Ronin, model S2.)


In [89]:
val splitted = selectedFields.map(r=>(r._1,r._2.split(' ')(0),r._2.split(' ')(1),r._3,r._4,r._5))

splitted: org.apache.spark.rdd.RDD[(String, String, String, String, String, String)] = MapPartitionsRDD[24] at map at <console>:33


### Step 6.  Save CSV to HDFS
Save the extracted data to comma delimited text files in the **/loudacre/devicestatus_etl** directory on HDFS


In [92]:
// .productIterator.toList converts tuple to List.
// .mkString implodes string with delimiter
splitted.map(_.productIterator.toList.mkString(",")).saveAsTextFile("file:/vagrant/IPNB/devicestatus_etl")

org.apache.hadoop.mapred.FileAlreadyExistsException:  Output directory file:/vagrant/IPNB/devicestatus_etl already exists

### Step 7.  Verify Results
verify results with HDFS commands
- first to show content of **/loudacre/devicestatus_etl** 
- then take sample rows from the files

In [95]:
!ls -l /vagrant/IPNB/devicestatus_etl

total 43517
-rw-rw-r-- 1 vagrant vagrant 22281863 Nov  7 00:11 part-00000
-rw-rw-r-- 1 vagrant vagrant 22278372 Nov  7 00:11 part-00001
-rw-rw-r-- 1 vagrant vagrant        0 Nov  7 00:11 _SUCCESS


In [94]:
!cat /vagrant/IPNB/devicestatus_etl/* | head

2014-03-15:10:10:20,Sorrento,F41L,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-117.543308253
2014-03-15:10:10:20,MeeToo,1.0,ef8c7564-0a1a-4650-a655-c8bbd5f8f943,37.4321088904,-121.485029632
2014-03-15:10:10:20,MeeToo,1.0,23eba027-b95a-4729-9a4b-a3cca51c5548,39.4378908349,-120.938978486
2014-03-15:10:10:20,Sorrento,F41L,707daba1-5640-4d60-a6d9-1d6fa0645be0,39.3635186767,-119.400334708
2014-03-15:10:10:20,Ronin,Novelty,db66fe81-aa55-43b4-9418-fc6e7a00f891,33.1913581092,-116.448242643
2014-03-15:10:10:20,Sorrento,F41L,ffa18088-69a0-433e-84b8-006b2b9cc1d0,33.8343543748,-117.330000857
2014-03-15:10:10:20,Sorrento,F33L,66d678e6-9c87-48d2-a415-8d5035e54a23,37.3803954321,-121.840756755
2014-03-15:10:10:20,MeeToo,4.1,673f7e4b-d52b-44fc-8826-aea460c3481a,34.1841062345,-117.9435329
2014-03-15:10:10:20,Ronin,Novelty,a678ccc3-b0d2-452d-bf89-85bd095e28ee,32.2850556785,-111.819583734
2014-03-15:10:10:20,Sorrento,F41L,86bef6ae-2f1c-42ec-aa67-6acecd7b0675,45.2400522984,-122.377467861
cat: write 