# Spark Lab 3 - Process Data Files with Spark (Solution)


## Data Scrubbing - Fixing Issues in devicestatus.txt

A common part of the ETL process is data scrubbing. In the following step, you will process data in order to get it into a standardized format for later processing.

Review the contents of the file **$DEV1DATA/devicestatus.txt**. This file contains data collected from mobile devices on Loudacre’s network, including device ID, current status, location and so on. Because Loudacre previously acquired other mobile provider’s networks, the data from different subnetworks has a different format. Note that the records in this file have different field delimiters: some use commas, some use pipes (|) and so on. 

In [1]:
!head $DEV1DATA/devicestatus.txt

2014-03-15:10:10:20,Sorrento F41L,8cc3b47e-bd01-4482-b500-28f2342679af,7,24,39,enabled,disabled,connected,55,67,12,33.6894754264,-117.543308253
2014-03-15:10:10:20|MeeToo 1.0|ef8c7564-0a1a-4650-a655-c8bbd5f8f943|0|31|63|70|39|27|enabled|enabled|enabled|37.4321088904|-121.485029632
2014-03-15:10:10:20|MeeToo 1.0|23eba027-b95a-4729-9a4b-a3cca51c5548|0|20|21|86|54|34|enabled|enabled|enabled|39.4378908349|-120.938978486
2014-03-15:10:10:20,Sorrento F41L,707daba1-5640-4d60-a6d9-1d6fa0645be0,8,22,60,enabled,enabled,disabled,68,91,17,39.3635186767,-119.400334708
2014-03-15:10:10:20,Ronin Novelty Note 1,db66fe81-aa55-43b4-9418-fc6e7a00f891,0,13,47,70,enabled,enabled,enabled,10,45,33.1913581092,-116.448242643
2014-03-15:10:10:20,Sorrento F41L,ffa18088-69a0-433e-84b8-006b2b9cc1d0,3,10,36,enabled,connected,enabled,53,58,42,33.8343543748,-117.330000857
2014-03-15:10:10:20,Sorrento F33L,66d678e6-9c87-48d2-a415-8d5035e54a23,1,34,74,enabled,disabled,enabled,57,42,15,37.3803954321,-121.840756755

### Step 1. Load the file from local path
When you load data in spark, you must use the full path. pyspark cannot recognize variables such as $DEV1DATA (which is defined in local bash shell)

In [2]:
mydata = sc.textFile('file:/home/cloudera/training_materials/data/devicestatus.txt')

### Step 2.	Determine which delimiter to use 
**hint**: the character at position 19 is the first use of the delimiter

In [3]:
mydata.map(lambda line:line[19]).distinct().collect()

[u'/', u',', u'|']

### Step 3.	Filter out any records which do not parse correctly 
**hint**: each record should have exactly 14 values

In [4]:
parsed  = mydata.map(lambda line:line.split(line[19]))

alternatively, you may define a function.

In [5]:
def my_split(s):
    if s[19]==',':
        return s.split(',')
    elif s[19]=='|':
        return s.split("|")
    else:
        return s.split("/")

parsed  = mydata.map(my_split)

In [6]:
parsed.count()

459540

created an RDD that removes the rows that did not parse correctly (i.e. not containing 14 values)

In [7]:
filtered = parsed.filter(lambda record:len(record)==14)

compare the # of records two RDDs

In [8]:
filtered.count()

459540

### Step 4	Extract the fields
Extra date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively)


In [9]:
selectedFields = filtered.map(lambda r:(r[0],r[1],r[2],r[12],r[13]))

Inspect the results by print the first two rows

In [10]:
for row in selectedFields.take(10):
    print("{}, {}, {}, {}, {}".format(*row))

2014-03-15:10:10:20, Sorrento F41L, 8cc3b47e-bd01-4482-b500-28f2342679af, 33.6894754264, -117.543308253
2014-03-15:10:10:20, MeeToo 1.0, ef8c7564-0a1a-4650-a655-c8bbd5f8f943, 37.4321088904, -121.485029632
2014-03-15:10:10:20, MeeToo 1.0, 23eba027-b95a-4729-9a4b-a3cca51c5548, 39.4378908349, -120.938978486
2014-03-15:10:10:20, Sorrento F41L, 707daba1-5640-4d60-a6d9-1d6fa0645be0, 39.3635186767, -119.400334708
2014-03-15:10:10:20, Ronin Novelty Note 1, db66fe81-aa55-43b4-9418-fc6e7a00f891, 33.1913581092, -116.448242643
2014-03-15:10:10:20, Sorrento F41L, ffa18088-69a0-433e-84b8-006b2b9cc1d0, 33.8343543748, -117.330000857
2014-03-15:10:10:20, Sorrento F33L, 66d678e6-9c87-48d2-a415-8d5035e54a23, 37.3803954321, -121.840756755
2014-03-15:10:10:20, MeeToo 4.1, 673f7e4b-d52b-44fc-8826-aea460c3481a, 34.1841062345, -117.9435329
2014-03-15:10:10:20, Ronin Novelty Note 2, a678ccc3-b0d2-452d-bf89-85bd095e28ee, 32.2850556785, -111.819583734
2014-03-15:10:10:20, Sorrento F41L, 86bef6ae-2f1c-42ec-aa67-6

### Step 5.	 Split the 2nd field into two
The second field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufacturer Ronin, model S2.)

In [11]:
splitted = selectedFields.map(lambda r:(r[0],r[1].split(' ')[0],r[1].split(' ')[1],r[2],r[3],r[4]))

inspect the results by printing first 10 rows

In [12]:
for row in splitted.take(10):
    print("{}, {}, {}, {}, {}, {}".format(*row))

2014-03-15:10:10:20, Sorrento, F41L, 8cc3b47e-bd01-4482-b500-28f2342679af, 33.6894754264, -117.543308253
2014-03-15:10:10:20, MeeToo, 1.0, ef8c7564-0a1a-4650-a655-c8bbd5f8f943, 37.4321088904, -121.485029632
2014-03-15:10:10:20, MeeToo, 1.0, 23eba027-b95a-4729-9a4b-a3cca51c5548, 39.4378908349, -120.938978486
2014-03-15:10:10:20, Sorrento, F41L, 707daba1-5640-4d60-a6d9-1d6fa0645be0, 39.3635186767, -119.400334708
2014-03-15:10:10:20, Ronin, Novelty, db66fe81-aa55-43b4-9418-fc6e7a00f891, 33.1913581092, -116.448242643
2014-03-15:10:10:20, Sorrento, F41L, ffa18088-69a0-433e-84b8-006b2b9cc1d0, 33.8343543748, -117.330000857
2014-03-15:10:10:20, Sorrento, F33L, 66d678e6-9c87-48d2-a415-8d5035e54a23, 37.3803954321, -121.840756755
2014-03-15:10:10:20, MeeToo, 4.1, 673f7e4b-d52b-44fc-8826-aea460c3481a, 34.1841062345, -117.9435329
2014-03-15:10:10:20, Ronin, Novelty, a678ccc3-b0d2-452d-bf89-85bd095e28ee, 32.2850556785, -111.819583734
2014-03-15:10:10:20, Sorrento, F41L, 86bef6ae-2f1c-42ec-aa67-6acec

### Step 6.  Save CSV to HDFS
Save the extracted data to comma delimited text files in the **devicestatus_etl** directory on local host


In [16]:
# if you need to rerun this, remove the output directory first
# ! hadoop fs -rm -r -f /loudacre/devicestatus_etl

In [None]:
splitted.map(lambda r :",".join(r)).saveAsTextFile('devicestatus_etl')

### Step 7.  Verify Results
verify results with linux commands
- first to show content of **devicestatus_etl** 
- then take sample rows from the files


In [18]:
!ls -l devicestatus_etl

total 43520
-rw-r--r-- 1 cloudera cloudera 22281863 Jul 31 19:43 part-00000
-rw-r--r-- 1 cloudera cloudera 22278372 Jul 31 19:43 part-00001
-rw-r--r-- 1 cloudera cloudera        0 Jul 31 19:43 _SUCCESS


In [20]:
!cat devicestatus_etl/* | head

2014-03-15:10:10:20,Sorrento,F41L,8cc3b47e-bd01-4482-b500-28f2342679af,33.6894754264,-117.543308253
2014-03-15:10:10:20,MeeToo,1.0,ef8c7564-0a1a-4650-a655-c8bbd5f8f943,37.4321088904,-121.485029632
2014-03-15:10:10:20,MeeToo,1.0,23eba027-b95a-4729-9a4b-a3cca51c5548,39.4378908349,-120.938978486
2014-03-15:10:10:20,Sorrento,F41L,707daba1-5640-4d60-a6d9-1d6fa0645be0,39.3635186767,-119.400334708
2014-03-15:10:10:20,Ronin,Novelty,db66fe81-aa55-43b4-9418-fc6e7a00f891,33.1913581092,-116.448242643
2014-03-15:10:10:20,Sorrento,F41L,ffa18088-69a0-433e-84b8-006b2b9cc1d0,33.8343543748,-117.330000857
2014-03-15:10:10:20,Sorrento,F33L,66d678e6-9c87-48d2-a415-8d5035e54a23,37.3803954321,-121.840756755
2014-03-15:10:10:20,MeeToo,4.1,673f7e4b-d52b-44fc-8826-aea460c3481a,34.1841062345,-117.9435329
2014-03-15:10:10:20,Ronin,Novelty,a678ccc3-b0d2-452d-bf89-85bd095e28ee,32.2850556785,-111.819583734
2014-03-15:10:10:20,Sorrento,F41L,86bef6ae-2f1c-42ec-aa67-6acecd7b0675,45.2400522984,-122.377467861
c