# Spark Lab 2 - Use RDDs to Transform a Dataset (Solution)


In [1]:
sc

### Note: this lab assumes that your spark runs in the local mode, so that all file paths defaults to local path.


## Explore the Loudacre web log files

### Data Preparation
first run the following commands to prepare data in local host

In [14]:
! cd training_materials/data/weblogs

###  Step 1.  Explore the data files

First explore what is in local directory **weblogs**


In [None]:
!ls 

Then view the head of one of the web log file. 

In [16]:
!cat * | head

3.94.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0" 200 14417 "http://www.loudacre.com"  "Loudacre Mobile Browser iFruit 1"
3.94.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /theme.css HTTP/1.0" 200 3576 "http://www.loudacre.com"  "Loudacre Mobile Browser iFruit 1"
19.38.140.62 - 21475 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00277.html HTTP/1.0" 200 15517 "http://www.loudacre.com"  "Loudacre Mobile Browser Ronin S1"
19.38.140.62 - 21475 [15/Sep/2013:23:58:34 +0100] "GET /theme.css HTTP/1.0" 200 13353 "http://www.loudacre.com"  "Loudacre Mobile Browser Ronin S1"
129.133.56.105 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0" 200 10590 "http://www.loudacre.com"  "Loudacre Mobile Browser Sorrento F00L"
129.133.56.105 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /theme.css HTTP/1.0" 200 12295 "http://www.loudacre.com"  "Loudacre Mobile Browser Sorrento F00L"
217.150.149.167 - 4712 [15/Sep/2013:23:56:06 +0100] "GET /ronin_s4_sales.html HT


### Step 2.  Set a variable for the data file so you do not have to retype it each time.

In [17]:
logfile="training_materials/data/weblogs"

### Step 3.	Create an RDD from the data file.

In [18]:
logs = sc.textFile(logfile)

### Step 4.	Create an RDD containing only those lines that are requests for JPG files.

In [19]:
jpglogs = logs.filter(lambda line:".jpg" in line)

### Step 5.	View the first 10 lines of the data using `take`

In [20]:
jpglogs.take(10)

[u'114.56.156.55 - 62826 [02/Dec/2013:12:10:36 +0100] "GET /ifruit_4.jpg HTTP/1.0" 200 4649 "http://www.loudacre.com"  "Loudacre Mobile Browser iFruit 3"',
 u'86.125.145.15 - 7362 [02/Dec/2013:12:10:35 +0100] "GET /ifruit_3a.jpg HTTP/1.0" 200 16517 "http://www.loudacre.com"  "Loudacre Mobile Browser iFruit 1"',
 u'241.32.165.119 - 4237 [02/Dec/2013:12:05:31 +0100] "GET /meetoo_4.0.jpg HTTP/1.0" 200 19283 "http://www.loudacre.com"  "Loudacre Mobile Browser Ronin S1"',
 u'215.56.56.60 - 16622 [02/Dec/2013:12:02:16 +0100] "GET /ronin_s2.jpg HTTP/1.0" 200 9524 "http://www.loudacre.com"  "Loudacre Mobile Browser MeeToo 3.0"',
 u'159.134.45.232 - 72027 [02/Dec/2013:11:57:54 +0100] "GET /ifruit_5a.jpg HTTP/1.0" 200 18828 "http://www.loudacre.com"  "Loudacre Mobile Browser Sorrento F21L"',
 u'64.90.36.213 - 21932 [02/Dec/2013:11:57:30 +0100] "GET /ronin_s2.jpg HTTP/1.0" 200 6012 "http://www.loudacre.com"  "Loudacre Mobile Browser Titanic 1100"',
 u'85.177.200.35 - 18442 [02/Dec/2013:11:54:19 +

### Step 6. Combine Previous Steps
Sometimes you do not need to store intermediate objects in a variable, in which case you can combine the steps into a single line of code. Combine the previous commands into a single command to count the number of JPG requests.

**Tip**: To break long lines, make sure to use a space and \ at the end.

In [22]:
sc.textFile(logfile) \
    .filter(lambda line:".jpg" in line) \
    .count()

64978

### Step 7.	 Compute line lengths using `map`
Now try using the `map` function to define a new RDD. Start with a simple map that returns the length of each line in the log file. This will produce an array of five integers corresponding to the **first five lines** in the file.


In [23]:
logs.map(lambda line:len(line)).take(5)

[153, 158, 152, 159, 152]

### Step 8.  Words
That is not very useful. Instead, try mapping to an array of words for each line. 

This time, you will print out five arrays, each containing the words in the corresponding log file line.


In [24]:
words  = logs.map(lambda line:line.split())

In [25]:
for word in words.take(5):
    print(word)

[u'151.21.109.132', u'-', u'36918', u'[02/Dec/2013:12:12:12', u'+0100]', u'"GET', u'/theme.css', u'HTTP/1.0"', u'200', u'10552', u'"http://www.loudacre.com"', u'"Loudacre', u'Mobile', u'Browser', u'Titanic', u'1100"']
[u'142.152.5.137', u'-', u'88890', u'[02/Dec/2013:12:11:45', u'+0100]', u'"GET', u'/KBDOC-00122.html', u'HTTP/1.0"', u'200', u'8938', u'"http://www.loudacre.com"', u'"Loudacre', u'Mobile', u'Browser', u'Titanic', u'2300"']
[u'142.152.5.137', u'-', u'88890', u'[02/Dec/2013:12:11:45', u'+0100]', u'"GET', u'/theme.css', u'HTTP/1.0"', u'200', u'14348', u'"http://www.loudacre.com"', u'"Loudacre', u'Mobile', u'Browser', u'Titanic', u'2300"']
[u'193.68.215.33', u'-', u'108180', u'[02/Dec/2013:12:11:24', u'+0100]', u'"GET', u'/KBDOC-00270.html', u'HTTP/1.0"', u'200', u'9680', u'"http://www.loudacre.com"', u'"Loudacre', u'Mobile', u'Browser', u'Titanic', u'2400"']
[u'193.68.215.33', u'-', u'108180', u'[02/Dec/2013:12:11:24', u'+0100]', u'"GET', u'/theme.css', u'HTTP/1.0"', u'200',

### Step 9.  IP addresses
Now that you know how map works, define a new RDD containing just the IP addresses from each line in the log file. (The IP address is the first “word” in each line).

In [27]:
ips = logs.map(lambda line: line.split()[0])
ips.take(5)

[u'151.21.109.132',
 u'142.152.5.137',
 u'142.152.5.137',
 u'193.68.215.33',
 u'193.68.215.33']

### Step 10.	Print Return Results
Although take and collect are useful ways to look at data in an RDD, their output is not very readable. Fortunately, they return arrays which you can iterate through:

In [13]:
for ip in ips.take(10): print ip

3.94.78.5
3.94.78.5
19.38.140.62
19.38.140.62
129.133.56.105
129.133.56.105
217.150.149.167
217.150.149.167
217.150.149.167
217.150.149.167


### Step 11 . Save IP addresses as a text file
Finally, save the list of IP addresses as a text file to: `iplist`

In [16]:
# if you were to recreate the iplist, delete the output folder first
# !hadoop fs -rm -r -r /loudacre/iplist

In [28]:
ips.saveAsTextFile("iplist")

### Step 12: Display result folder

In a terminal window, list the contents of the local 
**iplist** folder. You should see multiple files, including several part-xxxxx files, which are the files containing the output data. (“Part” (partition) files are numbered because there may be results from multiple tasks running on the cluster; you will learn more about this later.) Review the contents of one of the files to confirm that they were created correctly.


In [30]:
!ls iplist

part-00000  part-00052	part-00104  part-00156	part-00208  part-00260
part-00001  part-00053	part-00105  part-00157	part-00209  part-00261
part-00002  part-00054	part-00106  part-00158	part-00210  part-00262
part-00003  part-00055	part-00107  part-00159	part-00211  part-00263
part-00004  part-00056	part-00108  part-00160	part-00212  part-00264
part-00005  part-00057	part-00109  part-00161	part-00213  part-00265
part-00006  part-00058	part-00110  part-00162	part-00214  part-00266
part-00007  part-00059	part-00111  part-00163	part-00215  part-00267
part-00008  part-00060	part-00112  part-00164	part-00216  part-00268
part-00009  part-00061	part-00113  part-00165	part-00217  part-00269
part-00010  part-00062	part-00114  part-00166	part-00218  part-00270
part-00011  part-00063	part-00115  part-00167	part-00219  part-00271
part-00012  part-00064	part-00116  part-00168	part-00220  part-00272
part-00013  part-00065	part-00117  part-00169	part-00221  part-00273
part-00014  part-000

### Step 13. Create a User-IP dataset
Use RDD transformations to create a dataset consisting of the IP address and corresponding user ID for each request for an HTML file. (Disregard requests for other file types). The user ID is the third field in each log file line. Rows in the new RDD should be like:
```
(165.32.101.206, 8)
(100.219.90.44, 102)
...
```

In [31]:
ipusers = logs.filter(lambda line:".html" in line). \
    map(lambda line: (line.split()[0],line.split()[2]))

Display the first 10 rows in the form ipaddress/userid, e.g.:
```
165.32.101.206/8
100.219.90.44/102	
182.4.148.56/173
246.241.6.175/45395
175.223.172.207/4115
```

In [32]:
# display 10 records
for ipuser in ipusers.take(10):
    print("{}/{}".format(*ipuser))

142.152.5.137/88890
193.68.215.33/108180
97.76.130.234/96
184.216.140.241/202
200.149.118.18/61733
171.148.136.192/171
114.56.156.55/62826
86.125.145.15/7362
138.147.182.19/47
122.3.151.2/177


display the # of records

In [20]:
ipusers.count()

474360