# Spark Lab 4: Use Pair RDDs to join two datasets (Solution)

Data used in this lab:
- weblog data: `/home/cloudera/training_materials/data/weblogs/`
- user accounts data: `/home/cloudera/training_materials/data/static_data/accounts/`


**Tip**: In this lab you will be reducing and joining large datasets, which can take a lot of time. You may wish to perform the exercises below using a smaller dataset, consisting of only one of the many web log files, rather than all of them. For example, one of the web log file is **`2013-09-15.log`**.

## Explore Web Log Files

The following python function parse a log line into a tuple using regular expression. You can use this later on for log line pasing.

In [148]:
import re 
def parse_log_line(line):
    pattern = '^([\d.]+) - (\d+) \[(.+?)\] \"(.+?)\" (\d{3}) (\d+) \"(.+?)\"  \"(.+)\"';
    try:
        match = re.search(pattern,line)
        return (match.group(1),match.group(2),match.group(3),match.group(4), \
            match.group(5),match.group(6),match.group(7),match.group(8))
    except:
        pass

### Step 1. Load and parse data data
Load the data from the weblog data location and parse it with the provided `parse_log_line` function.
- Please read just one file **2013-09-15.log** in the testing period. After you're done testing, you can load the entire dataset, which may be time-consuming to process. 
- You can cache the data for later reuse using `.cache()`, so that you don't have to reload and repharse in each step of the way

In [149]:
logs = sc.textFile("file:/home/cloudera/training_materials/data/weblogs/2013-09-15.log").map(parse_log_line).cache()

In [150]:
# print some samples to verify the result
for row in logs.take(2):
    print row

(u'3.94.78.5', u'69827', u'15/Sep/2013:23:58:36 +0100', u'GET /KBDOC-00033.html HTTP/1.0', u'200', u'14417', u'http://www.loudacre.com', u'Loudacre Mobile Browser iFruit 1')
(u'3.94.78.5', u'69827', u'15/Sep/2013:23:58:36 +0100', u'GET /theme.css HTTP/1.0', u'200', u'3576', u'http://www.loudacre.com', u'Loudacre Mobile Browser iFruit 1')


### Step 2. Using map-reduce to count the number of requests from each user

- Use map and to create a Pair RDD with the user ID as the key, and the integer 1 as the value. (The user ID is the second field in each line.) After mapping, your data will look something like this:
```
(userid,1) 
(userid,1) 
(userid,1)
```
- Use reduceByKey to sum the values for each user ID. Your RDD data will be similar to:
```
(userid1,5) 
(userid2,7) 
(userid3,2)
```

In [151]:
userCount = logs.map(lambda row:(row[1],1)).reduceByKey(lambda a,b:a+b)

In [152]:
# use type function to verify the type of the variable
type(userCount)

pyspark.rdd.PipelinedRDD

In [153]:
# print the outcome in formatted fasion
for row in userCount.take(5):
    print("{},{}".format(*row))  # the operator * unpack the elements in the `row` tuple

60988,4
45741,2
53488,4
58228,4
63239,2


### Step 3. Determine how many users visited the site for each frequency (histogram) 
Determine how many users visited the site for each frequency. That is, how many users visited once, twice, three times and so on.
- Use `map` to reverse the key and value, like this:
```
(5,userid)
(7,userid)
(2,userid)
```
- Use the `countByKey` action to return a Map of `frequency:user-count` pairs, such as:
```
(5,10)
(7,50)
(2,100)
```
- save the result in variable `freqCount`


In [154]:

freqCount = userCount.map(lambda (user,cnt):(cnt,user)) \
    .countByKey()

- Determine the type of the variable `freqCount`

In [155]:
type(freqCount)

collections.defaultdict

- Print the outcome in the formatted fashion like this:
```
1:25
2:32
3:44
```

In [156]:
for item in freqCount.items():
    print("{}:{}".format(*item))

2:757
3:2
4:302
6:62
8:27
10:10
12:2


### Step 4. Build an IP address list for each user

#### A. Frist build a list of `<user_id, ip>` pairs sorted by user id
- Create an RDD where the user id is the key, and the value is the IP addresses that user has connected from. (IP address is the first field in each request line.)
- Sort result by userid
- Save result in **`userIpSorted`**



In [157]:
userIpSorted = logs.map(lambda row:(row[1],row[0])) \
    .sortByKey()

- Print the first 10 outcome in the form of `userid<tab>ip_address`, such as:
```
1         127.0.0.1
123       127.0.0.2
```

In [158]:
for item in userIpSorted.take(10):
    print("{}\t{}".format(*item))

1	225.237.182.117
1	225.237.182.117
100	187.142.238.101
100	187.142.238.101
100	208.253.247.9
100	208.253.247.9
10022	6.46.42.13
10022	6.46.42.13
10022	6.46.42.13
10022	6.46.42.13


#### B. Create an RDD where the user id is the key, and the value is the list of all the IP addresses that user has connected from. 
- Hint: Map to (userid,ip_address) and then use **`groupByKey`**
- Hint: **`groupByKey`** return a pyspark iterable object, you can convert it to a list using **`mapValues`** with a type conversion function: **`list()`**
- Save results in **userIpList**

In [159]:
userIpList = logs.map(lambda row:(row[1],row[0])) \
    .groupByKey() \
    .mapValues(list)

- Print the first five result as tab delimited rows, such as:
```
123    [ip_1, ip_2, ...]
2334   [ip_1, ip_4]
```

In [160]:
for item in userIpList.take(5):
    print("{}\t{}".format(item[0],item[1]))

60988	[u'248.254.226.12', u'248.254.226.12', u'248.254.226.12', u'248.254.226.12']
45741	[u'75.226.179.34', u'75.226.179.34']
53488	[u'52.101.213.105', u'52.101.213.105', u'52.101.213.105', u'52.101.213.105']
58228	[u'135.41.174.97', u'135.41.174.97', u'135.41.174.97', u'135.41.174.97']
63239	[u'240.193.143.197', u'240.193.143.197']


#### C. Create an RDD where the user id is the key, and the value is the *distinct* list of all the IP addresses that user has connected from. 
- Similar to B, but now we want a distinct list of values, consider using **set** instead of list
- Save results to **userIpListDistinct**

In [161]:
userIpListDistinct = logs.map(lambda row:(row[1],row[0])) \
    .groupByKey() \
    .mapValues(set)

- Print the first five result as tab delimited rows, such as:
```
123    [ip_1, ip_2, ...]
2334   [ip_1, ip_4]
```

In [162]:
for item in userIpListDistinct.take(5):
    print("{}\t{}".format(item[0],list(item[1])))

60988	[u'248.254.226.12']
45741	[u'75.226.179.34']
53488	[u'52.101.213.105']
58228	[u'135.41.174.97']
63239	[u'240.193.143.197']


## Join Web Log Data with Account Data

The accounts data (located in /home/cloudera/training_materials/data/static_data/accounts/). Sample rows look like this:
```
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
3,2008-12-21 09:19:50.0,\N,Dorthy,Chalmers,4073 Whaley Lane,San Mateo,CA,94479,6506877757,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
```
- The first field in each line is the user ID, which corresponds to the user ID in the web server logs. 
- The other fields include account details such as creation date, first and last name and so on.


### Step 1. Load and parse data data
- Load the data from the account data location and parse it.
- Cache the data for later reuse
- Save result in **accounts**

In [163]:
accounts = sc.textFile("file:/home/cloudera/training_materials/data/static_data/accounts/*") \
    .map(lambda line:line.split(',')) \
    .cache()

- verify result by printing the first element

In [164]:
accounts.take(1)

[[u'1',
  u'2008-10-23 16:05:05.0',
  u'\\N',
  u'Donald',
  u'Becton',
  u'2275 Washburn Street',
  u'Oakland',
  u'CA',
  u'94660',
  u'5100032418',
  u'2014-03-18 13:29:47.0',
  u'2014-03-18 13:29:47.0']]

### Step 2. Join the accounts data with userCount data. 
The goal of this step is to produce a dataset with userid, hitcount, and name. In order to do this, we need to 
- Create an RDD based on the accounts data consisting of key/value pairs such as:
```
(userid1,[userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…])
(userid2,[userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…])
(userid3,[userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,…])
```
- Join the Pair RDD with the set of userCount pair RDD calculated in the first section.
```
(userid1,([userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…], 3))
(userid2,([userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…], 8))
(userid3,([userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,…], 10))
```
- Save results in **joined**

In [165]:
joined = accounts.keyBy(lambda row:row[0]) \
    .join(userCount) \
    .map(lambda (userid,(row,count)):(userid, count,row[3]+' '+row[4]))

- Verify results by print the first 5 rows

In [166]:
joined.take(5)

[(u'25062', 2, u'Cynthia Anderson'),
 (u'57298', 2, u'Judith Caballero'),
 (u'9256', 2, u'Brett King'),
 (u'623', 2, u'Anthony Valenzuela'),
 (u'72369', 4, u'Evelyn Simmons')]

### Step 3. Create a list of names by Postal Code
- Use **keyBy** to create an RDD of account data with the postal code (9th field in the CSV data) as the key
- Convert the RDD to a new RDD with postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
    - Hint: First name and last name are the 4th and 5th fields respectively
- Save results in **postalRoster**

In [167]:
postalRoster = accounts.keyBy(lambda row:row[8]) \
    .mapValues(lambda row:(row[3],row[4])) \
    .groupByKey() \
    .mapValues(lambda names:list(names))

- Take the first five postal codes sorted by postal code

In [168]:
first5postal = postalRoster.takeOrdered(5,lambda (postal,names):postal)

-  Display the names for the first five postal codes in the following format, e.g. 
```
--- 85003
Jenkins, Thad 
Rick, Edward 
Lindsay, Ivy
...
--- 85004
Morris, Eric 
Reiser, Hazel 
Gregg, Alicia 
Preston, Elizabeth
...
```

In [169]:
for row in first5postal:
    print("--- {}".format(row[0]))
    for name in row[1]:
        print("{}, {}".format(name[1],name[0]))

--- 85000
Allen, Harvey
Prinz, Daniel
Pascale, Robert
Brookes, Donna
Mackenzie, James
Chamberlain, Robert
Cunningham, Richard
Sewell, Bailey
Marin, Daniel
--- 85001
Mendelsohn, Frances
Watson, Mary
Brookover, Donald
Hathaway, Brandon
Leonard, Crystal
Moran, Carrie
Kirksey, Marie
Lance, Issac
Barnes, Vesta
Fiore, Eva
Tucker, Keith
Medford, Danielle
Spell, Norman
Soto, Shelley
Frantz, Kathy
Wilkins, Timothy
Snyder, Joseph
Flores, Delbert
Eakes, Gail
Daniels, Bert
Carpenter, Vincent
--- 85002
Whitney, Ruby
Perry, David
James, Marianne
Holiman, Nancy
Roman, Allen
Manus, Donna
Reed, Nancy
Baird, Estella
Gilbert, James
McKay, David
Clark, Laura
Horn, John
Payne, Jessica
Stewart, Bryant
Jones, Jose
Robinson, Wesley
--- 85003
Martin, Mark
Ross, Vivian
Tabor, Harry
Strickland, Kyle
Dvorak, Kevin
Wisniewski, Virginia
Gibson, Catherine
Thies, Lindsey
--- 85004
Kitts, Mary
Viola, Kevin
Meadows, Tonya
Royalty, Sherry
Collins, Greg
Shirley, Joseph
White, Sandra
Stern, Timothy
Johnson, Dominic
Dewitt