# Assignment 3

*Objectives*: Wrangle a data set using two new tools, [Trifacta Wrangler](https://www.trifacta.com/products/wrangler/) and [Apache Spark](https://spark.apache.org/).  Results should include a cleaned-up data set and summary statistics.

*Grading criteria*: The tasks should all be completed, and questions should all be answered with clear responses, with shell commands and markdown cells explaining your work as appropriate in the cells provided (as more as needed).  The notebook itself should be completely reproducible (using AWS an EC2 instance based on the class AMI) from start to finish; another person should be able to use the code to obtain the same results as yours.  Note that you will receive no more than partial credit if you do not add text/markdown cells explaining your thinking where required.

*Attestation*: **Work individually**.  At the end of your submitted notebook, state that you did all of the substantial work on this assignment yourself, and acknowledge any assistance you received.

*Deadline*: Sunday, October 22, 12pm.  Zip your notebook and wrangled dataset and submit it to Blackboard as a single file.

## Part 1 - Wrangle a dataset with Trifacta

For this part, select a dataset from the [OKFN US City Open Data Census](http://us-city.census.okfn.org/).  Choose one according to your interest, but try to choose one that's "green" and has somewhere between 10,000 and 1,000,000 rows.  Try to choose a dataset that is less than 50MB (to save your instructors some time and space during grading!).

Document your process by answering each of the following questions.

### Q1.1 - Choose your dataset

Which dataset did you choose?  What is it called, and what is it about?  Provide a link to its main web page (not its data link, which you'll include next).

**Answer**

* I chose the Lobbyist Activity data from the city and county of San Fransisco. The name of each officer of the City and County of San Francisco with whom a lobbyist made a contact. Contacts of public officials are disclosed by lobbyists registered with the Ethics Commission on a monthly basis. This dataset updates to 10/27/2017.
* https://data.sfgov.org/City-Management-and-Ethics/Lobbyist-Activity-Contacts-of-Public-Officials/hr5m-xnxc

### Q1.2 - Get your data

Use `wget` to download your data onto your instance. 

**Answer**

In [1]:
!wget https://data.sfgov.org/api/views/hr5m-xnxc/rows.csv?accessType=DOWNLOAD

--2017-10-29 03:12:55--  https://data.sfgov.org/api/views/hr5m-xnxc/rows.csv?accessType=DOWNLOAD
Resolving data.sfgov.org (data.sfgov.org)... 52.206.140.205
Connecting to data.sfgov.org (data.sfgov.org)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘rows.csv?accessType=DOWNLOAD’

rows.csv?accessType     [      <=>           ]   4.81M  4.47MB/s    in 1.1s    

Last-modified header invalid -- time-stamp ignored.
2017-10-29 03:12:56 (4.47 MB/s) - ‘rows.csv?accessType=DOWNLOAD’ saved [5044507]



In [2]:
!mv rows.csv?accessType=DOWNLOAD sflobbyist.csv

### Q1.3 - Explore your data

Use command line tools of your choice (CSVKit, XSV, or other UNIX commands we've seen in class already) to explore your data.  How long is it?  Does it seem relatively clean, or do you see data issues that need wrangling?

**Answer**

In [3]:
!head sflobbyist.csv | xsv table

Date        Lobbyist           Lobbyist_Firm                              Official            Official_Department                                 Lobbyist_Client                                                                 MunicipalDecision                                                                                              DesiredOutcome                                                                         FileNumber     LobbyingSubjectArea
06/24/2015  Junius, Andrew     Reuben, Junius & Rose, Llp                 Mendrin, Shaunn     Planning, Department Of                             Udr                                                                             399 - Fee Litigation                                                                                           Approval                                                                               CPF-14-513661  Legal
05/31/2016  Olson, Daniel      Morgan Stanley Investment Management Inc.  Wang, Art           

In [4]:
!xsv headers sflobbyist.csv

1   Date
2   Lobbyist
3   Lobbyist_Firm
4   Official
5   Official_Department
6   Lobbyist_Client
7   MunicipalDecision
8   DesiredOutcome
9   FileNumber
10  LobbyingSubjectArea


In [5]:
!xsv count sflobbyist.csv

25224


In [6]:
!xsv select 8 sflobbyist.csv | xsv sort | uniq

DesiredOutcome
200109077811
2015-08833CUA
815 Tennessee Street
A better running transportation system that does not place an undue burden on taxpayers
A fair application of the tax exclusion on stock option compensation
A fair fee increase for commercial development
A policy that can both better support mothers/families and not become too burdensome for businesses to administer.
A workable solution for scheduling and hiring needs of both employees and employers
ABC and CUP Applications
ADVOCATE FOR A CONTIGUOUS CURB
"AFFIRMING THE EXEMPTION DETERMINATION - AT&T NETWORK ""LIGHTSPEED"" UPGRADE"
APPEAL
APPROVAL
APPROVAL OF CONTRACT
Acceptance and Approval by School Board
Access Agreement
Access to Tours
Accommodations on Powell street
Accurate characterization of material ban
Achieve compromise on the three proposed legislation to limit sales and ads for sweetened beverages.
Achieving local hire commitment
Acquisition of new recreation facilities
Acquisition of the 

In [7]:
!xsv select 9 sflobbyist.csv | xsv sort | uniq

FileNumber
""
#150168
(BOS Reference #160425)
011375;010927;011372;011148;011370;011222;012260;012263;012257;012269;012272;011231
041013-01 Acti
09-151
090228
090584
091251
091269
091430
091443
100053
100102
100104
100161
100265
100455
100472
100644
100674
100750
100755
100756
100759
100865
100899
100992/100993
1010 16th Street
101027
101057
101091
101105
101190
101225
101311
101351
101352
101537
1057.30
1062.50
110027
110070
110102
110155
110182
110207
110300
110332
110337
110344
110345
110462
11047
110506
110546
110548
110565
110623
110775
110798
110899
110998
111080
111201
111212
111331
111337
111371
120020
120021
120023
120220
"120266, 120267, 120268, 120269"
120299
120301
120407
120474
120475 & 120753
120554
120629
120669
120681
120681-120682
120802
120898
120941
130248
"1303308, 1303309, 130310, 130311"
130374
130459
130481
130527
130528
130556
130786
130788
13347_ENF
140120
140122
140307
140381
140709
140880
141024
141038
141095
141107
141298
149 9th Street
15-128
15-145
150241


* I used csvkit and xsv tools to explore the data. It contains 25224 lines in which the column 8 and 9 (DesiredOutcome and FileNumber) have data issues. There are different kinds of values to indicate the missing value. The other issue is that there are commas in a few column and it may disturb the results of split later.

### Q1.4 - Wrangle your data with Trifacta

Use Trifacta to import your data.  Find at least two columns you want to wrangle and clean them up - you can split values into new columns, remove bad values, whatever you like.

Execute your recipe, generating a summary you can review, and save your recipe.

Paste your recipe into the cell below using the markdown provided.

**Answer**

```

* splitrows table: MISSING col: column1 on: '\n' quote: '\"'

* split col: column1 on: ',' limit: 9 quote: '\"'

* header table: MISSING

* derive table: MISSING value: FileNumber as: 'column1'

* replace col: column1 with: '' on: `N/A` global: true

* replace col: column1 with: '' on: `n/a` global: true

* replace col: column1 with: '' on: `NA` global: true

* replace col: column1 with: '' on: `None` global: true

* rename col: column1 to: 'FileNumber2'

* derive value: DesiredOutcome as: 'column1'

* rename col: column1 to: 'DesiredOutcome2'

* replace col: DesiredOutcome2 with: '' on: `n/a` global: true

* replace col: DesiredOutcome2 with: '' on: `N/A` global: true

* replace col: DesiredOutcome2 with: '' on: `none` global: true

* replace col: Lobbyist with: '' on: `,` global: true

* replace col: Lobbyist_Firm with: '' on: `,` global: true

* replace col: Official with: '' on: `,` global: true

* replace col: Official_Department with: '' on: `,` global: true

* replace col: Lobbyist_Client with: '' on: `,` global: true

* replace col: MunicipalDecision with: '' on: `,` global: true

* replace col: LobbyingSubjectArea with: '' on: `,` global: true

* drop table: MISSING col: DesiredOutcome

* drop table: MISSING col: FileNumber

* set col: Date value: dateformat($col, 'yyyy-MM-dd')

```

### Q1.5 - Evaluate

How did it go?  Did your recipe work on the whole dataset?  Did you run into any problems?

**Answer**

The recipe went well. It worked on the whole dataset.

## Part 2 - Summary statistics with Spark

Use Spark to load your data and compute basic summary statistics (counts, or average, min/max, and mean).  You may borrow liberally from the example we saw in class, just change a few things as appropriate.

This is just to get you a taste... we'll do more with Spark next week and in Project 3.

### Q2.1 - Start Spark

First, load up Spark by executing the following cells.  You can just execute them!

In [8]:
import os

In [9]:
os.environ['SPARK_HOME'] = '/usr/local/lib/spark'

In [10]:
import findspark

In [11]:
findspark.init()

In [12]:
from pyspark import SparkContext

In [13]:
spark = SparkContext(appName='assignment-3')

In [14]:
spark

If it worked, you should see the description of your **SparkContext** and a link (that you can visit by replacing its IP address with your EC2 instance host name).

### Q2.2 - Upload your wrangled data

Upload the data you wrangled with Trifacta in Part 1.  You may use Jupyter's upload function for this, it doesn't need to be captured here.  You may want to compress your data before uploading it.

In a few cells below, ensure that your data uploaded correctly, and uncompress it if necessary.  Count its lines, check its filesize, or look at the first few lines as you deem appropriate until you're confident you have all your data to use here in the notebook.

**Answer**

In [15]:
!wc -l Lobbyist_Activity.csv

25225 Lobbyist_Activity.csv


In [16]:
!ls -lh Lobbyist_Activity.csv

-rw-rw-r-- 1 ubuntu ubuntu 5.1M Oct 29 03:01 Lobbyist_Activity.csv


In [17]:
!csvcut -n Lobbyist_Activity.csv

  1: Date
  2: Lobbyist
  3: Lobbyist_Firm
  4: Official
  5: Official_Department
  6: Lobbyist_Client
  7: MunicipalDecision
  8: DesiredOutcome2
  9: FileNumber2
 10: LobbyingSubjectArea


In [18]:
!head Lobbyist_Activity.csv | xsv table

Date        Lobbyist                Lobbyist_Firm                              Official                 Official_Department                                      Lobbyist_Client                                                                 MunicipalDecision                                                                                              DesiredOutcome2                                                                        FileNumber2    LobbyingSubjectArea
2015-06-24  """Junius Andrew"""     """Reuben Junius & Rose Llp"""             """Mendrin Shaunn"""     """Planning Department Of"""                             Udr                                                                             399 - Fee Litigation                                                                                           Approval                                                                               CPF-14-513661  Legal
2016-05-31  """Olson Daniel"""      Morgan Stanley Investment Ma

check the data have been cleaned up in the cell below

In [19]:
!xsv search -s8,9 "N/A" Lobbyist_Activity.csv | xsv select 8,9 | head | xsv table

DesiredOutcome2  FileNumber2


### Q2.3 - Load your data into a Spark RDD

Load up your data using the techniques we reviewed in class.  Extract the header. Get a count to verify that it's working correctly.

Modify the cells below to get started.

**Answer**

In [20]:
# Edit this cell to point to your file!
data = spark.textFile('Lobbyist_Activity.csv')

In [21]:
header = data.first()
header

'Date,Lobbyist,Lobbyist_Firm,Official,Official_Department,Lobbyist_Client,MunicipalDecision,DesiredOutcome2,FileNumber2,LobbyingSubjectArea'

In [22]:
data.count()

25225

### Q2.4 - Summarize your data

Choose one of the two techniques we saw in class to compute some basic numbers on one of your columns.  Your options are:

 * Use `map` and `filter` and `reduceByKey` with `lambda` functions find min/max values and to count frequencies in one column
 * Use the `Statistics` module to compute count, mean, min/max (don't forget to import it and numpy)
 
It's your choice.

**Answer**

I will choose the first method to find min/max and to count frequencies. Because the data I chose has no numeric columns but one date column. I will find the min and max for 'date' column and count frequencies for 'LobbyingSubjectArea' column.

In [23]:
Lobbyist = data.filter(lambda row: row != header) \
    .map(lambda row: row.split(","))\
    .take(5)

In [24]:
from operator import add

In [25]:
data.filter(lambda row: row != header) \
    .map(lambda row: row.split(","))\
    .map(lambda cols: cols[9],1)\
    .take(5)

['Legal',
 'Economic Development',
 'Technology',
 'Environment',
 'Planning and Building Permits']

In [26]:
count_num = data.filter(lambda row: row != header) \
    .map(lambda row: row.split(",")) \
    .map(lambda cols: (cols[9], 1)) \
    .reduceByKey(add) \
    .takeOrdered(40,key=lambda pair: -pair[1])
for lobbist_firm, count in count_num:
    print("{}\t{}".format(count, lobbist_firm))

13843	Planning and Building Permits
2173	Economic Development
1619	Transportation
1171	City Employee Benefits
1101	Housing/Property Tax
895	Public Safety
827	Technology
716	Health
677	Government Administration
519	Public Utilities
363	Accessibility
279	Recreation and Parks
261	Public Works
248	Environment
218	Legal
80	Social Services
51	Education
46	Open Government
46	Airport
35	Arts
28	Human Rights
14	Elections
13	Animal Welfare
1	Library


In [27]:
Lobbyist = data.filter(lambda row: row != header) \
    .map(lambda row: row.split(","))

In [28]:
lobby_date = Lobbyist.map(lambda cols: cols[0])

In [29]:
lobby_date.min()

'2010-01-01'

In [30]:
lobby_date.max()

'2017-09-30'

### Q2.5 - Evaluate

How did it go?  Did it work as you expected?  Did you run into any issues?

What do you like about using Spark?  Or do you dislike it?

**Answer**

It went with some mistakes at first. I found it was because the data quality issue that data that contains commas would be splitted when doing map jobs to split columns. I went back to the Q1.4 to re-do the data wrangling. At last, it worked well.

I like the speed of Spark. However, it is not so intuitive compared to command line. I believe I should make more efforts to be familiar with it.