# Exercise 3

*Objectives*: Wrangle a data set using two new tools, [Trifacta Wrangler](https://www.trifacta.com/start-wrangling/) and [Apache Spark](https://spark.apache.org/).  Results should include a cleaned-up data set and summary statistics.

*Grading criteria*: The tasks should all be completed, and questions should all be answered with clear responses, with shell commands and markdown cells explaining your work as appropriate in the cells provided (as more as needed).  The notebook itself should be completely reproducible (using AWS an EC2 instance based on the class AMI) from start to finish; another person should be able to use the code to obtain the same results as yours.  Note that you will receive no more than partial credit if you do not fill in the text/markdown cells provided explaining your thinking where required.

*Attestation*: **Work individually**.  At the end of your submitted notebook, state that you did all of the substantial work on this assignment yourself, and acknowledge any assistance you received.

*Deadline*: Monday, November 5, 1pm.  Zip your notebook and wrangled dataset and submit it to Blackboard as a single zip (`.zip`) file.

## Part 1 - Wrangle a dataset with Trifacta

For this part, select a CSV dataset from the [OKFN US City Open Data Census](http://us-cities.survey.okfn.org/).  Choose one according to your interest, but try to choose one that's "green" and has somewhere between 10,000 and 1,000,000 rows.  Try also to choose a dataset that is less than 50MB (to save us some time and space during grading!).

Document your process by answering each of the following questions.

### Q1.1 - Choose your dataset

Which dataset did you choose?  What is it called, and what is it about?  Provide a link to its main web page (not its data link, which you'll include next).

**Answer**

The dataset I choose is the 2017 employee salaries in Baltimore,MD. It collect a bunch of data about some eyployee's salaries in Baltimore, inculding there name, there working information, when they was hired, their annual salary, and the gross salaries.

link : http://us-cities.survey.okfn.org/entry/baltimore/employee-salaries/

### Q1.2 - Get your data

If possibly, use `wget` to download your data onto your instance. **If you cannot**, make sure that link you provided works, note that it will need to be uploaded manually, and upload it manually so you can inspect it in the next sections.

**Answer**

In [7]:
!wget  https://data.baltimorecity.gov/api/views/fh59-3d3c/rows.csv

--2018-11-02 13:53:04--  https://data.baltimorecity.gov/api/views/fh59-3d3c/rows.csv
Resolving data.baltimorecity.gov (data.baltimorecity.gov)... 52.206.140.205, 52.206.140.199, 52.206.68.26
Connecting to data.baltimorecity.gov (data.baltimorecity.gov)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: 'rows.csv'

rows.csv                [ <=>                ]   1.46M  --.-KB/s    in 0.03s   

Last-modified header invalid -- time-stamp ignored.
2018-11-02 13:53:04 (46.6 MB/s) - 'rows.csv' saved [1526371]




Use command line tools of your choice (CSVKit, XSV, or other UNIX commands we've seen in class already) to explore your data.  How long is it?  Does it seem relatively clean? Do you see data issues that need wrangling?

**Answer**

In [12]:
!csvstat rows.csv

  1. "NAME"

	Type of data:          Text
	Contains null values:  False
	Unique values:         13439
	Longest value:         31 characters
	Most common values:    Baylor Thompson,Joyce M (2x)
	                       Brown,Kevin M (2x)
	                       Brown,Robert L (2x)
	                       Canan,Ruth E (2x)
	                       Carter,Angela (2x)

  2. "JOBTITLE"

	Type of data:          Text
	Contains null values:  False
	Unique values:         1077
	Longest value:         30 characters
	Most common values:    Police Officer (1514x)
	                       Laborer (Hourly) (556x)
	                       RECREATION ARTS INSTRUCTOR (366x)
	                       EMT Firefighter Suppression (300x)
	                       OFFICE SUPPORT SPECIALIST III (280x)

  3. "DEPTID"

	Type of data:          Text
	Contains null values:  False
	Unique values:         665
	Longest value:         6 characters
	Most common values:    P04001 (368x)
	                       C90786 (241x)
	 

In [None]:
!csvcut -n rows.csv

From the above result we can see that this dataset contains 7 column. There are some null vaule in the Gross column, also the format of HIRE_DT is not correct, time format seems to be redundant here. Wr can use Trifacta to change the format and fill the null value with ANNUAL_RT(will have some error, but relatively close). The Name, Jobtitle and Department will be annoyed when convert to RDD, so they are all removed.

### Q1.4 - Wrangle your data with Trifacta

Use Trifacta to import your data.  You will have to create an account, which is free, to use Trifacta Wrangler.  

Find **at least two columns** you want to wrangle and clean them up - you can split values into new columns, remove bad values, whatever you like.

Execute your recipe, generating a summary you can review, and download your recipe.

Paste the text of your recipe into the cell below using the markdown provided.

**Answer**

```
dateformat col: HIRE_DT type: slashdate
set col: Gross value: IFMISSING($col, ANNUAL_RT)
drop col: NAME action: Drop
drop col: JOBTITLE action: Drop
drop col: DESCR action: Drop
```

### Q1.5 - Evaluate

How did it go?  Did your recipe work on the whole dataset?  Did you run into any problems?

**Answer**

The recipe worked smoothly, although it took a bit long since the dataset is relatively large, but finally I got the dataset I want.

## Part 2 - Summary statistics with Spark

Use Spark to load your data and compute basic summary statistics (counts, or average, min/max, and mean).  You may borrow liberally from the example we saw in class, just change a few things as appropriate.

This is just to get you a taste... we'll do more with Spark next week and in Project 3.

### Q2.1 - Start Spark

First, load up Spark by executing the following cells.  You can just execute them!

In [5]:
import findspark

In [6]:
findspark.init()

In [7]:
from pyspark import SparkContext

In [8]:
spark = SparkContext(appName='exercise-3')

In [9]:
spark

If it worked, you should see the description of your **SparkContext** and a link (that you can visit by replacing its IP address with your EC2 instance host name).

### Q2.2 - Upload your wrangled data

Upload the data you wrangled with Trifacta in Part 1.  You may use Jupyter's upload function for this, it doesn't need to be captured here.  You may want to compress your data before uploading it.

In a few cells below, ensure that your data uploaded correctly, and uncompress it if necessary.  Count its lines, check its filesize, or look at the first few lines as you deem appropriate until you're confident you have all your data to use here in the notebook.

**Answer**

In [91]:
!csvstat Baltimore.csv

  1. "DEPTID"

	Type of data:          Text
	Contains null values:  False
	Unique values:         665
	Longest value:         6 characters
	Most common values:    P04001 (368x)
	                       C90786 (241x)
	                       P04002 (230x)
	                       A99416 (157x)
	                       A64003 (134x)

  2. "HIRE_DT"

	Type of data:          Date
	Contains null values:  False
	Unique values:         4639
	Smallest value:        1962-04-03
	Largest value:         2017-10-02
	Most common values:    2017-05-27 (90x)
	                       2007-06-23 (66x)
	                       2006-07-01 (52x)
	                       2007-12-12 (49x)
	                       2017-07-12 (49x)

  3. "ANNUAL_RT"

	Type of data:          Number
	Contains null values:  False
	Unique values:         1769
	Smallest value:        1,800
	Largest value:         250,000
	Sum:                   740,215,086
	Mean:                  54,899.88
	Median:                50,656
	StDev:            

In [92]:
!csvcut -n Baltimore.csv

  1: DEPTID
  2: HIRE_DT
  3: ANNUAL_RT
  4: Gross


### Q2.3 - Load your data into a Spark RDD

Load up your data using the techniques we reviewed in class.  Extract the header. Get a count to verify that it's working correctly.

Modify the cells below to get started.

**Answer**

In [70]:
# Edit this cell to point to your file!
data = spark.textFile('Baltimore.csv')

In [90]:
header = data.first()
header

'"DEPTID","HIRE_DT","ANNUAL_RT","Gross"'

In [72]:
data.count()

13484

### Q2.4 - Summarize your data

Choose one of the two techniques we saw in class to compute some basic numbers on one of your columns.  Your options are:

 * Use `map` and `filter` and `reduceByKey` with `lambda` functions find min/max values and to count frequencies in one column
 * Use the `Statistics` module to compute count, mean, min/max (don't forget to import it and numpy)
 
It's your choice.

**Answer**

To analyze the data we first use spark to see which 10 salary amount is most frequent in the dataset.

In [73]:
from operator import add

In [76]:
salary_top10 = data.filter(lambda row: row != header) \
    .map(lambda row: row.split(",")) \
    .map(lambda cols: (cols[2], 1)) \
    .reduceByKey(add)

In [77]:
salary_top10.takeOrdered(10, key=lambda r: -r[1])

[('"20800.00"', 269),
 ('"19240.00"', 197),
 ('"65009.00"', 196),
 ('"48971.00"', 149),
 ('"30430.00"', 139),
 ('"24960.00"', 134),
 ('"31345.00"', 118),
 ('"83881.00"', 110),
 ('"32260.00"', 100),
 ('"10864.00"', 93)]

In [86]:
salary = data.filter(lambda row: row != header) \
    .map(lambda row: row.split(","))\
    .map(lambda col: float(col[2].replace('"', '')))

In [87]:
salary.take(10)

[57863.0,
 78600.0,
 54486.0,
 37415.0,
 72800.0,
 65009.0,
 67218.0,
 83576.0,
 45893.0,
 22464.0]

In [88]:
salary.max()

250000.0

In [89]:
salary.min()

1800.0

### Q2.5 - Evaluate

How did it go?  Did it work as you expected?  Did you run into any issues?

What do you like about using Spark?  Or do you dislike it?

**Answer**

When triing to get the max/min/mean value of the data, I found spark went wrong sometimes, although I do choose the right column and try to convert it into float in order to compare, after looking back to the csv file, I found 'NAME', 'JOBTITLE' and 'DESCR' column is annoying when covert csv file to RDD, so they must be removed, and this do solve the problem.