Create an RDD named `products` with `parallelize` containing the elements in the output.

In [28]:
products = sc.parallelize(['Apple', 'Apple', 'Cheese', 'Apple', 'Orange'])
products.collect()

['Apple', 'Apple', 'Cheese', 'Apple', 'Orange']

Count the number of elements in `products`

In [29]:
products.count()

5

Count the number of apples in `products`. Tip: use filter.

In [30]:
products.filter(lambda x: x == 'Apple').count()

3

show the (distinct) products.

In [31]:
products.distinct().collect()

['Apple', 'Orange', 'Cheese']

Download the file babynames from https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv?accessType=DOWNLOAD, store the file locally and load its contents into a RDD called `babynames` with textFile. Show the first 5 lines.

In [32]:
babynames = sc.textFile('../rows.csv')
babynames.take(5)

['Year,First Name,County,Sex,Count',
 '2013,GAVIN,ST LAWRENCE,M,9',
 '2013,LEVI,ST LAWRENCE,M,9',
 '2013,LOGAN,NEW YORK,M,44',
 '2013,HUDSON,NEW YORK,M,49']

The first line in the file is a header, filter out the first line to keep only lines with actual data.

In [33]:
firstline = babynames.first()
test = babynames.filter(lambda x: x != firstline)
test.take(5)

['2013,GAVIN,ST LAWRENCE,M,9',
 '2013,LEVI,ST LAWRENCE,M,9',
 '2013,LOGAN,NEW YORK,M,44',
 '2013,HUDSON,NEW YORK,M,49',
 '2013,GABRIEL,NEW YORK,M,50']

The elements in this RDD are each a line of text. Transform each element into a tuple or list that consists of the 5 columns in the csv by splitting the lines on comma characters. Show the first 5. Tip: you need `map` and the `split` method on Python Strings.

In [34]:
splitb = test.map(lambda x: x.split(','))
splitb.take(5)

[['2013', 'GAVIN', 'ST LAWRENCE', 'M', '9'],
 ['2013', 'LEVI', 'ST LAWRENCE', 'M', '9'],
 ['2013', 'LOGAN', 'NEW YORK', 'M', '44'],
 ['2013', 'HUDSON', 'NEW YORK', 'M', '49'],
 ['2013', 'GABRIEL', 'NEW YORK', 'M', '50']]

Count how many male babies are in the RDD.

In [35]:
males = splitb.filter(lambda x: x[3] == 'M')
males = males.map(lambda x: (x[3], x[4]))
males.reduceByKey(lambda x, y: int(x) + int(y)).take(5)

[('M', 667585)]

The next objective is to find the most given babyname.

First, convert the RDD into a key,value structure. Since we do not need anything but the name, we can convert every element into (name, 1). Show the first 5.

In [36]:
kv = splitb.map(lambda x: (x[1], x[4]))
kv = kv.reduceByKey(lambda x, y: int(x) + int(y))

Now you can aggregate the elements that have the same key, and sum the values to get the number of occurrences per name. Show the first 5, these might be different ones than displayed below. Tip: use `reduceByKey`

In [37]:
kv = kv.reduceByKey(lambda x, y: int(x) + int(y))
kv.take(1000)

[('GAVIN', 3618),
 ('LEVI', 1218),
 ('LOGAN', 6118),
 ('HUDSON', 873),
 ('GABRIEL', 5393),
 ('ELIZA', 341),
 ('MADELEINE', 459),
 ('ZARA', 329),
 ('DAISY', 412),
 ('JONATHAN', 5511),
 ('JACKSON', 3836),
 ('JUDY', 91),
 ('DAVID', 8585),
 ('SEBASTIAN', 4027),
 ('SAMUEL', 5358),
 ('DEVORA', 194),
 ('JAYDEN', 10770),
 ('MICHAEL', 12749),
 ('MATTHEW', 11161),
 ('CHARLES', 3054),
 ('LUNA', 511),
 ('ADELE', 143),
 ('LIAM', 7683),
 ('DYLAN', 7664),
 ('DANIEL', 10353),
 ('RYAN', 9978),
 ('ETHAN', 10524),
 ('WYATT', 1288),
 ('SURI', 159),
 ('ZISSY', 246),
 ('YIDES', 270),
 ('WILLIAM', 6935),
 ('ALEXANDER', 9401),
 ('LENA', 353),
 ('CORA', 267),
 ('GIA', 485),
 ('MADELINE', 1673),
 ('ANDREA', 980),
 ('TRINITY', 444),
 ('LEILANI', 429),
 ('HARMONY', 158),
 ('AMANDA', 1194),
 ('RACHEL', 3195),
 ('MARGOT', 88),
 ('NOA', 270),
 ('JESSICA', 1629),
 ('ABBY', 150),
 ('JENNY', 288),
 ('MILANA', 71),
 ('ADDISON', 1890),
 ('MACKENZIE', 1318),
 ('ADRIANNA', 1000),
 ('ATHENA', 309),
 ('HANNA', 221),
 ('ANIYA

Now `map` the name,frequency pairs so that you only have the values and use the `max` action to get the highest value.

In [38]:
maxname = kv.max(lambda x: int(x[1]))
maxname

('MICHAEL', 12749)

And revert back to the name,frequency pairs and filter the pair(s) that have a frequency equal to the max you found.

### No reason to map to a single value and then filter to match the value to a tuple when the max function can accept a lambda to look at the second value of the tuple
Also, that's just more susceptible to errors.

### Also, the original 'correct' answer didn't actually aggregate the count so it just looked at which name was present in the most counties rather than which name actually was most present.