# In this brief notebook, we'll explore a few methods in Spark Core

### We'll start with an online-dating dataset, described here: https://sites.google.com/a/insightdatascience.com/spark-lab/s3-data/dating-profiles

### Here we create an RDD from a csv stored in s3, and use the collect() action, which returns an array

In [1]:
#a csv of id, users
rawUsersRDD = sc.textFile("s3n://insight-spark-after-dark/users-sm.csv")
rawUsersRDD.collect()

[u'10001,Tony',
 u'10002,Mike',
 u'10003,Pat',
 u'10004,Chris',
 u'10005,Paco',
 u'10006,Eddie',
 u'90001,Lisa',
 u'90002,Cindy',
 u'90003,Paula',
 u'90004,Leslie',
 u'90005,Allman',
 u'90006,Kimberly']

### The collect action causes data to flow across the network from the worker nodes to the master (where you are running the jupyter notebook, or your data analysis)

In [2]:
#a csv of 
rawGendersRDD = sc.textFile("s3n://insight-spark-after-dark/gender-sm.csv")
rawGendersRDD.collect()

[u'10001,M',
 u'10002,M',
 u'10003,M',
 u'10004,M',
 u'10005,M',
 u'10006,M',
 u'90001,F',
 u'90002,F',
 u'90003,F',
 u'90004,F',
 u'90005,F',
 u'90006,F']

In [3]:
def rec_tup(record):
    tokens = record.split(",")
    return (int(tokens[0]), str(tokens[1]))

### Moving toward a join action, we can use the map() method to create Key/Value pairs from the tuple

In [5]:
usersRDD = rawUsersRDD.map(rec_tup)
usersRDD.collect()

[(10001, 'Tony'),
 (10002, 'Mike'),
 (10003, 'Pat'),
 (10004, 'Chris'),
 (10005, 'Paco'),
 (10006, 'Eddie'),
 (90001, 'Lisa'),
 (90002, 'Cindy'),
 (90003, 'Paula'),
 (90004, 'Leslie'),
 (90005, 'Allman'),
 (90006, 'Kimberly')]

gendersRDD = rawGendersRDD.map(rec_tup)
gendersRDD.collect()

### Now that we have two RDDs with Key/Value pairs, use the join method to join the RDDs based on the Key

In [8]:
usersWithGenderJoinedRDD = usersRDD.join(gendersRDD)
usersWithGenderJoinedRDD.collect()

[(90004, ('Leslie', 'F')),
 (10004, ('Chris', 'M')),
 (90005, ('Allman', 'F')),
 (90001, ('Lisa', 'F')),
 (10005, ('Paco', 'M')),
 (10001, ('Tony', 'M')),
 (90006, ('Kimberly', 'F')),
 (10002, ('Mike', 'M')),
 (90002, ('Cindy', 'F')),
 (10006, ('Eddie', 'M')),
 (10003, ('Pat', 'M')),
 (90003, ('Paula', 'F'))]

### Next Steps:

#### Question 1: There exists another zipped csv, s3n://insight-spark-after-dark/ratings-sm.csv.gz. Create an RDD called rawRatingsRDD

#### Question 2: Call the collect() method, what is the structure of the data?

SyntaxError: invalid syntax (<ipython-input-1-aec69f32977c>, line 1)