# PySpark: Working with RDDs

### Setup

FindSpark: This will circumvent many issues with your system finding spark

In [1]:
import findspark
findspark.init('c:/users/andy/spark')

Load Libraries

In [2]:
from pyspark import SparkConf, SparkContext

Set the file path

In [3]:
data_folder = "C:/Users/Andy/Dropbox/FactoryFloor/Repositories/Tutorial_Udemy_SparkPython/Course_Resources/ml-100k/"

Create the Spark Context

In [4]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local[*]").setAppName("MovieSimilarities")

# create a spark context object
sc = SparkContext(conf = conf)

Load your data.

In [5]:
data = sc.textFile(data_folder + "u.data")

### Creating and Types

An Array

In [10]:
rdd1 = sc.parallelize([1, 5, 60, 'a', 9, 'c', 4, 'z', 'f'])

Key/value

In [9]:
rdd2 = sc.parallelize([('a', 6),
                      ('a', 1),
                      ('b', 2),
                      ('c', 5),
                      ('c', 8),
                      ('c', 11)])

Key/value stores of dicts.

In [8]:
rdd3 = sc.parallelize([("a",[1, 2, 3]), ("b",[4, 5])])

## Inspecting

You can look at the entire RDD (only if its small) using:

print(rdd.collect())

But it's more likely your RDD is huge. Better to use:

In [11]:
print(rdd1.take(2))
print(rdd2.take(2))
print(rdd3.take(2))

[1, 5]
[('a', 6), ('a', 1)]
[('a', [1, 2, 3]), ('b', [4, 5])]


## Summarizing RDDs

**.count()** : is the number of entries.
* **.countByKey()**

In [14]:
print('RDD Count:', rdd3.count())
print('RDD Count by key:', rdd3.countByKey())

RDD Count: 2
RDD Count by key: defaultdict(<class 'int'>, {'a': 1, 'b': 1})


**.collectAsMap()** : returns key/value pairs as a dictionary (no duplicates, only last pair)

In [17]:
print('RDD Collect as map:',  rdd2.collectAsMap())

RDD Collect as map: {'a': 1, 'b': 2, 'c': 11}


**.stats()**: descriptive statistics

In [20]:
rdd4 = sc.parallelize(range(100))
rdd4.stats()

(count: 100, mean: 49.5, stdev: 28.86607004772212, max: 99.0, min: 0.0)

## Resources

https://hackersandslackers.com/working-with-pyspark-rdds/