# XYZ Research

Apply PySpark Transformation and Actions to a group of local lists across a distributed cluster. 

## Import Module and Initialize SparkSession

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("xyzResearch").getOrCreate()

25/02/20 12:46:40 WARN Utils: Your hostname, Cesars-MBP.local resolves to a loopback address: 127.0.0.1; using 192.168.7.230 instead (on interface en0)
25/02/20 12:46:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/02/20 12:46:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/20 12:46:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Create Group of Lists and Convert them to RDDs

In [3]:
data2001List = ['RIN1', 'RIN2', 'RIN3', 'RIN4', 'RIN5', 'RIN6', 'RIN7']
data2002List = ['RIN3', 'RIN4', 'RIN7', 'RIN8', 'RIN9']
data2003List = ['RIN4', 'RIN8', 'RIN10', 'RIN11', 'RIN12']

In [4]:
data2001rdd = spark.sparkContext.parallelize(data2001List)
data2002rdd = spark.sparkContext.parallelize(data2002List)
data2003rdd = spark.sparkContext.parallelize(data2003List)

## Tasks

### Task 1

How many research projects were initiated in the three years?

In [5]:
data2001and2002rdd = data2001rdd.union(data2002rdd)
data2001and2002rdd.collect()

                                                                                

['RIN1',
 'RIN2',
 'RIN3',
 'RIN4',
 'RIN5',
 'RIN6',
 'RIN7',
 'RIN3',
 'RIN4',
 'RIN7',
 'RIN8',
 'RIN9']

In [6]:
all_data_rdd = data2001and2002rdd.union(data2003rdd)
all_data_rdd.collect()

['RIN1',
 'RIN2',
 'RIN3',
 'RIN4',
 'RIN5',
 'RIN6',
 'RIN7',
 'RIN3',
 'RIN4',
 'RIN7',
 'RIN8',
 'RIN9',
 'RIN4',
 'RIN8',
 'RIN10',
 'RIN11',
 'RIN12']

In [7]:
all_data_rdd = data2001and2002rdd.union(data2003rdd).distinct()
all_data_rdd.collect()

                                                                                

['RIN4',
 'RIN10',
 'RIN2',
 'RIN5',
 'RIN11',
 'RIN6',
 'RIN1',
 'RIN9',
 'RIN12',
 'RIN3',
 'RIN8',
 'RIN7']

In [8]:
all_data_rdd.count()

12

***Summary***: Given the data is spread across three different lists with overlap each other, we apply various transformation and actions to identify the total number of projects.

- Create an RDD (`data2001and2002rdd`) with the `union()` transformation between the 2001 and 2002 datasets
- Again, with the same transformation combine all the lists into an RDD (`all_data_rdd`)
- To account for the duplicate values in the unified dataset apply `distinct()`

### Task 2

How many projects were completed in the first year?

In [13]:
firstYearCompleted = data2001rdd.subtract(data2002rdd)
firstYearCompleted.collect()

['RIN6', 'RIN2', 'RIN5', 'RIN1']

In [14]:
firstYearCompleted.count()

4

***Summary***: Identify the number of projects completed in the first year. Apply the `subtract()` action on the two RDDs. This method is used to return an RDD that contains elements present in the `data2001rdd` but not in `data2001rdd`.

### Task 3

How many projects were completed in the first two years?

In [16]:
firstTwoYearsCompleted = data2001and2002rdd.subtract(data2003rdd)
firstTwoYearsCompleted.collect()

['RIN2', 'RIN5', 'RIN6', 'RIN1', 'RIN9', 'RIN3', 'RIN3', 'RIN7', 'RIN7']

In [17]:
firstTwoYearsCompleted.distinct().count()

7

***Summary***: Similar to the steps taken in Task 2 but instead we use the RDD (`data2001and2002rdd`), which contains elements from the first two years, and `subtract()` the elements found in `data2003rdd`.