# Theory

## Earlier implementation:
an RDD has a compute function that produces an Iterator[T] for the
data that will be stored in the RDD.

The compute function (or computation) is opaque to Spark. That is, Spark does
not know what you are doing in the compute function. Whether you are performing
a join, filter, select, or aggregation, Spark only sees it as a lambda expression. Another
problem is that the Iterator[T] data type is also opaque for Python RDDs; Spark
only knows that it’s a generic object in Python.
Furthermore, because it’s unable to inspect the computation or expression in the
function, Spark has no way to optimize the expression


## New implementation
Spark 2.x introduced a few key schemes for structuring Spark. 
One is to express computations
by using common patterns found in data analysis. These patterns are
expressed as high-level operations such as filtering, selecting, counting, aggregating,
averaging, and grouping. This provides added clarity and simplicity.

This specificity is further narrowed through the use of a set of common operators in a
DSL. Through a set of operations in DSL, available as APIs in Spark’s supported languages
(Java, Python, Spark, R, and SQL), these operators let you tell Spark what you
wish to compute with your data, and as a result, it can construct an efficient query
plan for execution.

And the final scheme of order and structure is to allow you to arrange your data in a
tabular format, like a SQL table or spreadsheet, with supported structured data types

# Getting to work
we want to aggregate all the ages for each name, group by
name, and then average the ages

## low level

In [None]:
from pyspark import SparkContext

In [20]:
# Create an RDD of tuples (name, age)
sc = SparkContext("local", "SparkOldImplementation")
data_rdd = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),("TD", 35), ("Brooke", 25)])

ages_rdd = data_rdd.map(lambda x: (x[0], (x[1],1))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1])).map(lambda x: (x[0], x[1][0]/x[1][1]))
for age_info in ages_rdd.collect():
    print(age_info)
sc.stop()

('Brooke', 22.5)
('Denny', 31.0)
('Jules', 30.0)
('TD', 35.0)


## using structured apis

In [23]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

In [25]:
spark = SparkSession.builder.appName("SparkStructuredApis").getOrCreate()

data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31), ("Jules", 30),("TD", 35), ("Brooke", 25)], ["name", "age"])
ages_df = data_df.groupBy("name").agg(avg("age"))
ages_df.show()

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Jules|    30.0|
|    TD|    35.0|
| Denny|    31.0|
+------+--------+

