#### The Spark Session

`SparkContext` is the first component and the main entry for Spark functionality for connecting the cluster with the application.

To create `SparkContent`, we need `SparkConf` object to specify some parameters, including the master's node IP address.

To read data frames, we beed Spark SQL equivalent, the `SparkSession`, which similarly needs some parameters to be specified.

Here, the `log_of_songs` is an example of thousands of strings of song that is not passed into the `parallelize` method of the `spark.sparkContext` object.

`distributed_song_log_rdd = spark.sparkContext.parallelize(log_of_songs)`

So, if processing like `.lower()` is intended to be applied to this, we can have:

```bash 
def convert_to_lower_case(song):
    return song.lower()
```
        
We can now use the Spark function `map` to apply the `convert_to_lower_case` function to every song, like:

`distributed_song_log_rdd.map(convert_to_lower_case)`

This can also be written using the anonymous function `lambda`:

`distributed_song_log_rdd.map(lambda song: song.lower())`

#### Distributed Data stores

Large data needs distributed computing to be stored. Distributed file systems and databases store  data in a falut-tolerant ways so that when a machine breaks or becomes unavailable, the data is not lost. 

Hadoop has [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) for its data storage. It splits files into 64 or 128 megabyte blocks and these are replicated across cluster

#### Imperative vs Declarative Programming

Imperative - which could be Spark DataFrames and Python is about "How?" and Declarative programming, SQL, for example, is about "what?"

#### Data Wrangling with DataFrames Extra Tips

Extra Tips for Working With PySpark DataFrame Functions


General functions

We have used the following general functions that are quite similar to methods of pandas dataframes:

`select()`: returns a new DataFrame with the selected columns

`filter()`: filters rows using the given condition

`where()`: is just an alias for `filter()`

`groupBy()`: groups the DataFrame using the specified columns, so we can run aggregation on them

`sort()`: returns a new DataFrame sorted by the specified column(s). By default the second parameter 'ascending' is True.

`dropDuplicates()`: returns a new DataFrame with unique rows based on all or just a subset of columns

`withColumn()`: returns a new DataFrame by adding a column or replacing the existing column that has the same name. The first parameter is the name of the new column, the second is an expression of how to compute it.

Aggregate functions

Spark SQL provides built-in methods for the most common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. in the `pyspark.sql.functions module`. These methods are not the same as the built-in methods in the Python Standard Library, where we can find `min()` for example as well, hence you need to be careful not to use them interchangeably.

In many cases, there are multiple ways to express the same aggregations. For example, if we would like to compute one type of aggregate for one or more columns of the DataFrame we can just simply chain the aggregate method after a `groupBy()`. If we would like to use different functions on different columns, `agg()` comes in handy. For example `agg({"salary": "avg", "age": "max"})` computes the average salary and maximum age.

User defined functions (UDF)

In Spark SQL we can define our own functions with the udf method from the `pyspark.sql.functions` module. The default type of the returned variable for UDFs is string. If we would like to return an other type we need to explicitly do so by using the different types from the `pyspark.sql.types` module.

Window functions

Window functions are a way of combining the values of ranges of rows in a DataFrame. When defining the window we can choose how to sort and group (with the `partitionBy` method) the rows and how wide of a window we'd like to use (described by rangeBetween or rowsBetween).

[Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html), [Spark Python API](https://spark.apache.org/docs/latest/api/python/index.html)

#### Spark SQL

Spark allows querying the the dataframes using SQL code, similar to what can be used in MySQL or PostgresQL 

* [Spark Built in function](https://spark.apache.org/docs/latest/api/sql/index.html)

* [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-getting-started.html)