<font size=6> Spark Data Frames and SQL</font><br><br>
# ** MSTC MLlab**

## Sources:
* [Introduction to Spark with Python, by Jose A. Dianes](http://jadianes.github.io/spark-py-notebooks)
* [Complete Guide on DataFrame Operations in PySpark](https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/)
* [Understanding-DataFrames](https://github.com/awantik/pyspark-tutorial/wiki/Understanding-DataFrames)
* [From Pandas to Spark Dataframes](https://github.com/awantik/pyspark-tutorial/wiki/Migrating-from-Pandas-to-Apache-Spark%E2%80%99s-DataFrame)
* [Also ML](https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/)

<font size=5 color=brown> This notebook will introduce Spark capabilities to deal with data in a structured way. Basically, everything turns around the concept of *Data Frame* and using *SQL language* to query them.</font>
<br><br>

<font size=5> In Apache Spark, a DataFrame is a **distributed collection of rows under named columns**. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:</font>

*    <font size=5 color=red>Immutable</font> <font size=4>in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
*    **Lazy Evaluations**: Which means that a task is not executed until an action is performed.
*    **Distributed**: RDD and DataFrame both are distributed in nature.</font>
 

### PERFORMANCE:

![How to create a DataFrame](https://camo.githubusercontent.com/cc93c064c6fd754df0209d42ec054998edd81fa0/68747470733a2f2f7777772e736166617269626f6f6b736f6e6c696e652e636f6d2f6c6962726172792f766965772f6c6561726e696e672d7079737061726b2f393738313738363436333730382f67726170686963732f4230353739335f30335f30332e6a7067)

 ## How to create a DataFrame ?
 
 ![How to create a DataFrame](https://www.analyticsvidhya.com/wp-content/uploads/2016/10/DataFrame-in-Spark.png)

* ### A Spark `DataFrame` is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as a existing RDD in our case.

* ### The entry point into all SQL functionality in Spark is the `SQLContext` class. To create a basic instance, all we need is a `SparkContext` reference. Since we are running Spark in shell mode (using pySpark) we can use the global context object `sc` for this purpose. 

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## <font color=#AA1B5A> DataFrame RDD of Row objects

From: http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html

### Think of a DataFrame being implemented with an RDD of Row objects.
### <font color=#F01B5A>Nicest way to create Rows: create a custom subclass for your data:

In [None]:
from pyspark.sql import Row

NameAge = Row('fname', 'lname', 'age') # build a Row subclass
data_rows = [
    NameAge('John', 'Smith', 47),
    NameAge('Jane', 'Smith', 22),
    NameAge('Frank', 'Jones', 28),
]

In [None]:
# create a DataFrame from an RDD of Rows
data_rdd = sc.parallelize(data_rows)
data = sqlContext.createDataFrame(data_rdd)

In [None]:
type(data)

In [None]:
# ... or from a list (equivalent for small data)
data = sqlContext.createDataFrame(data_rows)

In [None]:
type(data)

In [None]:
data.show()

### For using Spark SQL we need the schema in our data.

In [None]:
data.printSchema()

## <font color=#AA1B5A>Creating a Data Frame from CSV file

## <font color=#F01B5A>We will read our Orange Churn dataset 

In [None]:
CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


In [None]:
type(CV_data)

In [None]:
CV_data.count()

### Spark SQL schema schema

For using Spark SQL we need the schema in our data.

In [None]:
CV_data.printSchema()

## COLUMNS?

## <font color=#F81B5A>...worth mentioning PARQUET

![Parquet](https://parquet.apache.org/assets/img/parquet_logo.png)
https://parquet.apache.org/

### Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

In [None]:
CV_data.columns

In [None]:
CV_data.head(5)

## <font color=#AA1B5A> In Python, you can also convert freely between Pandas DataFrame and Spark DataFrame</font>

In [None]:
import pandas as pd

In [None]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

## or... 

<font color=red size=6>BUT discuss this in terms of efficency???</font>

In [None]:
CV_data.toPandas().head(5)

## Spark DataFrames include some built-in functions as for example Summary Statistics

## `describe`:
* ### get the summary statistics (mean, standard deviance, min ,max , count) of numerical columns in a DataFrame


In [None]:
CV_data.describe().show()

In [None]:
CV_data.describe().toPandas().transpose()

## <font color=#F81B5A>Methods on Data Frames feel very SQL-like:
http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html

In [None]:
CV_data.select('Customer service calls','Churn').toPandas().head(5)

## Number of distinct states in train?

In [None]:
CV_data.select('State').distinct().count()

## Crosstab: contingency table

In [None]:
CV_data.crosstab('State', 'Churn').show()

### Filter and count

In [None]:
CV_data.filter(CV_data['Customer service calls'] > 3).count()

## `groupby`:
* ### How to find Churn vs no_Churn cases?

In [None]:
Count=CV_data.groupby('Churn').count().show()

* ### <font color=#F81BA0 size=5>TO DO:</font>

<font color=#F81B5A size=5>How to find the mean of 'Customer service calls' in Churn vs no_Churn groups in train?

In [None]:
CV_data.groupby('Churn').agg({'Customer service calls': 'mean'}).show()

<font color=#F81B5A size=5>And the mean of 'Total day minutes" and  'Customer service calls' for each State in train?

In [None]:
CV_data.groupby('State').agg({'Total day minutes': 'mean', 'Customer service calls': 'mean'}).toPandas()

# <font color=#F81B5A>SQL Syntax

## There is also a spark.sql function where you can do the same things with SQL query syntax.

### Apply SQL Queries on DataFrame

* ### <font color=brown>To apply SQL queries on DataFrame first we need to register DataFrame as table. Let’s first register train DataFrame as table.

In [None]:
CV_data.registerTempTable('CV_data_table')

In [None]:
Day_min = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`), MEAN(`Customer service calls`) 
    FROM CV_data_table GROUP BY State
""")

In [None]:
Day_min.toPandas()

### <font color=red>...NOW order: descend by average Day Minutes

In [None]:
Day_min = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`) as average_DayMin, MEAN(`Customer service calls`) 
    FROM CV_data_table GROUP BY State order by average_DayMin desc
""")

In [None]:
pd.DataFrame(Day_min.take(5))

## <font color=#F81B5A>... same as before but using SQL-like methods:

In [None]:
import pyspark.sql.functions as fn 

Day_min2=CV_data.groupby('State').agg(fn.mean('Total day minutes').alias("average_DayMin")
                            , fn.mean('Customer service calls')) \
                            .orderBy(fn.desc("average_DayMin"))

In [None]:
pd.DataFrame(Day_min2.take(5))

### <font color=brownUDFs> We can register a user defined function (UDF) from Python

<font color=red size=6>BUT AGAIN discuss this in terms of efficency???</font>

In [None]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}

toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

In [None]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

In [None]:
CV_data = CV_data.withColumn('Churn', toNum(CV_data['Churn'])) \
    .withColumn('International plan', toNum(CV_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(CV_data['Voice mail plan']))

### <font color=red>...NOTE that you MUST assign CV_data = ... to a NEW dataFrame

In [None]:
CV_data=CV_data.drop('Voice mail plan2')

In [None]:
CV_data.columns

In [None]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

## `sample`:
    How to create a sample DataFrame from the base DataFrame?

The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. The sample method will take 3 parameters.

    withReplacement = True or False to select a observation with or without replacement.
    fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame.
    seed for reproduce the result

Let’s create the two DataFrame t1 and t2 from train, both will have 20% sample of train and count the number of rows in each.

In [None]:
t1 = CV_data.sample(False, 0.5, 42)

In [None]:
t1.count()

## `appy`: apply map operation on DataFrame columns

We can apply a function on each row of DataFrame using map operation. After applying this function, we get the result in the form of RDD. Let’s apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) after applying the function (I am applying lambda function).

## SEE NEXT Notebook Map-Reduce typical Word Count Example

https://www.youtube.com/watch?v=V6DkTVvy9vk
https://www.youtube.com/watch?v=vfiJQ7wg81Y