In [1]:
displayHTML("<font size=8 color='green'>Introduction to Spark Data Frames and SQL using PySpark</font>")

### [MSTC](http://mstc.ssr.upm.es/big-data-track) and MUIT:

## Sources:
* [Databriks: introduction-to-dataframes-python](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html)
* [Introduction to Spark with Python, by Jose A. Dianes](http://jadianes.github.io/spark-py-notebooks)
* [Complete Guide on DataFrame Operations in PySpark](https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/)
* [Understanding-DataFrames](https://github.com/awantik/pyspark-tutorial/wiki/Understanding-DataFrames)
* [From Pandas to Spark Dataframes](https://github.com/awantik/pyspark-tutorial/wiki/Migrating-from-Pandas-to-Apache-Spark%E2%80%99s-DataFrame)
* [Also ML](https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/)

## This notebook will introduce Spark capabilities to deal with data in a structured way.
* ### Basically, everything turns around the concept of *Data Frame* and using *SQL language* to query them.</font>")

## In Apache Spark, a DataFrame is a **distributed collection of rows under named columns**.
- ### In simple terms, it is same as a table in relational database or an Excel sheet with Column headers.

## It also shares some common characteristics with RDD:<br>

*    **Immutable** in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
*    **Lazy Evaluations**: Which means that a task is not executed until an action is performed.
*    **Distributed**: RDD and DataFrame both are distributed in nature.

### PERFORMANCE:

![How to create a DataFrame](https://camo.githubusercontent.com/cc93c064c6fd754df0209d42ec054998edd81fa0/68747470733a2f2f7777772e736166617269626f6f6b736f6e6c696e652e636f6d2f6c6962726172792f766965772f6c6561726e696e672d7079737061726b2f393738313738363436333730382f67726170686963732f4230353739335f30335f30332e6a7067)

## How to create a DataFrame ?
 
 ![How to create a DataFrame](https://www.analyticsvidhya.com/wp-content/uploads/2016/10/DataFrame-in-Spark.png)

* ### A Spark `DataFrame` is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as a existing RDD in our case.

## <font color=#AA1B5A> DataFrame RDD of Row objects

From: http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html

### Think of a DataFrame being implemented with an RDD of Row objects.
- ### Row is a generic row object with an ordered collection of field
- ### Nicest way to create Rows: create a custom subclass for your data

In [12]:
from pyspark.sql import Row

NameAge = Row('fname lname', 'age') # build a Row subclass

user1 = NameAge('John Smith', 47)
user2 = NameAge('Jane Smith', 22)
user3 = NameAge('Frank Jones', 28)

data_rows = [ user1, user2, user3 ]

print(data_rows)

In [13]:
df1 = spark.createDataFrame(data_rows)

df1.show()

In [14]:
# Databricks DISPLAY
display(df1)

## TO DO: create another DataFrame df2 with sames users but with their weights:

fname lname|  weight

- 'John Smith' 80.5
- 'Jane Smith' 62.3
- 'Frank Jones' 71.5

In [16]:
NameWeight = Row('fname lname', 'weight') # build a Row subclass

df2 =  spark.createDataFrame([NameWeight('John Smith', 80.5), 
                              NameWeight('Jane Smith', 62.3),
                              NameWeight('Frank Jones', 71.5)])

display(df2)

## TO DO: Join both DataFrames into df

In [18]:
df = df1.join(df2, "fname lname")

display(df)

## We can apply functions to Columns using `pyspark.sql.functions` or our own Used-Definded Functions (UDF)

### for example:

- 1.- `select(\*cols)` : Projects a set of expressions and returns a new DataFrame.<br>
- 2.- apply `split` function to the "fname lname" column : split fname and lname
- 3.- `alias` returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode)

In [20]:
import pyspark.sql.functions as f

df_new= df.select(f.split(df['fname lname'],' ').alias('sep names'))

df_new.show()

- ## `explode(col)`: this function returns a new row for each element in the given array or map.

In [22]:
import pyspark.sql.functions as f

df_new = df.select(f.explode(f.split(df['fname lname'],' ')).alias('all'))

df_new.show()

# Creating a Data Frame from CSV file

## <font color=#F01B5A>We will read our Orange Churn dataset

In [25]:
# File location and type
file_location = "/FileStore/tables/churn_bigml_80-bf1a8.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

In [26]:
type(df)

In [27]:
df.printSchema()

In [28]:
display(df.describe())

In [29]:
# Convert to a Date type
df = df.withColumn('Voice mail plan', f.regexp_replace(df['Voice mail plan'],'Yes','1'))

In [30]:
display(df)

In [31]:
df.count()

In [32]:
df.columns

## `groupby`:
* ### How to find Churn vs no_Churn cases?

In [34]:
df.groupby('Churn').count().show()

In [35]:
df.crosstab('State', 'Churn').show()

In [36]:
dc=df.groupBy("State").agg(f.count("Churn").alias('Num Churn'))

In [37]:
dc.show()

## Use `filter()` to return the rows that match a predicate

In [39]:
filterDF = df.filter( df.State == "CA" )
#filterDF = df.filter( (df.State == "CA") & (df.Churn == 'False') )
#filterDF = df.filter( (df.State == "CA") & (df['Total day calls'] >  90) )

display(filterDF)

In [40]:
filterDF.count()

In [41]:
countDistinctDF = df.select("State", "Churn")\
  .groupBy("State")\
  .agg(f.countDistinct("Churn"))

In [42]:
countDistinctDF.show()

# Spark SQL schema

## For using Spark SQL we need the schema in our data.

In [45]:
df.printSchema()

## COLUMNS?

## <font color=#F81B5A>...worth mentioning PARQUET

![Parquet](https://parquet.apache.org/assets/img/parquet_logo.png)
https://parquet.apache.org/

### Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

## Before SQL Note that you can also convert freely between Pandas DataFrame and Spark DataFrame</font>

In [48]:
import pandas as pd

In [49]:
pd.DataFrame(df.take(5), columns=df.columns)

## or...

In [51]:
df.toPandas().head(5)

In [52]:
CV_data.groupby('Churn').agg({'Customer service calls': 'mean'}).show()

### <font color=#F81BA0 size=5>TO DO:</font>

- ### How to find the mean of 'Customer service calls' in every state

In [54]:
df.groupby('State').agg({'Total day minutes': 'mean', 'Customer service calls': 'mean'}).toPandas()

In [55]:
CV_data.groupby('State').agg({'Total day minutes': 'mean', 'Customer service calls': 'mean'}).toPandas()

# <font color=#F81B5A>SQL Syntax

## There is also a spark.sql function where you can do the same things with SQL query syntax.

### Apply SQL Queries on DataFrame

* ### <font color=brown>To apply SQL queries on DataFrame first we need to register DataFrame as table. Let’s first register train DataFrame as table.

In [58]:
df.registerTempTable('df_table')

In [59]:
Mean_DayMin_ServiceCalls = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`), MEAN(`Customer service calls`) 
    FROM df_table GROUP BY State
""")

In [60]:
type(Mean_DayMin_ServiceCalls)

In [61]:
Mean_DayMin_ServiceCalls.show()

In [62]:
Mean_DayMin_ServiceCalls.toPandas()

### <font color=red>...NOW order: descend by average Day Minutes

In [64]:
Day_min = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`) as average_DayMin, MEAN(`Customer service calls`) 
    FROM df_table GROUP BY State order by average_DayMin desc
""")

In [65]:
pd.DataFrame(Day_min.take(5))

## <font color=#F81B5A>... same as before but using SQL-like methods:

In [67]:
import pyspark.sql.functions as f

Day_min2=df.groupby('State').agg(f.mean('Total day minutes').alias("average_DayMin")
                            , f.mean('Customer service calls')) \
                            .orderBy(f.desc("average_DayMin"))

In [68]:
pd.DataFrame(Day_min2.take(5))

### <font color=brownUDFs> We can register a user defined function (UDF) from Python

In [70]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}

toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

In [71]:
pd.DataFrame(df.take(5), columns=df.columns)

In [72]:
df = df.withColumn('Churn', toNum(df['Churn'])) \
    .withColumn('International plan', toNum(df['International plan'])) \
    .withColumn('Voice mail plan', toNum(df['Voice mail plan']))

### <font color=red>...NOTE that you MUST assign CV_data = ... to a NEW dataFrame

In [74]:
df = df.drop('Voice mail plan2')

In [75]:
df.columns

In [76]:
pd.DataFrame(df.take(5), columns=df.columns)

## `sample`:
- ###   How to create a sample DataFrame from the base DataFrame?

### The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. The sample method will take 3 parameters.

- ### withReplacement = True or False to select a observation with or without replacement. fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame;  seed for reproduce the result

### Let’s create the two DataFrame t1 and t2 from train, both will have 20% sample of train and count the number of rows in each.

In [78]:
t1 = df.sample(False, 0.5, 42)

In [79]:
t1.count()

## `appy`: apply map operation on DataFrame columns

We can apply a function on each row of DataFrame using map operation. After applying this function, we get the result in the form of RDD. Let’s apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) after applying the function (I am applying lambda function).

## RETURN TO: Notebook with Word Count Example