## Pandas Dataframes - Spark Dataframes - Spark RDDs

Spark Dataframes is based on Spark RDDs - nevertheless 3 different datatypes that can be converted vice versa

Be aware that Spark dataframes, unlike Pandas dataframes, are suitable for cluster processing. Pandas soon reaches its limits with large amounts of data

Introduction

    RDDs can contain any kind of unstructured data
    Spark DataFrames are DataSets of Row objects. These Row objects contain structured data (i.e. they have a schema: names and types)
    Big advantages:
        You can run SQL queries on DataFrames
        You can read and write to JSON, Hive, and parquet
        You can communicate with JDBC/ODBC

In [0]:
# pandas and spark
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()



In [0]:
# STEP 1: RUN THIS CELL TO INSTALL BAMBOOLIB

# You can also install bamboolib on the cluster. Just talk to your cluster admin for that
%pip install bamboolib  

# Heads up: this will restart your python kernel, so you may need to re-execute some of your other code cells.

Python interpreter will be restarted.
Collecting bamboolib
  Downloading bamboolib-1.30.19-py3-none-any.whl (2.8 MB)
Collecting ipyslickgrid==0.0.3
  Downloading ipyslickgrid-0.0.3.tar.gz (51.4 MB)
Collecting toml>=0.10.0
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting ppscore<2.0.0,>=1.2.0
  Downloading ppscore-1.3.0.tar.gz (17 kB)
Collecting xlrd>=1.0.0
  Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Building wheels for collected packages: ipyslickgrid, ppscore
  Building wheel for ipyslickgrid (setup.py): started
  Building wheel for ipyslickgrid (setup.py): finished with status 'done'
  Created wheel for ipyslickgrid: filename=ipyslickgrid-0.0.3-py2.py3-none-any.whl size=1823285 sha256=950c03f481f0c4a70e7932108a8072e6c2b8fea13e6d2bcdc97e2b0e65feb2ec
  Stored in directory: /root/.cache/pip/wheels/5b/66/e0/f1d70e1b3787f5c8455f962245613a388d9a6e2373c15383ea
  Building wheel for ppscore (setup.py): started
  Building wheel for ppscore (setup.py): finished with s

In [0]:
# STEP 2: RUN THIS CELL TO IMPORT AND USE BAMBOOLIB

import bamboolib as bam

# This opens a UI from which you can import your data
bam  

# Already have a pandas data frame? Just display it!
# Here's an example
# import pandas as pd
# df_test = pd.DataFrame(dict(a=[1,2]))
# df_test  # <- You will see a green button above the data set if you display it

In [0]:
# STEP 1: RUN THIS CELL TO INSTALL BAMBOOLIB

# You can also install bamboolib on the cluster. Just talk to your cluster admin for that
%pip install bamboolib  

# Heads up: this will restart your python kernel, so you may need to re-execute some of your other code cells.

In [0]:
# STEP 2: RUN THIS CELL TO IMPORT AND USE BAMBOOLIB

import bamboolib as bam

# This opens a UI from which you can import your data
bam  

# Already have a pandas data frame? Just display it!
# Here's an example
# import pandas as pd
# df_test = pd.DataFrame(dict(a=[1,2]))
# df_test  # <- You will see a green button above the data set if you display it

In [0]:
# STEP 1: RUN THIS CELL TO INSTALL BAMBOOLIB

# You can also install bamboolib on the cluster. Just talk to your cluster admin for that
%pip install bamboolib  

# Heads up: this will restart your python kernel, so you may need to re-execute some of your other code cells.

In [0]:
# STEP 2: RUN THIS CELL TO IMPORT AND USE BAMBOOLIB

import bamboolib as bam

# This opens a UI from which you can import your data
bam  

# Already have a pandas data frame? Just display it!
# Here's an example
# import pandas as pd
# df_test = pd.DataFrame(dict(a=[1,2]))
# df_test  # <- You will see a green button above the data set if you display it

In [0]:
# STEP 1: RUN THIS CELL TO INSTALL BAMBOOLIB

# You can also install bamboolib on the cluster. Just talk to your cluster admin for that
%pip install bamboolib  

# Heads up: this will restart your python kernel, so you may need to re-execute some of your other code cells.

In [0]:
# STEP 2: RUN THIS CELL TO IMPORT AND USE BAMBOOLIB

import bamboolib as bam

# This opens a UI from which you can import your data
bam  

# Already have a pandas data frame? Just display it!
# Here's an example
# import pandas as pd
# df_test = pd.DataFrame(dict(a=[1,2]))
# df_test  # <- You will see a green button above the data set if you display it

### create a Pandas Dataframe

In [0]:
pdf = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
pdf

Unnamed: 0,k,v
0,foo,1
1,bar,2


### convert Pandas Dataframe to Spark Dataframe
Schema is an essential part of a Spark Dataframe

Spark Schema is automatically created from Pandas Dataframe

In [0]:
sdf = spark.createDataFrame(pdf)
sdf

Out[3]: DataFrame[k: string, v: bigint]

In [0]:
sdf.show()
sdf.schema

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
+---+---+

Out[4]: StructType([StructField('k', StringType(), True), StructField('v', LongType(), True)])

In [0]:
sdf.printSchema()

root
 |-- k: string (nullable = true)
 |-- v: long (nullable = true)



### convert Spark Dataframe to RDD

In [0]:
rdd1 = sdf.rdd
rdd1

Out[6]: MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:0

In [0]:
rdd1.collect()

Out[7]: [Row(k='foo', v=1), Row(k='bar', v=2)]

do a map-transformation on the RDD - 
take care of the schema, it might be lost

In [0]:
rdd2 = rdd1.map(lambda x:(x.k+"s",x.v))

In [0]:
rdd2.collect()

Out[9]: [('foos', 1), ('bars', 2)]

### convert RDD back to Spark Dataframe
add schema, in case it was lost

In [0]:
sdf2 = spark.createDataFrame(rdd2,sdf.schema)
sdf2.show()

+----+---+
|   k|  v|
+----+---+
|foos|  1|
|bars|  2|
+----+---+



In [0]:
sdf2.schema

Out[11]: StructType([StructField('k', StringType(), True), StructField('v', LongType(), True)])

In [0]:
pdf2 = sdf2.toPandas()

In [0]:
pdf2

Unnamed: 0,k,v
0,foos,1
1,bars,2
