Goals: To access the data in Spark dataframe.

Since there are several ways to access the data in Spark dataframe. In this note book we show how to access the data in several method in three programming languages which are Python, Scala and Java.

Because Scala and Java are not compatible with Jupyter notebook. So, we use the result from IntelliJ instead.

Ref:
- [1] https://x1.inkenkun.com/archives/1114#SELECT
- [2] http://mogile.web.fc2.com/spark/spark210/sql-programming-guide.html

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [10]:
!git clone https://github.com/damiannolan/iris-neural-network.git

Cloning into 'iris-neural-network'...
remote: Enumerating objects: 77, done.[K
remote: Total 77 (delta 0), reused 0 (delta 0), pack-reused 77[K
Unpacking objects: 100% (77/77), done.


In [0]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession\
         .builder\
         .master("local[*]")\
         .appName("sample_app")\
         .enableHiveSupport()\
         .getOrCreate()


In [0]:
df_py = spark.read.csv("iris-neural-network/iris-data-set.csv", header=True)

In [0]:
# register SQL-command df like table named df_SQL
df_py.createOrReplaceTempView("df_py_SQL")

In [48]:
df_py.show(5)
print("=============================================")
spark.sql("SELECT * FROM df_py_SQL LIMIT 5").show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0| 

In [45]:
df_py.select("*").show(5)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows



In [69]:
from pyspark.sql import functions as F

df_py.select(F.col("sepal_length")).show(5) # spark approach
df_py.select(df_py["sepal_length"], df_py[0]).show(5) # pythonic approachs

+------------+
|sepal_length|
+------------+
|         5.1|
|         4.9|
|         4.7|
|         4.6|
|         5.0|
+------------+
only showing top 5 rows

+------------+------------+
|sepal_length|sepal_length|
+------------+------------+
|         5.1|         5.1|
|         4.9|         4.9|
|         4.7|         4.7|
|         4.6|         4.6|
|         5.0|         5.0|
+------------+------------+
only showing top 5 rows



In Scala and Java the name of columns were changed into SepalLength and so on. However It 's not our point in this notebook.

In [0]:
% Scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

import org.apache.spark.sql.functions.column
import org.apache.spark.sql.functions.col

val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("sample_app")
    .getOrCreate()
    import spark.implicits._

val df_scala = spark.read.format("csv")
    .option("header", "true")
    .load("C:/Users/CU - teminal/Desktop/Spark/iris.csv")

df_scala.select($"SepalLength", // several ways to access the data in scala spark DF
                 'SepalLength,
                 col("SepalLength"), // spark.sql.functions.col
                 column("SepalLength"), // spark.sql.functions.column
                
                 'SepalLength as 'column_one)
        .show(5) // AS


"""
equivalent to
    spark.sql("SELECT SepalLength, " +

      "                       SepalLength AS column_one" +
      "                       FROM df_scala_SQL " +
      "                       LIMIT 5").show()
"""

# +-----------+-----------+-----------+-----------+----------+
# |SepalLength|SepalLength|SepalLength|SepalLength|column_one|
# +-----------+-----------+-----------+-----------+----------+
# |        5.1|        5.1|        5.1|        5.1|       5.1|
# |        4.9|        4.9|        4.9|        4.9|       4.9|
# |        4.7|        4.7|        4.7|        4.7|       4.7|
# |        4.6|        4.6|        4.6|        4.6|       4.6|
# |        5.0|        5.0|        5.0|        5.0|       5.0|
# +-----------+-----------+-----------+-----------+----------+


In [0]:
% Java

import org.apache.spark.sql.SparkSession;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.column;

SparkSession spark = SparkSession
                .builder()
                .master("local[*]")
                .appName("sample_app")
                .getOrCreate();

Dataset<Row> df_java = spark.read()
        .format("csv")
        .option("header", "true")
        .load("C:/Users/CU - teminal/Desktop/Spark/iris.csv");

df_java.select(col("SepalLength"),
               column("SepalLength"))
        .show(5);
    
# +-----------+-----------+
# |SepalLength|SepalLength|
# +-----------+-----------+
# |        5.1|        5.1|
# |        4.9|        4.9|
# |        4.7|        4.7|
# |        4.6|        4.6|
# |        5.0|        5.0|
# +-----------+-----------+