## Lab Prerequisit
1) MySql jar is added in classpath of Spark
> You can do it by having a symbolic link in SPARK_HOME/jars to MySql jar
>> `~$` ln -s /usr/share/java/mysql-connector-java-5.1.45.jar /home/talentum/spark/jars/mysql-connector-java.jar
Ref - https://www.cyberciti.biz/faq/creating-soft-link-or-symbolic-link/

2) cd to ~/test-jupyter/test/ on your apache sandbox

3) test`$` cp salaries.txt /tmp

4) test`$` mysql -u bigdata -p
password Bigdata@123

5) mysql>CREATE DATABASE test;

6) Mysql>use test;

7) Mysql>drop table if exists salaries;

8) Mysql>create table salaries (
gender varchar(1),
age int,
salary double,
zipcode int);

9) Mysql>load data local infile '/tmp/salaries.txt' into table salaries fields terminated by ',';

10) Mysql>alter table salaries add column `id` int(10) unsigned primary KEY AUTO_INCREMENT;

11) Quit MySql
> Mysql>quit;

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

## Pyspark working with MySql


In [3]:
url = "jdbc:mysql://127.0.0.1:3306/test?useSSL=false&allowPublicKeyRetrieval=true"
driver = "com.mysql.jdbc.Driver"
user = "bigdata"
password = "Bigdata@123"

# https://youtu.be/ray3YvnIohM

In [4]:
df =  spark.read\
    .format("jdbc")\
    .option("driver", driver)\
    .option("url", url)\
    .option("user", user)\
    .option("password", password)\
    .option("dbtable", "salaries")\
    .load()
df.count()

50

In [5]:
df =  spark.read\
    .format("jdbc")\
    .option("driver", driver)\
    .option("url", url)\
    .option("user", user)\
    .option("password", password)\
    .option("dbtable", "salaries")\
    .load()
df.show(5)

+------+---+-------+-------+---+
|gender|age| salary|zipcode| id|
+------+---+-------+-------+---+
|     F| 66|41000.0|  95103|  1|
|     M| 40|76000.0|  95102|  2|
|     F| 58|95000.0|  95103|  3|
|     F| 68|60000.0|  95105|  4|
|     M| 85|14000.0|  95102|  5|
+------+---+-------+-------+---+
only showing top 5 rows



# Reading From Database in Parallel

When we are reading large table, we would like to read that in parallel. This will dramatically improve read performance. We can pass “numPartitions” option to spark read function which will decide parallelism in reading data.

In [9]:
df =  spark.read\
    .format("jdbc")\
    .option("driver", driver)\
    .option("url", url)\
    .option("user", user)\
    .option("password", password)\
    .option("dbtable", "salaries")\
    .option("numPartitions", 10)\
    .load()
 
df.rdd.getNumPartitions()

1

In our case, it will still show as 1 partition only. This is because we do not have enough data to create 10 different partitions