# Spark ETL with SQL Database PostgreSQL

1. Install required spark libraries
2. Create connection with PostgreSQL Database
3. Read data from PostgreSQL Database
4. Transform data
5. Write data into PostgreSQL Server

### 1- Spark Librairies

Start Spark Session and Load all the required library

In [1]:
from pyspark.sql import SparkSession

https://mvnrepository.com/artifact/org.postgresql/postgresql/42.6.0 : 'org.postgresql:postgresql:42.6.0'

In [2]:
#Start Spark Session
spark_postgres = SparkSession.builder.appName("postgreSQL")\
        .config('spark.jars.packages', 'org.postgresql:postgresql:42.6.0')\
        .getOrCreate()
sqlContext = SparkSession(spark_postgres)

In [3]:
spark_postgres

### 2- Create Connection

In [9]:
#Load CSV file into DataFrame
postgre_df = spark_postgres.read \
    .format("jdbc") \
    .option("driver","org.postgresql.Driver") \
    .option("url", "jdbc:postgresql://127.0.0.1:5432/Spark_db") \
    .option("dbtable", "employee") \
    .option("user", "postgres") \
    .option("password", "xxxx") \
    .load()

In [10]:
postgre_df.show(5)

+-----------+----------------+--------------------+-----------+--------------------+------+---------+---+----------+--------------------+--------------------+-------------+--------+---------+
|employee_id|       full_name|           job_title|departement|       business_unit|gendre|ethnicity|age| hire_date|       annual_salary|               bonus|      country|    city|exit_date|
+-----------+----------------+--------------------+-----------+--------------------+------+---------+---+----------+--------------------+--------------------+-------------+--------+---------+
|     E02002|          Kai Le|   Controls Engineer|Engineering|       Manufacturing|  Male|    Asian| 47|02/05/2022|92.36800000000000...|               0E-18|United States|Columbus|     null|
|     E02003|    Robert Patel|             Analyst|      Sales|           Corporate|  Male|    Asian| 58|10/23/2013|45.70300000000000...|               0E-18|United States| Chicago|     null|
|     E02004|      Cameron Lo|Network Ad

In [11]:
postgre_df.printSchema()

root
 |-- employee_id: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- job_title: string (nullable = true)
 |-- departement: string (nullable = true)
 |-- business_unit: string (nullable = true)
 |-- gendre: string (nullable = true)
 |-- ethnicity: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- annual_salary: decimal(38,18) (nullable = true)
 |-- bonus: decimal(38,18) (nullable = true)
 |-- country: string (nullable = true)
 |-- city: string (nullable = true)
 |-- exit_date: string (nullable = true)



### 3- Transform data

In [15]:
# Create tempory tabel
postgre_df.createOrReplaceTempView("tempPostgres")

In [20]:
postgre_df.count()

1000

In [18]:
postgre_test = sqlContext.sql("select * from tempPostgres where age > 35")

In [19]:
postgre_test.count()

736

### 3- Write data

In [21]:
postgre_test.write \
    .format("jdbc")\
    .option("driver","org.postgresql.Driver") \
    .option("url", "jdbc:postgresql://127.0.0.1:5432/Spark_db") \
    .option("dbtable", "postgre_test") \
    .option("user", "postgres") \
    .option("password", "xxx") \
    .save()