<a href="https://colab.research.google.com/github/NonRoute/Data-Sci-Eng-Project/blob/main/Datasci_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Engineering (DE) Spark

## Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [2]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
    !tar xf spark-3.3.2-bin-hadoop3.tgz
    !mv spark-3.3.2-bin-hadoop3 spark
    !pip install -q findspark
    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark"

## Start a Local Cluster

In [3]:
import findspark
findspark.init()

In [4]:
spark_url = 'local'

In [5]:
from pyspark.sql import SparkSession

In [6]:
spark = SparkSession.builder\
        .master(spark_url)\
        .appName('Spark ML')\
        .getOrCreate()

## Spark SQL Data Preparation

In [8]:
from pyspark import SparkFiles

# Data from https://www.traffy.in.th/?page_id=27351
# Update every 3hr

url = 'https://publicapi.traffy.in.th/dump-csv-chadchart/bangkok_traffy.csv'
spark.sparkContext.addFile(url)

In [9]:
df = spark.read.option("delimiter", ",").option("multiline", "true").option("quote", '"').option("header", "true").option("escape", "\\").option("escape", '"').csv("file://" + SparkFiles.get("bangkok_traffy.csv"))
df.show()

+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------+----+------------+--------------------+
|  ticket_id|                type|        organization|             comment|               photo|         photo_after|            coords|             address|     subdistrict|         district|            province|           timestamp|         state|star|count_reopen|       last_activity|
+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------+----+------------+--------------------+
|2021-9LHDM6|                  {}|                null|            ไม่มีภาพ|https://storage.g...|                null|100.48661,13

In [10]:
df.count()

254679

In [11]:
df.printSchema()

root
 |-- ticket_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- organization: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- photo: string (nullable = true)
 |-- photo_after: string (nullable = true)
 |-- coords: string (nullable = true)
 |-- address: string (nullable = true)
 |-- subdistrict: string (nullable = true)
 |-- district: string (nullable = true)
 |-- province: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- state: string (nullable = true)
 |-- star: string (nullable = true)
 |-- count_reopen: string (nullable = true)
 |-- last_activity: string (nullable = true)



In [12]:
# drop unused column
cols = ['ticket_id','photo', 'photo_after']
df = df.drop(*cols)

 ## Convert to proper data type

In [13]:
from pyspark.sql.functions import col
cols = ['star', 'count_reopen']
for c in cols:
    df = df.withColumn(c, col(c).cast('int'))

In [14]:
cols = ['timestamp', 'last_activity']
for c in cols:
    df = df.withColumn(c, col(c).cast('timestamp'))

In [15]:
from pyspark.sql.functions import split, regexp_replace
cols = ['type']
for c in cols:
    df = df.withColumn(c, split(regexp_replace(col(c), "[{}]", ""), ","))

In [16]:
cols = ['organization', 'coords']
for c in cols:
  df = df.withColumn(c, split(col(c), ","))

In [17]:
df.printSchema()

root
 |-- type: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- organization: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- comment: string (nullable = true)
 |-- coords: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- address: string (nullable = true)
 |-- subdistrict: string (nullable = true)
 |-- district: string (nullable = true)
 |-- province: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- state: string (nullable = true)
 |-- star: integer (nullable = true)
 |-- count_reopen: integer (nullable = true)
 |-- last_activity: timestamp (nullable = true)



In [18]:
df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------+----+------------+--------------------+
|                type|        organization|             comment|              coords|             address|     subdistrict|         district|            province|           timestamp|         state|star|count_reopen|       last_activity|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------+----+------------+--------------------+
|                  []|                null|            ไม่มีภาพ|[100.48661, 13.79...|1867 จรัญสนิทวงศ์...|         บางพลัด|          บางพลัด|       กรุงเทพมหานคร|2021-09-01 10:44:...|กำลังดำเนินการ|null|        null|2022-02-22 04:59:...|
|         [ความสะอาด]|        [เขตบางซื่อ]|     

In [19]:
# count null
import pyspark.sql.functions as F

df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

In [20]:
df_agg.show()

+----+------------+-------+------+-------+-----------+--------+--------+---------+-----+------+------------+-------------+
|type|organization|comment|coords|address|subdistrict|district|province|timestamp|state|  star|count_reopen|last_activity|
+----+------------+-------+------+-------+-----------+--------+--------+---------+-----+------+------------+-------------+
|  97|         969|   2378|     0|   2378|         70|      72|      23|        0|    0|160341|      117728|           14|
+----+------------+-------+------+-------+-----------+--------+--------+---------+-----+------+------------+-------------+



In [21]:
df.filter("last_activity is NULL").show()

+-------------+------------+--------------------+--------------------+--------------------+------------+-----------+--------------------+--------------------+-----------+----+------------+-------------+
|         type|organization|             comment|              coords|             address| subdistrict|   district|            province|           timestamp|      state|star|count_reopen|last_activity|
+-------------+------------+--------------------+--------------------+--------------------+------------+-----------+--------------------+--------------------+-----------+----+------------+-------------+
|        [ถนน]|        null|จอดรถกันข้างทางใน...|[100.50468, 13.70...|48/1 ถ. เจริญกรุง...| วัดพระยาไกร|  บางคอแหลม|       กรุงเทพมหานคร|2022-07-30 07:15:...|รอรับเรื่อง|null|        null|         null|
|           []|        null|     วางของบนทางเท้า|[100.53675, 13.70...|395 4-5 ซอย นราธิ...|   ช่องนนทรี|    ยานนาวา|จังหวัดกรุงเทพมหานคร|2022-08-05 05:53:...|รอรับเรื่อง|null|        null|

In [22]:
# drop rows where last_activity = null
df = df.na.drop(subset=["last_activity"])

In [23]:
df.count()

254665