### Motivation

- Creating a pipeline to move the data from simple files to database using Pyspark. 

- Create the tables in Database 

- Connect to the database pull the data for analysis using Pyspark

- After analysis write the updated table / new table to database

- Write the final analysis report to csv file and send to local file system

In order to analysis the data, it has to be in the memory. Spark creates JVM in which the data can be loaded and processed. Pyspark provides multiple routes for loading the data into the Context. From the files inside the folder. Connecting with database server like postgres, and loading the tables from there. It also allows to read the file directly using sparkContext 

Pyspark allows to move the analysis to SQL engine / Hive engine lying below spark context, using the CreateTempView. 

In [1]:
import pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import *

In [2]:
airbnb_root = "/home/****/spark-warehouse/airbnb/"

In [3]:
spark = SparkSession.builder \
        .appName('AirBnB Session') \
        .config('spark.jars','/usr/share/java/postgresql-42.2.26.jar') \
        .getOrCreate()

22/11/23 11:14:07 WARN Utils: Your hostname, codeStation resolves to a loopback address: 127.0.1.1; using 192.168.126.83 instead (on interface wlo1)
22/11/23 11:14:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/11/23 11:14:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
sparkCon = spark.sparkContext

In [5]:
#The below url references the sql server running in local machiew with Airbnb database
url = "jdbc:postgresql://localhost:5432/airbnb"

In [6]:
calendar = sparkCon.textFile(airbnb_root + "calendar.csv") #This method seems redundant 
#The above method still needs to be reviewed with different kind of datasets

In [7]:
#HDFS the databases, tables are just folders. In the local file-system NTFS/Fat32 also they
#exists the same way
spark.sql("CREATE DATABASE IF NOT EXISTS airbnb_pyspark")
spark.sql("use airbnb_pyspark")

DataFrame[]

In [8]:
#The tables shows below are inside the Pyspark environment. After I bring all the necessary
#data inside pyspark environment
spark.sql("SHOW TABLES").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [9]:
#Creating the spark RDD Dataframes to get the data into pyspark
calendarTab = spark.read.csv(airbnb_root+"calendar.csv",inferSchema=True,header=True)
#Initialising the similar table inside the Pyspark environment for SQL based analysis
calendarTab.createOrReplaceTempView("airbnb_pyspark.calendar")

                                                                                

In [10]:
calendarTab.show(5)

+----------+-------------------+---------+-----+
|listing_id|               date|available|price|
+----------+-------------------+---------+-----+
|      2818|2019-12-05 00:00:00|        f| null|
|     73208|2019-08-30 00:00:00|        f| null|
|     73208|2019-08-29 00:00:00|        f| null|
|     73208|2019-08-28 00:00:00|        f| null|
|     73208|2019-08-27 00:00:00|        f| null|
+----------+-------------------+---------+-----+
only showing top 5 rows



In [None]:
#Note that the below activity is required only once. With help of overwrite mode one can 
#keep re-doing it. It will involve calls to Database connectivity
#Initialising the database connection and writing the table to it
calendarTab.write.format("jdbc") \
        .option("url","jdbc:postgresql://localhost:5432/airbnb") \
        .option("dbtable","calendar") \
        .option("user","postgres") \
        .option("password",1234) \
        .option("driver","org.postgresql.Driver") \
        .save(mode='overwrite')
#Writing the file to the database and making it permanent
calendarTab.write.saveAsTable("calendar",mode='overwrite')
#Checkout the files, they are super compressed into Parquets using Snappy Algorithms

In [21]:
#The below command deletes the file in the underlying file system
spark.sql("DROP TABLE calendartab")

22/11/22 20:58:03 WARN HadoopFSUtils: The directory file:/run/media/solverbot/repoA/gitFolders/moreDE/spark-warehouse/airbnb_pyspark.db/calendartab was not found. Was it deleted very recently?


DataFrame[]

In [11]:
listingTab = spark.read.csv(airbnb_root+"listings.csv",inferSchema=True,header=True)
#Initialising the database loading 
listingTab.createOrReplaceTempView("listings")

In [12]:
listingTab.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
 |-- calculated_host_listings_count: double (nullable = true)
 |-- availability_365: integer (nullable = true)



In [15]:
#Taking only part of the columns that is interesting using Spark Dataframe method
listingTab.select("name","price","availability_365").show(5,truncate=False)

+-------------------------------------------------+-----+----------------+
|name                                             |price|availability_365|
+-------------------------------------------------+-----+----------------+
|Quiet Garden View Room & Super Fast WiFi         |59   |44              |
|Quiet apt near center, great view                |160  |47              |
|100%Centre-Studio 1 Private Floor/Bathroom       |80   |198             |
|Lovely apt in City Centre (Jordaan)              |125  |141             |
|Romantic, stylish B&B houseboat in canal district|150  |199             |
+-------------------------------------------------+-----+----------------+
only showing top 5 rows



In [18]:
listings_truncated = spark.sql("SELECT name, price, availability_365 FROM listings")

In [22]:
listings_truncated.show(2,truncate=False)

+----------------------------------------+-----+----------------+
|name                                    |price|availability_365|
+----------------------------------------+-----+----------------+
|Quiet Garden View Room & Super Fast WiFi|59   |44              |
|Quiet apt near center, great view       |160  |47              |
+----------------------------------------+-----+----------------+
only showing top 2 rows



In [20]:
listings_truncated.printSchema()

root
 |-- name: string (nullable = true)
 |-- price: string (nullable = true)
 |-- availability_365: integer (nullable = true)



In [None]:
#This will create a folder and write the CSV files inside the listings_tab folder, 
listings_truncated.write.csv("listings_tab")

In [None]:
#This will connect to the external Postgres Data base and create the table 
#Listings_truncated and write the data, 
listing_truncated.write.format("jdbc") \
        .option("url","jdbc:postgresql://localhost:5432/airbnb") \
        .option("dbtable","listings_truncated") \
        .option("user","postgres") \
        .option("password",1234) \
        .option("driver","org.postgresql.Driver") \
        .save(mode='overwrite')
listingTab.write.saveAsTable("listings",mode='overwrite')

In [None]:
#This command will store the files in local environment as database
listings_truncated.write.saveAsTable("listings_truncated")

In [None]:
reviewsTab.write.format('jdbc') \
    .option("url",url) \
    .option('dbtable','reviews') \
    .option('user','postgres') \
    .option('password',1234) \
    .option('driver','org.postgresql.Driver') \
    .save(mode='ignore')
reviewsTab.write.saveAsTable('reviews')

In [12]:
#Creating the reviews tables in postgres database, local folder system and in pyspark env
reviewsTab = spark.read.csv(airbnb_root+"reviews.csv",inferSchema=True,header=True)
reviewsTab.createOrReplaceTempView("reviews")

[Stage 7:>                                                          (0 + 2) / 2]                                                                                

In [None]:
reviewsDetails.write.format('jdbc').option('url',url).option("dbtable",'reviews_details') \
    .option("user","postgres").option("password",1234).option("driver","org.postgresql.Driver") \
    .save(mode='ignore')
reviewsDetails.write.saveAsTable("review_tables") #permanent to folder

In [13]:
#Creating the review details tables in postgres database, local folder system and in pyspark env
reviewsDetails = spark.read.csv(airbnb_root+"reviews_details.csv",inferSchema=True,header=True)
reviewsDetails.createOrReplaceTempView("review_details") #temporary

                                                                                

In [None]:
neighbourhood.write.format('jdbc').option('url',url).option("dbtable",'neighbourhoods') \
    .option('user','postgres').option('password',1234).option('driver','org.postgresql.Driver') \
    .save(mode='ignore')
neighbourhood.write.saveAsTable("neighbourhoods")

In [14]:
neighbourhood = spark.read.csv(airbnb_root+"neighbourhoods.csv",inferSchema=True, header=True)
neighbourhood.createOrReplaceTempView("neighbourhoods") #used for the spark sql activity


In [51]:
help(spark.read.json)

Help on method json in module pyspark.sql.readwriter:

json(path: Union[str, List[str], pyspark.rdd.RDD[str]], schema: Union[pyspark.sql.types.StructType, str, NoneType] = None, primitivesAsString: Union[bool, str, NoneType] = None, prefersDecimal: Union[bool, str, NoneType] = None, allowComments: Union[bool, str, NoneType] = None, allowUnquotedFieldNames: Union[bool, str, NoneType] = None, allowSingleQuotes: Union[bool, str, NoneType] = None, allowNumericLeadingZero: Union[bool, str, NoneType] = None, allowBackslashEscapingAnyCharacter: Union[bool, str, NoneType] = None, mode: Optional[str] = None, columnNameOfCorruptRecord: Optional[str] = None, dateFormat: Optional[str] = None, timestampFormat: Optional[str] = None, multiLine: Union[bool, str, NoneType] = None, allowUnquotedControlChars: Union[bool, str, NoneType] = None, lineSep: Optional[str] = None, samplingRatio: Union[str, float, NoneType] = None, dropFieldIfAllNull: Union[bool, str, NoneType] = None, encoding: Optional[str] = 

In [52]:
#Reading the geojson file requires additional libraries from Apache, called Apache Sedona
#Will explore and update it soon.
neighbourhoodjson = spark.read.json(airbnb_root+"neighbourhoods.geojson")

In [53]:
neighbourhoodjson.printSchema()

root
 |-- features: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- geometry: struct (nullable = true)
 |    |    |    |-- coordinates: array (nullable = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- properties: struct (nullable = true)
 |    |    |    |-- neighbourhood: string (nullable = true)
 |    |    |    |-- neighbourhood_group: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)



In [15]:
#When we have bulk data, that is spanning many files. Pyspark makes it amazingly easy to 
#pull the data into the pyspark environment, process it, move it to multiple output points.
ytCSV = "/home/solverbot/Desktop/ytDE/csvfiles"

In [16]:
#There are bunch of files inside the above path. Going to check what happens
#When I pass it through spark.read.csv
ytvidPy = spark.read.csv(ytCSV,inferSchema=True, header=True)

                                                                                

In [70]:
ytvidPy.show(2)

+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|channel_title|category_id|        publish_time|                tags| views|likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|gDuslQ9avLc|     17.14.11|Захар и Полина уч...|    Т—Ж БОГАЧ|         22|2017-11-13T09:09:...|"захар и полина|"...| 62408|  334|     190|           50|https://i.ytimg.c...|            FALSE|           FALSE|                 FALSE|Знакомьтес

In [71]:
ytvidPy.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [72]:
#All the 10 files each averaging 40M is loaded into single dataframe
ytvidPy.count()

                                                                                

416869

In [None]:
ytvidPy.write.format('jdbc').option('')
#The below save to parquet compressed the files to 270 MB, almost 50% compression 
ytvidPy.write.saveAsTable("youtubeVideos")

In [17]:
ytvidPy.createOrReplaceTempView("youtubeVideos")

In [19]:
#The same is not true for the json files though. They need additional processing
ytJson = "/home/solverbot/Desktop/ytDE/jsonfiles/"

In [78]:
readingYtJson = spark.read.json(ytJson)

In [79]:
readingYtJson.take(2)

AnalysisException: 
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
      

In [80]:
caJson = spark.read.json(ytJson+"CA_category_id.json")

In [20]:
jsonRawCA= sparkCon.textFile(ytJson+"CA_category_id.json")

In [25]:
import json

In [28]:
jsonFile = open(ytJson+"CA_category_id.json")