This notebook will take the Youtube dataset made of csv files and json files and write the data to local database first.

Then it will write the data to AWS database using the AWS wrangler connection, and pyspark connection

In [1]:
### The multi imports that are required for this project

#The pumping equipment of the pipeline
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

#Suppress warnings
import warnings
warnings.filterwarnings('ignore')

#Work with JSON files
import json

#Work with AWS
import awswrangler
import boto3
import configparser

There are two sets of files we are going to work with. 

- CSV files that are in separate folders based on the regions. Files themselves don't have the region names include inside them

- Json files that are in single folder with region names present inside the files.

In real world such files or sources needs to be brought together in pipelines, joined correctly and then loaded into the final database / sink

In [2]:
csvsource = "/home/solverbot/Desktop/ytDE/csvfiles"
jsonsource= "/home/solverbot/Desktop/ytDE/jsonfiles"

# require the below libraries functions to write out the parquets
import pandas as pd
import urllib.parse
import os

parquetPath = "/home/solverbot/Desktop/ytDE/parquetSink"

In [3]:
# We will initiate the spark session with the parameters necessary to 
# make connection with the database.

spark = SparkSession.builder.appName("YT_Pipeline"). \
            config('spark.jars',"/usr/share/java/postgresql-42.2.26.jar"). \
            getOrCreate()

23/01/27 08:01:39 WARN Utils: Your hostname, codeStation resolves to a loopback address: 127.0.1.1; using 192.168.64.83 instead (on interface wlo1)
23/01/27 08:01:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/01/27 08:01:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
# lets first see where the raw files are located

In [None]:
%%sh
cd /home/solverbot/Desktop/ytDE/jsonfiles
ls 

In [None]:
%%sh 
cd /home/solverbot/Desktop/ytDE/csvfiles
ls -R

In [5]:
### To reduce the typing
sparkC = spark.sparkContext #rarely used
sparksql = spark.sql
filereader = spark.read

In [6]:
#implementing the recursive filelook for the csv files

youtubeCSV_table = filereader.csv(path=csvsource,
                                 recursiveFileLookup=True,
                                 header=True,
                                 inferSchema=True) \
                    .withColumn("region",input_file_name().substr(46,48))

                                                                                

In [7]:
# check the data

youtubeCSV_table.show(2)

                                                                                

+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+---------------+
|   video_id|trending_date|               title|channel_title|category_id|        publish_time|                tags| views|likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|         region|
+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+---------------+
|gDuslQ9avLc|     17.14.11|Захар и Полина уч...|    Т—Ж БОГАЧ|         22|2017-11-13T09:09:...|"захар и полина|"...| 62408|  334|     190|           50|https://i.ytimg.c...|            FALSE|  

In [8]:
# Cleaning up the region column

youtubeCSV_cleaned = youtubeCSV_table.selectExpr("*", "split_part(region, '/',1) as location") \
                .drop("region")

In [9]:
youtubeCSV_cleaned.show(2)

+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+--------+
|   video_id|trending_date|               title|channel_title|category_id|        publish_time|                tags| views|likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|location|
+-----------+-------------+--------------------+-------------+-----------+--------------------+--------------------+------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+--------+
|gDuslQ9avLc|     17.14.11|Захар и Полина уч...|    Т—Ж БОГАЧ|         22|2017-11-13T09:09:...|"захар и полина|"...| 62408|  334|     190|           50|https://i.ytimg.c...|            FALSE|           FALSE|      

In [None]:
youtubeCSV_cleaned.write.format('jdbc') \
                .option("url", "jdbc:postgresql://pipeline-tank.coc5gkht2i7a.us-east-1.rds.amazonaws.com:5432/pipeline_exercise") \
                .option('dbtable','yt_csv') \
                .option('user','postgres') \
                .option('password', 'wrangler') \
                .option('driver','org.postgresql.Driver') \
                .save()



In [None]:
reader = configparser.ConfigParser()
reader.read_file(open('calter.config'))

reg = reader["AWS"]["REGION"]
key = reader["AWS"]["KEY"]
sec = reader["AWS"]["SECRET"]

In [None]:
boto_session = boto3.session(region=reg,aws_access_key_id=key,
                            aws_secret_access_key=sec)

In [None]:
# Lets concentrate on the json files

youtubejsonRaw = filereader.json(path=jsonsource)

In [None]:
# Spark throws error that record is corrupt !!!
youtubejsonRaw.head(2)

In [None]:
# lets look at what is the reason for corruption using shell 

In [None]:
%%sh
cd /home/solverbot/Desktop/ytDE/jsonfiles
head -n 15 CA_category_id.json

In [None]:
%%sh
cd /home/solverbot/Desktop/ytDE/jsonfiles
tail -n 15 CA_category_id.json

In [46]:
def parquetMaker(file_name: str):
    """The function recieves the json filename and 
    converts to parquet and writes it to parquet
    folder.
    
    Ensure parquet folder in present in the path"""
    filepath = jsonsource + f'/{file_name}'
    dest_file = parquetPath+"/"+ file_name.split('.')[0] + '.parquet'
    print(dest_file)
    try:
        # Creating DF from content
        df_raw = pd.read_json(filepath)
        df_step_1 = pd.json_normalize(df_raw['items'])
        df_step_1.columns = ['kind','etag','id','channelId','title','assignable']
        df_step_1.to_parquet(path=dest_file)
    except Exception as e:
        raise e

In [47]:
parquetMaker("CA_category_id.json")

/home/solverbot/Desktop/ytDE/parquetSink/CA_category_id.parquet


In [48]:
newParquet = filereader.parquet(parquetPath)

In [49]:
newParquet.show(2)

+--------------------+--------------------+---+---------+-----+----------+
|                kind|                etag| id|channelId|title|assignable|
+--------------------+--------------------+---+---------+-----+----------+
|youtube#videoCate...|"m2yskBQFythfE4ir...|  1|     null| null|      null|
|youtube#videoCate...|"m2yskBQFythfE4ir...|  2|     null| null|      null|
+--------------------+--------------------+---+---------+-----+----------+
only showing top 2 rows



In [50]:
# import the os and glob module to work with multiple file
import os
import glob

jsonGlob = glob.glob(root_dir=jsonsource,pathname="*.json")

In [51]:
#This loop moves the json files through the function
#writes out the parquets
for jsonfile in jsonGlob:
    parquetMaker(jsonfile)

/home/solverbot/Desktop/ytDE/parquetSink/GB_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/JP_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/RU_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/FR_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/CA_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/US_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/DE_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/KR_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/MX_category_id.parquet
/home/solverbot/Desktop/ytDE/parquetSink/IN_category_id.parquet


In [52]:
jsonParquetDf = filereader.parquet(parquetPath)

In [54]:
jsonParquetDf.createOrReplaceTempView("jsondataframe_view")

In [56]:
#To make the reference easier for future, clean col names
sparksql("""SELECT channelId FROM jsondataframe_view LIMIT 2""").show()

+--------------------+
|           channelId|
+--------------------+
|UCBR8-60-B28hp2Bm...|
|UCBR8-60-B28hp2Bm...|
+--------------------+



Lets first write these two dataframes to the RDS instance in AWS

There are two ways to do it. 

- Using the SparkSession itself 

- Using the AWS Wrangler

In addition, we can write these tables s3 buckets, and in parallel
register them in Glue Catalog for Athena to query