
# Integrating Databricks with AWS Kinesis

In this notebook, we will use pyspark to stream data from Kinesis to Databricks. Firstly, we will read streaming data from Kinesis in Databricks and then perform transformations on the 3 Kinesis streams. Once this is complete, we will then write the transformed streams to Delta tables in Kinesis. 

## Reading streaming data from Kinesis

In [0]:
# Checking to see if the credential files is in the directory:
dbutils.fs.ls("/FileStore/tables")

In [0]:
# The following libraries are required:
# pyspark functions and types:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# URL processing:
import urllib

In [0]:
# To read the authentication_credentials.csv files:
# Specifying the file type to be csv:
file_type = "csv"
# Indicating the file's first row is the header:
first_row_is_header = 'true'
# Indicating the delimiter is a comma:
delimiter = ","
# Reading the csv file to a Spark dataframe:
aws_keys_df = spark.read.format(file_type)\
    .option("header", first_row_is_header)\
    .option("sep", delimiter)\
    .load("/FileStore/tables/authentication_credentials.csv")

In [0]:
# Extracting the AWS access key and secret access key from the Spark dataframe created above:
ACCESS_KEY = aws_keys_df.where(col("User name")=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
# Encoding the secret key for security purposes:
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

In [0]:
# Once data is continuously being sent to the 3 kinesis streams we can use the access key and secret key to read the streaming date into spark dataframes

In [0]:
# For the Pinterest posts data:
pin_0e1f6d6285c1_df = spark \
.readStream \
.format('kinesis') \
.option('StreamName','streaming-0e1f6d6285c1-pin') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

# Displaying the dataframe, this is the data arriving from Kinesis in the default schema of Kinesis: partitionKey, data, stream, shardId, sequenceNnumber, approximateArrivalTimestamp
display(pin_0e1f6d6285c1_df)

partitionKey,data,stream,shardId,sequenceNumber,approximateArrivalTimestamp
pin_partition,eyJpbmRleCI6MTA2OTcsInVuaXF1ZV9pZCI6IjI3MDJlZDQ0LTkxZjctNDQ1My05MjcwLTQ3MmI1NmYwYWFhNiIsInRpdGxlIjoiQ2xhc3NpYyBGb3JkIEJyb25jb3MgfCBVbml2ZXJpc3R5IFBhcmsgfCBDb3lvdGUgQnJvbmM= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044738922353066898276738488336434,2023-10-18T14:17:22.931+0000
pin_partition,eyJpbmRleCI6MTEyNiwidW5pcXVlX2lkIjoiM2I3N2Y2YmEtMWU0Zi00YmNkLWIwOWItNDM3OGZmZDEzNzM5IiwidGl0bGUiOiIxMyBXYXlzIFRvIEdldCBSaWQgb2YgRGFyayBDaXJjbGVzIFVuZGVyIFRoZSBFeWVzIiwiZGU= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044739831465283248478084025810994,2023-10-18T14:17:25.302+0000
pin_partition,eyJpbmRleCI6NTM3MiwidW5pcXVlX2lkIjoiMWI4Y2ZmZmItYTg4YS00NTc3LWIwM2ItMDU2NjAxYzJmODA2IiwidGl0bGUiOiJGaW5hbmNpYWwgUGVhY2UiLCJkZXNjcmlwdGlvbiI6IjkgUGVyc29uYWwgRmluYW5jZSBNaWw= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044740694638318453323452204974130,2023-10-18T14:17:27.682+0000
pin_partition,eyJpbmRleCI6MjQ2MywidW5pcXVlX2lkIjoiMGNlYTVjMTItZTZkNy00MjhmLWEzZWEtOThlMDRmZTc5NWU1IiwidGl0bGUiOiJBIENvenkgQ291Y2ggZm9yIG91ciBCaWcgRmFtaWx5ISAtIENvdHRvbiBTdGVtIiwiZGVzY3I= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044741746403781518050971638300722,2023-10-18T14:17:30.122+0000
pin_partition,eyJpbmRleCI6NDkxMywidW5pcXVlX2lkIjoiNGQyZDc5YzYtOWNhOC00NmM5LWEzOGUtOTMxYzVkOTY3ODA0IiwidGl0bGUiOiJIb3cgdG8gV29yayBGcm9tIEhvbWUgYXMgYW4gRXZlbnQgUGxhbm5lciIsImRlc2NyaXB0aW8= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044742677276662621315642320486450,2023-10-18T14:17:32.448+0000
pin_partition,eyJpbmRleCI6ODk5NSwidW5pcXVlX2lkIjoiNDg1Y2VkYWMtMWIxNy00NDE0LWJmNjktNDMyMmU0MjlkZjU3IiwidGl0bGUiOiIyMCBDdXRlIEJlaGluZCB0aGUgRWFyIFRhdHRvb3MgZm9yIFdvbWVuIiwiZGVzY3JpcHRpb24= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044743350648344146664161351303218,2023-10-18T14:17:33.885+0000
pin_partition,eyJpbmRleCI6OTk0OSwidW5pcXVlX2lkIjoiZTM0NzEyZmQtMWZiMi00YzJlLWEwOTMtMmZkYTNhNmE1OTE3IiwidGl0bGUiOiI4IEZhaXJ5IFRhbGUgVG93bnMgSW4gR2VybWFueSBZb3UgSGF2ZSBUbyBWaXNpdCAtIFRoZUY= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044743910380998628237537959739442,2023-10-18T14:17:35.273+0000
pin_partition,eyJpbmRleCI6NjM2NSwidW5pcXVlX2lkIjoiNTZhOGVhOTQtZDAwMy00NzAzLTg1MDctNDhhMjU3OGU4Y2VhIiwidGl0bGUiOiJNYXN0ZXIgQmVkcm9vbSBXaXRoIEJyaWdodCBTcHJpbmcgRWFzdGVybiBSZWRidWQgQnJhbmM= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044744820702140798053443952443442,2023-10-18T14:17:37.636+0000
pin_partition,eyJpbmRleCI6OTgxNSwidW5pcXVlX2lkIjoiNzA5YWViNWYtZGYzZi00ZmI0LWI1ZGEtN2Q1NmE0NzA3OGM3IiwidGl0bGUiOiJBIEV1cm9wZSBJdGluZXJhcnkgZm9yIDIgbW9udGhzIGluIEV1cm9wZTogWW91ciBwZXJmZWM= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044745814439164521278831719350322,2023-10-18T14:17:40.033+0000
pin_partition,eyJpbmRleCI6NzA5MCwidW5pcXVlX2lkIjoiZjkzZTBiMjgtZTRlYi00MThkLWJjYzItYzQ3OTgzZjQwMGY4IiwidGl0bGUiOiJGYXNoaW9uIENhdXN1YWwgR2VudGxlbWFuIE91dGVyd2VhciBDb2F0IiwiZGVzY3JpcHRpb24= (truncated),streaming-0e1f6d6285c1-pin,shardId-000000000003,49645424600970709831841044746786415523491440825622069298,2023-10-18T14:17:42.502+0000


In [0]:
# To see the data contained in the pinterest posts stream, we need to deserialise the data column of the dataframe above:
pin_0e1f6d6285c1_df = pin_0e1f6d6285c1_df.selectExpr("CAST(data as STRING)")

# Displaying the dataframe:
display(pin_0e1f6d6285c1_df)

data
"{""index"":3069,""unique_id"":""d0af510a-d6e3-4f5a-aff1-941aed7fed0d"",""title"":""World Market Inspired Yarn Cone Trees - Hello Central Avenue"",""description"":""These little World Market inspired yarn cone trees are super EASY to make, which comes in handy since there’s only a week left until Christmas! They are perfect to add to your c… "",""poster_name"":""Hello Central Avenue-Rebecca | Home Decor, DIY, & Organization"",""follower_count"":""6k"",""tag_list"":""Christmas Tree Crafts,Christmas Projects,Christmas Home,Christmas Holidays,Christmas Decorations,Christmas Ornaments,Outdoor Christmas,Christmas Angels,Happy Holidays"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/4b/7c/89/4b7c890b13ddac3c4e3fe9c2cc7e94b3.jpg"",""downloaded"":1,""save_location"":""Local save in /data/diy-and-crafts"",""category"":""diy-and-crafts""}"
"{""index"":780,""unique_id"":""88ccaa72-c810-4847-a489-61dd26c486fd"",""title"":""Artist Illustrates How Doing Anything Is Much Better When There Are Animals Around (29 Pics)"",""description"":""Chilling with animals is the best. There's nothing better than vibing with your furry friends; you get the best of both worlds. You can feel like you have company without the un… "",""poster_name"":""Bored Panda"",""follower_count"":""2M"",""tag_list"":""Cartoon Girl Images,Cartoon Art Styles,Girl Cartoon,Cute Cartoon Girl Drawing,Image Princesse Disney,Arte Sketchbook,Digital Art Girl,Cute Drawings,Drawing Pics"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/62/22/03/6222032afb7a9c42971e5b644bd9781e.jpg"",""downloaded"":1,""save_location"":""Local save in /data/art"",""category"":""art""}"
"{""index"":7598,""unique_id"":""cfb949d8-2d9b-4170-8d93-04462917d489"",""title"":""She Was A Forgiver. Her Heart Was So Large. She Didn't Know How To Give Up On People"",""description"":""Relationship Rules is a modern-age lifestyle/love blog that discusses everything from breakups to being amazing parents."",""poster_name"":""Renee Knick"",""follower_count"":""31"",""tag_list"":""Feeling Broken Quotes,Quotes Deep Feelings,Mood Quotes,Positive Quotes,Life Quotes,Broken Promises Quotes,Broken Trust Quotes,Words Hurt Quotes,Qoutes"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/52/36/0e/52360e5a45b1f6374733183fd40e7f87.png"",""downloaded"":1,""save_location"":""Local save in /data/quotes"",""category"":""quotes""}"
"{""index"":3145,""unique_id"":""1022ba0b-eae7-4eba-9120-1fe44d093a32"",""title"":""How to Colour Chickpeas for Play - Inspire My Play"",""description"":""Learn how to colour chickpeas for sensory play and craft with this easy DIY"",""poster_name"":""Laura- Inspire My Play | Play & Learning For Little Kids"",""follower_count"":""4k"",""tag_list"":""Baby Sensory Play,Sensory Activities Toddlers,Infant Activities,Sensory Bins,Baby Play,Kindergarten Activities,Diy For Kids,Crafts For Kids,Preschool Crafts"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/35/e0/44/35e0447b94dd9076ed9e51a8d692e489.jpg"",""downloaded"":1,""save_location"":""Local save in /data/diy-and-crafts"",""category"":""diy-and-crafts""}"
"{""index"":9664,""unique_id"":""dff062ed-674c-4833-8118-2f35f8087b3b"",""title"":""51 Beautiful Places to Satisfy Your Wanderlust From Afar"",""description"":""Go globe-trotting (virtually)."",""poster_name"":""House Beautiful"",""follower_count"":""876k"",""tag_list"":""Places Around The World,Travel Around The World,Around The Worlds,Beautiful Places To Travel,Wonderful Places,Amazing Places,Beautiful Things,Dream Vacations,Vacation Spots"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/af/0d/29/af0d29e8b6343451c7f9464ae9398129.jpg"",""downloaded"":1,""save_location"":""Local save in /data/travel"",""category"":""travel""}"
"{""index"":8613,""unique_id"":""1ae686e8-7f41-433d-bbbd-c3428de9e095"",""title"":""Personalised Viking Rune Initial Talisman Ring By Talisman Kind"",""description"":""A beautiful personalised Viking Rune initial talisman ring.Turn your initial into a gorgeous Viking Rune talisman ring. Here we are using the Elder Futhark Viking language to tu… "",""poster_name"":""notonthehighstreet.com"",""follower_count"":""750k"",""tag_list"":""Finger Tattoo Frauen,Finger Tattoos,Viking Rune Meanings,Rune Symbols And Meanings,Symbols Of Love,Viking Meaning,Tattoo Meanings,Symbole Tattoo,Nordic Symbols"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/d6/aa/ac/d6aaace6a7ad37c000f8eed1d72f7968.jpg"",""downloaded"":1,""save_location"":""Local save in /data/tattoos"",""category"":""tattoos""}"
"{""index"":8673,""unique_id"":""0c57a8d3-4722-4fa4-a590-3385b69f08c0"",""title"":""▷ 1001 + super coole Arm Tattoos auf einen Blick"",""description"":""Mini Tattoos am Oberarm, Saturn und Halbmond, kleine Arm Tattoos zum Entlehnen"",""poster_name"":""Archzine.net"",""follower_count"":""353k"",""tag_list"":""Dreieckiges Tattoos,Mini Tattoos,Little Tattoos,Friend Tattoos,Sleeve Tattoos,Saturn Tattoo,Ankle Tattoo Small,Small Arm Tattoos,Tattoo Planeta"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/87/bd/a1/87bda173ec57a1aa5a817f2af24abc5c.jpg"",""downloaded"":1,""save_location"":""Local save in /data/tattoos"",""category"":""tattoos""}"
"{""index"":1997,""unique_id"":""c5b58967-5c09-42ea-a824-ec70bdbba9f9"",""title"":""Santa wreath, Christmas Santa Wreath, Santa Stop here wreath, Christmas wreath"",""description"":""This hand made Christmas wreath will make any holiday door perfect! Made with red, green, and white deco mesh, 4 different ribbons, Wooden Santa Stop Here sign, and ornament bal… "",""poster_name"":""WreathsbyJaimeLyn and WayMakerCreations"",""follower_count"":""1k"",""tag_list"":""Christmas Wreaths For Front Door,Holiday Wreaths,Christmas Decorations,Holiday Decor,Christmas Crafts,Christmas Porch,Christmas Gnome,Rustic Christmas,Merry Christmas Sign"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/16/7e/55/167e5513a3820d7a909359169d62eaf2.jpg"",""downloaded"":1,""save_location"":""Local save in /data/christmas"",""category"":""christmas""}"
"{""index"":2267,""unique_id"":""54de78c0-8ab3-4e96-bfc8-584a2046e5fc"",""title"":""20 DIY Christmas Gift Baskets for Your Loved Ones"",""description"":""These DIY baskets for Christmas. These DIY gift ideas for Christmas will help you get a better gift for your loved ones and transform the way you celebrate Christmas. #christmas… "",""poster_name"":""Craftsy Hacks"",""follower_count"":""76k"",""tag_list"":""Diy Christmas Presents,Inexpensive Christmas Gifts,Neighbor Christmas Gifts,Christmas Gifts For Couples,Decoration Christmas,Handmade Christmas Gifts,Christmas Christmas,Christmas Gift Ideas,Thoughtful Christmas Gifts"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/c8/70/af/c870af77dc7ef58ffa07f57ae55a7a43.png"",""downloaded"":1,""save_location"":""Local save in /data/christmas"",""category"":""christmas""}"
"{""index"":3733,""unique_id"":""14f4c78b-b0aa-440a-a6d8-5641cf534429"",""title"":""Find the best global talent."",""description"":""Stuck at home? Continuing learning math with this free library of K-12 math video lessons that are perfect for homeschool and remote learning. (Tags: homeschooling, 3rd grade, 4… "",""poster_name"":""Fiverr"",""follower_count"":""565k"",""tag_list"":""Educational Websites For Kids,Free Learning Websites,Math Websites,Learning Resources,Teacher Resources,People Reading,4th Grade Math,Grade 3,Math For 5th Graders"",""is_image_or_video"":""image"",""image_src"":""https://i.pinimg.com/originals/1c/6d/59/1c6d59543e44182dfe4881ff4ab67e75.jpg"",""downloaded"":1,""save_location"":""Local save in /data/education"",""category"":""education""}"


In [0]:
# Converting the above datarame which contains strings for each row into its constituent coloumns and values:
# Converting the above datarame which contains strings for each row into its constituent coloumns and values:
# Defining the schema for the JSON data, the schema specifies the expected data types of each JSON field:
pin_json_schema = StructType([
    StructField("index", IntegerType(), True),
    StructField("unique_id", StringType(), True),
    StructField("title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("poster_name", StringType(), True),
    StructField("follower_count", StringType(), True),
    StructField("tag_list", StringType(), True),  
    StructField("is_image_or_video", StringType(), True),
    StructField("image_src", StringType(), True),
    StructField("downloaded", IntegerType(), True),
    StructField("save_location", StringType(), True),
    StructField("category", StringType(), True),          
])

# Using the from_json function, parse the JSON strings in the "data" column above according to the geo data schema defined above:
pin_parsed_df = pin_0e1f6d6285c1_df.withColumn("pin_parsed_data", from_json(col("data"), pin_json_schema))

# Selecting the values of each column from the parsed geo dataframe:
pin_E_0e1f6d6285c1_df = pin_parsed_df.select(
    col("pin_parsed_data.index").alias("index"),
    col("pin_parsed_data.unique_id").alias("unique_id"),
    col("pin_parsed_data.title").alias("title"),
    col("pin_parsed_data.description").alias("description"),
    col("pin_parsed_data.poster_name").alias("poster_name"),
    col("pin_parsed_data.follower_count").alias("follower_count"),
    col("pin_parsed_data.tag_list").alias("tag_list"),
    col("pin_parsed_data.is_image_or_video").alias("is_image_or_video"),
    col("pin_parsed_data.image_src").alias("image_src"),
    col("pin_parsed_data.downloaded").alias("downloaded"),
    col("pin_parsed_data.save_location").alias("save_location"),
    col("pin_parsed_data.category").alias("category")
)

# Showing the resulting geo dataframe:
display(pin_E_0e1f6d6285c1_df)

index,unique_id,title,description,poster_name,follower_count,tag_list,is_image_or_video,image_src,downloaded,save_location,category
9613,4d11e997-27a6-4a95-913d-2fa769725ca4,Print - Tranquil Irish Path in County Clare,"Tranquil Irish Path, County Clare, #Ireland by James A. Truett.",James A. Truett - Irish Artist,393,"Clare Ireland,Irish Decor,County Clare,Stone Walkway,Home Goods Decor,Irish Art,Irish Blessing,Ireland Travel,Countryside",image,https://i.pinimg.com/originals/ed/a8/6f/eda86f7abf26e685afab997e776f5314.jpg,1,Local save in /data/travel,travel
994,3cf9d8d2-b581-4f90-846e-d47ee16b7e89,60 Second Beauty: Burt's Bees All-Natural Lip Shine and Lip Gloss,"I have been trying to use more natural products now that I'm expecting, so I was eager to review the Burt's Bees Lip Glosses and Lip Shines that recently landed on my desk. Thes…",The Budget Babe,40k,"Burts Bees Lip Gloss,Diy Lip Gloss,Burts Bees Makeup,Lipgloss Diy,Burt's Bees Lip Shine,Mac Mehr,Gloss Labial,Natural Lips,Natural Baby",image,https://i.pinimg.com/originals/00/d8/f3/00d8f34d5b6382f4b03f6df21474b83e.jpg,1,Local save in /data/beauty,beauty
5634,5c4df202-5367-4885-9025-e9b395758223,How to Pay No Tax on Your Dividend Income,Did you have to pay tax on your dividend income? Here is how to keep that money instead of sending it to Uncle Sam. Pay no tax on dividend income.,Retire by 40,5k,"Investing In Stocks,Investing Money,Saving Money,Stock Investing,Money Savers,Saving Tips,Wealth Management,Money Management,Tax Help",image,https://i.pinimg.com/originals/3f/a3/90/3fa39017ff6188d929cc7982ad369427.jpg,1,Local save in /data/finance,finance
6424,547f52ea-9ce1-4963-8982-62f6dbc690e7,Boho Bedroom Decor Ideas,Are you thinking of refreshing your bedroom? Not sure where to start? If you've got something fun and comfortable on,The Cards We Drew,64k,"Small Room Bedroom,Room Ideas Bedroom,Home Bedroom,Small Rooms,Romantic Master Bedroom Ideas,Living Room And Bedroom In One,Cozy Master Bedroom Ideas,Modern Boho Master Bedroom,Dark Master Bedroom",image,https://i.pinimg.com/originals/cb/36/99/cb3699c00451c4767b94a86843f1013e.jpg,1,Local save in /data/home-decor,home-decor
2944,848ab3c9-92ee-4160-a2c4-f8b3babdebd7,Clay Tutorials Pt.4,No description available Story format,Myasaurus,2k,"Diy Crafts To Do,Diy Crafts Hacks,Diy Crafts Jewelry,Diy Bracelets Patterns,Polymer Clay Tools,How To Make Clay,Things To Do When Bored,Indie Room,Cute Clay",multi-video(story page format),Image src error.,0,Local save in /data/diy-and-crafts,diy-and-crafts
5289,158800d5-44af-4074-928d-96cfea2af35c,How To Make Money With Dividends,Dividends and dividend stocks are the paths to financial freedom. Find out how much money you need to save and invest to make a million dollars. Then live the life you want whil…,Dividends Diversify: Money Matters So Build Wealth & Be Rich,28k,"Ways To Earn Money,Earn Money From Home,Money Tips,Way To Make Money,Make Money Online,Investing In Stocks,Investing Money,Saving Money,Dividend Investing",image,https://i.pinimg.com/originals/f7/f4/43/f7f4434d8cea806a8688df8d411d2fe7.jpg,1,Local save in /data/finance,finance
8111,157b5520-b073-47c6-8f63-ab4ee3cdd4af,5 Areas Of Focus For Future Brides,Quotes About Wedding : Whatever our souls are made his and mine are the same | Emily Bronte Quote | Literary Wedding | Love Quotes,Odyssey,235k,"Cute Love Quotes,Love Quotes For Wedding,Famous Love Quotes,Romantic Quotes,Classic Love Quotes,Beautiful Quotes About Love,Best Book Quotes,Famous Wedding Quotes,Vintage Love Quotes",image,https://i.pinimg.com/originals/13/32/37/133237cbeac821ebd1f90dc5c5dde196.jpg,1,Local save in /data/quotes,quotes
3888,492b807e-6d41-4a65-8c29-ec14d9cba76a,Spelling Rules with Anchor Charts (also bonus Boom Cards),In this pack you will find 8 different spelling rules / anchor charts. Each rule has its own activity sheet to go with it. The pack is suitable for Grades 2 - 5. The posters can…,Teachers Pay Teachers,1M,"Phonics Chart,Phonics Rules,Spelling Rules,Phonics Words,Grade Spelling,Grammar Rules,English Phonics,English Vocabulary,English Grammar",image,https://i.pinimg.com/originals/af/db/a6/afdba631f9c6de13d882508c3b12c58f.jpg,1,Local save in /data/education,education
1262,de5449f6-f43d-4675-a855-23e56445e6af,Premium Rose Quartz Jade Roller + Gua Sha,"SHIPS SAME DAY FROM OUR US BASED WAREHOUSE In time for the holidays, we are proud to launch our updated Rose Quartz Jade Roller. Made from high quality Rose Quartz stone with an…",No Cap Beauty,84,"Face Care Routine,Face Care Tips,Skin Care Routine Steps,Oily Skin Care,Face Skin Care,Beauty Care,Beauty Hacks,Beauty Tips For Skin,Online Beauty Store",image,https://i.pinimg.com/originals/79/ba/99/79ba99921c704279178bf8cce2915983.jpg,1,Local save in /data/beauty,beauty
6529,6dc2db60-8d8a-4e03-991b-86f3fd9bacd7,2019 Holiday Home Walk Through - Jessica Sara Morris,"2019 Holiday Home Walk Through. How we styled our home for Christmas with a little bit of modern, scandanavian, mid century and farmhouse decor.",JESSICA SARA MORRIS | HOME DECOR + DIY ON A BUDGET,36k,"Apartment Decoration,Decoration Bedroom,Diy Home Decor,Garland Decoration,Decoration Home,Home Decorating,Decorate Apartment,House Decorations,Home Decor Styles",image,https://i.pinimg.com/originals/76/be/4a/76be4a33087950c75838c10c2a47ad3e.jpg,1,Local save in /data/home-decor,home-decor


In [0]:
# For the geolocation data:
geo_0e1f6d6285c1_df = spark \
.readStream \
.format('kinesis') \
.option('StreamName','streaming-0e1f6d6285c1-geo') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

# Displaying the dataframe, this is the data arriving from Kinesis in the default schema of Kinesis: partitionKey, data, stream, shardId, sequenceNnumber, approximateArrivalTimestamp
display(geo_0e1f6d6285c1_df)

partitionKey,data,stream,shardId,sequenceNumber,approximateArrivalTimestamp
geo_partition,eyJpbmQiOjYzNjUsInRpbWVzdGFtcCI6IjIwMjItMTAtMTFUMTc6MzQ6NDIiLCJsYXRpdHVkZSI6LTc4LjIxMDQsImxvbmdpdHVkZSI6LTEyMC4yOTIsImNvdW50cnkiOiJDaHJpc3RtYXMgSXNsYW5kIn0=,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781674444871826857202999690788866,2023-10-18T14:17:37.247+0000
geo_partition,eyJpbmQiOjk4MTUsInRpbWVzdGFtcCI6IjIwMTktMDMtMDJUMDY6NTQ6MTQiLCJsYXRpdHVkZSI6LTEwLjc2MjEsImxvbmdpdHVkZSI6LTE3NS45ODUsImNvdW50cnkiOiJDb25nbyJ9,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781676516970681676677611295604738,2023-10-18T14:17:39.652+0000
geo_partition,eyJpbmQiOjcwOTAsInRpbWVzdGFtcCI6IjIwMjItMDEtMDhUMTg6NTM6NTIiLCJsYXRpdHVkZSI6LTYyLjkyLCJsb25naXR1ZGUiOi02My43OTc0LCJjb3VudHJ5IjoiQW50YXJjdGljYSAodGhlIHRlcnJpdG9yeSBTb3V0aCA= (truncated),streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781678421028847569718698896785410,2023-10-18T14:17:42.118+0000
geo_partition,eyJpbmQiOjQzNTMsInRpbWVzdGFtcCI6IjIwMTgtMDItMDNUMjM6MTg6MDMiLCJsYXRpdHVkZSI6MTUuNTU1MywibG9uZ2l0dWRlIjotNTUuNDgwNiwiY291bnRyeSI6IkNvbmdvIn0=,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781679585224411858606662858309634,2023-10-18T14:17:43.474+0000
geo_partition,eyJpbmQiOjMxMDgsInRpbWVzdGFtcCI6IjIwMjAtMTAtMTVUMTI6NTQ6MjciLCJsYXRpdHVkZSI6LTczLjU4OCwibG9uZ2l0dWRlIjotMTQ2LjUzNywiY291bnRyeSI6IkRqaWJvdXRpIn0=,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781680702271869182524157725769730,2023-10-18T14:17:44.837+0000
geo_partition,eyJpbmQiOjI5NjMsInRpbWVzdGFtcCI6IjIwMjAtMTEtMTRUMTg6MzY6MjUiLCJsYXRpdHVkZSI6MTUuMzU4NiwibG9uZ2l0dWRlIjotODQuMDMyMiwiY291bnRyeSI6Ik5vcmZvbGsgSXNsYW5kIn0=,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781682593031851059804324405182466,2023-10-18T14:17:47.187+0000
geo_partition,eyJpbmQiOjEwODEzLCJ0aW1lc3RhbXAiOiIyMDE4LTA1LTI1VDE3OjMzOjEwIiwibGF0aXR1ZGUiOi04NS44Njk5LCJsb25naXR1ZGUiOi0xOC4zNTc0LCJjb3VudHJ5IjoiRG9taW5pY2FuIFJlcHVibGljIn0=,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781683808002299772506782423842818,2023-10-18T14:17:48.577+0000
geo_partition,eyJpbmQiOjM0MTcsInRpbWVzdGFtcCI6IjIwMTgtMDItMDZUMTM6NDE6MTkiLCJsYXRpdHVkZSI6LTQ1LjgxODMsImxvbmdpdHVkZSI6NS40MDI4LCJjb3VudHJ5IjoiTHV4ZW1ib3VyZyJ9,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781685570616144770636325303877634,2023-10-18T14:17:50.926+0000
geo_partition,eyJpbmQiOjMwOTAsInRpbWVzdGFtcCI6IjIwMTktMDgtMDVUMDc6MzI6MjEiLCJsYXRpdHVkZSI6MjguNTM0NywibG9uZ2l0dWRlIjoxNjQuNzIzLCJjb3VudHJ5IjoiUG9ydHVnYWwifQ==,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781686751736670534129097711288322,2023-10-18T14:17:52.395+0000
geo_partition,eyJpbmQiOjU0NTEsInRpbWVzdGFtcCI6IjIwMTktMTEtMDdUMDY6MDQ6NDQiLCJsYXRpdHVkZSI6LTU4LjYyOCwibG9uZ2l0dWRlIjotMTI4Ljk5NCwiY291bnRyeSI6IkJhaGFtYXMifQ==,streaming-0e1f6d6285c1-geo,shardId-000000000000,49645428141191709353378781687805919985238085806774550530,2023-10-18T14:17:53.724+0000


In [0]:
# To see the data contained in the geolocation stream, we need to deserialise the data column of the dataframe above:
geo_0e1f6d6285c1_df = geo_0e1f6d6285c1_df.selectExpr("CAST(data as STRING)")

# Displaying the dataframe:
display(geo_0e1f6d6285c1_df)

data
"{""ind"":2225,""timestamp"":""2019-08-20T23:00:52"",""latitude"":-85.7998,""longitude"":-109.58,""country"":""American Samoa""}"
"{""ind"":4616,""timestamp"":""2020-12-28T07:29:40"",""latitude"":-53.5421,""longitude"":-110.613,""country"":""Pakistan""}"
"{""ind"":649,""timestamp"":""2020-10-06T14:24:08"",""latitude"":-80.7952,""longitude"":-22.2518,""country"":""Sudan""}"
"{""ind"":350,""timestamp"":""2020-07-25T07:39:25"",""latitude"":-48.665,""longitude"":-77.0735,""country"":""El Salvador""}"
"{""ind"":3135,""timestamp"":""2022-02-28T05:15:42"",""latitude"":-42.578,""longitude"":-156.509,""country"":""Haiti""}"
"{""ind"":10353,""timestamp"":""2017-10-19T23:59:57"",""latitude"":-63.1226,""longitude"":-176.877,""country"":""Guam""}"
"{""ind"":1352,""timestamp"":""2021-02-27T01:41:15"",""latitude"":60.8783,""longitude"":63.8289,""country"":""Armenia""}"
"{""ind"":8378,""timestamp"":""2022-09-29T14:41:43"",""latitude"":-87.2,""longitude"":-177.109,""country"":""Albania""}"
"{""ind"":2215,""timestamp"":""2022-08-19T09:07:21"",""latitude"":66.9713,""longitude"":163.126,""country"":""Yemen""}"
"{""ind"":6487,""timestamp"":""2018-07-29T05:27:08"",""latitude"":-5.34445,""longitude"":-177.924,""country"":""Armenia""}"


In [0]:
# Converting the above datarame which contains strings for each row into its constituent coloumns and values:
# Defining the schema for the JSON data, the schema specifies the expected data types of each JSON field:
geo_json_schema = StructType([
    StructField("ind", IntegerType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("latitude", FloatType(), True),
    StructField("longitude", FloatType(), True),
    StructField("country", StringType(), True),    
])

# Using the from_json function, parse the JSON strings in the "data" column above according to the geo data schema defined above:
geo_parsed_df = geo_0e1f6d6285c1_df.withColumn("geo_parsed_data", from_json(col("data"), geo_json_schema))

# Selecting the values of each column from the parsed geo dataframe:
geo_E_0e1f6d6285c1_df = geo_parsed_df.select(
    col("geo_parsed_data.ind").alias("ind"),
    col("geo_parsed_data.timestamp").alias("timestamp"),
    col("geo_parsed_data.latitude").alias("latitude"),
    col("geo_parsed_data.longitude").alias("longitude"),
    col("geo_parsed_data.country").alias("country")
)

# Showing the resulting geo dataframe:
display(geo_E_0e1f6d6285c1_df)

ind,timestamp,latitude,longitude,country
8388,2018-10-26T02:37:43.000+0000,-89.5173,-179.689,Algeria
4059,2022-05-09T08:09:29.000+0000,-85.4776,-130.258,British Indian Ocean Territory (Chagos Archipelago)
5095,2021-10-28T21:55:36.000+0000,-73.6474,39.8754,Ireland
8695,2021-01-14T01:06:27.000+0000,-84.3984,-144.933,Bouvet Island (Bouvetoya)
594,2020-10-01T21:21:05.000+0000,-58.4025,-168.097,Albania
7276,2018-08-03T05:14:15.000+0000,-58.4743,21.875,Sweden
6198,2022-04-25T07:46:15.000+0000,32.8838,-149.384,Morocco
7666,2021-02-06T19:47:52.000+0000,-89.5173,-179.689,Algeria
2582,2020-01-21T22:47:21.000+0000,-63.5778,-31.1543,Bangladesh
9611,2020-07-06T19:36:55.000+0000,64.733,-2.5288,Thailand


In [0]:
# For the user data:
user_0e1f6d6285c1_df = spark \
.readStream \
.format('kinesis') \
.option('StreamName','streaming-0e1f6d6285c1-user') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

# Displaying the dataframe, this is the data arriving from Kinesis in the default schema of Kinesis: partitionKey, data, stream, shardId, sequenceNnumber, approximateArrivalTimestamp
display(user_0e1f6d6285c1_df)

partitionKey,data,stream,shardId,sequenceNumber,approximateArrivalTimestamp
user_partition,eyJpbmQiOjEwODEzLCJmaXJzdF9uYW1lIjoiQW5pdGEiLCJsYXN0X25hbWUiOiJNb29yZSIsImFnZSI6MjAsImRhdGVfam9pbmVkIjoiMjAxNi0wMy0wNFQxMDoxMDo0NSJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511545544367101473160574120493106,2023-10-18T14:17:48.211+0000
user_partition,eyJpbmQiOjM0MTcsImZpcnN0X25hbWUiOiJTYXJhaCIsImxhc3RfbmFtZSI6IkZveCIsImFnZSI6NDksImRhdGVfam9pbmVkIjoiMjAxNy0xMC0xMVQyMzo1NTozOCJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511546839126654280428695108714546,2023-10-18T14:17:50.580+0000
user_partition,eyJpbmQiOjMwOTAsImZpcnN0X25hbWUiOiJKb3ljZSIsImxhc3RfbmFtZSI6IkdvcmRvbiIsImFnZSI6MzksImRhdGVfam9pbmVkIjoiMjAxNy0wOC0wMlQxOTowNjoxMSJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511547662405137437991231803097138,2023-10-18T14:17:52.042+0000
user_partition,eyJpbmQiOjU0NTEsImZpcnN0X25hbWUiOiJBbWJlciIsImxhc3RfbmFtZSI6IkVyaWNrc29uIiwiYWdlIjoyNiwiZGF0ZV9qb2luZWQiOiIyMDE1LTExLTA3VDA5OjU3OjU1In0=,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511548375671371010622513599217714,2023-10-18T14:17:53.368+0000
user_partition,eyJpbmQiOjY4MTMsImZpcnN0X25hbWUiOiJBYmlnYWlsIiwibGFzdF9uYW1lIjoiQWxpIiwiYWdlIjoyMCwiZGF0ZV9qb2luZWQiOiIyMDE1LTEwLTI0VDExOjIzOjUxIn0=,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511549044207349257512515931209778,2023-10-18T14:17:54.699+0000
user_partition,eyJpbmQiOjEwMDAyLCJmaXJzdF9uYW1lIjoiTWljaGVsbGUiLCJsYXN0X25hbWUiOiJHYXJjaWEiLCJhZ2UiOjMxLCJkYXRlX2pvaW5lZCI6IjIwMTYtMDItMjdUMjI6NDM6NTcifQ==,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511549773189618485133976998510642,2023-10-18T14:17:56.036+0000
user_partition,eyJpbmQiOjI0NTMsImZpcnN0X25hbWUiOiJDYXRoZXJpbmUiLCJsYXN0X25hbWUiOiJCdXJrZSIsImFnZSI6MjcsImRhdGVfam9pbmVkIjoiMjAxNi0wNC0wOVQwMjowODoxMyJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511551167081088500801621593161778,2023-10-18T14:17:58.458+0000
user_partition,eyJpbmQiOjU2MzYsImZpcnN0X25hbWUiOiJNYWNrZW56aWUiLCJsYXN0X25hbWUiOiJFc3RlcyIsImFnZSI6MjksImRhdGVfam9pbmVkIjoiMjAxNi0wNS0wMlQwMDo1MDo0MCJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511552602076036383366589408346162,2023-10-18T14:18:01.033+0000
user_partition,eyJpbmQiOjEzMzgsImZpcnN0X25hbWUiOiJHcmVnb3J5IiwibGFzdF9uYW1lIjoiQmFybmV0dCIsImFnZSI6MjAsImRhdGVfam9pbmVkIjoiMjAxNi0wMy0yOVQxNzoxMDowMyJ9,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511553386668893313261061231607858,2023-10-18T14:18:02.384+0000
user_partition,eyJpbmQiOjQyOTUsImZpcnN0X25hbWUiOiJBbGV4YW5kcmlhIiwibGFzdF9uYW1lIjoiQWx2YXJhZG8iLCJhZ2UiOjIwLCJkYXRlX2pvaW5lZCI6IjIwMTUtMTAtMjNUMDQ6MTM6MjMifQ==,streaming-0e1f6d6285c1-user,shardId-000000000003,49645429742987334728239511554735830108003187357642653746,2023-10-18T14:18:04.807+0000


In [0]:
# To see the data contained in the user stream, we need to deserialise the data column of the dataframe above:
user_0e1f6d6285c1_df = user_0e1f6d6285c1_df.selectExpr("CAST(data as STRING)")

# Displaying the dataframe:
display(user_0e1f6d6285c1_df)

data
"{""ind"":2215,""first_name"":""Collin"",""last_name"":""Nolan"",""age"":21,""date_joined"":""2017-07-15T18:25:40""}"
"{""ind"":6487,""first_name"":""Dylan"",""last_name"":""Holmes"",""age"":32,""date_joined"":""2016-10-23T14:06:51""}"
"{""ind"":4661,""first_name"":""Brandy"",""last_name"":""Nolan"",""age"":39,""date_joined"":""2017-02-16T02:20:16""}"
"{""ind"":9376,""first_name"":""Amy"",""last_name"":""Adams"",""age"":20,""date_joined"":""2015-10-24T05:05:28""}"
"{""ind"":3912,""first_name"":""Antonio"",""last_name"":""Davis"",""age"":21,""date_joined"":""2015-12-05T06:09:18""}"
"{""ind"":4034,""first_name"":""Amanda"",""last_name"":""Bennett"",""age"":23,""date_joined"":""2016-07-26T16:57:11""}"
"{""ind"":2349,""first_name"":""Jaclyn"",""last_name"":""Browning"",""age"":30,""date_joined"":""2016-02-05T19:15:02""}"
"{""ind"":10773,""first_name"":""April"",""last_name"":""Davis"",""age"":26,""date_joined"":""2016-05-17T22:11:51""}"
"{""ind"":1407,""first_name"":""Jason"",""last_name"":""Morales"",""age"":58,""date_joined"":""2017-07-03T19:30:09""}"
"{""ind"":4977,""first_name"":""Amanda"",""last_name"":""Ball"",""age"":25,""date_joined"":""2016-01-13T17:36:30""}"


In [0]:
# Converting the above datarame which contains strings for each row into its constituent coloumns and values:
# Defining the schema for the JSON data, the schema specifies the expected data types of each JSON field:
user_json_schema = StructType([
    StructField("ind", IntegerType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("date_joined", TimestampType(), True),    
])

# Using the from_json function, parse the JSON strings in the "data" column above according to the user data schema defined above:
user_parsed_df = user_0e1f6d6285c1_df.withColumn("user_parsed_data", from_json(col("data"), user_json_schema))

# Selecting the values of each column from the parsed user dataframe:
user_E_0e1f6d6285c1_df = user_parsed_df.select(
    col("user_parsed_data.ind").alias("ind"),
    col("user_parsed_data.first_name").alias("first_name"),
    col("user_parsed_data.last_name").alias("last_name"),
    col("user_parsed_data.age").alias("age"),
    col("user_parsed_data.date_joined").alias("date_joined")
)

# Showing the resulting user dataframe:
display(user_E_0e1f6d6285c1_df)

ind,first_name,last_name,age,date_joined
4241,Alexandria,Alvarado,20,2015-10-23T04:13:23.000+0000
7567,Mark,Russell,56,2017-05-18T20:35:15.000+0000
6146,Adrian,Allen,21,2015-10-21T22:43:58.000+0000
4359,Jason,Barnes,27,2017-07-10T21:06:14.000+0000
5836,Brian,Davis,38,2015-12-09T10:50:57.000+0000
7195,Amber,Barnes,32,2017-08-14T00:57:09.000+0000
4535,Michael,Snyder,51,2017-07-22T13:03:59.000+0000
2667,Amber,Casey,23,2016-01-06T12:22:08.000+0000
8875,Aaron,Alexander,21,2015-10-25T07:36:08.000+0000
7415,April,Anderson,30,2016-06-27T08:35:51.000+0000


## Transforming the incoming three streams to ensure the data is clean and ready for analytics

### Transforming the Pinterest posts data:

In [0]:
# Renaming the 'index' column to 'ind':
pin_T_0e1f6d6285c1_df = pin_E_0e1f6d6285c1_df.withColumnRenamed("index", "ind")

In [0]:
# Reordering the columns in the dataframe (whilst omitting the 'downloaded' column):
# new desired order of columns:
desired_pin_column_order = ["ind", "unique_id", "title", "description", "follower_count", "poster_name", "tag_list", "is_image_or_video", "image_src", "save_location", "category"]

pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.select(desired_pin_column_order)

In [0]:
# Whilst performing a simple count on distinct values in the 'follower_count' column there were several hits for 'User Info Error'. Upon further investigation, the entire row of data is not useful so to drop these rows in the dataframe:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.filter(col("follower_count") != "User Info Error")

In [0]:
# Whilst performing a simple count on distinct values in the 'ind' column, results show that there are several identical rows of data, to drop these duplicates:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.dropDuplicates(["ind"])

In [0]:
# During EDA, the 'follower_count' column has values such as '10k, 500, 100k, 2M' so we need to turn all of these into integers:
# Create a custom user defined function (UDF) to remove the "k" and "M" from the 'follower_count' column and return an integer equivalent:

# First, defining a custom UDF to convert 'follower_count to integers and anything containing "k" in the values to integers:
def convert_follower_count(value):
    if 'M' in value:
        return int(value.strip('M')) * 1000000
    elif 'k' in value:
        return int(value.strip('k')) * 1000
    else:
        return int(value)

# Registering the UDF in Spark with function:
convert_follower_count_udf = udf(convert_follower_count)

# Applying the UDF to the "follower_count" column
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("follower_count", convert_follower_count_udf(col("follower_count")))

# Casting the 'follower_count' column to integer data type:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("follower_count", col("follower_count").cast("integer"))

In [0]:
# The "save_location" should only contain the the path location path so need to removed the prefix to this column:
# Creating a custom UDF to remove the "Local save in " of each row in the column:

def remove_prefix_save_location(value):
        return value.replace('Local save in ', '')
    
# Registering the UDF in Spark with function:
remove_prefix_save_location_udf = udf(remove_prefix_save_location)

# Applying the UDF to the "save_location" column
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("save_location", remove_prefix_save_location_udf(col("save_location")))

In [0]:
# Whilst performing EDA, there are several columns that contain values that are empty or contain no relevant data.
# We will now replace all these values with None:

# Replacing the "Image src error." with None in the image_src column:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("image_src", when(col("image_src") == "Image src error.", None).otherwise(col("image_src")))

# Replacing the "N,o, ,T,a,g,s, ,A,v,a,i,l,a,b,l,e" with None in the tag_list column:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("tag_list", when(col("tag_list") == "N,o, ,T,a,g,s, ,A,v,a,i,l,a,b,l,e", None).otherwise(col("tag_list")))

# Replacing the "No description available" & the "No description available Story format" with None from the description column:
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("description", when((col("description") == "No description available") | (col("description") == "No description available Story format"), None).otherwise(col("description")))

# Replacing the blanks, "loading..." with None in the 'title' column"
pin_T_0e1f6d6285c1_df = pin_T_0e1f6d6285c1_df.withColumn("title", when((col("title") == "") | (col("title") == "Loading..."), None).otherwise(col("title")))

### Transforming the Geolocation data:

In [0]:
# Creating a new column called "coordinates" that contains an array based on the latitute and longitude columns:
geo_T_0e1f6d6285c1_df = geo_E_0e1f6d6285c1_df.withColumn("coordinates", array(col("latitude"), col("longitude")))

In [0]:
# Dropping the latitude and longitude columns and reordering the columns:
desired_geo_columns = ["ind", "country", "coordinates", "timestamp"]

geo_T_0e1f6d6285c1_df = geo_T_0e1f6d6285c1_df.select(desired_geo_columns)

In [0]:
# Whilst performing a simple count on distinct values in the 'ind' column, results show that there are several identical rows of data, to drop these duplicates:
geo_T_0e1f6d6285c1_df = geo_T_0e1f6d6285c1_df.dropDuplicates(["ind"])

### Transforming the User data:

In [0]:
# Creating a new column 'user_name' which concatenates the information found in the 'first_name' and 'last_name' columns:
user_T_0e1f6d6285c1_df = user_E_0e1f6d6285c1_df.withColumn("user_name", concat(col("first_name"), lit(" "), col("last_name")))

In [0]:
# Dropping the first_name and last_name columns and reordering the columns:
desired_user_columns = ["ind", "user_name", "age", "date_joined"]

user_T_0e1f6d6285c1_df = user_T_0e1f6d6285c1_df.select(desired_user_columns)

In [0]:
# Whilst performing a simple count on distinct values in the 'ind' column, results show that there are several identical rows of data, to drop these duplicates:
user_T_0e1f6d6285c1_df = user_T_0e1f6d6285c1_df.dropDuplicates(["ind"])

In [0]:
# Displaying fully transformed pinterest posts dataframe:
display(pin_T_0e1f6d6285c1_df)

ind,unique_id,title,description,follower_count,poster_name,tag_list,is_image_or_video,image_src,save_location,category
1270,a61c8535-5585-4683-84d1-3caa5c44007f,DIY All-Natural Pore-Perfecting Rose Facial Toner | Body Unburdened,"Many people associate roses with beauty and romance. But me? Roses make me think of my grandfather. Specifically his nose. You see, my grandfather’s honker was a bit bulbous, an…",62000,Body Unburdened | Natural Skincare | Clear Skin | Holistic Health,"Beauty Care,Diy Beauty,Beauty Skin,Beauty Hacks,Beauty Ideas,Face Beauty,Beauty Makeup,Natural Beauty Tips,Health And Beauty Tips",image,https://i.pinimg.com/originals/b3/e6/3b/b3e63b5f71349ffb8187aab278520d69.png,/data/beauty,beauty
2393,6c9ffdc6-d361-44af-8259-91a862ff3868,61 Stunning Christmas Tree Decorations - Chaylor & Mads,"The best Christmas tree decorations for every style. Some are so unique they will blow your mind! Plus, find where to get the best trees and decor this year.",91000,"Kristen | Lifestyle, Mom Tips & Teacher Stuff Blog","White Christmas Tree Decorations,Elegant Christmas Trees,Christmas Holidays,Christmas Wreaths,Christmas Ideas,Flocked Christmas Trees Decorated,Christmas 2019,Xmas Trees,Christmas Quotes",image,https://i.pinimg.com/originals/90/b5/89/90b589959a296be4c2b402bb265504b0.png,/data/christmas,christmas
2572,cd7502cb-8a34-4b59-b677-19ba7f4b79a4,Top 5 Moments of Hallmark’s ‘Christmas at the Plaza’ With Ryan Paevey,A Christmas movie and a holiday mystery in one. Ryan Paevey played Detective Nathan West on General Hospital until his demise in 2018 and Soaps.com has been following him for th…,787000,SheKnows,"Christmas Party Ideas For Teens,Christmas Party Games,Christmas Activities,Christmas Traditions,Holiday Fun,Holiday Ideas,Christmas Drinking Games,Movie Drinking Games,Merry Little Christmas",image,https://i.pinimg.com/originals/45/58/39/455839acee56740f449b87237947cc6f.jpg,/data/christmas,christmas
9454,6f5c89d1-b551-47da-88b7-83dbcee5de41,Australian Shepherd Dog Tattoos Combo Outfit Legging + Hollow Tank For Women Pl - M / L,"Save on shipping and order more than one combo. PLEASE READ THE SIZE CHART CAREFULLY BEFORE CHOOSING YOUR SIZE. - Material: Polyester fiber (polyester), Moderate softness, cotto…",2000,Amaze Style,"Dachshund Shirt,Dog Hoodie,Australian Shepherd Dogs,Dog Mom Shirt,Dog Tattoos,Beautiful Dogs,Fashion Company,High Definition,Size Chart",image,https://i.pinimg.com/originals/1e/09/b7/1e09b747aa44759ad8d7231f34ffba4b.jpg,/data/tattoos,tattoos
8924,b055bda3-ca57-492b-b42e-532327fbc921,5000+ Best Tattoo Ideas | Med Tech,#designtattoo #tattoo Unterarm Armbinde Tattoo-Designs Tribal-Bilder Tattoo-Samoa-Design P #Tattoo #TattooDesigns #tattoos,413,Grace,"Forearm Tattoos,Arm Band Tattoo,Body Art Tattoos,Tribal Tattoos,Sleeve Tattoos,Tatoos,Turtle Tattoos,Buddha Tattoos,Stomach Tattoos",image,https://i.pinimg.com/originals/40/01/2f/40012f47d580f121c861206d5ce5af3b.jpg,/data/tattoos,tattoos
5345,98b99fa5-f6b7-4a26-b95c-8e602d222a20,14 Money Tips Dave Ramsey Wish Everyone Knew Sooner,"In the realm of personal finance, Dave Ramsey seems to have a huge influence on people, and for good reasons. Dave Ramsey provides good advice on personal finance, financial pla…",30000,Prosmartrepreneur,"Best Money Saving Tips,Ways To Save Money,Money Tips,Saving Money,Money Budget,Money Hacks,Budgeting Finances,Budgeting Tips,Faire Son Budget",image,https://i.pinimg.com/originals/50/29/9a/50299a2b700e6071f7301a6bb2eba2ed.png,/data/finance,finance
6825,89cf5cc7-d755-4948-8563-60fa485e6953,This Fall's Best New Henleys,"There's a good reason why the casual, collarless button-up is a favorite of both Walt Whitman and Ryan Gosling.",49000,Men's Journal,"Mode Masculine,Masculine Style,Sharp Dressed Man,Well Dressed Men,Stylish Men,Men Casual,Casual Attire,Casual Fall,Casual Styles",image,https://i.pinimg.com/originals/71/f2/69/71f26954c698e73ee68ebe11d1620725.jpg,/data/mens-fashion,mens-fashion
1139,8c4606b6-b9fe-4f15-ac51-f0654c359a46,Make-Up-Quickie: Fenty Beauty endlich in der Schweiz!,"Rihanna's viel-gehypte Make-Up-Linie Fenty Beauty ist endlich auch in der Schweiz erhältlich, thanks to Sephora. Wir tauchen gleich in die Kollektion ein!",7000,Hey Pretty Beauty Blog,"Beauty Care,Diy Beauty,Beauty Skin,Health And Beauty,Homemade Beauty,Healthy Beauty,Beauty Ideas,Face Beauty,Sephora",image,https://i.pinimg.com/originals/01/62/bd/0162bd977635d9fe3bf54a6275f44890.jpg,/data/beauty,beauty
6266,9862005f-97e2-4c97-ada6-05d732844415,In This House Stairs Version 3 Decor Decal Sticker Wall Vinyl Art - saphire blue,"In This House The latest in home decorating. Beautiful wall vinyl decals, that are simple to apply, are a great accent piece for any room, come in an array of colors, and are a…",7000,Boop Decals,"Inexpensive Home Decor,Easy Home Decor,Cheap Home Decor,Painted Staircases,Painted Stairs,Staircase Painting,Escalier Design,Foyer Decorating,Budget Decorating",image,https://i.pinimg.com/originals/a4/57/56/a457562af7802a3ca441e6b24fe116cd.jpg,/data/home-decor,home-decor
7879,1512de5c-0ef6-46e4-bc8c-9a258c80e957,25 Best Quotes & Funny Memes About Writing To Celebrate National Author's Day,National Author's Day is November 1 - so what better day to celebrate your favorite author and the books they write? Look to these funny memes about writing and author quotes fr…,942000,YourTango,"Writer Memes,Writer Quotes,Book Memes,Quotes About Writers,Famous Author Quotes,Writing Advice,Writing A Book,Writing Ideas,Writing Corner",image,https://i.pinimg.com/originals/44/5e/53/445e53ea45bdc8696a6ae339c4ebf7fd.jpg,/data/quotes,quotes


In [0]:
# Displaying fully transformed geolocation dataframe:
display(geo_T_0e1f6d6285c1_df)

ind,country,coordinates,timestamp
1270,Western Sahara,"List(-19.774, -160.69)",2018-04-02T13:53:37.000+0000
2393,Anguilla,"List(-89.1797, -174.015)",2021-07-26T13:12:01.000+0000
2572,Algeria,"List(-53.9169, -104.473)",2018-04-18T03:19:13.000+0000
9454,Azerbaijan,"List(-68.2139, -44.4492)",2018-02-28T22:49:47.000+0000
8924,Benin,"List(-86.738, -169.385)",2018-07-03T22:33:10.000+0000
5345,Congo,"List(-60.4924, -95.676)",2021-04-23T05:40:29.000+0000
6825,United Arab Emirates,"List(-44.2199, 69.4942)",2020-02-24T14:55:16.000+0000
1139,France,"List(-68.4636, 112.577)",2018-05-06T18:02:56.000+0000
6266,Argentina,"List(-89.63, -179.022)",2022-02-28T18:16:08.000+0000
7879,Algeria,"List(-89.5173, -179.689)",2018-09-10T05:40:21.000+0000


In [0]:
# Displaying fully transformed user dataframe:
display(user_T_0e1f6d6285c1_df)

ind,user_name,age,date_joined
8258,Christopher Dixon,60,2017-05-23T01:47:05.000+0000
6790,Sean Franklin,52,2015-11-25T03:43:16.000+0000
9468,Christine Middleton,33,2016-05-31T12:01:31.000+0000
8396,Robin Miller,50,2016-04-13T01:27:55.000+0000
5016,Brooke Brown,35,2016-03-24T12:29:52.000+0000
3741,Audrey Carlson,43,2016-06-14T12:18:24.000+0000
2323,Andre Carey,28,2015-11-16T13:20:58.000+0000
6970,Joan Brewer,51,2016-10-15T10:00:11.000+0000
10257,Nicole Armstrong,57,2016-05-15T20:39:08.000+0000
8074,Kenneth Townsend,45,2016-10-12T02:46:02.000+0000


## Loading the streaming data to Delta tables in Databricks

The transformations for each of the 3 streams are now complete, now we need to store them in Databricks. We will load them into Databricks by writing each of the tables to Datrabricks Delta tables.

In [0]:
# Writing the Pinterest posts dataframe to a delta table.
# We will provide and option in the query so that we can recover the previous state of a query in case of failure (data will be safe).
pin_T_0e1f6d6285c1_df.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/kinesis/_checkpoints_pin/") \
  .table("0e1f6d6285c1_pin_table")

In [0]:
# Writing the geolocation dataframe to a delta table.
# We will provide and option in the query so that we can recover the previous state of a query in case of failure (data will be safe).
geo_T_0e1f6d6285c1_df.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/kinesis/_checkpoints_geo/") \
  .table("0e1f6d6285c1_geo_table")

In [0]:
# Writing the user dataframe to a delta table.
# We will provide and option in the query so that we can recover the previous state of a query in case of failure (data will be safe).
user_T_0e1f6d6285c1_df.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/kinesis/_checkpoints_user/") \
  .table("0e1f6d6285c1_user_table")

In [0]:
# In order to run the 'writeStream' function again, need to delete the checkpoint folder for each of the 3 dataframes unless the checkpoint destination is changed. If we do change it, we can gradually track the different checkpoints if necessary:
# Removing Pinterest posts checkpoint folder:
dbutils.fs.rm("/tmp/kinesis/_checkpoints_pin/", True)

# Removing geolocation checkpoint folder:
dbutils.fs.rm("/tmp/kinesis/_checkpoints_geo/", True)

# Removing user checkpoint folder:
dbutils.fs.rm("/tmp/kinesis/_checkpoints_user/", True)