# Mount S3 bucket to Databricks

The S3 bucket containing the data for the Pinterest posts, users and geolocation will first have to be mounted to Databricks.

Once mounted, the data within each topic will then be read into Spark dataframes so that we can clean and query this data.

In [0]:
dbutils.fs.ls("/FileStore/tables")

In [0]:
# The following libraries are required:
# pyspark functions
from pyspark.sql.functions import *
# URL processing
import urllib

In [0]:
# To read the authentication_credentials.csv files:
# Specifying the file type to be csv:
file_type = "csv"
# Indicating the file's first row is the header:
first_row_is_header = 'true'
# Indicating the delimiter is a comma:
delimiter = ","
# Reading the csv file to a Spark dataframe:
aws_keys_df = spark.read.format(file_type)\
    .option("header", first_row_is_header)\
    .option("sep", delimiter)\
    .load("/FileStore/tables/authentication_credentials.csv")

In [0]:
# Extracting the AWS access key and secret access key from the Spark dataframe created above:
ACCESS_KEY = aws_keys_df.where(col("User name")=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
# Encoding the secret key for security purposes:
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

In [0]:
# Mounting the AWS S3 to Databricks:
# AWS S3 bucket name
AWS_S3_BUCKET = "user-0e1f6d6285c1-bucket"
# Mount name for the bucket
MOUNT_NAME = "/mnt/pinterest_project1"
# Source url
SOURCE_URL = "s3n://{0}:{1}@{2}".format(ACCESS_KEY, ENCODED_SECRET_KEY, AWS_S3_BUCKET)
# Mounting the drive
dbutils.fs.mount(SOURCE_URL, MOUNT_NAME)

In [0]:
# Checking if the S3 bucket was mounted succesfully:
display(dbutils.fs.ls("/mnt/pinterest_project1/"))

path,name,size,modificationTime
dbfs:/mnt/pinterest_project1/kafka-connect-s3/,kafka-connect-s3/,0,1700922615214
dbfs:/mnt/pinterest_project1/topics/,topics/,0,1700922615214


In [0]:
# The data in the S3 bucket is in JSON format, need to read this into Databricks for each of the Kafka topics
# To do this we will create a function which takes in a path as the input and outputs the first 10 rows of the Spark DataFrame:
def read_json_to_df(file_location):
    # Specifying the file to have .json extension:
    file_type = "json"
    # Asking Spark to infer schema:
    infer_schema = "true"

    # Reading in JSONs from mounted S3 bucket:
    df = spark.read.format(file_type) \
        .option("inferschema", infer_schema) \
        .load(file_location)
    
    # Return the Spark DataFrame:
    return df

In [0]:
# The data in the S3 bucket is in JSON format, now to read this into Databricks for pinterest posts data:
# Defining a file location variable for pinterest posts data:
file_location_pin = "/mnt/pinterest_project1/topics/0e1f6d6285c1.pin/partition=0/*.json" # Asterisk(*) indicates reading all the content of the 

# Reading the Pinterest posts data and saving it to the below variable:
df_pin_0e1f6d6285c1 = read_json_to_df(file_location_pin)

# Displaying the first 10 rows of the pin dataframe:
display(df_pin_0e1f6d6285c1.limit(10))

category,description,downloaded,follower_count,image_src,index,is_image_or_video,poster_name,save_location,tag_list,title,unique_id
event-planning,Το όνομα που επέλεξε η μαμά Ανδριανή για τη γλυκιά Τιτίκα δεν είναι καθόλου τυχαίο. Και φυσικά δεν άφησε τίποτα στην τύχη ούτε την ημέρα της βάπτισης. Ανέθεσε την οργάνωση στην…,1,4,https://i.pinimg.com/originals/db/aa/d2/dbaad28fa85012a4ea6958540d98a8e5.jpg,4387,image,Manosbojana Katsareas,Local save in /data/event-planning,"Diy Flowers,Flower Diy,Baptism Decorations,Christening,Event Planning,Wedding Planner,Baptism Ideas,Birthday,Party",Βάπτιση: H παραμυθένια βάπτιση της Τιτίκας με θέμα το μονόκερο από την e.m. for you,ae5e7377-f1bd-4ac5-94de-bee317f51a43
home-decor,"Традиционные шведские коттеджи, обычно с красным фасадом — это настоящее воплощением идеального зимнего уюта. Они обычно оформлены очень просто и ✌PUFIK. Beautiful Interiors. On…",1,136k,https://i.pinimg.com/originals/32/eb/72/32eb72e4fd8654c115a64528bd1f34b4.png,6717,image,PUFIK Interiors & Inspirations,Local save in /data/home-decor,"Scandinavian Cottage,Swedish Cottage,Swedish Home Decor,Swedish Farmhouse,Swedish Style,Swedish Kitchen,Kitchen Black,Swedish House,Cozy Cottage",〚 Уютные шведские коттеджи от Carina Olander 〛 ◾ Фото ◾ Идеи ◾ Дизайн,bc5ab9ee-505e-44f6-92ba-677fe4fdf3e3
home-decor,"Традиционные шведские коттеджи, обычно с красным фасадом — это настоящее воплощением идеального зимнего уюта. Они обычно оформлены очень просто и ✌PUFIK. Beautiful Interiors. On…",1,136k,https://i.pinimg.com/originals/32/eb/72/32eb72e4fd8654c115a64528bd1f34b4.png,6717,image,PUFIK Interiors & Inspirations,Local save in /data/home-decor,"Scandinavian Cottage,Swedish Cottage,Swedish Home Decor,Swedish Farmhouse,Swedish Style,Swedish Kitchen,Kitchen Black,Swedish House,Cozy Cottage",〚 Уютные шведские коттеджи от Carina Olander 〛 ◾ Фото ◾ Идеи ◾ Дизайн,bc5ab9ee-505e-44f6-92ba-677fe4fdf3e3
home-decor,"Традиционные шведские коттеджи, обычно с красным фасадом — это настоящее воплощением идеального зимнего уюта. Они обычно оформлены очень просто и ✌PUFIK. Beautiful Interiors. On…",1,136k,https://i.pinimg.com/originals/32/eb/72/32eb72e4fd8654c115a64528bd1f34b4.png,6717,image,PUFIK Interiors & Inspirations,Local save in /data/home-decor,"Scandinavian Cottage,Swedish Cottage,Swedish Home Decor,Swedish Farmhouse,Swedish Style,Swedish Kitchen,Kitchen Black,Swedish House,Cozy Cottage",〚 Уютные шведские коттеджи от Carina Olander 〛 ◾ Фото ◾ Идеи ◾ Дизайн,bc5ab9ee-505e-44f6-92ba-677fe4fdf3e3
event-planning,"15.1k Likes, 83 Comments - THE EVENT COLLECTIVE ✖️ (@theeventcollectivex) on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by…”",1,311,https://i.pinimg.com/originals/91/0b/5c/910b5c120f7d1570ffc840302d7b49f4.jpg,4858,image,Marie Bradford,Local save in /data/event-planning,"Diy Birthday Decorations,Balloon Decorations,Table Decorations,Emerald Green Decor,40th Birthday Parties,24th Birthday,Surprise Birthday,Brunch Decor,Quinceanera Themes",THE EVENT COLLECTIVE ✖️ on Instagram: “I’ve always loved emerald green 🌲 by @a.purnellproduction Beautiful balloons by @basicallycuteevents @inspiredengravings for the acrylic…”,58101415-9273-4311-a5bd-0015a56579b4
home-decor,"6,636 Likes, 141 Comments - The Cottage Journal (@thecottagejournal) on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are g…",1,394,https://i.pinimg.com/originals/8c/17/a2/8c17a257b70780480bb89c3699363144.jpg,6633,image,Sarah Martin,Local save in /data/home-decor,"Diy Kitchen Cabinets,Kitchen Redo,Home Decor Kitchen,New Kitchen,Home Kitchens,Kitchen Remodeling,Aqua Kitchen,Kitchen Counters,Kitchen Islands",The Cottage Journal on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are giving. If you could paint your cabinets any…”,d136f6bc-840d-44f8-bbad-115eb7e6c51e
home-decor,"6,636 Likes, 141 Comments - The Cottage Journal (@thecottagejournal) on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are g…",1,394,https://i.pinimg.com/originals/8c/17/a2/8c17a257b70780480bb89c3699363144.jpg,6633,image,Sarah Martin,Local save in /data/home-decor,"Diy Kitchen Cabinets,Kitchen Redo,Home Decor Kitchen,New Kitchen,Home Kitchens,Kitchen Remodeling,Aqua Kitchen,Kitchen Counters,Kitchen Islands",The Cottage Journal on Instagram: “Can you say color?! 😍😍😍 We are loving the cheery vibes that these aqua blue cabinets are giving. If you could paint your cabinets any…”,d136f6bc-840d-44f8-bbad-115eb7e6c51e
christmas,"Features: Material:Lint Size:48ｘ18cm Quantity:1 pc Shape:Santa Claus, snowman. Elk Occasion:Christmas Description: 1. Fashion design, high quality 2. Santa Claus, snowman. Elk C…",1,5k,https://i.pinimg.com/originals/b5/7f/21/b57f219fa89c1165b57525b8eae711da.jpg,1706,image,Wear24-7,Local save in /data/christmas,"Merry Christmas To You,Christmas Toys,Great Christmas Gifts,Christmas Snowman,Christmas Ornaments,Holiday,Christmas Party Decorations,Christmas Themes,Decoration Party",Standing Figurine Toys Xmas Santa Claus Snowman Reindeer Figure Plush Dolls Christmas Decorations Ornaments Home Indoor Table Ornaments Christmas Party Tree Hanging Decor Toys Gifts for Kids Friends…,b5c8a1b5-9e90-4522-9bec-2477b698d5b7
christmas,"Features: Material:Lint Size:48ｘ18cm Quantity:1 pc Shape:Santa Claus, snowman. Elk Occasion:Christmas Description: 1. Fashion design, high quality 2. Santa Claus, snowman. Elk C…",1,5k,https://i.pinimg.com/originals/b5/7f/21/b57f219fa89c1165b57525b8eae711da.jpg,1706,image,Wear24-7,Local save in /data/christmas,"Merry Christmas To You,Christmas Toys,Great Christmas Gifts,Christmas Snowman,Christmas Ornaments,Holiday,Christmas Party Decorations,Christmas Themes,Decoration Party",Standing Figurine Toys Xmas Santa Claus Snowman Reindeer Figure Plush Dolls Christmas Decorations Ornaments Home Indoor Table Ornaments Christmas Party Tree Hanging Decor Toys Gifts for Kids Friends…,b5c8a1b5-9e90-4522-9bec-2477b698d5b7
christmas,"Features: Material:Lint Size:48ｘ18cm Quantity:1 pc Shape:Santa Claus, snowman. Elk Occasion:Christmas Description: 1. Fashion design, high quality 2. Santa Claus, snowman. Elk C…",1,5k,https://i.pinimg.com/originals/b5/7f/21/b57f219fa89c1165b57525b8eae711da.jpg,1706,image,Wear24-7,Local save in /data/christmas,"Merry Christmas To You,Christmas Toys,Great Christmas Gifts,Christmas Snowman,Christmas Ornaments,Holiday,Christmas Party Decorations,Christmas Themes,Decoration Party",Standing Figurine Toys Xmas Santa Claus Snowman Reindeer Figure Plush Dolls Christmas Decorations Ornaments Home Indoor Table Ornaments Christmas Party Tree Hanging Decor Toys Gifts for Kids Friends…,b5c8a1b5-9e90-4522-9bec-2477b698d5b7


In [0]:
# The data in the S3 bucket is in JSON format, now to read this into Databricks for pinterest posts data:
# Defining a file location variable for pinterest posts data:
file_location_geo = "/mnt/pinterest_project1/topics/0e1f6d6285c1.geo/partition=0/*.json" # Asterisk(*) indicates reading all the content of the specified file that have the .json extension

# Reading the Pinterest posts data and saving it to the below variable:
df_geo_0e1f6d6285c1 = read_json_to_df(file_location_geo)

# Displaying the first 10 rows of the pin dataframe:
display(df_geo_0e1f6d6285c1.limit(10))

country,ind,latitude,longitude,timestamp
British Indian Ocean Territory (Chagos Archipelago),9455,-82.9272,-150.346,2022-03-15T01:46:32
British Indian Ocean Territory (Chagos Archipelago),6814,-86.5675,-149.565,2022-09-02T11:34:28
British Indian Ocean Territory (Chagos Archipelago),9455,-82.9272,-150.346,2022-03-15T01:46:32
British Indian Ocean Territory (Chagos Archipelago),6814,-86.5675,-149.565,2022-09-02T11:34:28
British Indian Ocean Territory (Chagos Archipelago),5111,-83.7472,8.65953,2021-04-01T00:56:57
British Indian Ocean Territory (Chagos Archipelago),5111,-83.7472,8.65953,2021-04-01T00:56:57
British Indian Ocean Territory (Chagos Archipelago),2989,-87.013,133.062,2020-01-09T19:18:54
Antarctica (the territory South of 60 deg S),10073,-32.8885,-170.295,2021-06-29T19:56:04
Antarctica (the territory South of 60 deg S),10073,-32.8885,-170.295,2021-06-29T19:56:04
Antarctica (the territory South of 60 deg S),10073,-32.8885,-170.295,2021-06-29T19:56:04


In [0]:
# The data in the S3 bucket is in JSON format, now to read this into Databricks for pinterest posts data:
# Defining a file location variable for pinterest posts data:
file_location_user = "/mnt/pinterest_project1/topics/0e1f6d6285c1.user/partition=0/*.json" # Asterisk(*) indicates reading all the content of the specified file that have the .json extension

# Reading the Pinterest posts data and saving it to the below variable:
df_user_0e1f6d6285c1 = read_json_to_df(file_location_user)

# Displaying the first 10 rows of the pin dataframe:
display(df_user_0e1f6d6285c1.limit(10))

age,date_joined,first_name,ind,last_name
42,2017-02-18T00:31:22,Christopher,6353,Hernandez
27,2016-03-08T13:38:37,Christopher,2015,Bradshaw
59,2017-05-12T21:22:17,Alexander,10673,Cervantes
48,2016-02-27T16:57:44,Christopher,1857,Hamilton
45,2016-09-15T06:02:53,Christopher,10020,Hawkins
35,2015-10-22T22:42:23,Christopher,2041,Campbell
48,2016-06-13T17:09:14,Christopher,7031,Anderson
27,2016-03-08T13:38:37,Christopher,2015,Bradshaw
59,2017-05-12T21:22:17,Alexander,10673,Cervantes
48,2016-02-27T16:57:44,Christopher,1857,Hamilton


In [0]:
 # Checking to see if data from the S3 has been read correctly:
 dbutils.fs.ls("/mnt/pinterest_project1/topics/")

# Batch data processing: Spark on Databricks

The S3 bucket has been succesfully mounted to Databricks and the JSON data within the 3 topics has been successfully read into 3 Spark dataframes.

We will now clean the data within each of these dataframes using Spark. Spark is a unified engine for large-scale distributed data processing on computer clusters. It will offer:
- Distributed Processing
- Scalability
- In-Memory Processing
- Fault Tolerance
- and many more benefits.

## Importing the Transformations notebook so that the cleaning/ transformation methods are available

In [0]:
%run 
/Users/m.maruthan@hotmail.co.uk/Transformations

## Cleaning the Pinterest posts data

In [0]:
# The cleaning/ transformation methods have been created in a separate notebook and imported in
# To clean the pinterest posts data, we call the transform_pin method and provide the dataframe as an input:
df_pin_0e1f6d6285c1 = transform_pin(df_pin_0e1f6d6285c1)

display(df_pin_0e1f6d6285c1.limit(10))

ind,unique_id,title,description,follower_count,poster_name,tag_list,is_image_or_video,image_src,save_location,category
4,55abcd28-bda1-4453-bbcd-1427fb3aa49b,"Mexican Artist Uses Unique Technique To Make His Drawings Glow, And The Result Is Mesmerizing",Mexican artist Enrique Bernal has found a magical way to illuminate his beautiful pencil drawings with life. Check his unique glowing art in this article!,2000000,Bored Panda,"Girl Drawing Sketches,Art Drawings Sketches Simple,Pencil Art Drawings,Realistic Drawings,Cool Drawings,Drawing Ideas,Panda Drawing,Disney Drawings,Pencil Drawing Inspiration",image,https://i.pinimg.com/originals/0c/37/fa/0c37fab39da6ded220c3f9ccac8d117c.jpg,/data/art,art
25,f19b91c7-2a58-41ae-a013-3806d248baec,How to use an Angled Paint Brush! Painting Techniques with The Social Easel Online Paint Studio,If I could only choose one paint brush it would be the angled brush! I am going to break down four separate Techniques I like to use an Angled Paint Brush with a video painting…,20000,The Social Easel Online Paint Studio | Video Painting Tutorials,"Fall Canvas Painting,Basic Painting,Acrylic Painting Flowers,Canvas Painting Tutorials,Autumn Painting,Painting Techniques,Diy Painting,Painting & Drawing,Canvas Art",image,https://i.pinimg.com/originals/cc/8e/81/cc8e8190f773d5e3bb7d86890b566da7.png,/data/art,art
27,1bc67f67-70f6-4c5b-ae03-d8201f4bb9b7,Bulgarian Artist Makes Incredible Illustrations That Glow From Within,"It doesn't matter if you use a pencil, a crayon or the tip of your nose to create art, it is no small feat to produce something that'll knock everyone's socks off. Some artists…",2000000,Bored Panda,"Outline Drawings,Pencil Art Drawings,Cool Art Drawings,Horse Drawings,Graphite Drawings,Drawings Of Angels,Hair Drawings,Girl Drawing Sketches,Drawing Artist",image,https://i.pinimg.com/originals/0d/e2/ba/0de2ba7b5eaa155211bb2f219fdedf3a.jpg,/data/art,art
33,58b0546a-bf3f-494a-89f3-c53d0f537f6e,The Astronomer,The Astronomer Fine Art Print by Charlie Bowater. Authentic giclee print artwork on paper or canvas. Wall Art purchases directly support the artist.,17000,Eyes On Walls,"Art And Illustration,Art Illustrations,The Old Astronomer,Arte Inspo,Look Wallpaper,Fairy Art,Anime Art Girl,Artwork Prints,3d Artwork",image,https://i.pinimg.com/originals/4e/a7/65/4ea7657855b643c0566103805b54e8f7.jpg,/data/art,art
46,19234073-8905-4885-b0d5-98e0b84cbf27,10 Watercolor Hacks For Beginners | Tips and Tricks to Making Watercolor Painting Easier,Mountain monologue watercolor,27000,"It's me, JD | DIY, Crafts, Home & Organization","Arte Inspo,Kunst Inspo,Watercolor Artists,Watercolor Ideas,Simple Watercolor,Tattoo Watercolor,Watercolor Techniques,Watercolor Animals,Watercolor Illustration",image,https://i.pinimg.com/originals/fd/54/89/fd548935dcb13545120a2115baaa41d9.jpg,/data/art,art
63,9bcb7142-50e1-4dc5-bc2a-466368d47aa1,"Japanese Artist Depicts The Typical Life Of His Pet Hamster, And The Result Is Adorable",Japanese artist and art university graduate Gotte have turned their creative skills towards a very creative subject. Their light-hearted watercolor pictures depict a typical day…,2000000,Bored Panda,"Art And Illustration,Cute Animal Drawings,Cute Drawings,Drawing Animals,Pretty Art,Cute Art,Art Inspo,Bel Art,Art Du Croquis",image,https://i.pinimg.com/originals/47/3b/ca/473bca41a36b536983a17e8f28598d7a.png,/data/art,art
74,f0a3a02d-5cc3-4cb6-8668-7f94a5f5d323,3rd Grade Fall Forrests,"3rd Grade Fall Forests Third grade has been working on these cute little creations for the past couple of art classes now, and have been doing a great job! Materials: Tru-Ray co…",13000,Elements of the Art Room,"Fall Art Projects,School Art Projects,Halloween Art Projects,Art Education Lessons,Art Lessons Elementary,Fall Crafts For Kids,Art For Kids,Kid Art,Art 2nd Grade",image,https://i.pinimg.com/originals/9c/e6/61/9ce661ab5c3bad61e30266496481a591.jpg,/data/art,art
82,f62e3bd2-cf77-44c8-ae64-81273f901592,🍂watercolor autumn leaves 🍂,,150000,Zezè,"Watercolor Art Lessons,Watercolor Painting Techniques,Painting Tips,Easy Watercolor Paintings,Watercolor Tips,Watercolor Tutorials,Painting Tutorials,Watercolor Flowers,Watercolors",multi-video(story page format),https://i.pinimg.com/videos/thumbnails/originals/c1/3f/d4/c13fd4de97106b98970b81402a4ca9f3.0000001.jpg,/data/art,art
91,ca91fe99-d9a9-4d59-9d0d-87e32d1c692f,Shower Curtain Art Tutorial | NEVER SKIP BRUNCH by Cara Newhart,"Top interior design blogger, Never Skip Brunch, shares her step by step tutorial to make your own Shower Curtain Art for cheap. Click here now for more!!",26000,CARA NEWHART [never skip brunch],"Art Diy,Diy Wall Art,Large Wall Art,Wall Art Decor,Wall Of Art,Art For Walls,Large Art Prints,Artwork Wall,Cool Wall Art",image,https://i.pinimg.com/originals/3c/b7/94/3cb7947e405a91e17e14b9e3a5466b2d.jpg,/data/art,art
93,ea8f328e-fba8-417c-867d-4151a012fc40,Easy Giraffe Art for Kids,"Colourful, fun and easy Giraffe Art for Kids! Using the watercolour resist technique to create an animal art project that kids will love. Get started by downloading the printabl…",252000,Arty Crafty Kids,"Kindergarten Art Lessons,Art Lessons For Kids,Art Lessons Elementary,Art Videos For Kids,School Art Projects,Projects For Kids,Crafts For Kids,Easy Kids Art Projects,Easy Art For Kids",video,https://i.pinimg.com/videos/thumbnails/originals/1f/5e/ed/1f5eed0cd761c3954721b6e82de353aa.0000001.jpg,/data/art,art



## Cleaning the geolocation data

In [0]:
# The cleaning/ transformation methods have been created in a separate notebook and imported in
# To clean the geolocation data, we call the transform_geo method and provide the dataframe as an input:
df_geo_0e1f6d6285c1 = transform_geo(df_geo_0e1f6d6285c1)

display(df_geo_0e1f6d6285c1.limit(10))

ind,country,coordinates,timestamp
4,Albania,"List(-88.8298, -170.188)",2022-07-07T00:18:41.000+0000
25,Ecuador,"List(-81.3019, 63.8961)",2021-12-02T12:40:33.000+0000
27,Albania,"List(-88.8298, -170.188)",2020-11-15T23:51:27.000+0000
33,Japan,"List(-47.1159, -118.396)",2018-02-15T21:25:06.000+0000
46,Afghanistan,"List(4.21689, -145.82)",2018-09-15T09:16:57.000+0000
63,Albania,"List(-88.8298, -170.188)",2021-10-27T10:14:52.000+0000
74,Antigua and Barbuda,"List(-81.0108, -165.206)",2020-01-29T14:03:35.000+0000
82,India,"List(29.9602, -101.96)",2022-05-19T07:17:10.000+0000
91,Canada,"List(-54.5927, -90.6345)",2021-10-11T12:44:03.000+0000
93,Barbados,"List(-83.8846, -179.612)",2019-03-05T05:51:17.000+0000



## Cleaning the user data

In [0]:
# The cleaning/ transformation methods have been created in a separate notebook and imported in
# To clean the pinterest posts data, we call the transform_pin method and provide the dataframe as an input:
df_user_0e1f6d6285c1 = transform_user(df_user_0e1f6d6285c1)

display(df_user_0e1f6d6285c1.limit(10))

ind,user_name,age,date_joined
4,Adam Acosta,20,2015-10-21T21:26:45
25,Amber Gray,24,2017-07-01T07:56:15
27,Adam Acosta,20,2015-10-21T21:26:45
33,Angela Conner,36,2016-09-21T11:18:22
46,Erik Kelley,30,2016-01-06T09:58:56
63,Adam Acosta,20,2015-10-21T21:26:45
74,Amanda Benitez,21,2015-11-01T09:16:18
82,Andres Cortez,26,2015-11-20T21:50:39
91,Darryl Baker,29,2016-02-26T03:45:09
93,Angela Bates,23,2015-10-30T15:08:57



# Analysis on the created dataframes

In [0]:
# Importing the window:
from pyspark.sql.window import Window

In [0]:
# Finding the most popular category in each country:
# Creating a temporary dataframe which joins the geo and pin dataframes on ind:
temp_df1 = df_geo_0e1f6d6285c1.join(df_pin_0e1f6d6285c1, on=['ind'])

# Grouping by 'country', 'category', and count the number of pins in each category within each country
grouped_df1 = temp_df1.groupBy('country', 'category').agg(count('*').alias('category_count'))

# Using a window function to rank the categories within each country
window_spec1 = Window.partitionBy('country').orderBy(desc('category_count'))

# Creating a dataframe where there is a ranking column which ranks over the above specified window:
ranked_df1 = grouped_df1.withColumn('rank', rank().over(window_spec1))

# Filtering for rows where rank is 1 (top category)
max_categories_df = ranked_df1.filter(col('rank') == 1)

# Selecting the relevant columns for the final result
top_categories_per_country_df = max_categories_df.select('country', 'category', 'category_count')

# Shownig the result
top_categories_per_country_df.display()

country,category,category_count
Afghanistan,education,12
Albania,art,19
Algeria,quotes,27
American Samoa,tattoos,8
Andorra,tattoos,9
Angola,diy-and-crafts,4
Angola,education,4
Anguilla,diy-and-crafts,5
Antarctica (the territory South of 60 deg S),tattoos,4
Antigua and Barbuda,art,4


In [0]:
# Finding the most popular category in each year:

# Creating a temporary geo dataframe which also has a 'post_year' column:
temp_geo_df = df_geo_0e1f6d6285c1.withColumn("post_year", year("timestamp"))

# Filtering for years between 2018 and 2022
temp_geo_df = temp_geo_df.filter((col("post_year") >= 2018) & (col("post_year") <= 2022))

# Creating a temporary dataframe which joins the geo and pin dataframes on ind:
temp_df2 = temp_geo_df.join(df_pin_0e1f6d6285c1, on=['ind'])

# Grouping by 'timestamp' (post_year), 'category', and count the number of pins in each category within each year
grouped_df2 = temp_df2.groupBy('post_year', 'category').agg(count('*').alias('category_count'))

# Using a window function to rank the categories within each year:
window_spec2 = Window.partitionBy('post_year').orderBy(desc('category_count'))

# Creating a dataframe where there is a ranking column which ranks over the above specified window:
ranked_df2 = grouped_df2.withColumn('rank', rank().over(window_spec2))

# Filtering for rows where rank is 1 (top category)
max_year_categories_df = ranked_df2.filter(col('rank') == 1)

# Selecting the relevant columns for the final result
top_categories_per_year_df = max_year_categories_df.select('post_year', 'category', 'category_count')

# Showing the result
top_categories_per_year_df.display()

post_year,category,category_count
2018,christmas,34
2019,art,26
2020,finance,26
2021,education,28
2022,christmas,32


In [0]:
# Finding the user with the most followers in each country:

# Creating a temporary dataframe which joins the geo and pin dataframes on ind:
temp_df3 = df_geo_0e1f6d6285c1.join(df_pin_0e1f6d6285c1, on=['ind'])

# Using a window function to rank the follower_count within each country:
window_spec4 = Window.partitionBy('country').orderBy(desc('follower_count'))

# Creating a dataframe where there is a ranking column which ranks over the above specified window:
ranked_df3 = temp_df3.withColumn('rank', rank().over(window_spec4))

# Filtering for rows where rank is 1 (top follower_count)
top_followers_df = ranked_df3.filter(col('rank') == 1)

# Selecting the relevant columns for the final result
top_followers_per_country_df = top_followers_df.select('country', 'poster_name', 'follower_count')

# Removing the duplicates on the country:
top_followers_per_country_df = top_followers_per_country_df.dropDuplicates(["country"])

# Showing the result
top_followers_per_country_df.display()

country,poster_name,follower_count
Afghanistan,9GAG,3000000
Albania,The Minds Journal,5000000
Algeria,Apartment Therapy,5000000
American Samoa,Mamas Uncut,8000000
Andorra,Teachers Pay Teachers,1000000
Angola,Tastemade,8000000
Anguilla,"Kristen | Lifestyle, Mom Tips & Teacher Stuff Blog",92000
Antarctica (the territory South of 60 deg S),Refinery29,1000000
Antigua and Barbuda,Country Living Magazine,1000000
Argentina,Cheezburger,2000000


In [0]:
# Finding the country with the user with the most followers:

# Ordering the above resulting dataframe by follower_count:
ordered_top_followers_per_country_df = top_followers_per_country_df.orderBy(col('follower_count').desc())

# Selecting only the country and follower_count columns:
ordered_top_followers_per_country_df = ordered_top_followers_per_country_df.select('country', 'follower_count').limit(2)

# Showing result:
ordered_top_followers_per_country_df.display()


country,follower_count
Angola,8000000
American Samoa,8000000


In [0]:
# Finding the most popular category for different age groups:

# Defining the age groups that are required in a new column:
temp_user_df = df_user_0e1f6d6285c1.withColumn(
    "age_group",
    (when((col("age") >= 18) & (col("age") <= 24), "18-24")
     .when((col("age") >= 25) & (col("age") <= 35), "25-35")
     .when((col("age") >= 36) & (col("age") <= 50), "36-50")
     .when(col("age") > 50, "50+")
     .otherwise("unknown age"))
)

# Joining the users DF to the pins DF:
temp_df4 = temp_user_df.join(df_pin_0e1f6d6285c1, on=["ind"])

# Grouping the age_group and category and counting the number of posts per category per age group:
grouped_df4 = temp_df4.groupBy("age_group", "category").agg(count("*").alias("category_count"))

# Using a window function to rank the categories within each age group:
window_spec5 = Window.partitionBy("age_group").orderBy(desc('category_count'))

# Creating a dataframe where there is a ranking column which ranks over the above specified window:
ranked_df5  = grouped_df4.withColumn("rank", rank().over(window_spec5))

# Filtering for rows where the rank is 1:
top_categories_age_group_df = ranked_df5.filter(col("rank") == 1)

# Selecting the relevant columsn for the final output:
top_categories_age_group_df1 = top_categories_age_group_df.select("age_group", "category", "category_count")

# Displaying the result:
top_categories_age_group_df1.display()

age_group,category,category_count
18-24,tattoos,66
25-35,christmas,41
36-50,finance,31
50+,vehicles,15


In [0]:
# Finding the median follower count for different age groups:
# Can use the same temporary df created above temp_df4 as it has created the age group column and joined the users and pins df

# # Defining the age groups that are required in a new column:
# temp_user_df = df_user_0e1f6d6285c1.withColumn(
#     "age_group",
#     (when((col("age") >= 18) & (col("age") <= 24), "18-24")
#      .when((col("age") >= 25) & (col("age") <= 35), "25-35")
#      .when((col("age") >= 36) & (col("age") <= 50), "36-50")
#      .when(col("age") > 50, "50+")
#      .otherwise("unknown age"))
# )

# # Joining the users DF to the pins DF:
# temp_df4 = temp_user_df.join(df_pin_0e1f6d6285c1, on=["ind"])

# Using the 'percentile_approx' function from SQL (using the 'expr') we can do this simply by grouping by the age_group and then finding the median follower count per age group, which is equivalent to the 50th percentile value and then we give the alias median_follower_count:
median_followers_age_group_df = temp_df4.groupBy("age_group").agg(expr("percentile_approx(follower_count, 0.5)").alias("median_follower_count"))

# Showing resulting df:
median_followers_age_group_df.display()

age_group,median_follower_count
50+,1000
36-50,6000
18-24,108000
25-35,24000


In [0]:
# Finding the how many users have joined each year:

# Creating a temporary geo dataframe which also has a 'post_year' column:
temp_user_df1 = df_user_0e1f6d6285c1.withColumn("joined_year", year("date_joined"))

# Grouping by the joined year and doing a count:
num_users_year_df = temp_user_df1.groupBy("joined_year").agg(count("*").alias("number_users_joined"))

# Showing result:
num_users_year_df.display()

joined_year,number_users_joined
2015,532
2016,604
2017,220


In [0]:
# Finding the median follower count of users based on their joining year:

# Joining the temp_users_df1 above to the pins dataframe:
temp_df5 = temp_user_df1.join(df_pin_0e1f6d6285c1, on =["ind"])

# Using the 'percentile_approx' function from SQL (using the 'expr') we can do this simply by grouping by the joined_year and then finding the median follower count per age group, which is equivalent to the 50th percentile value and then we give the alias median_follower_count:
median_followers_joined_date_df = temp_df5.groupBy("joined_year").agg(expr("percentile_approx(follower_count, 0.5)").alias("median_follower_count"))

# Showing resulting df:
median_followers_joined_date_df.display()

joined_year,median_follower_count
2015,128000
2016,19000
2017,2000


In [0]:
# Finding the median follower count of users based on their age group and joining year:

# # Defining the age groups that are required in a new column:
# temp_user_df = df_user_0e1f6d6285c1.withColumn(
#     "age_group",
#     (when((col("age") >= 18) & (col("age") <= 24), "18-24")
#      .when((col("age") >= 25) & (col("age") <= 35), "25-35")
#      .when((col("age") >= 36) & (col("age") <= 50), "36-50")
#      .when(col("age") > 50, "50+")
#      .otherwise("unknown age"))
# )

# Can use the above temp df already created for age groups, now need to create a column for joined year:
temp_user_df2 = temp_user_df.withColumn("joined_year", year("date_joined"))

# Joining the temp user df to the pins dataframe:
temp_df6 = temp_user_df2.join(df_pin_0e1f6d6285c1, on = ["ind"])

# Using the 'percentile_approx' function from SQL (using the 'expr') we can do this simply by grouping by the age_group & joined_year and then finding the median follower count for this grouping, which is equivalent to the 50th percentile value and then we give the alias median_follower_count:
median_followers_age_group_joined_date_df = temp_df6.groupBy("age_group", "joined_year").agg(expr("percentile_approx(follower_count, 0.5)").alias("median_follower_count"))

# Showing result:
median_followers_age_group_joined_date_df.orderBy("joined_year").display()

age_group,joined_year,median_follower_count
25-35,2015,44000
18-24,2015,228000
50+,2015,14000
36-50,2015,13000
50+,2016,908
25-35,2016,22000
36-50,2016,8000
18-24,2016,46000
25-35,2017,2000
50+,2017,1000


# Unmount S3 bucket

Finally, we can unmount the S3 bucket from Databricks. However, this step is not necessary as the mounted S3 bucket will add more data in batches to the S3 bucket and in turn to the Spark dataframes on Databricks.

In [0]:
# # To unmount the S3 bucket from Databricks:
dbutils.fs.unmount("/mnt/pinterest_project1")