Skip to content

Data Engineering project for ZoomCamp`24: JSONL -> PostgreSQL/BigQuery + Metabase + Mage.AI

Notifications You must be signed in to change notification settings

dmytrovoytko/data-engineering-amazon-reviews

Repository files navigation

Data Engineering project Amazon Reviews

Data Engineering project for ZoomCamp`24: JSONL -> PostgreSQL + Metabase + Mage.AI

ETL for Amazon Reviews'23` dataset.

Data Engineering project Amazon Reviews

‼️New branch with export to BigQuery (in progress, export works only via scripts, no Mage yet, stay tuned!)

Cloud environment: GitHub CodeSpaces, free and enough to test (120 core-hours of Codespaces compute for free monthly, 15GB of Codespaces storage for free monthly).

To reproduce and review this project it would be enough (hopefully) less than hour, and ~8GB total for described datasets. You don't need to use anything extra, like VS Code or trial accounts - just webrowser + GitHub account is totally enough.

This level of cloud resources allowed me to process and analyze-visualize datasets with at least 6 Mln reviews without issues.

🎯 Goals

This is my Data Engineering project started during DE ZoomCamp'24. And the main goal is straight-forward: Extract - Transform - Load data, then visualize some insights.

I chose to analyze Amazon Reviews dataset. Full dataset is huge, and it is available to download in parts - by product categories. There are some smaller and larger sub-datasets, some of them we would explore here (Digital_Music, Health_and_Personal_Care, All_Beauty, Software). Each subset includes 2 files: products (meta), ratings and user comments (reviews).

Some time ago I became very interested in how much we can trust all those ratings on Amazon, including 'bestseller' statuses. But without proper data it was hard to investigate. Now, years later, I have more skills in data analytics and this huge dataset, thanks to McAuley Lab. I chose to use only part of the whole information:

  • Products: categories and average ratings
  • Reviews: user's ratings, review dates, was that verified purchase or not (will analyze reviews texts later, awesome source for many insights I think)

Thanks to ZoomCamp for the reason to learn many new tools and get back to my 'reviews & trust' questions!

πŸ•΅οΈ Questions that I chose to investigate during this project:

  • What are the trends in reviews ratings for verified/not purchases? Are they more negative or more positive than average?
  • As not verified purchase reviews can be manipulative, what is their ratio in total number?
  • As automated review submission is technically possible with more advanced tech last years, what are the not verified purchase rating trends over time?
  • Are there any significant differences in trends for product categories?

Let's explore together! These categories (datasets) I played with so far.

Reviews by Main category

ℹ️ Assets

  • dataset urls are defined in .csv files stored in /data - dataset_urls0.csv, dataset_urls1.csv, etc
  • my scripts (bash, python, sql)
  • pre-configured Metabase dashboard
  • pre-configured Mage AI pipelines

🧰 Tech stack

  • Docker and docker-compose. All dockerized apps use default bridge network (172.17.0.x).
  • PostgreSQL as a data base/warehouse - open source and free
  • Metabase as an analitics platform - open source and free self-hosted
  • Mage AI as an orchestration tool - open source and free self-hosted

πŸ’‘ Combination of CodeSpaces + PostgreSQL + Metabase + Mage AI would be a good choice for those who prefer/experiment with a simple and open source approach, or those who hesitate to deal with The BIG 3 cloud providers (AWS, Azure, GCP) with their serious payments. Looks like a good fit for early stage startups, developers, and those who learn Data Engineering as me (you?). ❕Find one more open source and free opportunity at the end of this instructiom.

πŸš€ Instructions to deploy

πŸ™ˆ Oh, so many steps?! Please take a deep breath to calm down, they are all simple enough, that's why so many. One step at a time. I did it successfully many times to test it, you can do it too! Let's go!

NB: and if you 'lost the track', you can always start from scratch - create new CodeSpace and go step by step. Good luck!

πŸ› οΈ Setup environment

  1. Fork this repo on GitHub.
  2. Create GitHub CodeSpace from the repo.
  3. Start CodeSpace, wait for requirements.txt installation to finish (it starts automatically, just watch & wait).
  4. Copy dev.env to .env - run cp dev.env .env in your CodeSpace terminal. You don't need to change anything to test the process. It contains all key settings (PostgreSQL etc).
cp dev.env .env

‡️ Download dataset and start PostgreSQL server

  1. Run bash download.sh to download 1st dataset (smallest) - from dataset_urls0.csv. As a result you will have 2 files in your data directory: meta_Digital_Music.jsonl.gz and meta_Digital_Music.jsonl.gz. Cool! βœ…
bash download.sh
  1. Start PostgreSQL docker: bash start_postgres.sh. It will automatically download postgres container and start it.
bash start_postgres.sh
  1. Wait until you see the message database system is ready to accept connections - PostgreSQL is ready to serve you.

PostgreSQL is ready

Let it run in this terminal so you can see logs and easily stop it later (with just Ctrl-C). CodeSpace will pop-up the notification that Your application running on port 5432 is available. - just ignore it (close that pop-up).

  1. Start 2nd terminal (click + button on the panel in the right corner above terminal) and switch to it.
  2. You can check database connection by running pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews, password postg. All dockerized apps of this setup are suppose to run in default bridge network, 172.17.0.x, with 172.17.0.2 for PostgreSQL. If this step fails you're in trouble πŸ˜… But, when you run it in a CodeSpace and follow the instruction it should work fine, I tested multiple times. Type quit to return to terminal CLI. Ok, PostgreSQL is running. βœ…
pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews

⌨️ Ingest data using CLI (bash & python)

Dataset files have been downloaded, PostgreSQL is running - time to ingest your data!

  1. Run bash process.sh to ingest dataset files into PostgreSQL database.
bash process.sh

It executes python script process_jsonl.py with default parameters from your .env settings. By default (without parameter) it processes the 1st dataset files. As the result of successful loading you will see some progress messages and finally Finished ingesting data/... into the PostgreSQL database! Total time ... second(s) +++ PROCESSING finished: OK!.

Processing successful

Congratulations, the first approach to load data accomplished and you have records in 2 tables of your database: meta with products, reviews with ratings. βœ…

  1. Now you can ingest the next dataset with the same approach:
  • download 2nd dataset: run bash download.sh dataset_urls1.csv (Health_and_Personal_Care files)
bash download.sh dataset_urls1.csv
  • process 2nd dataset: run bash process.sh dataset_urls1.csv
bash process.sh dataset_urls1.csv
  • download 3rd dataset: run bash download.sh dataset_urls2.csv (All_Beauty files)
bash download.sh dataset_urls2.csv

If you want to try data workflow orchestration have some patience and follow to Visualize data step. If you don't want (or have no time) to play with Mage.AI, just process similarly dataset_urls2.csv for All_Beauty, and download then process dataset_urls3.csv for Software.

πŸ“Š Visualize data

Finally, probably the most interesting part, let's see our data using Metabase - open sourse and free self-hosted Business Intelligence, Dashboards and Data Visualization tool.

Reviews by Verified purchase

  1. Start it in Docker - run
bash start_metabase.sh

It will automatically download Metabase container and start it.

  1. CodeSpace will pop-up the notification that Your application running on port 3000 is available. - click Open in Browser.

Metabase app pop-up

New page would probably open white. Please wait a couple of seconds to let it start, then refresh the page. Now you will see login screen. Just login with john@mailinator.com, pass: Jj123456 (no worries, it's self-hosted, all safe). Pre-configured dashboard with reports is already there for you! βœ…πŸŽ‰

πŸ’‘ In case you accidentally close that pop-up or Metabase page and you need it later (after ingesting new datasets), you can always open that page from Ports tab:

Metabase app Ports

  1. Explore the dashboard on the main screen. There are 2 tabs: Products and Reviews.
  2. Products: you can see 2 reports:
  • pie chart with number of Products by Main category
  • Products number distribution by Average rating
  1. Reviews: you can see 4 reports:
  • pie chart with Reviews number by Verified/not purchase
  • Reviews rating number distribution by Verified/not purchase over time (by months)
  • pie chart with Reviews number by Main category
  • Reviews rating number distribution by Main category over time (by months)

You can see some screenshots below.

  1. The more datasets you load, the more categories you can see. That's why I offer you to download and process at least 2 datasets. You can process 3rd dataset by following familiar steps in Ingest data using CLI (bash & python) or go ahead and discover Mage AI.

πŸ§™ Orchestrate data ingestion (Mage AI)

I wouldn't say it was so simple and easy as Matt showed us in videos, but with some time and effort I managed to find the way to convert the logic of my process_jsonl.py script into Mage pipelines and bricks. Why is it worth my/your time? Because larger datasets (like Kindle_Store) will probably demand a serious cloud storage, database and compute than free CodeSpace playground. And I (you?) need to learn how to deal with them with a more scalable system, providing long job execution, partitioning, monitoring and logs. So let's see 2 pipelines I managed to setup with Mage.

πŸ’‘ In case something goes wrong or complicated, you can still ingest more datasets with step 11 Ingest data using CLI (bash & python) and then reload Metabase dashboard page.

  1. Run unzip mage.zip to extract pre-configured Mage AI workflow.
unzip mage.zip
  1. Run docker-compose build to automatically download and prepare Docker container with Mage AI.
docker-compose build
  1. Run docker-compose up to start Docker container with Mage AI.
docker-compose up
  1. CodeSpace will pop-up the notification that Your application running on port 6789 is available. - click Open in Browser.

Open Mage AI page

That new page would probably open white. Please wait a couple of seconds to let Mage AI start, then refresh the page. Now you will see Mage dashboard. Your orchestration center is ready to serve you! βœ…

  1. Go to Pipelines - move mouse to the left edge of the window, it will open menu slider, click on Pipelines.

Open Mage AI pipelines

You will see 2 pipelines I configured:

Mage AI pipelines list

  1. Click load_run_dataset. It will open current pipeline triggers. Then click Run@once button.

Mage AI pipeline trigger

  1. In the Run pipeline now dialogue you can define a variable, in this case it will be the file name of dataset.

Mage AI pipeline run variable

Enter dataset_urls2.csv (that you downloaded but not processed yet) and click Run now button. Ingestion process started. You can see the progress on Pipeline runs page. It appears with a little delay, no worries.

Mage AI pipeline run progress

  1. After ingesting new dataset you can switch to Metabase page, update it and see new data in the dashboard reports. Congratulations! βœ…πŸŽ‰

Metabase dashboard products

Metabase dashboard reviews by verified monthly

Metabase dashboard reviews by category monthly

  1. If you get to this point, please ⭐️star⭐️ my repo! πŸ™Œ

🎁 One more open source opportunity

Bonus for those who survived reading instructions!

In addition to using PostgreSQL in docker (local on your computer or as described in GitHub CodeSpace) you can use/experiment with Supabase - an open source Firebase alternative.

It provides a quite generous free plan (not a trial, no credit card required): 500 MB database space, 2 Core shared CPU, 1GB RAM, 5 GB bandwidth, 1 GB file storage. Free projects are paused after 1 week of inactivity. Limit of 2 active projects.

I tested it, it works with my scripts, and with Metabase as well. You just need to:

  • register free account
  • get access credentials (host, port, user, password, database)
  • put new values to .env file
  • create in Supabase 2 tables - meta and reviews, then define structure (according to my python script)
  • and finally use my process.sh (or Mage AI pipelines) to ingest data there
  • Metabase works with Supabase as with any PostgreSQL database - create dashboard for your new database.

Let me know how it worked for you!

⏹️ Instructions to stop the apps

  • Simple way - stop all together by stopping your CodeSpace. Remember, this will leave all downloaded data in your CodeSpace - you can start it later and continue playing with tools and data. You can also delete CodeSpace with all the data.
  • Stop all active Docker containers - run this command in terminal
docker stop $(docker ps -a -q)
  • Stop Mage AI
    • switch to 2nd terminal and press Ctrl-C
    • then run docker-compose down in terminal
  • Stop PostgreSQL - switch to 1st terminal and press Ctrl-C
  • You can delete downloaded dataset .jsonl.gz files from /data folder.

Roadmap

  • Refactor python code
  • Export to .parquet
  • Export to BigQuery
  • [/] BigQuery: update Mage orchestration, visualization and instructions
  • [/] Load more data to analyze Kindle Store dataset
  • Visualize using Apache Superset (as an alternative to Metabase)
  • Visualize using Looker Studio
  • Orchestrate with Prefect (as an alternative to Mage)
  • Orchestrate with Apache Airflow (as an alternative to Mage)

Stay tuned!

Support

πŸ™ Thank you for your attention and time!

  • If you experience any issue while following this instruction (or something left unclear), please add it to Issues, I'll be glad to help/fix. And your feedback, questions & suggestions are welcome as well!
  • Feel free to fork and submit pull requests.

If you find this project helpful, please ⭐️star⭐️ my repo https://github.com/dmytrovoytko/data-engineering-amazon-reviews to help other people discover it πŸ™

Made with ❀️ in Ukraine πŸ‡ΊπŸ‡¦ Dmytro Voytko

About

Data Engineering project for ZoomCamp`24: JSONL -> PostgreSQL/BigQuery + Metabase + Mage.AI

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published