Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Sentiment Analysis on Twitter data using Bernoulli Naïve Bayes

License

Notifications You must be signed in to change notification settings

SarahHannes/tweet-sentiment

Repository files navigation

Twitter Sentiment Analysis [archieved] Open in Streamlit

CD CT CT_Workflow_Rerun

A Python based project for performing sentiment analysis on Twitter data. Get full project paper here.

  • Twitter data is hourly scraped using {twint} package.
  • Scheduled model training is performed monthly on MLflow served on a g1-small GCP Compute Engine instance.
  • Model training artifacts are stored in GCP Cloud Storage.
  • Instance schedule is applied on the subscribed Compute Engine instance for cost efficiency.
  • Total cost of GCP usage is less than MYR 3.00/ month (approx. USD 0.72/ month).

Dashboard

Click Open in Streamlit to view dashboard!

ezgif com-gif-maker (2)

Simplified View of Pipelines

flowcharts-DAG for black background drawio

Architecture Overview

flowcharts-pipelines (6)

GCP Config

startup-script for GCP Compute Engine instance:

#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u <USERNAME> tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root <gsutil URI> \
--host localhost

Working Example:

#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u tweet_sentiment_py tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root gs://mlflow_bucket_001 \
--host localhost

Limitations & Roadblocks

  • Unfortunately, hardly reproducible due to manual pipeline integration & authentication processes.
  • Dashboard is not scalable. Currently the twitter handle belonging to twitter accounts of interests were hardcoded in python file served on {Streamlit} for data analysis and visualization.
  • No fallbacks on failed scheduled Actions.
  • Roadblock: As of Jan 2022, GitHub Action build may fail due to dependencies installation error. This affects both the scheduled pipelines and dashboard. (See Ref)

Credits

Project based on the cookiecutter data science project template. #cookiecutterdatascience


Thank you to the developers of twint and all other packages for making this project possible!

About

Sentiment Analysis on Twitter data using Bernoulli Naïve Bayes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published