Skip to content

MadhurDixit13/SentimentFlow

Repository files navigation

📊 Reddit Sentiment Analysis Data Pipeline 🚀

🔹 Project Overview

This project automates data ingestion, processing, and visualization using Apache Airflow, AWS S3, and Streamlit. It extracts Reddit posts, performs sentiment analysis, and visualizes insights on a real-time dashboard.

👉 Technologies Used:

  • ETL & Orchestration: Apache Airflow
  • Data Processing: Pandas, NumPy, PyArrow, TextBlob
  • Storage: AWS S3 (Parquet format)
  • Dashboard: Streamlit & Plotly
  • Containerization: Docker & Kubernetes
  • CI/CD: GitHub Actions & Docker Hub
  • Cloud Deployment: AWS EC2

🔹 Architecture

🏠 Apache Airflow  ➔  📦 AWS S3 (Data Storage)  ➔  📈 Streamlit Dashboard
  • Airflow DAGs fetch Reddit data, analyze sentiment, and store results in AWS S3.
  • Streamlit Dashboard dynamically pulls data from S3 and visualizes trends.

🚀 Setup Instructions

1️⃣ Clone the Repository

git clone https://github.com/yourusername/Reddit-Sentiment-Analysis-Data-Pipeline.git
cd Reddit-Sentiment-Analysis-Data-Pipeline

2️⃣ Set Up Environment Variables

Create a .env file in the project root and add:

AWS_ACCESS_KEY=your_aws_access_key
AWS_SECRET_KEY=your_aws_secret_key
S3_BUCKET_NAME=your_s3_bucket_name
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_client_secret
REDDIT_USER_AGENT=your_reddit_user_agent

3️⃣ Run the Project Locally (Docker)

docker-compose up --build -d

🚀 Deployment via CI/CD (GitHub Actions)

🔹 GitHub Secrets

Before pushing code, add these GitHub Secrets under Settings → Secrets & Variables:

Secret Name Description
AWS_ACCESS_KEY AWS S3 Access Key
AWS_SECRET_KEY AWS S3 Secret Key
S3_BUCKET_NAME S3 Bucket Name
REDDIT_CLIENT_ID Reddit API Client ID
REDDIT_CLIENT_SECRET Reddit API Secret
REDDIT_USER_AGENT Reddit API User Agent
DOCKER_HUB_USERNAME Docker Hub Username
DOCKER_HUB_ACCESS_TOKEN Docker Hub Token
SERVER_IP AWS EC2 Public IP
SERVER_USER SSH Username (ubuntu for AWS)
SSH_PRIVATE_KEY Your .pem SSH Key

🔹 How CI/CD Works

  1. Push to main branch → GitHub Actions builds & pushes Docker images.
  2. Deploys to AWS EC2 via SSH.
  3. Pulls latest images & restarts containers automatically.

🚀 Manual Deployment (AWS EC2)

1️⃣ SSH into Your EC2 Server

ssh -i your-key.pem ubuntu@your-server-ip

2️⃣ Pull Latest Docker Images

docker pull your-dockerhub-username/airflow:latest
docker pull your-dockerhub-username/streamlit-dashboard:latest

3️⃣ Restart Services

docker-compose down
docker-compose up -d

📊 Dashboard Preview

  • Streamlit Dashboard - EC2 Deployed Dashboard
  • Airflow UI - EC2 Deployed Airflow UI I had to stop the EC2 instance because I needed to use a t2.medium instance, which wasn’t covered under the free tier, and I couldn’t afford to keep it running.

** Screenshots **

Here are some screenshots of the application to show the working as I had to stop the EC2 instance.

  • Airflow UI(DAGs running daily) DAGS
  • Calender Calender
  • Streamlit Dashboard Streamlit
  • Streamlit Chart 1 Chart 1
  • Streamlit Chart 2 Chart 2
  • Streamlit Chart 3 Chart 3

🔹 Key Features

Automated Data Pipeline – Extract, transform, store & visualize Reddit data.
Parallel Processing – Uses Airflow CeleryExecutor with Redis for scalability.
Secure Deployment – Manages secrets via GitHub Secrets & AWS Secrets Manager.
CI/CD Pipeline – Automates Docker builds & deployment via GitHub Actions.
Scalable Infrastructure – Runs as Docker containers, deployable to AWS, GCP, Kubernetes.


📞 Contact & Contributions

👨‍💻 Author: Madhur Dixit
🤝 Contributions: PRs are welcome! Open an issue to discuss improvements.
🌟 Star this repo if you found it useful!

🔹 License

This project is licensed under the MIT License. Feel free to modify and use it.


About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published