This project leverages advanced machine learning and data engineering techniques to predict the popularity of Reddit posts based on various features. It showcases the integration of multiple cutting-edge technologies to create a robust, automated pipeline for data processing and model training.
- Apache Airflow: Orchestrates the entire data pipeline, from scraping to model training, ensuring daily updates and seamless workflow management.
- Image Caption Generation: Automatically generates detailed descriptive captions for images using multi-modal large language models (LLMs).
- TensorFlow: Powers the multimodal deep learning model for popularity prediction.
- Text Embeddings: Utilizes advanced NLP techniques to create meaningful representations of post titles. The model used: FlagEmbedding's bge-m3.
- Image Embeddings: Generates rich visual features from post images. The model used: Vision Transformer Image Classification.
- Visual Embeddings: Combines image and text data for a comprehensive multimodal representation. The model used: FlagEmbedding's VisualBGE.
- Reddit API (PRAW): Facilitates efficient data scraping from Reddit.
- FlagEmbedding: Employed to create sophisticated visual and combined embeddings.
- Title Embeddings
- Image Embeddings
- Caption Embeddings (generated from images)
- Visual Embeddings (combined image and text)
- Post Metadata
- Author's Metadata
The Airflow-managed pipeline includes:
- Data Scraping
- Data Filtering
- Image Fetching
- Image Caption Generation
- Embedding Creation (Text, Visual, Combined)
- Feature Merging
- Model Training and Evaluation
- Model Persistence
The pipeline runs daily, continuously improving the model with new data. Each trained model is saved along with its evaluation metrics for tracking performance over time.
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.9.1
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
pip install torchvision timm einops ftfy
Note: Download the Visual Embedding model from BAAI/bge-visualized and specify the path in src/t06.4_create_embeddings_combined.py:50
.
pip install -r requirements.txt
-
Clone the repository:
git clone https://github.com/DistilledCode/mmrl.git cd mmrl
-
Configure Reddit credentials in
praw.ini
:[bot1] client_id=secret client_secret=secret username=secret password=secret [bot2] client_id=secret client_secret=secret password=secret username=secret
-
Start the scraper:
./monitor_scrapper.sh
-
Set up the Airflow environment:
export PROJ_DIR=$PWD cp praw.ini smpp_pipeline.py ~/airflow/dags
-
Launch Airflow:
airflow scheduler & airflow webserver -p 8080 &
Access the Airflow web interface at
http://localhost:8080
to enable and monitor the pipeline.