Multi-Modal Representation Learning for Social Media Popularity Prediction

This project leverages advanced machine learning and data engineering techniques to predict the popularity of Reddit posts based on various features. It showcases the integration of multiple cutting-edge technologies to create a robust, automated pipeline for data processing and model training.

Key Technologies and Features

ETL Pipeline

Apache Airflow: Orchestrates the entire data pipeline, from scraping to model training, ensuring daily updates and seamless workflow management.

Deep Learning and Embeddings

Image Caption Generation: Automatically generates detailed descriptive captions for images using multi-modal large language models (LLMs).
TensorFlow: Powers the multimodal deep learning model for popularity prediction.
Text Embeddings: Utilizes advanced NLP techniques to create meaningful representations of post titles. The model used: FlagEmbedding's bge-m3.
Image Embeddings: Generates rich visual features from post images. The model used: Vision Transformer Image Classification.
Visual Embeddings: Combines image and text data for a comprehensive multimodal representation. The model used: FlagEmbedding's VisualBGE.

Data Processing

Reddit API (PRAW): Facilitates efficient data scraping from Reddit.
FlagEmbedding: Employed to create sophisticated visual and combined embeddings.

Features Used for Prediction

Title Embeddings
Image Embeddings
Caption Embeddings (generated from images)
Visual Embeddings (combined image and text)
Post Metadata
Author's Metadata

Airlfow Pipeline Overview

The Airflow-managed pipeline includes:

Data Scraping
Data Filtering
Image Fetching
Image Caption Generation
Embedding Creation (Text, Visual, Combined)
Feature Merging
Model Training and Evaluation
Model Persistence

The pipeline runs daily, continuously improving the model with new data. Each trained model is saved along with its evaluation metrics for tracking performance over time.

Installation and Setup

Dependencies

Apache Airflow

export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.9.1
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

FlagEmbedding

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
pip install torchvision timm einops ftfy

Note: Download the Visual Embedding model from BAAI/bge-visualized and specify the path in src/t06.4_create_embeddings_combined.py:50.

Other Dependencies

pip install -r requirements.txt

Project Setup

Clone the repository:

git clone https://github.com/DistilledCode/mmrl.git
cd mmrl

Configure Reddit credentials in praw.ini:

[bot1]
client_id=secret
client_secret=secret
username=secret
password=secret

[bot2]
client_id=secret
client_secret=secret
password=secret
username=secret

Start the scraper:
```
./monitor_scrapper.sh
```

Set up the Airflow environment:

export PROJ_DIR=$PWD
cp praw.ini smpp_pipeline.py ~/airflow/dags

Launch Airflow:
```
airflow scheduler &
airflow webserver -p 8080 &
```
Access the Airflow web interface at http://localhost:8080 to enable and monitor the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
monitor_scrapper.sh		monitor_scrapper.sh
requirements.txt		requirements.txt
scrap_comments.py		scrap_comments.py
smpp_pipeline.py		smpp_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Representation Learning for Social Media Popularity Prediction

Key Technologies and Features

ETL Pipeline

Deep Learning and Embeddings

Data Processing

Features Used for Prediction

Airlfow Pipeline Overview

Installation and Setup

Dependencies

Apache Airflow

FlagEmbedding

Other Dependencies

Project Setup

About

Releases

Packages

Languages

License

DistilledCode/mmrl

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Representation Learning for Social Media Popularity Prediction

Key Technologies and Features

ETL Pipeline

Deep Learning and Embeddings

Data Processing

Features Used for Prediction

Airlfow Pipeline Overview

Installation and Setup

Dependencies

Apache Airflow

FlagEmbedding

Other Dependencies

Project Setup

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages