Skip to content

Rameshei87/Reddit-Streaming-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-Streaming-Data

An attempt to stream data from the reddit api using kafka, process with spark, and store in s3 data lake.

Build dockerfiles Go to docker directory and run build script.

./build.sh

docker-compose up -d --no-recreate

Start streaming data

cd redditStreaming/src/main/python/reddit

python3 -m reddit_streaming.py

Start kafka producer python3 -m reddit_producer.py

Etc. Remove untagged docker images

docker rmi $(docker images | grep "^" | awk "{print $3}")

Note on versions

When changing version of spark, hadoop, jupyterlab, etc, versions must be updated in build.sh, respective *.Dockerfile, pom.xml, requirements.txt and reddit_streaming.py.

pyspark write stream to s3 error

Likely caused by guava jar mismatch, follow steps here: https://kontext.tech/article/689/pyspark-read-file-in-google-cloud-storage

Kafka/zookeeper broker id mismatch

wurstmeister/kafka-docker#409

If there are kafka errors, run docker-compose down, delete cluster_config/kafka/logs and cluster_config/zookeeper/data/version-2 directories, run docker-compose up -d.

S3 s3://aws-glue-assets-965504608278-us-east-2/scripts/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published