Cyberbullism sentiment data-pipeline

This project builds up an easy to use data-pipeline to ingest 1% english tweets in which cyberbullism sentiment is calculated, based on tweet text. The predicted sentiment can be:

neutral
sexism
racism

Kibana can then be used to customize data visualization.

Stack architecture

In order to compose data-pipeline, the following architecture has been used:

Logstash: data ingestion tool to ingest 1% twitter firehorse sample. Official twitter doc
Kafka: persistance and message queue
Spark | Spark NLP: distributed big data processing and NLP library. Credits - Spark NLP John Snow Labs
ElastiSearch: should I explain why?
Kibana: great visualization tool

Configuration

Twitter developer account auth is needed in order to authenticate for tweet ingestion. A .env file must be placed in root-level directory, containing the following variables:

LOGSTASH_CONS_KEY=""
LOGSTASH_CONS_SECRET=""
LOGSTASH_OAUTH_KEY=""
LOGSTASH_OAUTH_SECRET=""

Please note: Each variable value must be enclosed in ""

Exposed applications

Following applications are available:

Kafka UI: port 8080
Spark master UI: port 8081
Spark worker UI: port 8082
ElasticSearch: RESTful interface at port 9200
Kibana: port 5601

Pretty much standard ports.

How to run

Boot data-pipeline application simply by running in root-level directory:

sudo docker compose -f docker-compose.yaml up

Then, in order to launch cyberbullism sentiment analysis:

Attach a shell to spark-driver container
Run:

spark-submit --master spark://spark-master:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0,org.elasticsearch:elasticsearch-spark-30_2.12:8.2.0 --conf="spark.driver.memory=3G" --conf="spark.executor.memory=4G" /opt/tap-project/code/twitter_stream_es_service.py

Credits

If you could ever follow Technologies for Advanced Programming at Univeristy of Catania, don't miss this great experience. Syllabus | TECHNOLOGIES FOR ADVANCED PROGRAMMING (unict.it)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
kafka		kafka
logstash		logstash
spark-local		spark-local
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
progetto.txt		progetto.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyberbullism sentiment data-pipeline

Stack architecture

Configuration

Exposed applications

How to run

Credits

About

Releases

Packages

Languages

IgnazioGul/Cyberbullism-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Cyberbullism sentiment data-pipeline

Stack architecture

Configuration

Exposed applications

How to run

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages