This project builds up an easy to use data-pipeline to ingest 1% english tweets in which cyberbullism sentiment is calculated, based on tweet text. The predicted sentiment can be:
- neutral
- sexism
- racism
Kibana can then be used to customize data visualization.
In order to compose data-pipeline, the following architecture has been used:
- Logstash: data ingestion tool to ingest 1% twitter firehorse sample. Official twitter doc
- Kafka: persistance and message queue
- Spark | Spark NLP: distributed big data processing and NLP library. Credits - Spark NLP John Snow Labs
- ElastiSearch: should I explain why?
- Kibana: great visualization tool
Twitter developer account auth is needed in order to authenticate for tweet ingestion.
A .env
file must be placed in root-level directory, containing the following variables:
LOGSTASH_CONS_KEY=""
LOGSTASH_CONS_SECRET=""
LOGSTASH_OAUTH_KEY=""
LOGSTASH_OAUTH_SECRET=""
Please note: Each variable value must be enclosed in ""
Following applications are available:
- Kafka UI: port 8080
- Spark master UI: port 8081
- Spark worker UI: port 8082
- ElasticSearch: RESTful interface at port 9200
- Kibana: port 5601
Pretty much standard ports.
Boot data-pipeline application simply by running in root-level directory:
sudo docker compose -f docker-compose.yaml up
Then, in order to launch cyberbullism sentiment analysis:
- Attach a shell to spark-driver container
- Run:
spark-submit --master spark://spark-master:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0,org.elasticsearch:elasticsearch-spark-30_2.12:8.2.0 --conf="spark.driver.memory=3G" --conf="spark.executor.memory=4G" /opt/tap-project/code/twitter_stream_es_service.py
If you could ever follow Technologies for Advanced Programming at Univeristy of Catania, don't miss this great experience. Syllabus | TECHNOLOGIES FOR ADVANCED PROGRAMMING (unict.it)