- 👩💻 Janice Tan Yi Xuan
- 👩💻 Li Yishan
- 👩💻 Ma Xiaojing
- 👨💻 Ranjit Khare
- 👨💻 Wang Xiaoyuan
Tweepy: Tweepy is used to connect to twitter and get the historical and streaming data. Then, this data is saved to hdfs or sent to TCP-port.
Spark SQL/MLlib: Load the messages from HDFS and process them, and saves the result for visulisation in matplotlib or Tableau.
Spark Streaming: Consumes the messages from TCP-port and process them, and sends them to Flask server.
Flask: Python web framework, which receives the data from Spark and shows dashboards.
Tweepy:
- run "Full archive tweet tweepy-v4.ipynb" for historical data extraction
- run "Stream tweet tweepy.ipynb" for real time data extraction and sending to TCP port
Spark SQL/MLlib:
- install pyspark
- run "Tweet LDA.ipynb" for data processing and modelling
Visulization:
- install bar_chart_race
- run "AprTweetsBarChart.ipynb" to create bar chart race video
Spark Streaming
- run "Tweet stream process.ipynb" to process real time data and send to Flask server
Flask:
- Run the flask.rc file in this project (source flask.rc)
- Run "flask run", this starts the application by default in localhost:5000
