Skip to content

Shawvin/Big-data-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic modelling and sentiment analysis on twitter data with Spark

Authors

  • 👩‍💻 Janice Tan Yi Xuan
  • 👩‍💻 Li Yishan
  • 👩‍💻 Ma Xiaojing
  • 👨‍💻 Ranjit Khare
  • 👨‍💻 Wang Xiaoyuan

Architecture

Tweepy: Tweepy is used to connect to twitter and get the historical and streaming data. Then, this data is saved to hdfs or sent to TCP-port.

Spark SQL/MLlib: Load the messages from HDFS and process them, and saves the result for visulisation in matplotlib or Tableau.

Spark Streaming: Consumes the messages from TCP-port and process them, and sends them to Flask server.

Flask: Python web framework, which receives the data from Spark and shows dashboards.

Dashboard

dashboard

Instructions

Tweepy:

  1. run "Full archive tweet tweepy-v4.ipynb" for historical data extraction
  2. run "Stream tweet tweepy.ipynb" for real time data extraction and sending to TCP port

Spark SQL/MLlib:

  1. install pyspark
  2. run "Tweet LDA.ipynb" for data processing and modelling

Visulization:

  1. install bar_chart_race
  2. run "AprTweetsBarChart.ipynb" to create bar chart race video

Spark Streaming

  1. run "Tweet stream process.ipynb" to process real time data and send to Flask server

Flask:

  1. Run the flask.rc file in this project (source flask.rc)
  2. Run "flask run", this starts the application by default in localhost:5000

About

Project for big data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors