Description
We ingest Twitter feed using Flume. The tweets will be stored in the avro schema. To use this data, we create tables using AvroSerde in Hive. Once created this data can be cleansed using tools like OpenRefine,Pig etc. The cleansed data can then be used for visualization.
Components
- twitter.conf is used to store all the configurations required for ingesting tweets
- TwitterDataAvroSchema.avsc contains the avro schema.
- avrodataread.q is used to create a staging table with avro Serde.
- create_tweets_avro_table.q is used to create a processing table with well defined DDL's.
Prerequisites
To run this software you need the following:
- Linux
- Hadoop 2.0
- Hive 2.0
- Flume
- Twitter Developer App Credentials
Steps
-
Get credentials for developing twitter apps.
-
Write a twitter.conf file and replace the variables with your secret keys given by twitter.
-
Execute twitter.conf in the terminal
flume-ng agent -n TwitterAgent -f $FLUME_CONF_DIR/twitter.conf -
Get the schema from the avro log file
hdfs dfs -cat /user/flume/tweets/FlumeData.* | head -
Copy and then save the schema in a file called
TwitterDataAvroSchema.avsc -
Edit the file for readability.
-
Write a hql file called
avrodataread.qto create table tweets using the AvroSerDe, mention the avro schema file in the tblproperties. -
Execute the file in terminal
hive -f FlumeHiveTwitterApp/Hive scripts/avrodataread.q -
To create a table for processing or for visualization, use the file named
create_tweets_avro_table.qand execute it.hive -f FlumeHiveTwitterApp/Hive scripts/create_tweets_avro_table.q -
Clean using tools like pig,OpenRefine etc.
-
Visualize the data into a dashboard using tools like tablaeu,d3.js etc.