This repository contains the methods used in Twitter data research. The research investigates if the number of tweets about a certain video game is linked to the number of sold copies in Europe. The used data is provided by Twiqs.nl, and contains most of the tweets from The Netherlands from late 2010 until late 2015. The tweets are mostly in Dutch, because that is the target language Twiqs is collecting.
View the graphs made during this research on this page.
How to run it yourself
This repository contains a project folder named MapReduce. This folder contains the Maven project that is required to perform the MapReduce job on the data set of tweets. In order to perform this job yourself, please follow these steps:
- Copy the project folder to the cluster (for example, zip the project, run
scpto move it to the cluster and then unzip it).
- Navigate to the project folder's (the folder containing a file called
pom.xml) and run mvn package.
- After this, you may run the MapReduce job by navigating into the
targetfolder and executing the following command:
hadoop jar bigdata-0.2.jar gametweets.GameTweets <input path names...> <name of output folder>(so, if you wish to perform the job on the months September and October in the year 2013, use:
/data/twitterNL/201309/* /data/twitterNL/201310/*as input paths).
- The result of the job will end up in the specified output folder on the HDFS.
Tweet count processing
After the MapReduce job has finished the Java program in
/Website/GraphGenerator.java can be used to process the output files.
- The input folder should be the folder that you specified as output folder for the MapReduce job (the folder where
_SUCCESSand the part files are in). Put the input folder in the
inputFoldervariable located on line 23.
- The desired output file can be placed in the variable called
outputon line 119.
- Run the program through your IDE or by compiling and running the file yourself.
VGChartz dataGetting the
To get the data for plotting the sales numbers the
VGChartzScraper.java can be executed. This program will visit the VGChartz website and get the required information (sales numbers in the 10 weeks after release). The file
sold.php will be generated by the program, which can be used on the website.
Deploying the website
As last step the processed data can be put on a webserver to be accessible on the web. The
/Website/index.php can be placed into the web server, along with the data file that has been generated by the Java processing and the
/Website/images/ folder (contains the game covers). The
index.php file expects to find the data file as
data.php. Now the website should be accesible on your web server and display the graphs for all the games.