Skip to content
Assignment for the Managing Big Data course in which a huge Twitter dataset is analyzed
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This repository contains the methods used in Twitter data research. The research investigates if the number of tweets about a certain video game is linked to the number of sold copies in Europe. The used data is provided by, and contains most of the tweets from The Netherlands from late 2010 until late 2015. The tweets are mostly in Dutch, because that is the target language Twiqs is collecting.


View the graphs made during this research on this page.

How to run it yourself


This repository contains a project folder named MapReduce. This folder contains the Maven project that is required to perform the MapReduce job on the data set of tweets. In order to perform this job yourself, please follow these steps:

  1. Copy the project folder to the cluster (for example, zip the project, run scp to move it to the cluster and then unzip it).
  2. Navigate to the project folder's (the folder containing a file called pom.xml) and run mvn package.
  3. After this, you may run the MapReduce job by navigating into the target folder and executing the following command: hadoop jar bigdata-0.2.jar gametweets.GameTweets <input path names...> <name of output folder> (so, if you wish to perform the job on the months September and October in the year 2013, use: /data/twitterNL/201309/* /data/twitterNL/201310/* as input paths).
  4. The result of the job will end up in the specified output folder on the HDFS.
Tweet count processing

After the MapReduce job has finished the Java program in /Website/ can be used to process the output files.

  1. The input folder should be the folder that you specified as output folder for the MapReduce job (the folder where _SUCCESS and the part files are in). Put the input folder in the inputFolder variable located on line 23.
  2. The desired output file can be placed in the variable called output on line 119.
  3. Run the program through your IDE or by compiling and running the file yourself.
Getting the VGChartz data

To get the data for plotting the sales numbers the can be executed. This program will visit the VGChartz website and get the required information (sales numbers in the 10 weeks after release). The file sold.php will be generated by the program, which can be used on the website.

Deploying the website

As last step the processed data can be put on a webserver to be accessible on the web. The /Website/index.php can be placed into the web server, along with the data file that has been generated by the Java processing and the /Website/images/ folder (contains the game covers). The index.php file expects to find the data file as data.php. Now the website should be accesible on your web server and display the graphs for all the games.

You can’t perform that action at this time.