# Exercise


Your task is to implement a small database application, which imports a dataset of Twitter tweets from the CSV file into database. 

Your application has to be able to answer queries corresponding to the following questions:

  * How many Twitter users are in our database?
  * Which Twitter users link the most to other Twitter users? (Provide the top ten.)
  * Who is are the most mentioned Twitter users? (Provide the top five.)
  * Who are the most active Twitter users (top ten)?
  * Who are the five most grumpy (most negative tweets) and the most happy (most positive tweets)? (Provide five users for each group)
  
Your application can be written in the language of your choice. It must have a form of UI but it is not important if it is a CLI UI, a GUI, or a Web-based UI.


You present your system's answers to the questions above in a Markdown file on your Github account. That is, you hand in this assignment via Github, with one hand-in per group.
Push your solution, source, code, and presentation of the results to a Github repository per group and push a link to your solution in the Moodle hand-in area.
The hand-in time is latest 1. May 2017 at 24o'clock.


## Hints

You can download and uncompress a dataset of Twitter tweets from http://help.sentiment140.com/for-students/.

```bash
wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
```

In your VM the `unzip` package is not installed by default. Install it via:

```bash
sudo apt-get install unzip
```

Now you can uncompress the Twitter dataset to your current directory with:

```bash
unzip trainingandtestdata.zip
```

After uncompression you will have a folder with two files. For your exercise you will use the bigger on `training.1600000.processed.noemoticon.csv`. However, both files are CSV files, which do not contain a header row. The documentation on http://help.sentiment140.com/for-students/ says that the columns contain the following.

> The data is a CSV with emoticons removed. Data file format has 6 fields:
> 
> 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
> 
> 1 - the id of the tweet (2087)
> 
> 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
> 
> 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
> 
> 4 - the user that tweeted (robotickilldozr)
> 
> 5 - the text of the tweet (Lyx is cool)


To make use of the `--headerline` switch when importing the data with `mongoimport`, we add a headerline accordingly:

```bash
sed -i '1s;^;polarity,id,date,query,user,text\n;' training.1600000.processed.noemoticon.csv
```

That leads to that the fields of our documents are named according to the given headerfields.

After importing the dataset, the `dates` are represented as strings instead of proper date objects. You might want to convert them with the following code:

```javascript
db.tweets.find().forEach(function(doc){
    if (doc.date instanceof Date !== true) {
        doc.date = new Date(doc.date);
        db.tweets.save(doc);
    }
});
```

For some of the questions, you might want to have a look into MongoDB's aggregation framework and queries using regular expressions. For example, the following query finds all tweets mentioning another Twitter user. 

```mongo
db.tweets.aggregate(
    {$match:{text:/@\w+\/}},
    {$group:{_id:null,text:{$push:"$text"}}
})
```

