# Data 201/422 Group Assignment

### Calendar

| Step | When? | What?|
|-------|------|-----|
| 1 | 20 / 21 Sep | Launch |
| 2 | 4 / 5 October | Feedback Session |
| 3 | 12 / 13 October | Group Work and QA |
| 4 | 18 / 19 October | Group Presentations |
| 5 | 21 October, 12:00 pm | Deliverable Due Date |

### Goals

Your goals as a team are to:  
1. identify interesting data sources to work on;
2. wrangle the data sources into a suitable target data model;
3. present the data to a larger audience.  

Throughout the project, you should aim to exhibit that you can apply what we learnt in the class (and so some R coding in 201, and some R and Julia coding if in 422)

### Deliverable

The deliverable that we ask you to submit by the end of the group work (21 October, midday) is a compressed folder (a zipped folder) containing:

1. A project report. The report should clearly describe:
   1. what data sources you used
   2. why you chose those data sources
   3. what target you chose (i.e., what is the intended use of the data, ...)
   4. what difficulties you had to overcome to wrangle the data sources into the target data model
   5. what techniques you did use
   6. what you managed to achieve and what you failed to do
2. A project diary: every day you work on the project, take a detailed note of who-did-what in a text file. Even when you work by yourself, take note of what you did and then report that into the group diary
3. Slides or any other material for the data presentation
4. Jupyter notebooks with all the code you wrote to do the various wrangling tasks, well documented and commented. If you use more than one Jupyter notebook, you should also include a document explaining how the notebooks are linked (i.e., in which sequence should we run the notebooks to reproduce your analysis, what do they perform, ...)
5. The data you produced, or a link to access the data you produced, and all the relevant documentation for us to access and understand the data. If we can access the data with the information you give us, we consider that a non-submission.

All documents must be either in .pdf, markdown or jupyter notebooks; absolutely no word documents or similar proprietary formats.

### Steps:

Working in group on a data science project is a skill in itself.  
In the labs from here to the end of the course we are suggesting you one way to structure your work.
You are highly encouraged to find some time to work together as a group during the next weeks: in my experience, new teams needs to be sitting in the same room to keep their efforts coordinated.

Also, this may be the right time to try and get used to some version control and team management workflows/software. My suggestion is to use github (a very short introduction [here](https://guides.github.com/activities/hello-world/) ) or gitlab (a more comprehensive introduction [here](https://docs.gitlab.com/ee/gitlab-basics/README.html) ). This are standards in industry, and learning them is surely worth the effort. If you decide to go down this road, we can help you :-)

**More detailed information for steps after 1. are offered closer to the step time**

#### 1. Launch

This is the phase of exploration and prototyping.

###### Exploration

As a team, you need to identify _good_ set of data sources. A _good_ datasource for this projects is a source of data that:  
1. you can read and handle, or learn to read and handle in a very short time. Formats that are too exotique may be very interesting, but they may also require a too long time to import into anything familiar to you. To be fair, most of the datasets you can find in the sources we suggest you are in a good format. _pro tip_: avoid data presented in pdf tables, unless you are ready to do a lot of hand work (if you do, it's great).
2. it is not already in the most suitable data model. If the source dataset is already perfect, there's little you can do and you won't have the opportunity to exhibit your skills. Government or NGO data is usually quite dirty, so they represent good choices.
3. it is _interesting_ for you. I don't have a strong interest in cars, and for me working a month on car data set would be a boring hell. But maybe you LOVE cars, than that dataset would be perfect for you. Try to consider your passions, or look at the news to see what may be a hot topic. Avoid abused datasets (i.e., _iris_ or _mtcars_ or _titanic_ or ...): we have already seen all possible projects about that stuff.
4. it is _joined_ or _joinable_ to other data sources. Consider the presence of common identifiers (things as: region names, product ids, years or dates, ...) that you may use to connect two or more data sources and tell a story (great data stories often emerge from connections).

###### Prototyping

You won't find the best data sources at the first go.

Once you have identified a candidate set of data sources, try and write down what use you want to make of it. A good way is to try and write down some question you would like to answer with the data sources (good questions are question that do require some wrangling and cannot be simply answered using only one of the data sources).

After that, **prototype** your data wrangling: you may decide to work on a limited subset of data to see you can quickly write some code to get closer to answering your questions. The code does not need to be perfect, nor the questions need to be answered at this stage. Yet, if the questions look to easy or too hard consider (1) changing question, or (2) changing data sources and questions.

Fail fast, fail often. Trying and hitting a wall as soon as possible you can avoid to find yourself stuck at the mid of your project.

#### 2. Feedback Session

A feedback session is a collaborative effort between 2 groups.

In the first hour, group A will expose its sources, targets, work done and future plans to group B. Group B gives feedback about all the phases: they can suggest slightly improved questions, point toward possible weakness in the methodology (for example, data contamination risks, ...), rise ethical questions, and suggest solutions to problems group A is facing. After the first hour, groups swap roles.

#### 3. Q and A

Two hours of allocated time for group work with Giulio, Thomas and Phil present to help you out if you have questions or doubts.

#### 4. Presentations

Impress us! Tell us what you have done, and what is interesting in the data you wrangled. What wonderful very distant data sets you managed to get together and clean and make sense of. Tell us how the data model you shaped your data into allows for other data scientists to work on that data. Use animation, visualization and your best communication skills.

Presentation times 7 minutes + 1 minute for Q&A from the audience.

Presentation are marked separetely, and are worth 10% of the course final grade.

### Suggested data sources

There's no hard boundary: if you do identify a data source we did not think about, even better.

Our data repositories suggestions are:

1. https://data.govt.nz/ : a large repository of New Zealand Governmental Datasets, in various formats, most of local interest.
2. http://www.who.int/gho/en/ : a database of information from the World Health Organization, of global interest.
3. http://data.un.org/ : a collection of databases from the United Nations, of global interest.
4. https://www.google.com/publicdata/directory : a directory of public data from Google.


Google offers (a beta version of) a search engine for finding datasets: https://toolbox.google.com/datasetsearch

## Marks and policy

The group project is worth 40% of the final course grade.

We will evaluate the delivarable you submit. Out of 40 marks:

1. Project report: 15 marks (1 for writing, 1 for visualization, 1 for clearness in the report, 5 for suitability of the target data model, 7 for depth and extension of the project)
2. Project diary: 5 marks (2 for detail, 2 for presentation, and 1 for rigorous keeping)
3. Presentation material: no marks (presentation are evaluated separetely)
4. Code Notebooks: 15 marks (10 for code quality and skills, 5 for documentation quality)
5. Final Data and Documentation: 5 marks (2 for the data availability, 3 for the quality of documentation)

Bonus points may be given for use of github or other special efforts.

**Data 422 students**: notice that some Julia code is required if you want to get the full marks for the code notebooks and the depth and extension of the project. Try to identify a task that you can do with Julia.

Late submission are highy discouraged. You have a month of time to get your project together, if you find yourself short in time for very serious problems let me know and we'll find a solution. Otherwise, if not agreed differently between me (Giulio) and you, the standard departmental late submission policy will be applied.

## Final note

**This document may be updated during the month of the group project in order to provide more detailed information, if and where needed**