Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: The title of the project. Maximum 41 characters.

**Name:** Jan Laurens Geffert

**Email address associated with your DataCamp account:** laurensgeffert@gmail.com

**GitHub username:** JanLauGe

**Project description**: This will be read by the students on the DataCamp platform **before** deciding to start the project. The description should be three paragraphs, written in Markdown.

- Paragraph 1 should be an exciting introduction to analysis/model/etc. students will complete.
- Paragraph 2 should list the background knowledge you assume the student doing this project will have, the more specific the better. Please list things like modules, tools, functions, methods, statistical concepts, etc.
- Paragraph 3 should describe and link to (if possible) the dataset used in the project.

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Title of the first task  (<= 55 chars) (sentence case)

An exciting intro to the analysis. Provide context on the problem you're going to solve, the dataset(s) you're going to use, the relevant industry, etc. You may wish to briefly introduce the techniques you're going to use. Tell a story to get students excited! It should at most have 1200 characters.

The most common error instructors make in **context cells** is referring to the student or the project. We want project notebooks to appear as a blog post or a data analysis. Bad: *"In this project, you will..."* Good: *"In this notebook, we will..."*

The first task in projects often involve loading data. Please store any data files you use in the `datasets/` folder in this repository.

Images are welcome additions to every Markdown cell, but especially this first one. Make sure the images you use have a [permissive license](https://support.google.com/websearch/answer/29508?hl=en) and display them using [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#images). Store your images in the `img/` folder in this repository.

The Scottish Crossbill (**Loxia scotica**) is a small passerine bird that inhabits the Caledonian Forests of Scotland, and is the only terrestrial vertebrate species unique to the United Kingdom. Only ~ 20,000 individuals of this species are alive today.

The first step is to get occurrence data for the species we are interested in. This used to be the main challenge in Biogeography. Natural Historians such as Charles Darwin and Alexander von Humboldt would travel for years on rustic sail ships around the globe collecting specimen. Today, we are standing on the shoulders of giants. Getting data is fast and easy thanks to the work of two organisations:

- [the Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/), an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. We will use their data in this project.

- [rOpenSci](https://ropensci.org/), a non-profit initiative that has developed an ecosystem of open source tools, we run annual unconferences, and review community developed software. Their package `rgbif` will help us access the species data.


In [4]:
install.packages("rgbif")


The downloaded binary packages are in
	/var/folders/mv/gv3qxyn90vn8j0vjgg4gk2fr0000gp/T//RtmpqnLpP1/downloaded_packages


In [6]:
# Code and comments for the first task
# It should consist of up to 10 lines of code (not including comments)
# and take at most 10 seconds to execute on an average laptop.
library(tidyverse)
library(rgbif)

# get the database id ("key") for the Scottish Crossbill
speciesKey <- name_backbone('Loxia scotica')$speciesKey

# get the occurrence records of this species
gbif_response <- occ_search(
  scientificName = "Loxia scotica", country = "GB",
  hasCoordinate = TRUE, hasGeospatialIssue = FALSE,
  limit = 9999)

# backup to reduce API load
write_rds(x = gbif_response, path = here::here('gbif_occs_loxsco.rds'))

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
“cannot open file '/Users/laurens.geffert/projects/projects-instructor-application-r/data/gbif_occs_loxsco.rds': No such file or directory”

ERROR: Error in saveRDS(x, con): cannot open the connection


## 2. Title of the second task (<= 55 chars)  (sentence case)

Context / background / story / etc. This cell should at most have 800 characters.

The most common error instructors make in **context cells** is referring to the student or the project. We want project notebooks to appear as a blog post or a data analysis. Bad: *"In this task, you will..."* Good: *"Next, we will..."*

GBIF and rOpenSci just saved us years or roaming around the highlands with a 
pair of binoculars, camping in mud, rain, and snow, and chasing crossbills 
through the forest. Nevertheless, it is still up to us to make sense of the
data we got back, in particular to clean it, as data collected on this large
scale can have its own issues. Luckily, GBIF provides some useful metadata
on each record. Here, I will exclude those that

* are not tagged as "present" (they may be artifacts from collections)
* don't have any flagged issues (nobody has noticed anything abnormal with this)
* are under creative commons license (we can use them here)
* are older than 1965

In [9]:
library(lubridate)
birds_clean <- gbif_response$data %>%
  # get decade of record from eventDate
  mutate(decade = eventDate %>% ymd_hms() %>% round_date("10y") %>% year() %>% as.numeric()) %>%
  # clean data using metadata filters
  filter(
    # only creative commons license records
    str_detect(license, "http://creativecommons.org/") &
    # only records with no issues
    issues == "" &
    # no records before 1965
    decade >= 1970 &
    # no records after 2015 (there is not a lot of data yet)
    decade < 2020) %>%
  # retain only relevant variables
  select(decimalLongitude, decimalLatitude, decade) %>% arrange(decade)


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

“ 143 failed to parse.”

## 3. Title of the third task (<= 55 chars)  (sentence case)

Okay so we've got clean data now. Here comes the nifty trick. We want to look
at the data in subsets by decade in order to see if and how the spatial 
distribution of the species has changed over time. To do so, we can "nest"
data in a list column using the `tidyr` package:

In [12]:
birds_nested <- birds_clean %>%
  # define the nesting index
  group_by(decade) %>% 
  # aggregate data in each group
  nest()

# let's have a look
glimpse(birds_nested)

Observations: 5
Variables: 2
$ decade <dbl> 1970, 1980, 1990, 2000, 2010
$ data   <list> [<# A tibble: 5 x 2,   decimalLongitude decimalLatitude,   ...


*Stop here! Only the three first tasks. :)*