Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: From one nest to another: birds and data

**Name:** Jan Laurens Geffert

**Email address associated with your DataCamp account:** laurensgeffert@gmail.com

**GitHub username:** JanLauGe

**Project description**: Jump into the nest! Assess the impact of climate change on bird distributions using a rapid experimentation framework with nested dataframes, list columns, and machine learning models embedded in dataframes.

In this project you will work with bird observation and climate data. This project assumes you can manipulate data frames using `dplyr` as taught in the course [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse). We will also use a number of other `tidyverse` packages such as `tidyr` and `purrr`. Some familiarity with the fundamentals of machine learning is advisable, you can build it with [Introduction to Machine Learning](https://www.datacamp.com/courses/introduction-to-machine-learning-with-r).

The project uses data from two sources: a collection of bird observations from the [the Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/) and climate data provided by the UK Met Office's [UKCP09 gridded observation datasets](https://www.metoffice.gov.uk/climate/uk/data/ukcp09).

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Fieldwork in the digital age – download the data

The Scottish Crossbill (*Loxia scotica*) is a small bird inhabiting the Scottish Forests. Only ~ 20,000 individuals of this species are alive today.

Our first step is to get occurrence data for this species. This used to be the main challenge in Biogeography. Natural Historians like Charles Darwin and Alexander von Humboldt travelled for years on rustic sail ships around the globe collecting specimen. Today, we stand on the shoulders of giants. Getting data is fast and easy thanks to two organisations:

- [the Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/), an international network and research infrastructure aimed at providing anyone, anywhere, open access to data about life on Earth. We will use their data in this project.

- [rOpenSci](https://ropensci.org/), a non-profit initiative that develops open source tools for academic data sets. Their package `rgbif` will help us access the species data.


![Crossbill image](img/Loxia.jpg)

In [1]:
# Code and comments for the first task
# It should consist of up to 10 lines of code (not including comments)
# and take at most 10 seconds to execute on an average laptop.
library(tidyverse)
# install.packages("rgbif")
library(rgbif)

# get the database id ("key") for the Scottish Crossbill
speciesKey <- name_backbone('Loxia scotica')$speciesKey

# get the occurrence records of this species
gbif_response <- occ_search(
  scientificName = "Loxia scotica", country = "GB",
  hasCoordinate = TRUE, hasGeospatialIssue = FALSE,
  limit = 9999)

# backup to reduce API load
write_rds(x = gbif_response, path = here::here('gbif_occs_loxsco.rds'))

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


## 2. Sorting out the bad eggs – data cleaning

GBIF and rOpenSci just saved us years of roaming around the highlands with a 
pair of binoculars, camping in mud, rain, and snow, and chasing crossbills 
through the forest. Nevertheless, it is still up to us to make sense of the
data we got back, in particular to clean it, as data collected on this large
scale can have its own issues. Luckily, GBIF provides some useful metadata
on each record. Here, I will exclude those that

* are not tagged as "present" (they may be artifacts from collections)
* don't have any flagged issues (nobody has noticed anything abnormal with this)
* are under creative commons license (we can use them here)
* are older than 1965

In [3]:
library(lubridate)
birds_clean <- gbif_response$data %>%
  # get decade of record from eventDate
  mutate(decade = eventDate %>% ymd_hms(quiet = TRUE) %>% round_date("10y") %>% year() %>% as.numeric()) %>%
  # clean data using metadata filters
  filter(
    # only creative commons license records
    str_detect(license, "http://creativecommons.org/") &
    # only records with no issues
    issues == "" &
    # no records before 1965
    decade >= 1970 &
    # no records after 2015 (there is not a lot of data yet)
    decade < 2020) %>%
  # retain only relevant variables
  select(decimalLongitude, decimalLatitude, decade) %>% arrange(decade)

## 3. Nesting the data

So we've got clean data now. Here comes the nifty trick. We want to look
at the data in subsets by decade in order to see if and how the spatial 
distribution of the species has changed over time. To do so, we can "nest"
data in a list column using the `tidyr` package. The result is a dataframe
with the grouping columns left as usual and a `list` column containing the 
aggregated data of each group. This types of columns work particularly
well with the `purrr` package, as we will see in the next exercise.

In [4]:
birds_nested <- birds_clean %>%
  # define the nesting index
  group_by(decade) %>% 
  # aggregate data in each group
  nest()

# let's have a look
glimpse(birds_nested)

Observations: 5
Variables: 2
$ decade <dbl> 1970, 1980, 1990, 2000, 2010
$ data   <list> [<# A tibble: 5 x 2,   decimalLongitude decimalLatitude,   ...


*a complete analysis using this dataset: https://janlauge.github.io/2018/nesting-models-in-R-data-frames/*