## 📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data. 

## 💾 The data

### You have access to a real dataset containing dinosaur records from the Paleobiology Database ([source](https://paleobiodb.org/#/)):


| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |

The data was enriched with data from Wikipedia.

In [None]:
# libraries what I need
library(tidyverse)
library(plotly)
library(skimr)

In [None]:

# Load the data
dinosaurs <- read_csv('data/dinosaurs.csv', show_col_types = FALSE)


In [None]:
# Preview the dataframe
dinosaurs

In [None]:
# Discover the data
skim(dinosaurs)

In [None]:
# What about missing data in the dataset?
# I will drop_na
dinosaurs <- dinosaurs %>% drop_na()
map(dinosaurs,~sum(is.na(.)))

In [None]:
# How many different dinosaur names are present in the data?
length(unique(dinosaurs$name))

In [None]:
# Which was the largest dinosaur?
dinosaurs[max(dinosaurs$length_m),]

In [None]:
# What dinosaur type has the most occurrences in this dataset? 
# Create a visualization (table, bar chart, or equivalent) 
# to display the number of dinosaurs per type. 
# Use the AI assistant to tweak your visualization (colors, labels, title...).

p1  <- dinosaurs %>% 
arrange(desc(occurrence_no)) %>% 
head(10) %>% 
ggplot(aes(occurrence_no,fill = as.factor(type))) +
geom_histogram() +
labs(fill = 'type')

ggplotly(p1,dynamicTicks = TRUE)

In [None]:
# Did dinosaurs get bigger over time? 
# Show the relation between the dinosaur length and their age to illustrate this
dinosaurs  <- dinosaurs %>% 
mutate(years = (max_ma + min_ma)/2) 
p <-  dinosaurs %>% 
ggplot(aes(years,length_m)) +
geom_point() +
geom_smooth(method = 'glm') 

ggplotly(p,dynamicTicks = TRUE)

In [None]:
# Here I'm looking for each type how many it have of dinosaurs
p <- dinosaurs %>%
ggplot(aes(type,fill = type)) + 
geom_bar() + 
theme(axis.text.x = element_text(angle = 30,vjust = 3))
ggplotly(p,dynamicTicks = TRUE)

In [None]:
p <- 
dinosaurs %>% 
ggplot(aes(region,length_m,color = diet)) + 
geom_point() +
theme(axis.text.x = element_blank()) +
facet_wrap(~diet)

ggplotly(p)

# I'm discovered that all dinosaurs have diet herbivorous 
# is the lagerst length than the others in all regions

In [None]:
# Here I'm looking for families and its counts

p <- dinosaurs  %>% 
ggplot(aes(y = family,color = family)) + geom_point(stat = 'count') 

ggplotly(p,dynamicTicks = TRUE)

In [None]:
write_csv(dinosaurs,'dinosaurs.csv')