# Introduction

The goal of this project is to analyse biodiversity data from the National Parks Service, particularly around various species observed in national parks.

This project will scope, analyze, prepare, plot data, and seek to explain the findings from the analysis.

Here are some of the questions that this project has sought to answer:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

**Data sources:**

Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

## Scoping

It is good practice to start any new project by creating a project scope. The sections below will outline and help guide the projects process and progress. 

### Project Goals

The best place to start the projects scope is with the projects goals. This projects perspective will be through a biodiversity analyst for the National Parks Service (abbreviated to NPS going foward). The NPS wants to ensure the survival of endangered and at-risk species in order to maintain the level of biodeiversity within their parks. Therefore, the main objectives of an analyst will be understanding characterists of the species within the parks and their conservation status, also those species relationship to the national parks.

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks? 

### Data

This project has two data sets that came with the package. The first `csv` file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project. 

### Analysis

This section is where the decriptive statistics and data visualisation techniques are emplyed to better understand the data. Tools such as statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics to be computed include:

1. Distributions
1. counts
1. relationship between species
1. conservation status of species
1. observations of species in parks. 

### Evaluation

Lastly, it's a good idea to revisit the goals and check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This section will also reflect on what has been learned through the process, and if any of the questions were unable to be answered. This could also include limitations or if any of the analysis could have been done using different methodologies.


## Import Python Modules

First start by importing the primary modules to be used throughout the project. 

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as pyplot
import seaborn as sns

## Loading The Data

In order to start analysing the data from the csv files it must first be converted into a format the is accessible and has the ability to be explored and visualised within Python. 

The next few steps loads in `observations.csv` and `species_info.csv` as `DataFrames` called `observations` and `species` respectively. The newly created `DataFrames` are then glimpsed with the `.head()` function for a quick check of their contents.

#### Observations

The `Observations.csv` contains information from recorded sightings of different species throughout the national parks in the past 7 days. The columns included are:

- **scientific_name** - The scientific name of each species
- **park_name** - The name of the national park
- **observations** - The number of observations in the past 7 days

In [2]:
observations = pd.read_csv('../raw_data/observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


#### Species

The `species_info.csv` contains information on the different species in the National Parks. The columns in the data set include:
- **category** - The category of taxonomy for each species
- **scientific_name** - The scientific name of each species
- **common_names** - The common names of each species
- **conservation_status** - The species conservation status


In [3]:
species = pd.read_csv('../raw_data/species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


#### Data Characteristics

Next there will be a check of the dimensions of both datasets. `observations` has 23296 rows and 3 columns while `species` has 5824 rows and 4 columns.  

In [4]:
print(f'Observations Shape: {observations.shape}')
print(f'Species Shape: {species.shape}')

Observations Shape: (23296, 3)
Species Shape: (5824, 4)


Then we will look at the number of `category` and the different decribed species. In total there  7 distinct categories 5 of which are animals and 2 are plants.

In [5]:
print(f'Number of Categories: {species.category.nunique()}')
print(f'Categories: {species.category.unique()}')

Number of Categories: 7
Categories: ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


Lets go one step deeper and look at the count for each of these categories and get a feel for the spread of species that has been recorded. As you can see `Vascular Plant` makes up the majority of species with over 4470 counts with `Reptile` having the fewest at 79.

In [6]:
species.groupby('category').size()

category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64

Another column worth exploring is `conservation_status`. There are the categories `Species of Concern`, `Endangered`, `Threatened`, `In Recovery` and `NaN` values.

In [7]:
print(f'Number of conservation statuses: {species.conservation_status.nunique()}')
print(f'Conservation Statuses: {species.conservation_status.unique()}')

Number of conservation statuses: 4
Conservation Statuses: [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


Next is a count of the conservation statueses. There are 5633 NaN values meaning that there are that many species without a concern as far as conservation. There are also 16 endangered, 4 in recovery and 161 species of concern. 

#### Note

While commonly having NaN values requires the data to be treated carefully, in this particular dataset it means that the species is nnot under any conservation status.

In [10]:
print(f'Is NaN: {species.conservation_status.isna().sum()}')
print(species.groupby('conservation_status').size())

Is NaN: 5633
conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64


#### Observations

The next section looks at `obseravtions`. Lets first look at the total number of parks

In [None]:
print(f'')