# Analysis of Endangered Species in National Parks
## Introduction
The conservation of biodiversity within the National Park Service (NPS) is a critical objective, yet numerous species within this protected areas are classified as endagered. This project seeks to analyze the conservation statuses of endagered species across various national parks to uncover patterns or themese in the types of species affected.
## Project Scoping
### Project Goals:
In this project, the perspective will be through a biodiversity analyst for the National Parks Service. The National Parks Service wants to ensure the survival of at-risk species, to maintain the level of biodiversity within their parks. Some questions that are posed:

- What is the distribution of conservation status for species?
- Are certain type of species more likely to be endangered?
- Are the difference between species and their conservation statuse significant?
- What is most prevalent? What is their distribution amongst parks?

### Data
- Data Source: The data for this project was provided by [codecademy](https://www.codecademy.com/learn) for the Biodiversity in National Parks project. The dataset has two data files, `observations.csv` and `species_info.csv`.
- Data Content: The first `csv` file holds recorded sightings of different species at several national parks for the past 7 days. The other file contains information about the different species and their conservation statuses.
- Data Preparation: Data cleaning, and handling missing values.
This dataset will be used to analyze the goals of this project.

### Analysis
In this sections, summary statistics and visualization techniques will be employed to understand the data better. Statistical inference will be used to test if the observed vales are statistically significant. Key metrics that will be computed:
- Counts
- Distributions
- relationship between species
- conservation status of species
- observatios of species in parks

### Evaluation
I will revisit the analysis to check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This seaction will reflect on what has been learned through the process, and if any of the questions were unable to be answered. This could include limitations or if any of the analysis could have been done using different methodologies.


## Import Python Modules
Import the primary modules that will be used in this project.

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline


### Loading the Data

### Species
The columns in the `species_info.csv` file include:
- **category:** The category of taxanomy for each species
- **scientific_name:** The scientific name of each species
- **common_names:** The common names of each species
- **conservation_status:** The species conservation status

In [2]:
species = pd.read_csv('species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


### Observations
The columns in the `observations.csv` file include:
- **scientific_name:** The scientific name of each species
- **park_name:** The name of the national park
- **observations:** The number of observations in the past 7 days

In [4]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


### Data Characteristics
This section checks the dimensions of the dataset, for `species` there are 5,824 rows and 4 columns while `observatins` have 23,296 rows and 3 columns.

In [5]:
print(f'species shape: {species.shape}')
print(f'observations shape: {observations.shape}')

species shape: (5824, 4)
observations shape: (23296, 3)


## Explore the Data
This section explores the `species` data a little bit more in depth. First, I try to find the number of distinct species in the dataset using the `scientific_name` column to get 5,541 species. That's quite a lot of species!

In [8]:
print(f'Number of species in the dataset: {species.scientific_name.nunique()}')

Number of species in the dataset: 5541


Next is to find the number of `category` that are represented in the dataset. There are 7 categories including animals and plants.

In [10]:
print(f'Number of categories: {species.category.nunique()}')
print(f'Categories: {species.category.unique()}')

Number of categories: 7
Categories: ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


Drilling down on the categories, I can get the count of `category` in the data. As we can see, vascular plants are by far the largest share of species with 4,470 in the data, and reptiles being the smallest with 79.

In [14]:
species.groupby('category').size()

category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64