# Biodiversity in National Parks

## Introduction:

This is my first python project shared on Github. In it, I will explore and perform data analysis on the conservation status of 25,000 observed species from 2 datasets, sourced from the National Parks Service.

### Goals:
Here are a few questions to begin guiding my analysis - ultimatley, I will be looking to answer more questions as we begin exploring the data deeper.

    - What is the distribution of conservation_status for animals?
    - Are certain types of species more likely to be endangered?
    - Are the differences between species and their conservation status significant?
    - Which species were spotted the most at each park?

**Let's begin by importing the required libraries:**

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load CSV Files: inspect first 10 rows.

**Below is documentation relating to the data, providing a brief description of each column; I find this to be very useful as a reference when working with new datasets that are outside of my domain.**

species_info.csv - contains data about different species and their conservation status.  
`category` - class of animal  
`scientific_name` - the scientific name of each species  
`common_name` - the common names of each species  
`conservation_status` - each species' current conservation status  

observations.csv - holds recorded sightings of different species at several national parks for the past 7 days.  
`scientific_name` - the scientific name of each species  
`park_name` - park where species were found  
`observations` - the number of times each species was observed at the park

**After reviewing the columns, I have decided not to name the DataFrame 'Observations' due to the identical column name. To avoid confusion, I will use 'racking' as a synonym instead.** 

In [41]:
species_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/species_info.csv")
tracking_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/observations.csv")

In [39]:
species_df.head(10)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
5,Mammal,Odocoileus virginianus,White-Tailed Deer,
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",
7,Mammal,Canis latrans,Coyote,Species of Concern
8,Mammal,Canis lupus,Gray Wolf,Endangered
9,Mammal,Canis rufus,Red Wolf,Endangered


In [42]:
tracking_df.head(10)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
5,Elymus virginicus var. virginicus,Yosemite National Park,112
6,Spizella pusilla,Yellowstone National Park,228
7,Elymus multisetus,Great Smoky Mountains National Park,39
8,Lysimachia quadrifolia,Yosemite National Park,168
9,Diphyscium cumberlandianum,Yellowstone National Park,250


#### Note: I see a high proportion of NaN values in the  species_df immediately, which will need to be better understood why. I also see that both df's contain a column named 'scientific_name' which may be useful to act as a primary key, linking both tables by a shared relation.

## Getting to Know the Data: what data is here, what's missing, what isn't?

In [28]:
species_df.columns, species_df.shape

(Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object'),
 (5824, 4))

In [44]:
tracking_df.columns, tracking_df.shape

(Index(['scientific_name', 'park_name', 'observations'], dtype='object'),
 (23296, 3))

In [27]:
species_df.isnull().sum().sort_values(ascending=False)

conservation_status    5633
category                  0
scientific_name           0
common_names              0
dtype: int64

In [45]:
tracking_df.isnull().sum().sort_values(ascending=False)

scientific_name    0
park_name          0
observations       0
dtype: int64

In [46]:
species_df.duplicated().sum(), tracking_df.duplicated().sum()

(0, 15)

### Before exploring more, let's first dig a bit deeper to see why species_df contains a significant portion of NaN; and why tracking_df contains 15 duplicate records.

#### Are the missing values: systematic, MAR, MCAR, or MNAR? Without much domain knowledge, I will need to look closer.

In [36]:
species_df['conservation_status'].value_counts()

Species of Concern    161
Endangered             16
Threatened             10
In Recovery             4
Name: conservation_status, dtype: int64