# Biodiversity in National Parks

## Introduction:

This is my second python project shared on Github. In it, I will explore and perform data analysis on the conservation status of 25,000 observed species from 2 datasets, sourced from the National Parks Service.

### Goals:
Here are a few questions to begin guiding my analysis - ultimatley, I will be looking to answer more questions as we begin exploring the data deeper.  

- What is the distribution of `conservation_status` for animals?  
- Are certain types of species more likely to be endangered?  
- Are the differences between species and their conservation status significant?  
- Which species were spotted the most at each park?

**Below is documentation relating to both datasets, providing a brief description of each column; I find this to be very useful as a reference, and it is good to read before diving into a new dataset.**

species_info.csv - contains data about different species and their conservation status.  
`category` - class of animal  
`scientific_name` - the scientific name of each species  
`common_name` - the common names of each species  
`conservation_status` - each species' current conservation status  

observations.csv - holds recorded sightings of different species at several national parks for the past 7 days.  
`scientific_name` - the scientific name of each species  
`park_name` - park where species were found  
`observations` - the number of times each species was observed at the park

**Let's begin by importing a few libraries:**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load CSV Files: inspect first 10 rows.

**During my review of the column names, I have decided not to name the DataFrame 'Observations' due to the identical column name. To avoid confusion, I will instead use 'tracking' as a synonym instead of observations.** 

In [2]:
species_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/species_info.csv")
tracking_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/observations.csv")

In [52]:
species_df.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [51]:
tracking_df.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


## Digging into the missing data.

Above, I saw a high proportion of NaN values in the  species_df immediately, which will need to be better understood why. *I also see that both df's contain the column `scientific_name` which may be useful to act as a primary key, linking both tables by a shared relation.*

**Now I will continue exploring the data while looking more into each column to see if any others contain NaN, or other possible issues that could be addressed with data cleaning.**

In [11]:
species_df_shape = species_df.columns, species_df.shape
species_df_shape

(Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object'),
 (5824, 4))

In [12]:
tracking_df_shape = tracking_df.columns, tracking_df.shape
tracking_df_shape 

(Index(['scientific_name', 'park_name', 'observations'], dtype='object'),
 (23296, 3))

In [13]:
species_null = species_df.isnull().sum().sort_values(ascending=False)
species_null

conservation_status    5633
category                  0
scientific_name           0
common_names              0
dtype: int64

In [14]:
tracking_null = tracking_df.isnull().sum().sort_values(ascending=False)
tracking_null

scientific_name    0
park_name          0
observations       0
dtype: int64

**Diving deeper into the only column with null values, I want to know if the missing values are: systematic, MAR, MCAR, or MNAR? Without much domain knowledge, I will need to look closer.**

In [18]:
conservation_types = species_df['conservation_status'].unique()
print("conservation_types:", conservation_types)
species_df['conservation_status'].value_counts()

conservation_types: [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


Species of Concern    161
Endangered             16
Threatened             10
In Recovery             4
Name: conservation_status, dtype: int64

**It seems the missing values are a result of no label being assigned if the species is not either a: species of concern, endangered, threatened, or in recovery.**
**This tells me the NaN values are systematically missing. When values are systematically missing, there is no need to fill the null values or delete the rows. But, I would like to modify their label so I can perform better analysis, such as computing summary statistic - I will modify NaN values to 'Healthy' indicating that the species at large is not considered at risk.**

### Now we can look into any duplicate rows.

In [9]:
species_df.duplicated().sum(), tracking_df.duplicated().sum()

(0, 15)