# Analyzing Biodiversity and Conservation Status in National Parks.

## Introduction:

This is my second python project shared on Github. I will explore and perform data analysis on 2 datasets, sourced from the National Parks Service.

### Goals:
I will be looking to answer more questions as I begin exploring the data, but these questions will guide my intial exploration:

- What is the distribution of `conservation_status` for animals?  
- Are certain types of species more likely to be endangered?  
- Are the differences between species and their conservation status significant?  
- Which species were spotted the most at each park?

### Raw Data Files:

**species_info.csv** - contains data about different species and their conservation status.  
`category` - class of animal  
`scientific_name` - the scientific name of each species  
`common_name` - the common names of each species  
`conservation_status` - each species' current conservation status  

**observations.csv** - holds recorded sightings of different species at several national parks for the past 7 days.  
`scientific_name` - the scientific name of each species  
`park_name` - park where species were found  
`observations` - the number of times each species was observed at the park

## Load CSV Files: inspect first 10 rows.

During my review of the column names, I have decided not to name the DataFrame 'Observations' due to the identical column name. To avoid confusion, I will instead use 'tracking' as a synonym instead of observations.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
species_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/species_info.csv")
tracking_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/observations.csv")

In [3]:
species_df.head()

In [4]:
species_df.dtypes

In [5]:
tracking_df.head()

In [6]:
tracking_df.dtypes

## Digging into the missing data.

I only see `NaN` values in  **species_df** for the column `conservation_status`- this will be something to look into a bit deeper as I begin to clean the data.  
*I also see that both df's contain the column `scientific_name`. This will be useful to act as a primary key, linking both tables by a shared relation if I choose to join both df's.*

**For now, though, I want to continue exploring the data while looking more into each column to see if any others contain NaN, or other possible issues that could be addressed with data cleaning.**

In [7]:
species_df_shape = species_df.columns, species_df.shape

species_df_shape

In [8]:
tracking_df_shape = tracking_df.columns, tracking_df.shape

tracking_df_shape 

In [9]:
species_null = species_df.isnull().sum().sort_values(ascending=False)

species_null

In [10]:
total_rows, total_col = species_df.shape
null_rows = species_null[0]
null_per = (null_rows/total_rows) *100

print(f"{null_per.round(3)}% of the rows in the conservation_status column are null.")

In [11]:
tracking_null = tracking_df.isnull().sum().sort_values(ascending=False)

tracking_null

**Diving deeper I see a majority of the rows in the `conservation_status` column have null values. I want to know if the missing values are: systematic, MAR, MCAR, or MNAR? Without much domain knowledge, I will look closer at the non-null values.**

In [12]:
conservation_types = species_df['conservation_status'].unique()

print("conservation_types:", conservation_types)
species_df['conservation_status'].value_counts()

**`Conservation Status`** is an ordinal categorical variable with 4 categories: 
- Species of Concern, 
- Endangered, 
- Threatened, 
- In recovery. 
 
**It seems less surprising now, why there is a significant portion of null values (96.72%) - it is common for a species to not fit 1 of the 4 categories, which suggests that null values represent *no* conservation status,i.e. the species is not at risk and is healthy.**

This insight will allow me to make the assumption that NaN values are systematically missing due to the dataset only recording conservation status for species known to be at risk. However, it is crucial to remember that this is an assumption and to verify it to ensure that my analysis is accurate.

To perform a more comprehensive analysis, we can modify the null values to represent a 'Healthy' conservation status, indicating that the species is not considered at risk. This relabeling allows us to include these records in our analysis without the need to fill the null values or delete the rows.

In [13]:
species_df['conservation_status'] = species_df['conservation_status'].fillna('Healthy').astype('category')
con_status = species_df['conservation_status'].astype('category')

labels = con_status.unique()
con_status_counts = con_status.value_counts()

con_status_counts

In [14]:
prop = (con_status_counts/total_rows)*100

prop

In [15]:
legend_labels = [f'{label} - {prop:.1f}%' for label, prop in zip(labels, prop)]

In [20]:
plt.pie(con_status_counts)
plt.title('Proportion of Conservation Status')
plt.legend(legend_labels,bbox_to_anchor=(1, 0.5), loc='center left')

plt.show()
plt.clf()

### Before performing further EDA, I want to backup and finish the data cleaning.. I need to look into any duplicate rows.

In [17]:
s_dups = species_df.duplicated()
t_dups = tracking_df.duplicated()

s_dups.sum(), t_dups.sum()

In [18]:
duplicates = tracking_df[tracking_df.duplicated(keep=False)]

duplicates.sort_values(by=['scientific_name'])

The rows above are all duplicates, there seems to be no reason for including both records so i will now drop all duplicates.

In [19]:
tracking_df = tracking_df.drop_duplicates()

tracking_df.duplicated().sum()