# Biodiversity in National Parks

## Overview
The goal of this analysis is to identify patterns in the conservation status of species across national parks, uncover which species and parks are most at risk, and provide actionable insights to help the National Parks Service focus its conservation efforts effectively.

## Project Goals
- Investigate the distribution of species across parks.
- Analyze conservation statuses and identify trends.
- Visualize data to highlight patterns and insights.
- Provide actionable recommendations for conservation efforts.

In [3]:
# Code to import the required libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

#configure visualizations for better readability in in plot generated by seaborn and matplotlib
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)




In [4]:
# we will use pandas to load our csv files.

species = pd.read_csv('species_info.csv')
observations = pd.read_csv('observations.csv')

#preview the dataset using head(). This will just allow us to make sure that the data set is loading properly.
print("Species dataset preview: ")
display(species.head())

print("\nObservations Dataset Preview:")
display(observations.head())

Species dataset preview: 


Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,



Observations Dataset Preview:


Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [None]:
#Checking the structure of the datasets. Here we can identify information such as number of rows and column/ column names and data types. 
print("Species Dataset Info:")
species.info()

print("\nOBservations Dataset Info:")
observations.info()

Species Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB

OBservations Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


In [6]:
#Lets now look at unique values in columns
#Using the .unique() method returns an array of unique values from a column

print("Unique Species Categories:")
print(species['category'].unique(),"\n")

print("Conservation Statuses:")
print(species['conservation_status'].unique(),"\n")

print("Park Names:")
print(observations["park_name"].unique(),"\n")

Unique Species Categories:
['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant'] 

Conservation Statuses:
[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery'] 

Park Names:
['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park'] 



In [9]:
#checking for missing values in the observations dataset
print("Missing values in Observations dataset:")
print(observations.isnull().sum())

Missing values in Observations dataset:
scientific_name    0
park_name          0
observations       0
dtype: int64


In [7]:
#checking for missing values in the species dataset
print("Missing values in Species dataset:")
print(species.isnull().sum())

Missing values in Species dataset:
category                  0
scientific_name           0
common_names              0
conservation_status    5633
dtype: int64


In [8]:
# we will use "no intervention" as a placeholder for the conservation_status column meaning that missing values will indicate that no special conservatio measures are in place for the species. 

#using the .fillna() method to replace NaN values 
species["conservation_status"].fillna("No Intervention", inplace=True)

#verify the change 
print("\nMissing values after filling:")
print(species.isnull().sum())


Missing values after filling:
category               0
scientific_name        0
common_names           0
conservation_status    0
dtype: int64


In [10]:
# To improve data quality we will also check for duplicates in our data

print("Duplicate rows in Species dataset:", species.duplicated().sum())

print("Duplicate rows in observation dataset:", observations.duplicated().sum())

Duplicate rows in Species dataset: 0
Duplicate rows in observation dataset: 15


In [12]:
# removing duplicates in observation dataset

observations.drop_duplicates(inplace=True)
print("Duplicates removed. Remaining duplicates in observation dataset:",observations.duplicated().sum())

Duplicates removed. Remaining duplicates in observation dataset: 0
