# Introduction

This project aims to analyze biodiversity within four prominent national parks in the United States: Bryce Canyon, Great Smoky Mountains, Yellowstone, and Yosemite. Using Python, we will explore species observations, conservation statuses, and their ecological implications.

The analysis will address the following key questions:

**1. Species Observations**
- *Most Observed Species*: Identify the species that are most frequently observed in each park to understand which are better adapted to these environments.
- *Variability Across Parks*: Compare species observations across the four parks to reveal significant patterns and differences in biodiversity.
- *Temporal Trends*: Analyze historical observation data to determine trends over time, including whether certain species sightings are increasing or decreasing.

**2. Species Information**
- *Species Distribution by Category*: Examine the distribution of species across categories (Amphibian, Bird, etc.) to understand which groups are most represented in each park.
- *Conservation Status*: Identify species with endangered statuses and assess how these species are distributed across the parks, providing insight into conservation needs.
- *Endemic and Rare Species*: Determine which species are endemic or rare in each park to identify priority candidates for conservation efforts.

## Data sources:

Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

# Import Python Modules

In [36]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

# Data Acquisition & Preparation
## Data Import
### *observations.csv*
This dataset contains records of species observations across four national parks. The fields are as follows:

- **scientific_name**: The scientific name of the species observed, following the binomial nomenclature system.
- **park_name**: The name of the national park where the observation was made. The parks included are Bryce Canyon, Great Smoky Mountains, Yellowstone, and Yosemite.
- **observations**: The number of times each species has been observed in the respective park, providing insights into species abundance and distribution.

In [43]:
observations_df = pd.read_csv("observations.csv")
observations_df.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


### *species_info.csv*
This dataset provides detailed information about various species, including their classification and conservation status. The fields are as follows:

- **category**: The taxonomic category of the species, which can include Amphibian, Bird, Fish, Mammal, Nonvascular Plant, Reptile, or Vascular Plant.
- **scientific_name**: The scientific name of the species, which matches the names in the observations dataset for easy cross-referencing.
- **common_names**: Common names or vernacular names associated with the species, providing a more relatable reference for non-scientific audiences.
- **conservation_status**: The conservation status of the species, which may be empty or indicate statuses such as Endangered, In Recovery, Species of Concern, or Threatened, highlighting the species' risk levels.

In [46]:
species_info_df = pd.read_csv("species_info.csv")
species_info_df.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


## Data Exploration & Cleaning
**observations.csv**

In [84]:
# observations_df.info()
# There are no null values, but scientific_name and park_name are mapped as objects

In [86]:
# We will start by forcing the first 2 columns being mapped as strings
observations_df['scientific_name'] = observations_df['scientific_name'].astype("string")
observations_df['park_name'] = observations_df['park_name'].astype("string")
observations_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  string
 1   park_name        23296 non-null  string
 2   observations     23296 non-null  int64 
dtypes: int64(1), string(2)
memory usage: 546.1 KB


In [101]:
# Let's check the different values for each field

for column in ["scientific_name", "park_name"]:
    nunique_values = observations_df[column].nunique()
    print(f"# of Unique values for {column}: {nunique_values}\n")

# We will do the same for observations, but we will just see the range of values we could find

print(f"Range of values for observations: [{observations_df['observations'].min()}, {observations_df['observations'].max()}]\n")

# of Unique values for scientific_name: 5541

# of Unique values for park_name: 4

Range of values for observations: [9, 321]



In [None]:
#

**species_info.csv**

In [97]:
#species_info_df.info()
# no null values except on "conservation_status" column, where most of the values are null. All the fields are mapped as objects
# Let's convert each column to the most suitable data type:
species_info_df['category'] = species_info_df['category'].astype('category')
species_info_df['scientific_name'] = species_info_df['scientific_name'].astype('string')
species_info_df['common_names'] = species_info_df['common_names'].astype('string')
species_info_df['conservation_status'] = species_info_df['conservation_status'].astype('string')
species_info_df['conservation_status'].fillna('No Intervention', inplace=True)
species_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   category             5824 non-null   category
 1   scientific_name      5824 non-null   string  
 2   common_names         5824 non-null   string  
 3   conservation_status  5824 non-null   string  
dtypes: category(1), string(3)
memory usage: 142.7 KB


In [111]:
# Let's check the different values for each field

for column in species_info_df.columns:
    nunique_values = species_info_df[column].nunique()
    print(f"# of Unique values for {column}: {nunique_values}\n")

# of Unique values for category: 7

# of Unique values for scientific_name: 5541

# of Unique values for common_names: 5504

# of Unique values for conservation_status: 5



# Analysis & Interpretation: Data Visualization and Hypothesis Testing
## Species Observations
**Most Observed species in each park**

**Comparison of species observations across parks: patterns and differences in biodiversity**

**Temporal trends over time**

## Species Information
**Species distribution by Category: which are the groups more represented in each park?**

**Conservation Status: endangered species and their distribution among parks**

**Endemic & Rare species in each park**

# Conclusions