# Project: Analyzing Endangered Species in National Parks 🏞️🦋

## Introduction

This project involves analyzing data from the National Parks Service to investigate endangered species in various parks. The goal is to uncover patterns and trends in the conservation statuses of these species, providing insights and recommendations for biodiversity conservation. We have two datasets to work with in this project: "observations.csv" and "species_info.csv".

## Objectives

1. **Data Cleaning and Preparation**:
   - Handle missing values and ensure data consistency.
   - Standardize common names and scientific names for accuracy.

2. **Descriptive Analysis**:
   - Summarize the data to understand the distribution of species across different categories.
   - Highlight the proportion of species with various conservation statuses (e.g., Endangered, Threatened).

3. **Pattern Detection**:
   - Investigate if certain categories (e.g., Mammals, Birds) are more prone to endangerment.
   - Identify any geographical patterns or park-specific trends.

4. **Visualization**:
   - Create clear and informative plots to illustrate key findings.
   - Use bar charts, pie charts, and geographical maps to represent data visually.

5. **Pose and Answer Questions**:
   - What percentage of species in the dataset are classified as endangered?
   - Are there specific parks with a higher concentration of endangered species?
   - Do certain species categories show higher endangerment rates?

6. **Interpretation and Reporting**:
   - Summarize the insights gained from the analysis.
   - Provide recommendations for conservation efforts based on the findings.

## Dataset Overview: observations.csv

* **scientific_name:** The formal scientific name of the species observed, typically in Latin, following the binomial nomenclature system.
* **park_name:** The name of the national park where the observation was made.
* **observations:** The number of observations recorded for the species in the specified park.

In [31]:
# Load the obervations.csv dataset
observations_data = pd.read_csv('observations.csv')

# Display the first few rows of the dataset
observations_data.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


## Dataset Overview: species_info.csv

* **category:** Indicates the biological classification of the species, such as Mammal, Bird, Reptile, etc.
* **scientific_name:** The formal scientific name of the species, typically in Latin, following the binomial nomenclature system.
* **common_names:** Lists the commonly used names for the species in everyday language. There can be multiple common names for a single species, separated by commas.
* **conservation_status:** Indicates the level of threat faced by the species, such as Endangered, Threatened, etc. This column may contain missing values (NaN) for species without a defined status.

In [32]:
import pandas as pd

# Load the species_info.csv dataset
species_info = pd.read_csv('species_info.csv')

# Display the first few rows of the dataset
species_info.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


## Descrpitvive Statistics: observations

In [33]:
# Perform descriptive statistics on species_info dataset
species_info_stats = species_info.describe(include='all').transpose()

# Perform descriptive statistics on observations_data dataset
observations_data_stats = observations_data.describe(include='all').transpose()

# Display the descriptive statistics
print("\nDescriptive Statistics for observations.csv")
print(observations_data_stats)


Descriptive Statistics for observations.csv
                   count unique                                  top  freq  \
scientific_name    23296   5541                     Myotis lucifugus    12   
park_name          23296      4  Great Smoky Mountains National Park  5824   
observations     23296.0    NaN                                  NaN   NaN   

                       mean        std  min   25%    50%    75%    max  
scientific_name         NaN        NaN  NaN   NaN    NaN    NaN    NaN  
park_name               NaN        NaN  NaN   NaN    NaN    NaN    NaN  
observations     142.287904  69.890532  9.0  86.0  124.0  195.0  321.0  


## Descrpitvive Statistics: Species Info

In [34]:
# Perform descriptive statistics on species_info dataset
species_info_stats = species_info.describe(include='all').transpose()

# Display the descriptive statistics
print("Descriptive Statistics for species_info.csv")
print(species_info_stats)

Descriptive Statistics for species_info.csv
                    count unique                 top  freq
category             5824      7      Vascular Plant  4470
scientific_name      5824   5541   Castor canadensis     3
common_names         5824   5504  Brachythecium Moss     7
conservation_status   191      4  Species of Concern   161
