# Project: Analyzing Endangered Species in National Parks 🏞️🦋

## Introduction

This project involves analyzing data from the National Parks Service to investigate endangered species in various parks. The goal is to uncover patterns and trends in the conservation statuses of these species, providing insights and recommendations for biodiversity conservation. We have two datasets to work with in this project: "observations.csv" and "species_info.csv".

## Objectives

1. **Data Cleaning and Preparation**:
   - Handle missing values and ensure data consistency.
   - Standardize common names and scientific names for accuracy.

2. **Descriptive Analysis**:
   - Summarize the data to understand the distribution of species across different categories.
   - Highlight the proportion of species with various conservation statuses (e.g., Endangered, Threatened).

3. **Pattern Detection**:
   - Investigate if certain categories (e.g., Mammals, Birds) are more prone to endangerment.
   - Identify any geographical patterns or park-specific trends.

4. **Visualization**:
   - Create clear and informative plots to illustrate key findings.
   - Use bar charts, pie charts, and geographical maps to represent data visually.

5. **Pose and Answer Questions**:
   - What percentage of species in the dataset are classified as endangered?
   - Are there specific parks with a higher concentration of endangered species?
   - Do certain species categories show higher endangerment rates?

6. **Interpretation and Reporting**:
   - Summarize the insights gained from the analysis.
   - Provide recommendations for conservation efforts based on the findings.

## Importing Modules
These are the modules that we'll be using thoughout this project:
* **Pandas:** Data analysis and manipulation.
* **Seaborn:** Data visualization.

In [71]:
import pandas as pd

## Dataset Overview: observations.csv

* **scientific_name:** The formal scientific name of the species observed, typically in Latin, following the binomial nomenclature system.
* **park_name:** The name of the national park where the observation was made.
* **observations:** The number of observations recorded for the species in the specified park.

In [72]:
# Load the obervations.csv dataset
observations_data = pd.read_csv('observations.csv')

# Display the first few rows of the dataset
observations_data.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


## Dataset Overview: species_info.csv

* **category:** Indicates the biological classification of the species, such as Mammal, Bird, Reptile, etc.
* **scientific_name:** The formal scientific name of the species, typically in Latin, following the binomial nomenclature system.
* **common_names:** Lists the commonly used names for the species in everyday language. There can be multiple common names for a single species, separated by commas.
* **conservation_status:** Indicates the level of threat faced by the species, such as Endangered, Threatened, etc. This column may contain missing values (NaN) for species without a defined status.

In [73]:
# Load the species_info.csv dataset
species_info = pd.read_csv('species_info.csv')

# Display the first few rows of the dataset
species_info.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


## Descrpitvive Statistics: observations

In [74]:
# Perform descriptive statistics on species_info dataset
species_info_stats = species_info.describe(include='all').transpose()

# Perform descriptive statistics on observations_data dataset
observations_data_stats = observations_data.describe(include='all').transpose()

# Display the descriptive statistics
print("\nDescriptive Statistics for observations.csv")
print(observations_data_stats)


Descriptive Statistics for observations.csv
                   count unique                                  top  freq  \
scientific_name    23296   5541                     Myotis lucifugus    12   
park_name          23296      4  Great Smoky Mountains National Park  5824   
observations     23296.0    NaN                                  NaN   NaN   

                       mean        std  min   25%    50%    75%    max  
scientific_name         NaN        NaN  NaN   NaN    NaN    NaN    NaN  
park_name               NaN        NaN  NaN   NaN    NaN    NaN    NaN  
observations     142.287904  69.890532  9.0  86.0  124.0  195.0  321.0  


## Descrpitvive Statistics: Species Info

In [75]:
# Perform descriptive statistics on species_info dataset
species_info_stats = species_info.describe(include='all').transpose()

# Display the descriptive statistics
print("Descriptive Statistics for species_info.csv")
print(species_info_stats)

Descriptive Statistics for species_info.csv
                    count unique                 top  freq
category             5824      7      Vascular Plant  4470
scientific_name      5824   5541   Castor canadensis     3
common_names         5824   5504  Brachythecium Moss     7
conservation_status   191      4  Species of Concern   161


## 1. Data Cleaning and Preparation
Looking through our the descriptive statistics, it seems that we may have some data cleaning to do.

I noticed we have 5824 scientific names, but only 5541 of them were unique. Wich is strange, because two species should not have the same cientific name. So, we have a problem: there are duplicated species. Let's hunt for those duplicates, should we?

In [86]:
# Creating a view of the duplicated rows
duplicate_species = species_info[species_info.duplicated('scientific_name', keep=False)].sort_values('scientific_name')
duplicate_species

Unnamed: 0,category,scientific_name,common_names,conservation_status
5553,Vascular Plant,Agrostis capillaris,"Rhode Island Bent, Colonial Bent, Colonial Ben...",
2132,Vascular Plant,Agrostis capillaris,"Rhode Island Bent, Colonial Bent, Colonial Ben...",
2134,Vascular Plant,Agrostis gigantea,"Redtop, Black Bent, Redtop, Water Bentgrass",
5554,Vascular Plant,Agrostis gigantea,"Redtop, Black Bent, Redtop, Water Bentgrass",
4178,Vascular Plant,Agrostis mertensii,"Northern Agrostis, Arctic Bentgrass, Northern ...",
...,...,...,...,...
5643,Vascular Plant,Vulpia myuros,"Rattail Fescue, Foxtail Fescue, Rattail Fescue...",
2331,Vascular Plant,Vulpia octoflora,"Annual Fescue, Eight-Flower Six-Weeks Grass, P...",
4290,Vascular Plant,Vulpia octoflora,"Annual Fescue, Eight-Flower Six-Weeks Grass, P...",
3347,Vascular Plant,Zizia aptera,"Golden Alexanders, Heartleaf Alexanders, Heart...",


Now, we'll organize them to get a better view of wha seems to be the problem.

In [77]:
# Organizing 'scientific_name' column alphabetically
species_info[species_info.duplicated('scientific_name', keep=False)].sort_values('scientific_name')

Unnamed: 0,category,scientific_name,common_names,conservation_status
5553,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",
2132,Vascular Plant,Agrostis capillaris,Rhode Island Bent,
2134,Vascular Plant,Agrostis gigantea,Redtop,
5554,Vascular Plant,Agrostis gigantea,"Black Bent, Redtop, Water Bentgrass",
4178,Vascular Plant,Agrostis mertensii,"Arctic Bentgrass, Northern Bentgrass",
...,...,...,...,...
5643,Vascular Plant,Vulpia myuros,"Foxtail Fescue, Rattail Fescue, Rat-Tail Fescu...",
2331,Vascular Plant,Vulpia octoflora,Annual Fescue,
4290,Vascular Plant,Vulpia octoflora,"Eight-Flower Six-Weeks Grass, Pullout Grass, S...",
3347,Vascular Plant,Zizia aptera,"Heartleaf Alexanders, Heart-Leaf Alexanders, M...",


In [78]:
# Function to merge common names of duplicate rows
def merge_common_names(df):
    df['common_names'] = df.groupby('scientific_name')['common_names'].transform(lambda x: ', '.join(x.unique()))
    return df.drop_duplicates(subset='scientific_name', keep='first')

# Apply the merge function
species_info_merged = merge_common_names(species_info)

# Verify the merge by displaying the first few rows of the merged dataset
species_info_merged.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq
category,5541,7,Vascular Plant,4262
scientific_name,5541,5541,Clethrionomys gapperi gapperi,1
common_names,5541,5237,Dicranum Moss,7
conservation_status,178,4,Species of Concern,151


Now we have to fill the NaN values in the conservation_status column. I decided to name the NaN values as "Conserved".

In [79]:
# Fill NaN values in the conservation_status column with a placeholder
species_info_filled = species_info_merged.fillna({'conservation_status': 'Conserved'})
species_info_filled

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Conserved
1,Mammal,Bos bison,"American Bison, Bison",Conserved
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Conserved
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",Conserved
4,Mammal,Cervus elaphus,"Wapiti Or Elk, Rocky Mountain Elk",Conserved
...,...,...,...,...
5819,Vascular Plant,Solanum parishii,Parish's Nightshade,Conserved
5820,Vascular Plant,Solanum xanti,"Chaparral Nightshade, Purple Nightshade",Conserved
5821,Vascular Plant,Parthenocissus vitacea,"Thicket Creeper, Virginia Creeper, Woodbine",Conserved
5822,Vascular Plant,Vitis californica,"California Grape, California Wild Grape",Conserved


Now that the species_info is prepared, I will put the data in a new value with new index called "species_info_prepared"

In [85]:
# Reset the index
species_info_filled.reset_index(drop=True, inplace=True)

# Save the filled dataset with the reorganized index to a new CSV file
species_info_filled.to_csv('species_info_prepared.csv', index=False)

species_info_prepared = pd.read_csv('species_info_prepared.csv')
species_info_prepared

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Conserved
1,Mammal,Bos bison,"American Bison, Bison",Conserved
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Conserved
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",Conserved
4,Mammal,Cervus elaphus,"Wapiti Or Elk, Rocky Mountain Elk",Conserved
...,...,...,...,...
5536,Vascular Plant,Solanum parishii,Parish's Nightshade,Conserved
5537,Vascular Plant,Solanum xanti,"Chaparral Nightshade, Purple Nightshade",Conserved
5538,Vascular Plant,Parthenocissus vitacea,"Thicket Creeper, Virginia Creeper, Woodbine",Conserved
5539,Vascular Plant,Vitis californica,"California Grape, California Wild Grape",Conserved
