# Assessing Biodiversity in Natural Areas: Insights into Conservation and Species Distribution
### A Data-Driven Exploration Using U.S. National Park Observations

___

### 📚 Table of Contents
- 🆎 [Abstract](#--abstract)
- 🗝️ [Key Words](#---key-words)
- 🌳 [1 - Introduction](#--1---introduction)
- ℹ️ [1.1 - Dataset Overview](#--11---dataset-overview)
- 🧪 [2 - Methodology](#--2---methodology)
- 🛜 [2.1 - Data Sources](#--21---data-sources)
- 💻 [2.2 - Analytical Approach](#--22---analytical-approach)
- 🛠️ [2.3 - Tools and Environment](#--23---tools-and-environment)
- [2 - Understanding the Biodiversity Data](#2---understanding-the-biodiversity-data)
- [2.1 - Loading necessary python libraries](#21---loading-necessary-python-libraries)
- [2.2 - Loading data files](#22---loading-data-files)


___



### 🆎Abstract

Biodiversity in natural areas is a real treasure that must be preserved for the well-being of current and future generations. One way to contribute to this global goal is by monitoring species under various conservation statuses within protected ecosystems.

In this project, I apply data science, data analysis, and data visualization technics, to explore biodiversity patterns using two datasets - `species_info.csv` and `observations.csv` - provided by Codecademy. Python libraries such as Pandas, Matplotlib, and Seaborn are used to process, analyze, and visualize the data. The goal is to uncover insights into species richness, conservation status distribution, and ecological representation across U.S. national parks.


___

### 🗝️ Key Words

Biodiversity, Natural Areas, Conservation, Data Analysis, Data Visualization, Python, Jupyter Notebooks

___

### 🌳1 - Introduction

According to United Nations (UN), biodiversity - or biological diversity - is defined as the "variety of life on Earth, in all its forms, from genes and bacteria to entire ecosystems as forests or coral reefs" **[1]**. Biodiversity plays a vital role in sustaining life on Earth by supporting food production, clean water, medicine, climate stability, and economic growth supported, i.e., by robust ecosystem services such as pollination, water purification, and climate regulation. Over half of the global GDP depends on nature, and forests and oceans could act as carbon sinks, absorbing more than half of all carbon emissions **[1]**. 

However, this delicate and invaluable richness is increasingly threatened by human activities. Land use — particularly agriculture — is the leading driver of biodiversity loss, while climate change continues to accelerate species extinction and ecosystem collapse **[1]**.


This project aims to develop practical skills in data analysis and interpretation using Python libraries such as Pandas, Matplotlib, and Seaborn. The datasets, provided by Codecademy, focus on biodiversity — a critical component of ecological resilience and sustainability. Two CSV files are used: `species_info.csv`, which contains metadata about various species, and `observations.csv`, which records species sightings across national parks. Through exploratory analysis and visualization, the project seeks to uncover patterns in species distribution and conservation status.

Additionally, the analysis aims to explore ecological representation across protected areas, identifying trends in species richness and conservation priorities.


#### ℹ️1.1 - Dataset Overview

**`species_info.csv`contains:**
- `category`: Broad classification of the species (e.g., Mammal, Bird, Reptile);
- `scientific_name`: Latin name of the species;
- `common_names`: Common names used to refer to the species;
- `conservation_status`: Indicates if the species is endangered, threatened, or of least concern;

**`observations.csv` contains:**
- `scientific_name`: Latin name of the species observed;
- `park_name`: Name of the national park where the observation was recorded;
- `observations`: Number of times the species was observed in that park;


___


### 🧪 2 - Methodology

This project employs a structured data analysis workflow to explore biodiversity patterns across U.S. national parks. The methodology is designed to address the core themes outlined in the title and introduction: conservation status and species distribution.
Moreover, it is also intended, to practice and employ data science, data analysis and data visualization technics, gathered during "Data Scientist: Machine Learning Specialist - career Path" course.

#### 🛜2.1 - Data Sources

Two datasets provided by Codecademy form the basis of this analysis:

- `species_info.csv`: Contains metadata on species, including category, scientific and common names, and conservation status.

- `observations.csv`: Records the number of times each species was observed in specific national parks.

#### 💻2.2 - Analytical Approach

To assess biodiversity and conservation insights, the following steps are undertaken:

- **Data Cleaning and Integration**: Ensure consistency in species naming and merge datasets where necessary.

- **Exploratory Data Analysis (EDA)**: Use Pandas to summarize species categories, conservation status, and observation counts.

- **Visualization**: Apply Matplotlib and Seaborn to uncover patterns in species richness and conservation distribution across parks.

- **Comparative Analysis*: Identify which parks host the highest number of endangered or threatened species.

- **Ecological Representation**: Evaluate how well different species categories are represented across protected areas.


#### 🛠️2.3 - Tools and Environment

- **Programming Language**: Python

- **Libraries**: Pandas, Matplotlib, Seaborn

- **Platform**: Jupyter Notebook


___
### 3 - Understanding the Biodiversity Data

#### 3.1 - Loading necessary python libraries

In [23]:
import pandas as pd

#### 3.2 - Loading data files

In [26]:
# Loading datasets

observ = pd.read_csv("observations.csv")
spec_info = pd.read_csv("species_info.csv")

#### 3.3 - Initial data exploration

In [29]:
# Preview the data

print(observ.head(), "\n\n\n")
print(spec_info.head())

# Notes:

answer = "\n\nAccording to output data, the information listed below, agrees with the stated in Data Overview section."
print(answer)

            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85 



  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domes

In [31]:
# Checking dimensions and data type

print("Species:")
print(spec_info.info())
print("\nObservations:")
print(observ.info())

answer1 = "\n\nAccording to output, it seems that every column has the right data type. However, all columns for species have 5824 records, except 'conservation_status' which has 191 records."
print(answer1)

Species:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None

Observations:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None


According to output, it seems that every column has the right data type. However, all columns for species have 5824 records, except 'cons

In [33]:
# Unique values per column

print("Species unique categories: ", spec_info['category'].unique())
print("\n\nUnique conservation status: ", spec_info['conservation_status'].unique())

Species unique categories:  ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


Unique conservation status:  [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


___

### 7 - References

**[1]** United Nations (n.d.); Climate Action; *Biodiversity - our strongest natural defense against climate change*. Retrieved October 24, 2025, from https://www.un.org/en/climatechange/science/climate-issues/biodiversity.

**[2]** 