# Introduction

The goal of this project is to analyze biodiversity data from U.S. National Parks Service, focusing on the various species observed across different national park location.

This project involves defining the scope, cleaning, analyzing, visulaizing, and interpreting the data to uncover insights about species conservation and distribution.

Through this analysis, the project aims to answer key questions such as:

- What is the distribution of conservation statuses among species?
- Are certain types of species (e.g., mammals, birds, plants) more likely to be endangered?
- Are the differeces between species categories and their conservation statuses statistically significant?
- Which species are most prevalent, and how are they distributed across the parks?

**Data Sources:**

The datasets `Observations.csv` and `Species_info.csv` were provided by [Codecademy.com](https://www.codecademy.com).

> *Note: The data used in this project is inspired by real biodiversity datasets, but is primarily fictional and intended for educational purposes.*


## Scoping

It's beneficial to create a project scope whenever a new project is being started. Four sections were created below to help the project's process and progress. The first section is the projects goals, this section will define the high-level objectives and set the intentions for this project. The next section is the data, luckily in this project, data is already provided but still needs to be checked if project goals can be met with the available data. Thirdly, the analysis will have to be thought through, which include the methods and questions that are aligned with the project goals. Lastly, evaluation will help us build conclusions and findings from our analysis.

### Project Goals

In this project the perspective will be through a biodiversity analyst for the National Parks Service. The National Park Service wants to ensure the survival of at-risk species, to maintain the level of biodiversity within their parks. Therefore, the main objectives as an analyst will be understanding characteristics about the species and their conservations status, and those species and their relationship to the national parks. Some questions that are posed:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

### Data

This project has two data sets that came with the package. the first `csv` file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project.

### Analysis 

In this section, descriptive statistics and data visualization techni`ues will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include:

1. Distributions
2. counts
3. relationship between species
4. conservation status of species
5. observations of species in parks.

### Evaluation 

Lastly, it's a good idea to revisit the goals and check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This section will also reflect on what has been learned through the process, and if any of the questions were unable to be answered. This could also include limitations or if any of the analysis could have been done using different methodologies.


## Import Python Modules

First, import the primary modules that will be used in this project:

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

## Loading the Data

To analyze the status of conservation of species and their observations in national parks, load the datasets into `DataFrames`. Once loaded as `DataFrames` the data can be explored and visualized with python.

In the next steps, `Observation.csv` and `Species_info.csv` are read in as `DataFrames` are glimpsed with `.head()` to check its contents.

#### species

The `species_info.csv` contains information on the different species in the National Parks. The columns in the data set include:
- **category** - The category of taxonomy for each species
- **scientific_name** - The scientific name of each species
- **common_names** - The common names of each species
- **conservation_status** - The species conservation status

In [2]:
species = pd.read_csv('species_info.csv',encoding='utf-8')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
