# Introduction

This project focuses on analyzing data from the National Park Service, specifically related to various species across different park locations and their conservation status. The key sections of this project include defining objectives, data preparation, analysis, data visualization, and drawing conclusions from the findings.

The project seeks to answer the following questions:
- What is the distribution of conservation status among species?
- Are certain species types more prone to endangerment?
- Are the differences between species and their conservation status statistically significant?
- Which animal is most common, and how are they distributed across parks?

**Data sources:**

The datasets `Observations.csv` and `Species_info.csv` were provided by [Codecademy.com](https://www.codecademy.com).

Note: The data used in this project is inspired by real-world data but is primarily fictional.

# Project objective
## Project goal
In this project, the analysis will be conducted from the viewpoint of a biodiversity analyst for the National Park Service. The primary goal of the National Park Service is to protect at-risk species and sustain biodiversity within their parks. As an analyst, the main focus will be to understand the characteristics of the species, their conservation status, and how they relate to the national parks. Key questions include:

- What is the distribution of conservation status among species?
- Are certain types of species more prone to being endangered?
- Are there significant differences between species and their conservation status?
- Which animals are most common, and how are they distributed across the parks?
## Data
This project has two data sets that came with the package. The first `csv` file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project. 

# Data preparation
First, import the library that will use to analyze in this project.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import IPython.display #This is for displaying tables with better readability

To examine the conservation status of species and their observations in national parks, the datasets are first loaded into DataFrames. Once in DataFrame format, the data can be explored and visualized using Python.

In the following steps, `Observations.csv` and `Species_info.csv` are read into DataFrames named `observations` and `species`, respectively. The contents of these DataFrames are then previewed with `.head()` to get an initial look at the data.

In [2]:
species = pd.read_csv('species_info.csv')
display(species.head(5))

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


The `species_info.csv` file provides details about various species found in the National Parks. The dataset includes the following columns:

- category: The taxonomic category for each species
- scientific_name: The scientific name of each species
- common_names: The common names for each species
- conservation_status: The conservation status of each species

In [3]:
observations = pd.read_csv('observations.csv')
display(observations.head(5))

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


The Observations.csv file provides data on species sightings recorded in various national parks over the past week. The columns in this dataset are:

- scientific_name: The scientific name of each species
- park_name: The name of the national park where the sightings occurred
- observations: The count of observations made in the past 7 days

# Analytics
This section will delve deeply into the data and use the insights gained to perform further analysis.

## Data Exploration

To begin with, we'll examine the characteristics of the data by checking the dimensions and data types of the variables in both datasets using the `describe` method. 

For the `observations` dataset:
- It contains a total of 23,296 rows.
- There are 5,541 unique scientific names.
- It includes data from only 4 parks, with "Myotis lucifugus" being the most common scientific name and "Mountains National Park" being the most frequently reported park.

For the `species` dataset:
- It has 5,824 rows, each with a unique category and scientific name.
- There are 4 distinct conservation status types.
- Out of the rows, 191 have data for the conservation status, while the rest are missing values.

In [6]:
display(observations.describe(include='all'))
display(species.describe(include='all'))

Unnamed: 0,scientific_name,park_name,observations
count,23296,23296,23296.0
unique,5541,4,
top,Myotis lucifugus,Great Smoky Mountains National Park,
freq,12,5824,
mean,,,142.287904
std,,,69.890532
min,,,9.0
25%,,,86.0
50%,,,124.0
75%,,,195.0


Unnamed: 0,category,scientific_name,common_names,conservation_status
count,5824,5824,5824,191
unique,7,5541,5504,4
top,Vascular Plant,Castor canadensis,Brachythecium Moss,Species of Concern
freq,4470,3,7,161
