# Biodiversity

This is the third project on Codecademy Data Scientist: Machine Learning Career Path.

Codecademy.com has provided `observations.csv` and `species_info.csv` - the data is fictional but is inspired by real data.

The aim of this project is to study biodiversity data provided by the National Parks Service, with a specific focus on the species recorded in various national parks. This project will involve scoping, analyzing, preparing, and plotting data, with the goal of interpreting and explaining the results of the analysis.

# Scoping

Project Scoping is the helpfull beggining of the project. It helps to recognise and line up the structure while requiring you to think through your entire project before you begin. 
Considering [Data Science Project Scoping Guide](https://www.datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/) it will be good to start with setting high-level goals of the project, determeniton which actions coulde be done or improved and data we need and where we can gather them, and, finally, the description of following analysis and which techniques could be implemented.

#### Goals

Assuming the National Park Service aims to preserve at-risk species and maintain biodiversity within their parks, the primary objectives as an analyst will be to understand the characteristics of these species, their conservation status, and their relationship to the national parks. Some of the key questions to explore include:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

#### Actions

As a project for a fictional customer there is no need to scope actions.

#### Data

Codecademy has provided two `csv` files filled fictional data but inspired by real information. 
The first `species_info.csv` file has information about each species and another `observations.csv` has observations of species with park locations. This data will be used to analyze the goals of the project. 

#### Analysis

This section will utilize descriptive statistics and data visualization techniques to gain a deeper understanding of the data. Statistical inference will also be applied to determine whether the observed values are statistically significant. Key metrics to be calculated include: 

1. Distributions
1. Counts
1. Relationship between species
1. Conservation status of species
1. Observations of species in parks. 

# Load data

In [38]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

First we import neсessary python libraries. 
To work with csv data set we use Pandas. To evaluate and visualize our findings we may use Matplotlib and Seaborn.

## Species info

In [8]:
species = pd.read_csv('species_info.csv',encoding='utf-8')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


The `species_info.csv` keeps information about the different species in the National Parks. It has 4 columns:
- category - The class for each species
- scientific_name - The scientific name of each species
- common_names - The common names of each species
- conservation_status - The species conservation status

In [11]:
species.shape

(5824, 4)

`species`has 5,824 rows and 4 columns

## Observations

In [15]:
observations = pd.read_csv('observations.csv', encoding='utf-8')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


The `observations.csv` keeps information about observations of different species in the national parks in the past 7 days. It has 3 columns:

- scientific_name - The scientific name of each species
- park_name - The name of the national park
- observations - The number of observations in the past 7 daysdays

In [18]:
observations.shape

(23296, 3)

`observations` has 23,296 rows and 3 columns.