# Portfolio Project: Biodiversity
## Initial loading of data

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels
import matplotlib.pyplot as plt
import math

# read data
species_info = pd.read_csv('species_info.csv')
observations = pd.read_csv('observations.csv')

species_info
print(species_info.head())
print(species_info.describe())
print(f'rows and columns: {species_info.shape}')
print('\n')
print(observations.head())
print(observations.describe())
print(f'rows and columns: {observations.shape}')

  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic Cattle (Feral), Dom...                 NaN  
3  Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)                 NaN  
4                                      Wapiti Or Elk                 NaN  
              category scientific_name        common_names conservation_status
count             5824            5824                5824                 191
unique               7            5541                5504                   4
top     Vascular Plant   Columba livia  Br

### Species:
contains info on observed species, including:
- category: what animal class they belong to
- scientific name: animal's scientific name
- common name: animal's common name
- conservation status: whether animal is endangered or not

5824 entries, 4 columns

### Observations:
contains info on species and their respective locations:
- scientific name: animal's scientific name
- park name: the place the animal was found in
- observations: amount of times found

23296 entries, 3 columns

Seems that conservation status of species_info is NaN (empty) unless stated otherwise, which implies that these species are not endangered.
Only 4 out of 191 species have a unique conservation status value, meaning that they maybe endangered or at risk.

meanwhile, observation list contains scientific name, park name, and number of times that particular species was found. both lists have scientific_name in common, so we can use that to connect them for analysis.

## initial exploration of data

### Species Info data:

In [14]:
print(f'number of unique species: {species_info.scientific_name.nunique()}')
print(f'number of species represented: {species_info.category.unique()}')
print(species_info.groupby('category').size())
print(f'# of unique conservation statuses: {species_info.conservation_status.unique()}')
print(f'na values: {species_info.conservation_status.isna().sum()}') #assuming na is normal
print(species_info.groupby('conservation_status').size())

number of unique species: 5541
number of species represented: ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']
category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64
# of unique conservation statuses: [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']
na values: 5633
conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64


Seems that there are 7 categories of plant and animal to look after, and 5 categories of conservation statuses to observe, assuming nan means normal.

### Observation list:

In [19]:
print(f'different parks: {observations.park_name.unique()}')
print(f'# of observations: {observations.observations.sum()}')

different parks: ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']
# of observations: 3314739


we have 3,314,739 observations from 4 national parks. that's a lot