# Biodiversity in National Parks
Jonathan Bitner | Started 3/11/2024\
Codecademy portfolio project\
My goal is to showcase my thought process when looking at the data and explain each decision I make.\
For questions or comments, email at jsbitner94@gmail.com

## I. Outline:
* Review data in `observations.csv` and `species_info.csv`
* Determine project goals
* Consider analytical steps required
* Explore and explain data
* Format for presentation

## II. Review the data

### Import files

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, chi2_contingency

observations_csv = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\national-parks-biodiversity\observations.csv', encoding_errors='replace')
species_info_csv = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\national-parks-biodiversity\species_info.csv', encoding_errors='replace')

### Descriptive statistics
#### Descriptives for `observations_csv`

In [29]:
print('First five rows:\n', observations_csv.head())
print('\nColumn names:\n', observations_csv.columns, '\n\nInfo:')
print(observations_csv.info())
print('\nDescription:\n', observations_csv.describe(include='all'))

First five rows:
             scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85

Column names:
 Index(['scientific_name', 'park_name', 'observations'], dtype='object') 

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None

Description:
        

#### Initial observations
* Curious if all `park_name` values end in `National Park`
* Column names are approptiately named and formatted (lowercase, underscore_for_space, no whitespace)
* No missing data on initial inspection
* Data types are appropriate
* Surprised to see the large amount of data for only four national parks
    * Potential to change `park_name` to Categorical

#### Descriptives for `species_info_csv`

In [30]:
print('First five rows:\n', species_info_csv.head())
print('\nColumn names:\n', species_info_csv.columns, '\n\nInfo:')
print(species_info_csv.info())
print('\nDescription:\n', species_info_csv.describe(include='all'))

First five rows:
   category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic Cattle (Feral), Dom...                 NaN  
3  Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)                 NaN  
4                                      Wapiti Or Elk                 NaN  

Column names:
 Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object') 

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Co

#### Initial observations
* I had the impression I would be working with trees; good to know that this must contain all life in the national parks.
* Could encounter difficulties with length of `common_names` due to multiple entries in a single observation
* Columns are appropriately named and formatted
* Data types are approptiate
    * Potential to change `category` to Categorical, since there are only seven unique values
    * Same for `conservation_status`; four unique values
* Only `conservation_status` is missing values
    * Are values only included for endangered species?
    * It seems worth exploring endangered species further
* I am surprised to see that the `count` and `unique` values are not the same for `scientific_name` and `common_names`
    * Could a species be listed as `endangered` in one region, but not another, resulting in separate rows? Unlikely, since this dataset is not connected to `observations_csv` and has no park or region-related information.
    * Could there be multiple choices for `common_names`? Unlikely, since there can be multiple names in `common_names`