# Biodiversity in National Parks
Jonathan Bitner | Started 3/11/2024\
Codecademy portfolio project\
My goal is to showcase my thought process when looking at the data and explain each decision I make.\
For questions or comments, email at jsbitner94@gmail.com

### Project description from Codecademy:
For this project, you will interpret data from the National Parks Service about endangered species in different parks.\
You will perform some data analysis on the conservation statuses of these species and investigate if there are any patterns or themes to the types of species that become endangered.\
During this project, you will analyze, clean up, and plot data as well as pose questions and seek to answer them in 
a meaningful way.\
After you perform your analysis, you will share your findings about the National Park Service.

## I. Outline:
* Review data in `observations.csv` and `species_info.csv`
* Determine project goals
* Explore and explain data; consider analytical steps required
* Format for presentation

## II. Review the data

### Import files

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, chi2_contingency

observations_csv = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\national-parks-biodiversity\observations.csv', encoding_errors='replace')
species_info_csv = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\national-parks-biodiversity\species_info.csv', encoding_errors='replace')

### Descriptive statistics
#### Descriptives for `observations_csv`

In [2]:
print('First five rows:\n', observations_csv.head())
print('\nColumn names:\n', observations_csv.columns, '\n\nInfo:')
print(observations_csv.info())
print('\nDescription:\n', observations_csv.describe(include='all'))

First five rows:
             scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85

Column names:
 Index(['scientific_name', 'park_name', 'observations'], dtype='object') 

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None

Description:
        

#### Initial observations
* Curious if all `park_name` values end in `National Park`
* Column names are approptiately named and formatted (lowercase, underscore_for_space, no whitespace)
* No missing data on initial inspection
* Data types are appropriate
* Surprised to see the large amount of data for only four national parks
    * Potential to change `park_name` to Categorical

#### Descriptives for `species_info_csv`

In [3]:
print('First five rows:\n', species_info_csv.head())
print('\nColumn names:\n', species_info_csv.columns, '\n\nInfo:')
print(species_info_csv.info())
print('\nDescription:\n', species_info_csv.describe(include='all'))

First five rows:
   category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic Cattle (Feral), Dom...                 NaN  
3  Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)                 NaN  
4                                      Wapiti Or Elk                 NaN  

Column names:
 Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object') 

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Co

#### Initial observations
* I had the impression I would be working with trees; good to know that this must contain all life in the national parks.
* Could encounter difficulties with length of `common_names` due to multiple entries in a single observation
* Columns are appropriately named and formatted
* Data types are approptiate
    * Potential to change `category` to Categorical, since there are only seven unique values
    * Same for `conservation_status`; four unique values
* Only `conservation_status` is missing values
    * Are values only included for endangered species?
    * It seems worth exploring endangered species further
* I am surprised to see that the `count` and `unique` values are not the same for `scientific_name` and `common_names`
    * Indicates that there are duplicates of some kind here
    * Could a species be listed as `endangered` in one region, but not another, resulting in separate rows? Unlikely, since this dataset is not connected to `observations_csv` and has no park or region-related information.
    * Could there be multiple choices for `common_names`? Unlikely, since there can be multiple names in `common_names`

## III. Project goals
### Personal goals
* My main goal in this project is to take a deep dive into everything I learned:
    * Data tidying/wrangling
    * Determine what visualizations fit the data best
    * Ask questions about the data and provide answers
    * Format results into a presentable product
* _Note:_ The purpose of this document is to showcase my thought process.
    * It may look out of order, because I may think of things later
    * It will be blocky as I try to process small chunks at a time
    * I plan to create a separate document that will organize everything into a more readable document
    
### Project directions
* Initial questions (brainstorming - will select questions later):
    * What are the four parks in this dataset?
        * Where are they located?
        * How much area do they cover?
        * When were they founded?
        * What was the level of human interaction before founding?
        * What is the current level of human impact? (example: direct impact, such as visitors, and indirect impact, such as climate change) 
        * Can I extrapolate data from these parks to other national parks (Are these good representations of the other ~60 parks?
    * What is the distribution of species in the national parks?
        * Which park has more endangered species?
        * What is the proportion of endangered species versus other?
        * Which parks stand out as having significantly more of one species/category?

## IV. Data wrangling and tidying
Since the `observations_csv` seems to contain the meat of the data, I will start there.

### Preliminary data cleaning

* Already completed preliminary data cleaning with `.info()` and `.describe(include='all')`

### Checking for duplicates
#### Duplicates in `scientific_name`
* Given that there are 23,296 observations, splitting it between four parks should result in no fewer than 5,824 unique values for `scientific_name`, but there are 5,541 unique values, indicating duplicates.
* Maybe I don't quite understand what `observations` means
    * I assumed it was all of the instances a given species was found during a certain time-period
    * Maybe it could represent (for example) six different researchers covering unique areas of the park submitting their own reports, resulting in overlap on the same species within that park?
    * Either way, warrants further investigation

In [34]:
# First, I will sort the data by 'scientific_name', and look at the first few instances of duplicates.
sorted_observations_csv = observations_csv.sort_values(by=['scientific_name', 'park_name'])
# print(sorted_observations_csv.head(20))

# Ok that didn't work like I had hoped. 
# Instead, I will:
    # Group by scientific name
    # Filter for results greater than four
grouped_observations_csv = observations_csv.groupby('scientific_name').count().reset_index()
print(grouped_observations_csv.head())
print(len(grouped_observations_csv)) # Checking that there are still 5,541 unique observations (there are)
# Seeing how many duplicates there are for 'park_name' and 'observations'; I hope they match
print(grouped_observations_csv.park_name.value_counts())
print(grouped_observations_csv.observations.value_counts())
# I find these results interesting: 
    # Eight observations for a species found 265 times, and twelve found nine times
    # These are in multiples of four, so they are likely duplicated evenly across the parks
    
# Gathering a list of 'scientific_names' to filter 'observations_csv' using '.isin()'
duplicated_observations_list = grouped_observations_csv.scientific_name[grouped_observations_csv['observations'] > 4] 
print(len(duplicated_observations)) # Expecting 274 (265+7 from 'value_counts' above that were greater than four) # Output 274
duplicated_observations_df = observations_csv[observations_csv.scientific_name.isin(duplicated_observations_list)].sort_values(
    by=['scientific_name', 'park_name']).reset_index(drop=True)

# Investigating duplicates
# print(duplicated_observations_df.head(20)) # Still no answers, try unique 'scientific_name' for trends or patterns?
# print(duplicated_observations_df.scientific_name.unique()) # No obvious trends, maybe also check category?
duplicated_observations_species_info = species_info_csv[species_info_csv.scientific_name.isin(duplicated_observations_list)].reset_index(drop=True)
print(duplicated_observations_species_info.head())
print(duplicated_observations_species_info.scientific_name.value_counts().head(20))

# Initial answer: There are nine triple duplicates of 'scientific_name' in 'species_info_csv', and 265 double duplicates.
# Why is that?

        scientific_name  park_name  observations
0         Abies bifolia          4             4
1        Abies concolor          4             4
2         Abies fraseri          4             4
3  Abietinella abietina          4             4
4     Abronia ammophila          4             4
5541
park_name
4     5267
8      265
12       9
Name: count, dtype: int64
observations
4     5267
8      265
12       9
Name: count, dtype: int64
274
  category           scientific_name               common_names  \
0   Mammal            Cervus elaphus              Wapiti Or Elk   
1   Mammal    Odocoileus virginianus          White-Tailed Deer   
2   Mammal                Sus scrofa        Feral Hog, Wild Pig   
3   Mammal               Canis lupus                  Gray Wolf   
4   Mammal  Urocyon cinereoargenteus  Common Gray Fox, Gray Fox   

  conservation_status  
0                 NaN  
1                 NaN  
2                 NaN  
3          Endangered  
4                 NaN  
scientifi