# Project: Bio-diversity Data Analysis from the National Park Services.

## Project Goals
The goal of this project is to analyze biodiversity data from the National Parks Services, with respect to the conservation status of various species observed in different national park locations.

**The Project Scope consists of:**
1. Explore and Clean the Data
2. Perform statistical Analysis on the Data
3. Visualize the Data
4. Seek to explore relations discovered from this analysis, with degrees of significance. 

## Questions to Answer
1. How are categories of species spread over the four parks?
2. How are individual species with unhealthy populations spread over the four parks?
3. What categories of species have the highest rates of unhealthy populations?
4. What is the variation of observations of individual species with unhealthy populations compared to their category mean?

## Definitions
Biodiversity refers to the variety of living species on Earth, including plants, animals, bacteria, and fungi.

## Data Sources
Both 'Observations.csv' and 'Species_info.csv' were provided by [Codecademy](https://www.codecademy.com).
**This data is fictional. It is modeled after real-world data for the purposes of practicing data analytics.**

## References
1. Biodiversity. Education. (n.d.). Retrieved April 17, 2023, from https://education.nationalgeographic.org/resource/biodiversity/ 

# Preparing the Data

## Import Statements & Dataframe Creation

In [37]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import display_html

%matplotlib inline

species = pd.read_csv('species_info.csv',encoding='utf-8')
observations = pd.read_csv('observations.csv', encoding='utf-8')

## Exploring Species Dataframe

In exploring the species dataframe, we observe this about the data:
1. The count for the category, scientific_name, and common_names are all 5824. 
    - This implies there are no Nulls in these columns, meaning every scientific name has a corresponding common name and category.  
2. The count for conservation_status is only 191, and there are only 4 defined categories. This shows that Nan is being used to identify species not in a conservation status.
    - We will create a new conservation status called "healthy_population", and populate the NaN values with this new status. 
3. The scientific_name count does not equal the unique count. This means scientific_names, which should be unique, are being repeated. 
    - This may be acceptable with common_names, as similar but unique species may be incorrectly identified as the same species by the common population. 
        - We group by scientific name with a Lambda function to display the number of instances each scientific name is repeated. 
    - After reviewing a sample of duplicate entries in scientific name, there is no value in keeping these duplicate entries. We will drop the duplicates in our data cleansing. 
4. We perform a value count of category and conservation status to see the distribution of these columns.
    - For conservation status, we see that the total count of all non healhy populations is less than 200, or under 3.5% of the total species count (including count of healthy populations). 
        - Because healthy populations are represented by Nan, we explicitly count this separately from the non-null values.
        - Once we clean the data, we will see the true distribution count and percentage of conservation status.
    - For category, we observe plants are over 80% of total species observed, with all animals consisting of the remaining 20%
        - Would need to further research to confirm if this 4:1 ratio is to be expected in a natural habitat, or implies bias in the data collection process. 

In [3]:
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [25]:
species.describe(include='all')

Unnamed: 0,category,scientific_name,common_names,conservation_status
count,5824,5824,5824,191
unique,7,5541,5504,4
top,Vascular Plant,Castor canadensis,Brachythecium Moss,Species of Concern
freq,4470,3,7,161


In [21]:
print(f"Unique Categories: {species.category.unique()}")
print()
print(f"Unique Conservation Status: {species.conservation_status.unique()}")

Unique Categories: ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']

Unique Conservation Status: [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


In [5]:
species.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


In [28]:
df1 = species.category.value_counts().to_frame(name = '')
df2 = species.category.value_counts(normalize = True).to_frame(name = '') * 100

df1_styler = df1.style.set_table_attributes("style='display:inline'").set_caption('Species Category Value Count')
df2_styler = df2.style.set_table_attributes("style='display:inline'").set_caption('Species Category Value Count Percentage')

display_html(df1_styler._repr_html_()+df2_styler._repr_html_(), raw=True)

category,Unnamed: 1
Vascular Plant,4470
Bird,521
Nonvascular Plant,333
Mammal,214
Fish,127
Amphibian,80
Reptile,79

category,Unnamed: 1
Vascular Plant,76.751374
Bird,8.945742
Nonvascular Plant,5.71772
Mammal,3.674451
Fish,2.180632
Amphibian,1.373626
Reptile,1.356456


In [13]:
print(f"Nan Value count for conservation status: {species.conservation_status.isna().sum()}")

Nan Value count for conservation status: 5633


In [12]:
species.conservation_status.value_counts()

conservation_status
Species of Concern    161
Endangered             16
Threatened             10
In Recovery             4
Name: count, dtype: int64

In [37]:
species.groupby("scientific_name").size().loc[lambda x: x > 1].sort_values()

scientific_name
Agrostis capillaris                 2
Panicum capillare                   2
Panicum miliaceum                   2
Panicum rigidulum var. rigidulum    2
Parietaria pensylvanica             2
                                   ..
Myotis lucifugus                    3
Columba livia                       3
Holcus lanatus                      3
Streptopelia decaocto               3
Canis lupus                         3
Length: 274, dtype: int64

In [31]:
df3 = species[species["scientific_name"] == "Myotis lucifugus"]
df4 = species[species["scientific_name"] == "Streptopelia decaocto"]

df3_styler = df3.style.set_table_attributes("style='display:inline'").set_caption("Myotis lucifugus Duplicates")
df4_styler = df4.style.set_table_attributes("style='display:inline'").set_caption("Streptopelia decaocto Duplicates")

display_html(df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)

Unnamed: 0,category,scientific_name,common_names,conservation_status
37,Mammal,Myotis lucifugus,"Little Brown Bat, Little Brown Myotis",Species of Concern
3042,Mammal,Myotis lucifugus,"Little Brown Bat, Little Brown Myotis, Little Brown Myotis",Species of Concern
4467,Mammal,Myotis lucifugus,Little Brown Myotis,Species of Concern

Unnamed: 0,category,scientific_name,common_names,conservation_status
3077,Bird,Streptopelia decaocto,Eurasian Collared-Dove,
3140,Bird,Streptopelia decaocto,"Eurasian Collared Dove, Eurasian Collared-Dove",
4514,Bird,Streptopelia decaocto,Eurasian Collared Dove,


## Exploring Observations Dataframe

In exploring the observations dataframe, we observe this about the data:
1. The scientific name unique entries of 5541 matches our species dataframe. 
    - We can join both dataframes on the scientific_name column
2. We see every column has the same non-null count of 23,296. 
    - This infers every observation is a complete entry (no null entries), consisting of both a park name and the scientific name.
3. We see the observation count per park is 5824 for all four parks exactly.
    - This implies a limit on data collection for the number of entries per park. 
        - It is possible that relevant data was not included due to this limit.
    - Each park has more entries than the total unique scientific names.
        - If every park had every unique species, that would equate to 22,164 observations at the maximum. 
        - When we group by scientific name over 4 entries, we see there are 274 duplicates to account for.
4. While the average of obervations per species is 142, the standard deviation is quite high at just about half the mean. 
    - Additionally, the min is 9 and the max is 321. This is quite a wide range.
        - Perhaps some species were simply more difficult to observe during the survey window. 
5. We observe duplicate/problematic data when we review the count of scientific name entries.
    - We see some species have up to 12 unique entries. The maximum should be 4, as each species can exist in all 4 parks. 
        - The best guess inference for this is that multiple volunteeers/workers were counting their observations. These observations were entered in separately, vs. being summed and entered as one entry. 
    - We will proceed to add the total entries for each species per each park when there are multiple entries.
        - We cannot simply drop duplicate entries, because that'll drop useful observation count data. Instead, we need to sum every duplicate scientific name and park name to find a total count for observations. 
    

In [40]:
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [41]:
observations.describe(include = 'all')

Unnamed: 0,scientific_name,park_name,observations
count,23296,23296,23296.0
unique,5541,4,
top,Myotis lucifugus,Great Smoky Mountains National Park,
freq,12,5824,
mean,,,142.287904
std,,,69.890532
min,,,9.0
25%,,,86.0
50%,,,124.0
75%,,,195.0


In [42]:
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


In [43]:
print(f"Unique Park Names: {observations.park_name.unique()}")

Unique Park Names: ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']


In [27]:
observations.park_name.value_counts()

park_name
Great Smoky Mountains National Park    5824
Yosemite National Park                 5824
Bryce National Park                    5824
Yellowstone National Park              5824
Name: count, dtype: int64

In [34]:
observations.groupby("scientific_name").size().loc[lambda x: x > 4].sort_values()

scientific_name
Agrostis capillaris                  8
Panicum capillare                    8
Panicum miliaceum                    8
Panicum rigidulum var. rigidulum     8
Parietaria pensylvanica              8
                                    ..
Myotis lucifugus                    12
Columba livia                       12
Holcus lanatus                      12
Streptopelia decaocto               12
Canis lupus                         12
Length: 274, dtype: int64

In [33]:
df5 = observations[observations["scientific_name"] == "Columba livia"]
df6 = observations[observations["scientific_name"] == "Streptopelia decaocto"]

df5_styler = df5.style.set_table_attributes("style='display:inline'").set_caption('Columba livia Duplicates')
df6_styler = df6.style.set_table_attributes("style='display:inline'").set_caption('Streptopelia decaocto Duplicates')

display_html(df5_styler._repr_html_()+df6_styler._repr_html_(), raw=True)

Unnamed: 0,scientific_name,park_name,observations
1865,Columba livia,Bryce National Park,135
2191,Columba livia,Yellowstone National Park,251
3255,Columba livia,Yosemite National Park,142
3441,Columba livia,Bryce National Park,96
6968,Columba livia,Bryce National Park,108
10468,Columba livia,Yosemite National Park,144
10688,Columba livia,Yellowstone National Park,232
11193,Columba livia,Yellowstone National Park,239
11859,Columba livia,Great Smoky Mountains National Park,44
12700,Columba livia,Great Smoky Mountains National Park,34

Unnamed: 0,scientific_name,park_name,observations
1635,Streptopelia decaocto,Yellowstone National Park,255
3200,Streptopelia decaocto,Bryce National Park,92
3376,Streptopelia decaocto,Yosemite National Park,124
4515,Streptopelia decaocto,Bryce National Park,88
7057,Streptopelia decaocto,Great Smoky Mountains National Park,74
8072,Streptopelia decaocto,Bryce National Park,121
8710,Streptopelia decaocto,Yellowstone National Park,255
10107,Streptopelia decaocto,Great Smoky Mountains National Park,72
10643,Streptopelia decaocto,Yellowstone National Park,261
14699,Streptopelia decaocto,Great Smoky Mountains National Park,110


## Cleaning the Data

Implementing the below items will improve both readability and functionality of the data, making data analysis possible. 

For species dataframe:
1. Create a 'healthy_population' conservation status to replace Nan values. 
2. Remove duplicate entries in scientific_name. 
    - We have 5824 total entries consisting of 5541 unique entries.
    - This consolidation will remove 283 duplicate rows.
    
For observations dataframe:
1. Combine duplicate scientific_name and park_name rows while adding the observation count of each.
    1. Identify instances where scientific_name and park_name are the same.
    2. Sum the observations counts in each duplicate row to obtain a total count of observations.
    3. Write a unique entry row consisting of scientific_name, park_name, and observations (total).
    4. Delete the previous duplicate rows, so only the unique row remains. 