# Biodiversity in National Parks

## Introduction
This data analysis is part of the Codecademy Data Scientist career path. This project in particular will focus on cleaning up the data in order to perform analysis on it. The data is from the National Parks Service about endangered species in different parks.


The data analysis is composed as follows:

1. Descriptive analysis of the datasets
2. Exploratory data analysis
2. Main Questions
3. Conclusion
4. Appendix

The Main Questions part consists of the following research questions:
1. Which park is most biodiverse?
2. Which species are endangered?
3. What Park has the most endangered species?
4. What category of animals is the most endangered?
5. Is there a difference in endangerement of endangered species among different parks?
6.... 

Each question will be accompanied by 2 graphs

### 1 Initial Data Inspection

#### 1.1 Importing libraries and data


In [317]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [318]:
observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

#### 1.2 Data inspection


In [319]:
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [320]:
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


In [321]:
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [322]:
species.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


the observation dataset is exactly 4 times the length of the species dataset, which indicates that the observation dataset consists of 4 observations of each species, looking at the number of national parks below, we can conclude that this is in observation per specie in each national park.

In [323]:
for num, park in enumerate(observations.park_name.unique()):
    print(num+1, park)

1 Great Smoky Mountains National Park
2 Yosemite National Park
3 Bryce National Park
4 Yellowstone National Park


#### 1.3 Merging the Data

We therefore want to merge the data on the scientific_name column to have a complete dataset.

In [324]:
biodiversity_data = pd.merge(observations, species)

In [325]:
biodiversity_data.head()

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
0,Vicia benghalensis,Great Smoky Mountains National Park,68,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
1,Vicia benghalensis,Yosemite National Park,148,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
2,Vicia benghalensis,Yellowstone National Park,247,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
3,Vicia benghalensis,Bryce National Park,104,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
4,Neovison vison,Great Smoky Mountains National Park,77,Mammal,American Mink,


#### 1.4 Cleaning the data

In [326]:
print(len(biodiversity_data[biodiversity_data.duplicated() == True]))
biodiversity_data[biodiversity_data.duplicated() == True].head()

31


Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
1070,Monotropa hypopithys,Great Smoky Mountains National Park,73,Vascular Plant,"American Pinesap, Pine-Sap",
1071,Monotropa hypopithys,Great Smoky Mountains National Park,73,Vascular Plant,Pinesap,
1850,Plantago major,Great Smoky Mountains National Park,90,Vascular Plant,"Nipple-Seed Plantain, Plantain",
1851,Plantago major,Great Smoky Mountains National Park,90,Vascular Plant,"Broadleaf Plantain, Buckhorn Plantain, Common ...",
2126,Eleocharis palustris,Great Smoky Mountains National Park,62,Vascular Plant,Spike-Rush,


As we can see there are 31 duplicates, this is mainly because of slightly different common_names entries. We will get rid of them.

In [327]:
biodiversity_data = biodiversity_data.drop_duplicates()

In [328]:
biodiversity_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25601 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   scientific_name      25601 non-null  object
 1   park_name            25601 non-null  object
 2   observations         25601 non-null  int64 
 3   category             25601 non-null  object
 4   common_names         25601 non-null  object
 5   conservation_status  880 non-null    object
dtypes: int64(1), object(5)
memory usage: 1.4+ MB


In [329]:
biodiversity_data.conservation_status.unique()

array([nan, 'Species of Concern', 'Threatened', 'Endangered',
       'In Recovery'], dtype=object)

The conservation_status column seems to have a lot of missing data. Looking at the unique values in this column, it is because species with no concern about there conservation have been given a NaN value. We will fill this with a corresponding value.

In [330]:
biodiversity_data['conservation_status'] = biodiversity_data['conservation_status'].fillna('Not of Concern')

In [331]:
biodiversity_data.conservation_status.value_counts()

Not of Concern        24721
Species of Concern      732
Endangered               80
Threatened               44
In Recovery              24
Name: conservation_status, dtype: int64

In [332]:
biodiversity_data.describe(include='all')

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
count,25601,25601,25601.0,25601,25601,25601
unique,5541,4,,7,5504,5
top,Castor canadensis,Bryce National Park,,Vascular Plant,Brachythecium Moss,Not of Concern
freq,36,6406,,19534,28,24721
mean,,,142.196477,,,
std,,,69.901035,,,
min,,,9.0,,,
25%,,,86.0,,,
50%,,,123.0,,,
75%,,,195.0,,,


Looking at the unique values of the scientific_name and the common_names column we see a difference (5541 - 5504 = 40). Every scientific name has to correspond to one common name so let's find out what is wrong here.

In [333]:
name_s = biodiversity_data.groupby('scientific_name').common_names.count().reset_index()
doubles = name_s[name_s.common_names > 4]
doubles

Unnamed: 0,scientific_name,common_names
104,Agrostis capillaris,16
107,Agrostis gigantea,16
111,Agrostis mertensii,16
116,Agrostis scabra,16
118,Agrostis stolonifera,16
...,...,...
5468,Vireo solitarius,16
5481,Vulpia bromoides,16
5484,Vulpia myuros,16
5485,Vulpia octoflora,16


When grouping scientific names and counting their common_names counts they all should return 4, one for every observation per national park. However, quite a few return more than 4. If we pick one out we can see what is going on:

In [334]:
biodiversity_data[biodiversity_data.scientific_name == 'Vireo solitarius']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
17996,Vireo solitarius,Bryce National Park,112,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
17997,Vireo solitarius,Bryce National Park,112,Bird,Blue-Headed Vireo,Not of Concern
17998,Vireo solitarius,Yosemite National Park,153,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
17999,Vireo solitarius,Yosemite National Park,153,Bird,Blue-Headed Vireo,Not of Concern
18000,Vireo solitarius,Yosemite National Park,140,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
18001,Vireo solitarius,Yosemite National Park,140,Bird,Blue-Headed Vireo,Not of Concern
18002,Vireo solitarius,Yellowstone National Park,240,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
18003,Vireo solitarius,Yellowstone National Park,240,Bird,Blue-Headed Vireo,Not of Concern
18004,Vireo solitarius,Great Smoky Mountains National Park,81,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
18005,Vireo solitarius,Great Smoky Mountains National Park,81,Bird,Blue-Headed Vireo,Not of Concern


We see that this species has multiple observations counts per national park, and multiple common names. These common names value have additional common names in some of the values, however for the sake of a tidy dataset we only want one. Otherwise the observation count for a specific specie is doubled since it has two rows.

In [338]:
biodiversity_data = biodiversity_data.drop_duplicates(subset=['scientific_name', 'park_name'])
# After this we only have one observation per scientific and common name

In [339]:
biodiversity_data[biodiversity_data.scientific_name == 'Vireo solitarius']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
17996,Vireo solitarius,Bryce National Park,112,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
17998,Vireo solitarius,Yosemite National Park,153,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
18002,Vireo solitarius,Yellowstone National Park,240,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern
18004,Vireo solitarius,Great Smoky Mountains National Park,81,Bird,"Blue-Headed Vireo, Solitary Vireo",Not of Concern


We now have an observation for every single specie in every national park. 

In [340]:
biodiversity_data.describe(include='all')

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
count,22164,22164,22164.0,22164,22164,22164
unique,5541,4,,7,5229,5
top,Vicia benghalensis,Great Smoky Mountains National Park,,Vascular Plant,Brachythecium Moss,Not of Concern
freq,4,5541,,17048,28,21452
mean,,,142.314835,,,
std,,,69.885082,,,
min,,,9.0,,,
25%,,,86.0,,,
50%,,,124.0,,,
75%,,,195.0,,,


However, if we perform .describe() again on the dataset we see that the difference has only become larger (5541 - 5529). This is because we also might have common names with multiple scientific names.

In [342]:
name_c = biodiversity_data.groupby('common_names').scientific_name.count().reset_index()
doubles_c = name_c[name_c.scientific_name > 4]
doubles_c

Unnamed: 0,common_names,scientific_name
11,A Moss,8
16,"A Sedge, Sedge",16
79,Alpine Fescue,8
105,Alpine Springbeauty,8
120,Amblystegium Moss,8
...,...,...
5173,Yellow Pincushion,12
5183,Yellow Warbler,8
5197,Yellow-Rumped Warbler,8
5199,Yellow-Throated Warbler,8


In [343]:
biodiversity_data[biodiversity_data.common_names == 'Alpine Fescue']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
16912,Festuca brachyphylla,Bryce National Park,121,Vascular Plant,Alpine Fescue,Not of Concern
16913,Festuca brachyphylla,Yellowstone National Park,256,Vascular Plant,Alpine Fescue,Not of Concern
16914,Festuca brachyphylla,Great Smoky Mountains National Park,66,Vascular Plant,Alpine Fescue,Not of Concern
16915,Festuca brachyphylla,Yosemite National Park,133,Vascular Plant,Alpine Fescue,Not of Concern
19280,Festuca brachyphylla ssp. breviculmis,Yellowstone National Park,275,Vascular Plant,Alpine Fescue,Not of Concern
19281,Festuca brachyphylla ssp. breviculmis,Yosemite National Park,167,Vascular Plant,Alpine Fescue,Not of Concern
19282,Festuca brachyphylla ssp. breviculmis,Great Smoky Mountains National Park,65,Vascular Plant,Alpine Fescue,Not of Concern
19283,Festuca brachyphylla ssp. breviculmis,Bryce National Park,106,Vascular Plant,Alpine Fescue,Not of Concern


So let's also drop duplicates this way around.

In [344]:
biodiversity_data = biodiversity_data.drop_duplicates(subset=['common_names', 'park_name'])

In [345]:
biodiversity_data[biodiversity_data.common_names == 'Alpine Fescue']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
16912,Festuca brachyphylla,Bryce National Park,121,Vascular Plant,Alpine Fescue,Not of Concern
16913,Festuca brachyphylla,Yellowstone National Park,256,Vascular Plant,Alpine Fescue,Not of Concern
16914,Festuca brachyphylla,Great Smoky Mountains National Park,66,Vascular Plant,Alpine Fescue,Not of Concern
16915,Festuca brachyphylla,Yosemite National Park,133,Vascular Plant,Alpine Fescue,Not of Concern


Now if we perform .describe() on biodiversity_data we can see that the count of scientific names aligns with the count of common names.

In [346]:
biodiversity_data.describe(include='all')

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
count,20916,20916,20916.0,20916,20916,20916
unique,5229,4,,7,5229,5
top,Vicia benghalensis,Great Smoky Mountains National Park,,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",Not of Concern
freq,4,5229,,16344,4,20216
mean,,,142.258701,,,
std,,,69.89274,,,
min,,,9.0,,,
25%,,,86.0,,,
50%,,,124.0,,,
75%,,,195.0,,,


### 2 Exploratory data analysis

### 3 Main Questions
#### 3.1 Which park is most biodiverse?
What makes a park biodiverse? 
- the number of species in the park



In [348]:
biodiversity_data.head()

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
0,Vicia benghalensis,Great Smoky Mountains National Park,68,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",Not of Concern
1,Vicia benghalensis,Yosemite National Park,148,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",Not of Concern
2,Vicia benghalensis,Yellowstone National Park,247,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",Not of Concern
3,Vicia benghalensis,Bryce National Park,104,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",Not of Concern
4,Neovison vison,Great Smoky Mountains National Park,77,Mammal,American Mink,Not of Concern


In [355]:
num_species_park = biodiversity_data.groupby('park_name').observations.sum()
num_species_park

park_name
Bryce National Park                     517568
Great Smoky Mountains National Park     387581
Yellowstone National Park              1295803
Yosemite National Park                  774531
Name: observations, dtype: int64

#### 3.2 Which species are endangered?


#### 3.3 What park has the most endangered species?

#### 3.4 What category of animals is the most endangered?

#### 3.5 Is there a difference in endangerement of species among different parks?