# Codecademy Portfolio Project: Biodiversity in National Parks

Goal:

 - Interpret data from the National Parks Service about endangered species in different parks
 - Investigate if there are any patterns or themes to the types of species that become endangered.
 
Provided Resources:
 
 - observations.csv
 - species_info.csv

In [1]:
# import section ------------------------------------------------------------------------------------------------------------- #
# updated with progress, sorted by purpose of the library -------------------------------------------------------------------- #

import pandas as pd
import numpy as np

Loading the provided csv-files and taking a look at them to get an overview over the provided data.

In [2]:
# read provided "observations.csv" as pandas dataframe and assign it to a variable ------------------------------------------- #
observations_df = pd.read_csv("./CC_provided_resources/observations.csv")

observations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


In [3]:
# print first 10 rows of "observations.csv" ---------------------------------------------------------------------------------- #
observations_df.head(10)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
5,Elymus virginicus var. virginicus,Yosemite National Park,112
6,Spizella pusilla,Yellowstone National Park,228
7,Elymus multisetus,Great Smoky Mountains National Park,39
8,Lysimachia quadrifolia,Yosemite National Park,168
9,Diphyscium cumberlandianum,Yellowstone National Park,250


In [4]:
# read provided "species_info.csv" as pandas dataframe and assign it to a variable ------------------------------------------- #
species_info_df = pd.read_csv("./CC_provided_resources/species_info.csv")

species_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


In [5]:
# print first 10 rows of "species_info.csv" ---------------------------------------------------------------------------------- #
species_info_df.head(10)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
5,Mammal,Odocoileus virginianus,White-Tailed Deer,
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",
7,Mammal,Canis latrans,Coyote,Species of Concern
8,Mammal,Canis lupus,Gray Wolf,Endangered
9,Mammal,Canis rufus,Red Wolf,Endangered


Initial insights:
 - The dataframes need to be joined with an inner join on "scientific_name" column.
 - Need to check unique entries in both "scientific_name" columns beforehand, wether or not, there is a mismatch.
 - Check data-type of "NaN" values, replace with "null" type if needed.
 - Check text columns data-type, convert dtype to string.
 - Causal questions will be raised further into the analysis, because of unfamiliarity with the subject.

In [6]:
# print unique entries (and count thereof) for "scientific_name" to check if mismatches occur with inner_join ---------------- #
obs_unique = observations_df["scientific_name"].unique()
obs_nunique = observations_df["scientific_name"].nunique()
obs_sorted = np.sort(obs_unique)
print(obs_sorted)
print(obs_nunique)

['Abies bifolia' 'Abies concolor' 'Abies fraseri' ...
 'Zonotrichia querula' 'Zygodon viridissimus'
 'Zygodon viridissimus var. rupestris']
5541


In [7]:
# print unique entries (and count thereof) for "scientific_name" to check if mismatches occur with inner_join ---------------- #
info_unique = species_info_df["scientific_name"].unique()
info_nunique = species_info_df["scientific_name"].nunique()
info_sorted = np.sort(info_unique)
print(info_sorted)
print(info_nunique)

['Abies bifolia' 'Abies concolor' 'Abies fraseri' ...
 'Zonotrichia querula' 'Zygodon viridissimus'
 'Zygodon viridissimus var. rupestris']
5541


In [8]:
# above output suggests, there will not be a mismatch, still performing a full comparison ------------------------------------ #
if np.array_equal(obs_sorted ,info_sorted):
    print("They are identical!")
else:
    print("They are NOT identical!")

They are identical!


As both sorted arrays are identical, it can be concluded that an inner join_can be performed and there will not be a loss of data. (Or in case of an outer_join, there will be no rows with missing values.)

In [9]:
# joining observations.csv and species_info.csv on "scientific_name" columns ------------------------------------------------- #
all_data_df = observations_df.merge(species_info_df, on="scientific_name", how="inner")
all_data_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 25632 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   scientific_name      25632 non-null  object
 1   park_name            25632 non-null  object
 2   observations         25632 non-null  int64 
 3   category             25632 non-null  object
 4   common_names         25632 non-null  object
 5   conservation_status  880 non-null    object
dtypes: int64(1), object(5)
memory usage: 1.4+ MB


Beeing unfamiliar with the subject I will perform an exploratory analysis, to get an idea what type of relations one might want to look for.

In [None]:
# 