# Project Scope

## Goals
- What is the distribution of conservation status for animals?
<br>
- Are certain types of species more likely to be endangered?
<br>
- Are the differences between species and their conservation status significant?
<br>
- Which species were spotted the most at each park?

## Data
We will use the observations.csv and species_info.csv files supplied from Codecademy.

## Analysis
We will use pandas, matplotlib, and seaborne to explore, analyse, and visualise the data.
Once completed we will be able draw our conclusions based on our goals for the project.

# Import Appropriate Modules

In [205]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# Import the Data

In [206]:
obs_df = pd.read_csv('/Users/jordangreen/Desktop/biodiversity_starter/observations.csv')
spec_df = pd.read_csv('/Users/jordangreen/Desktop/biodiversity_starter/species_info.csv')

obs_df = observation data of each species
<br>
spec_df = info for each of the observed species


# Inspect and Clean the Data

In [207]:
display(spec_df.head())

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [208]:
print(spec_df.columns)

Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object')


In [209]:
print(spec_df.category.unique())

['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


'category': How the species is classified in the dataset.  Possible values above.
<br>
'scientific_name': The unique name given to a species in taxonomy
<br>
'common_names': common aliases for the species
<br>
'conservation_status':  Measures if a species needs conservation and to what degree

The species dataset gives information about each species that appears in the observations dataset.  Immediately the 'conservation_status' column jumps out.  There are NaN values which could either be missing data or it could mean that the species is not endangered at all.  Before making any changes to the data, I want to check if there are any more null values.

In [210]:
print(spec_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None


By using the .info function, it is clear that only the 'conservation_status' column contains NaN values (and quite a lot of them).  Given the context of the dataset, we can assume that all NaN values represent species that are not endangered.  To make the data reflect this, we can change every null value in the 'conservation_status' column to the string 'Not Endangered'.  After doing this we can convert the column into ordered categorical data.

In [211]:
spec_df['conservation_status'] = spec_df['conservation_status'].fillna('Not Endangered')
spec_df.conservation_status = pd.Categorical(spec_df.conservation_status, 
                                               categories=['Not Endangered', 'Species of Concern', 
                                               'In Recovery', 'Threatened', 'Endangered'], 
                                               ordered=True)

In [212]:
print(spec_df.nunique())

category                  7
scientific_name        5541
common_names           5504
conservation_status       5
dtype: int64


Before we move on, I want to look if there are any duplicates in the 'scientific_name' column.  There is a discrepancy between the number of unique scientific names and the number of names that are in the dataset.  This indicates that there are duplicate rows.


In [213]:
dupes = spec_df[spec_df.scientific_name.duplicated(keep=False)]
print('Duplicates: ' + str(len(dupes)))

Duplicates: 557


There are 557 duplicate rows in the data set.  Now that we know there are duplicates that have the same scientific name.  This is not possible and represents an error in the dataset.  Some of the duplicate scientific names might have differing conservation_status' than their duplicates.  If this is the case, it would be best to keep the one with the most serious conservation_status.  Using the .drop_duplicates function we can remove all rows where there are duplicate names except the first instance of that name.  

In [214]:
spec_df.drop_duplicates(inplace=True, subset='scientific_name', keep='first')
spec_df.reset_index(drop=True, inplace=True)
worst_case = dupes.groupby('scientific_name').conservation_status.max()

Now that we have the worst cases for all of the duplicates we will replace the conservation status in the dataframe with the one in the worst case for each species.

In [215]:
for species, worst in worst_case.iteritems():
    spec_df.conservation_status[spec_df.scientific_name == species] = worst

In the code above, we iterated through every species: worst case pair in the worst_case and updated the dataframe with the matching name and changed its conservation status to the one we found in the worst_case grouping we generated earlier.  

In [216]:
dupes = spec_df[spec_df.scientific_name.duplicated(keep=False)]
print('Duplicates: ' + str(len(dupes)))

Duplicates: 0


The duplicates are now removed and the dataset looks clean.  Since the observation set also contains a column for scientific names, and this is now unique to every row in the species set, we can join the two tables together.

In [227]:
merged_df = spec_df.merge(obs_df, on='scientific_name', how='outer')
display(merged_df)

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Not Endangered,Bryce National Park,130
1,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Not Endangered,Yellowstone National Park,270
2,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Not Endangered,Great Smoky Mountains National Park,98
3,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Not Endangered,Yosemite National Park,117
4,Mammal,Bos bison,"American Bison, Bison",Not Endangered,Yosemite National Park,128
...,...,...,...,...,...,...
23291,Vascular Plant,Vitis californica,"California Grape, California Wild Grape",Not Endangered,Yellowstone National Park,237
23292,Vascular Plant,Tribulus terrestris,"Bullhead, Caltrop, Goathead, Mexican Sandbur, ...",Not Endangered,Great Smoky Mountains National Park,50
23293,Vascular Plant,Tribulus terrestris,"Bullhead, Caltrop, Goathead, Mexican Sandbur, ...",Not Endangered,Yellowstone National Park,239
23294,Vascular Plant,Tribulus terrestris,"Bullhead, Caltrop, Goathead, Mexican Sandbur, ...",Not Endangered,Bryce National Park,111


It appears that there are duplicated scientific names in this new frame as well.  However, we know that there are 4 different parks so it would make sense that there can be up to 4 entries for each species.  To confirm that there arent any actual duplicate observations, we can find all the scientific names that occur more than 4 times.  

In [245]:
print(merged_df.scientific_name.value_counts())
print(merged_df.scientific_name.value_counts().unique())

Columba livia             12
Castor canadensis         12
Holcus lanatus            12
Procyon lotor             12
Canis lupus               12
                          ..
Glyceria borealis          4
Cardamine dissecta         4
Sematophyllum demissum     4
Penstemon canescens        4
Tephrosia virginiana       4
Name: scientific_name, Length: 5541, dtype: int64
[12  8  4]


We can see that there are 5541 unique names which is a great sign because this is the same number of rows in spec_df.  However, there are up to 12 rows in obs_df with the same scientific names so we should look at an example of a set of these rows.

In [358]:
print(merged_df[merged_df.scientific_name == 'Columba livia'])

    category scientific_name common_names conservation_status  \
748     Bird   Columba livia    Rock Dove      Not Endangered   
749     Bird   Columba livia    Rock Dove      Not Endangered   
750     Bird   Columba livia    Rock Dove      Not Endangered   
751     Bird   Columba livia    Rock Dove      Not Endangered   
752     Bird   Columba livia    Rock Dove      Not Endangered   
753     Bird   Columba livia    Rock Dove      Not Endangered   
754     Bird   Columba livia    Rock Dove      Not Endangered   
755     Bird   Columba livia    Rock Dove      Not Endangered   
756     Bird   Columba livia    Rock Dove      Not Endangered   
757     Bird   Columba livia    Rock Dove      Not Endangered   
758     Bird   Columba livia    Rock Dove      Not Endangered   
759     Bird   Columba livia    Rock Dove      Not Endangered   

                               park_name  observations      protected  
748                  Bryce National Park           135  Not Protected  
749       

While there are 12 rows with the same scientific names, each has a different number of observations.  From the .value_counts() function we can see that all of the counts are multiples of 4.  Given that there are 4 different parks, it is likely that the the sets of 4 represent different years our instances when each of the 4 parks were measured.  As such,  duplicates do not seem to be an issue here, our merged data appears to be clean, and we can move on the our analysis.

## Analysis

Now that our data is clean, we can analyze it to address the questions that we raised back in our scope. 
<br>
<br>
Thes are: 
<br>
- What is the distribution of conservation status for animals?
<br>
- Are certain types of species more likely to be endangered?
<br>
- Are the differences between species and their conservation status significant?
<br>
- Which species were spotted the most at each park?

### Distribution of Conservation Status For Animals?

To answer the first question we can simply use .value_counts to find the proportion that each conservations status represents in the data set.  

In [356]:
status_prop = merged_df.conservation_status.value_counts(normalize=True)
print(status_prop)

Not Endangered        0.967033
Species of Concern    0.027644
Endangered            0.002919
Threatened            0.001889
In Recovery           0.000515
Name: conservation_status, dtype: float64


The figure above shows the proportion of conservations status for all animals.  While this shows us that the vast majority of observations were of 'Not Endangered' animals, it doesnt tell us much else.  We should find the distribution of conservations status for animals based on their category.  This could give us more insight.

In [353]:
merged_df['protected'] = merged_df.conservation_status.apply\
(lambda x: 'Protected' if x != 'Not Endangered' else 'Not Protected')
merged_df.groupby('category').conservation_status.value_counts(normalize=True).unstack()

conservation_status,Not Endangered,Species of Concern,In Recovery,Threatened,Endangered
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amphibian,0.9125,0.05,,0.025,0.0125
Bird,0.848369,0.138196,0.005758,,0.007678
Fish,0.905512,0.031496,,0.03937,0.023622
Mammal,0.82243,0.130841,,0.009346,0.037383
Nonvascular Plant,0.984985,0.015015,,,
Reptile,0.936709,0.063291,,,
Vascular Plant,0.989709,0.00962,,0.000447,0.000224


The table above gives us some interesting insights about the dataset.  First, mammals have the highest proportion of endangered species.  This could be for a variety of reasons but my intuition tells me that this is likely because mammals and fish are the most are the most hunted categories and fish reproduce more.  So it makes sense that mammals are first and fish are second.  Interestingly, birds are the only category that have a group "in recovery".  I'm not sure why this could be however.  Plants from both categories have the highest proportion of "Not Endangered" species.  This also makes sense as plants easily reproduce and have less predators.  
