## Data Cleaning for Biodiversity in National Parks

In [267]:
#libraries
import pandas as pd
pd.set_option('display.max_rows', None)

In [268]:
#read cvs and look at head to get a feel for data
observe = pd.read_csv('observations.csv')
print(observe.head())
print(observe.shape)
species = pd.read_csv('species_info.csv')
print(species.head())
print(species.shape)

            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85
(23296, 3)
  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs

The csv's appear to be two separate csv's, one containing data on all of the species observed with a count of observations, park where it was observed and scientific name, The other with the scientific name, common names, and conservation status. These should be joined on scientific name. 

In [269]:
#merge the two csv, outer, keeping all rows
biodiverse = pd.merge(observe, species, how='outer', on='scientific_name')
print(biodiverse.head())

      scientific_name                            park_name  observations  \
0  Vicia benghalensis  Great Smoky Mountains National Park            68   
1  Vicia benghalensis               Yosemite National Park           148   
2  Vicia benghalensis            Yellowstone National Park           247   
3  Vicia benghalensis                  Bryce National Park           104   
4      Neovison vison  Great Smoky Mountains National Park            77   

         category                        common_names conservation_status  
0  Vascular Plant  Purple Vetch, Reddish Tufted Vetch                 NaN  
1  Vascular Plant  Purple Vetch, Reddish Tufted Vetch                 NaN  
2  Vascular Plant  Purple Vetch, Reddish Tufted Vetch                 NaN  
3  Vascular Plant  Purple Vetch, Reddish Tufted Vetch                 NaN  
4          Mammal                       American Mink                 NaN  


In [270]:
#drop duplicates
biodiverse.drop_duplicates(inplace=True)
#check for missing values
print(biodiverse.isna().sum())


scientific_name            0
park_name                  0
observations               0
category                   0
common_names               0
conservation_status    24721
dtype: int64


There are a large number of missing values in the conservation status column, why is that?

In [271]:
#check what values exist in conservation status
print(biodiverse.conservation_status.unique())
#check what if there are any correlations with a particular scientific category
print(biodiverse['category'][biodiverse['conservation_status'].isna()].value_counts())

#check if there are any columns that are duplicated apart from common names
print('biodiverse dataframe shape')
print(biodiverse.shape)
print('biodiverse duplicates dataframe shape')
print(biodiverse[biodiverse.duplicated(subset=['scientific_name', 'park_name', 'observations', 'category', 'conservation_status'], keep = 'first')==True].shape)
#There are a high number of values that are duplicated apart from the common name. 

[nan 'Species of Concern' 'Threatened' 'Endangered' 'In Recovery']
Vascular Plant       19350
Bird                  2013
Nonvascular Plant     1312
Mammal                 966
Fish                   476
Reptile                304
Amphibian              300
Name: category, dtype: int64
biodiverse dataframe shape
(25601, 6)
biodiverse duplicates dataframe shape
(2300, 6)


Since the conservation status categories are do not seem to include a 'least concern' category, it's likely that some of the the NaN values in this list are species that do not have a conservation status at this time. For that reason, for now we'll replace those with the status label 'No Status'. Many of the items without a listing are vascular plants, these are less likely to be tracked on the conservation status list as well. 

There are also 2300 items that are completely duplicated apart from the common name. These are likely not unique observations, so we're going to drop rows that are duplicates apart from the common names from our dataset.

In [272]:
#fill nan values for biodiverse.conservation_status with 'No Status'
biodiverse.fillna(value={'conservation_status': 'No Status'}, inplace=True)
print(biodiverse.conservation_status.value_counts())

No Status             24721
Species of Concern      732
Endangered               80
Threatened               44
In Recovery              24
Name: conservation_status, dtype: int64


In [273]:
#remove the rows that are duplicates apart from common name
biodiverse.drop_duplicates(subset=['scientific_name', 'park_name', 'observations', 'category', 'conservation_status'], inplace = True)
print(biodiverse.shape)

(23301, 6)


In [274]:
#check data types and possible issues
print(biodiverse.dtypes)
print(biodiverse.nunique())
print(biodiverse.shape)
print(biodiverse.groupby(['park_name']).nunique())

scientific_name        object
park_name              object
observations            int64
category               object
common_names           object
conservation_status    object
dtype: object
scientific_name        5541
park_name                 4
observations            304
category                  7
common_names           5230
conservation_status       5
dtype: int64
(23301, 6)
                                     scientific_name  observations  category  \
park_name                                                                      
Bryce National Park                             5541           142         7   
Great Smoky Mountains National Park             5541           129         7   
Yellowstone National Park                       5541           149         7   
Yosemite National Park                          5541           151         7   

                                     common_names  conservation_status  
park_name                                                   

In [275]:
#remove 'National Park' from park_name column
biodiverse['park_name']=biodiverse['park_name'].replace(' National Park', '', regex=True)
print(biodiverse.groupby(['park_name']).nunique())

                       scientific_name  observations  category  common_names  \
park_name                                                                      
Bryce                             5541           142         7          5230   
Great Smoky Mountains             5541           129         7          5230   
Yellowstone                       5541           149         7          5230   
Yosemite                          5541           151         7          5230   

                       conservation_status  
park_name                                   
Bryce                                    5  
Great Smoky Mountains                    5  
Yellowstone                              5  
Yosemite                                 5  


We can now see that we're looking at data from four National Parks where they've observed 5541 unique species. The data has been cleaned and tidied, so now we can save it to a new csv, 'biodiversity_data.csv'

In [276]:
biodiverse.to_csv('biodiversity_data.csv')