# Biodiversity Portfolio Project (Codecademy Data Science)

As part of the Codecademy Data Science course, I was tasked with analysing (lightly fictionalised) data from the US National Parks Service about the species within those parks. The project is intended as an opportunity to demonstrate the data wrangling and data analysis competencies developed over the course. 

The goal of my project is to answer whether certain categories of species are more likely to be granted conservation status and whether species with a conservation status are observed more or less often than species without a conservation status.

**Project Goals**

The goals of the project are to answer these two overarching questions:
- **Question 1**: Are certain categories of species more likely to have been granted a conservation status? If so, does this hold true for all of the conservation statuses?
- **Question 2**: Are species with conservation statuses observed more or less than species without conservation statuses? Does this change between national parks?

**Data**

The data provided is in two csv files: `species_info.csv` and `observations.csv`. The first csv contains information about each of the species observed and the other csv lists observations of each of those species by national park. Through data wrangling methods, I remove duplicate rows in the species csv and sum the multiple observations in the observations csv to create a usable dataframe, `new_df.csv`.

**Analysis**

In analysing the data, a mixture of descriptive statistics, data visualisation, and inferential statistics are used in an attempt to answer the two questions asked at the outset.

**Evaluation**

The project concludes by evaluating whether the questions outlined above have been answered satisfactorily.

**Please note** that the **data wrangling** takes place in the **first** of these two notebooks (**this one**), and the **data analysis** takes place in the **second** of these two notebooks. 

# Data Wrangling

### Initial naive merge, sense checks

#### Importing modules

In [2]:
import numpy as np
import pandas as pd

#### Importing data and merging into a single dataframe

In [3]:
observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')
df = observations.merge(species, on='scientific_name')

df.head()

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
0,Vicia benghalensis,Great Smoky Mountains National Park,68,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
1,Vicia benghalensis,Yosemite National Park,148,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
2,Vicia benghalensis,Yellowstone National Park,247,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
3,Vicia benghalensis,Bryce National Park,104,Vascular Plant,"Purple Vetch, Reddish Tufted Vetch",
4,Neovison vison,Great Smoky Mountains National Park,77,Mammal,American Mink,


#### Initial sense checks

In [28]:
print('There are {} species.'.format(df.scientific_name.nunique()))
print('\nThere are {} national parks, named {}.'.format(df.park_name.nunique(), df.park_name.unique()))
print('\nThere are {} categories, named {}.'.format(df.category.nunique(), df.category.unique()))
print('\nThere are {} conservation status categores, named {}.'.format(df.conservation_status.nunique(), df.conservation_status.unique()))
print('\nThere are {} rows in the species dataframe, of which {} are NaNs in the conservation status column.'.format(len(species.conservation_status), species.conservation_status.isna().sum()))
print('There are {} rows in the master dataframe, of which {} are NaNs in the conservation status column.'.format(df.conservation_status.isna().sum(), len(df.conservation_status)))

There are 5541 species.

There are 4 national parks, named ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Yellowstone National Park' 'Bryce National Park'].

There are 7 categories, named ['Vascular Plant' 'Mammal' 'Bird' 'Nonvascular Plant' 'Amphibian'
 'Reptile' 'Fish'].

There are 4 conservation status categores, named [nan 'Species of Concern' 'Threatened' 'Endangered' 'In Recovery'].

There are 5824 rows in the species dataframe, of which 5633 are NaNs in the conservation status column.
There are 24752 rows in the master dataframe, of which 25632 are NaNs in the conservation status column.


#### Exploring the problems

These last two lines should give us pause; there are 5824 rows in the species dataframe and 24572 rows in the master dataframe.

This reveals two problems: first, the fact that there are 5824 rows in the species dataframe but only 5541 species shows that there are many *duplicate species rows*. Second, the fact that there are more than 4 times the amount of rows in the master dataframe than the species dataframe reveals that there are many *duplicate observations rows*.

To explore the issue, let's narrow our search to only a small subset of species - only those listed as 'In Recovery' in the conservation status column.

In [5]:
species[species.conservation_status == 'In Recovery']

Unnamed: 0,category,scientific_name,common_names,conservation_status
100,Bird,Haliaeetus leucocephalus,Bald Eagle,In Recovery
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
3143,Bird,Falco peregrinus anatum,American Peregrine Falcon,In Recovery
4565,Bird,Pelecanus occidentalis,Brown Pelican,In Recovery


In [6]:
df[df.conservation_status == 'In Recovery']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
6009,Canis lupus,Yosemite National Park,35,Mammal,"Gray Wolf, Wolf",In Recovery
6012,Canis lupus,Bryce National Park,27,Mammal,"Gray Wolf, Wolf",In Recovery
6015,Canis lupus,Bryce National Park,29,Mammal,"Gray Wolf, Wolf",In Recovery
6018,Canis lupus,Bryce National Park,74,Mammal,"Gray Wolf, Wolf",In Recovery
6021,Canis lupus,Great Smoky Mountains National Park,15,Mammal,"Gray Wolf, Wolf",In Recovery
6024,Canis lupus,Yellowstone National Park,60,Mammal,"Gray Wolf, Wolf",In Recovery
6027,Canis lupus,Yellowstone National Park,67,Mammal,"Gray Wolf, Wolf",In Recovery
6030,Canis lupus,Yellowstone National Park,203,Mammal,"Gray Wolf, Wolf",In Recovery
6033,Canis lupus,Great Smoky Mountains National Park,14,Mammal,"Gray Wolf, Wolf",In Recovery
6036,Canis lupus,Yosemite National Park,117,Mammal,"Gray Wolf, Wolf",In Recovery


This is a very revealing initial look - demonstrating clearly that we have *multiple different observations for each national park*. For Canis Lupus, there are 3 sets of data for each of the national parks! Let's see if we can find more issues with the data by focusing on one species - Canis Lupus.

In [7]:
species[species.scientific_name == 'Canis lupus']

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered


In [8]:
observations[observations.scientific_name == 'Canis lupus']

Unnamed: 0,scientific_name,park_name,observations
1294,Canis lupus,Yosemite National Park,35
1766,Canis lupus,Bryce National Park,27
7346,Canis lupus,Bryce National Park,29
9884,Canis lupus,Bryce National Park,74
10190,Canis lupus,Great Smoky Mountains National Park,15
10268,Canis lupus,Yellowstone National Park,60
10907,Canis lupus,Yellowstone National Park,67
13427,Canis lupus,Yellowstone National Park,203
17756,Canis lupus,Great Smoky Mountains National Park,14
19330,Canis lupus,Yosemite National Park,117


In [9]:
df[df.scientific_name == 'Canis lupus']

Unnamed: 0,scientific_name,park_name,observations,category,common_names,conservation_status
6008,Canis lupus,Yosemite National Park,35,Mammal,Gray Wolf,Endangered
6009,Canis lupus,Yosemite National Park,35,Mammal,"Gray Wolf, Wolf",In Recovery
6010,Canis lupus,Yosemite National Park,35,Mammal,"Gray Wolf, Wolf",Endangered
6011,Canis lupus,Bryce National Park,27,Mammal,Gray Wolf,Endangered
6012,Canis lupus,Bryce National Park,27,Mammal,"Gray Wolf, Wolf",In Recovery
6013,Canis lupus,Bryce National Park,27,Mammal,"Gray Wolf, Wolf",Endangered
6014,Canis lupus,Bryce National Park,29,Mammal,Gray Wolf,Endangered
6015,Canis lupus,Bryce National Park,29,Mammal,"Gray Wolf, Wolf",In Recovery
6016,Canis lupus,Bryce National Park,29,Mammal,"Gray Wolf, Wolf",Endangered
6017,Canis lupus,Bryce National Park,74,Mammal,Gray Wolf,Endangered


From the above, we can clearly see that merging our datasets without first dealing with the issues involved was a mistake; our duplicates have compounded to create an unwieldy dataset.

We have two distinct problems - one from each of the initial datasets - that have combined in the 'df' dataframe above.

First, in species, we have duplicates of our species because of different common names and conservation statuses put into the system.

Second, in observations, we often have multiple observations for the same park. I don't believe this is a 'mistake' per se - it might be the observations for a given year, for example - but because we don't know how the data was gathered, it seems to me that correct thing to do is combine all the observations into a 'total observations' column. Because we don't know the significance of the disaggregation of the data, it seems safer to sum it into a single column for easier analysis.

### Fixing 'observations' (summing multiple observations)

In [10]:
# Summing the observations for each species/park
total_observations = observations.groupby(['scientific_name', 'park_name'], as_index=False).sum()
# Dropping duplicates
new_observations = total_observations.drop_duplicates(subset=['scientific_name', 'park_name'])
# Renaming column
new_observations = new_observations.rename(columns={'observations': 'total_observations'})

new_observations.head(10)

Unnamed: 0,scientific_name,park_name,total_observations
0,Abies bifolia,Bryce National Park,109
1,Abies bifolia,Great Smoky Mountains National Park,72
2,Abies bifolia,Yellowstone National Park,215
3,Abies bifolia,Yosemite National Park,136
4,Abies concolor,Bryce National Park,83
5,Abies concolor,Great Smoky Mountains National Park,101
6,Abies concolor,Yellowstone National Park,241
7,Abies concolor,Yosemite National Park,205
8,Abies fraseri,Bryce National Park,109
9,Abies fraseri,Great Smoky Mountains National Park,81


This seems to have been successful. Let's just briefly check that we've successfully reduced the number of rows without reducing the number of species.

In [11]:
print(len(observations.scientific_name))
print(len(new_observations))

23296
22164


In [12]:
print(observations.scientific_name.nunique())
print(new_observations.scientific_name.nunique())

5541
5541


Now we can look again at canis lupus - let's compare 'observations' with 'new observations'.

In [13]:
observations[observations.scientific_name == 'Canis lupus']

Unnamed: 0,scientific_name,park_name,observations
1294,Canis lupus,Yosemite National Park,35
1766,Canis lupus,Bryce National Park,27
7346,Canis lupus,Bryce National Park,29
9884,Canis lupus,Bryce National Park,74
10190,Canis lupus,Great Smoky Mountains National Park,15
10268,Canis lupus,Yellowstone National Park,60
10907,Canis lupus,Yellowstone National Park,67
13427,Canis lupus,Yellowstone National Park,203
17756,Canis lupus,Great Smoky Mountains National Park,14
19330,Canis lupus,Yosemite National Park,117


In [14]:
new_observations[new_observations.scientific_name == 'Canis lupus']

Unnamed: 0,scientific_name,park_name,total_observations
3216,Canis lupus,Bryce National Park,130
3217,Canis lupus,Great Smoky Mountains National Park,59
3218,Canis lupus,Yellowstone National Park,330
3219,Canis lupus,Yosemite National Park,196


We have successfully summed our observations together to give 'total_observations'. Let's move on to fixing the 'species' dataset.

### Fixing 'species' (removing duplicate species rows)

Let's look again at the species dataset, specifically the problems that we saw when we looked at Canis Lupus.

In [15]:
species[species.scientific_name == 'Canis lupus']

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered


We have two separate 'problems' here: duplicates because of different common names, and duplicates because of different conservation statuses.

#### Removing duplicates with different common names (keeping the longest common name in each instance)

In [16]:
# Sorting species by length of common names
species.sort_values(by=['common_names'], key=lambda x: x.str.len(), inplace=True)
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
861,Vascular Plant,Iva annua,Iva,
1434,Vascular Plant,Quercus,Oak,
151,Bird,Philomachus pugnax,Ruff,
2113,Vascular Plant,Juncus dichotomus,Rush,
2527,Vascular Plant,Pyrus,Pear,


In [17]:
# Dropping duplicates, keeping the longest common name in each instance
new_species = species.drop_duplicates(subset=['category', 'scientific_name', 'conservation_status'], keep='last')

Let's check that we've successfully removed duplicates with different common names (keeping the longest common name in each instance) - we can check both canis lupus and the length of the new_species dataframe compared to the species dataframe.

In [18]:
new_species[new_species.scientific_name == 'Canis lupus']

Unnamed: 0,category,scientific_name,common_names,conservation_status
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered


In [19]:
print(len(species))
print(len(new_species))
print(len(species) - len(new_species))

5824
5543
281


We have successfully removed 281 duplicate species rows.

#### Thinking about duplicates with different conservation statuses 

There is a second kind of duplicate - species with different conservation statuses. Let's take a look at how many species this affects.

In [20]:
duplicates = new_species.duplicated(subset=['category', 'scientific_name', 'common_names'])
conservation_duplicates = duplicates[duplicates == True]
conservation_duplicates

560     True
4448    True
dtype: bool

As we can see from the above, there are only two duplicates in the dataset where the conservation statuses are different. In addition, because there is no way of distinguishing which is the 'right' conservation status, it seems to me that we should leave this one be.

### Merging the dataset again

In [21]:
new_df = new_observations.merge(new_species, on='scientific_name')
new_df.head(10)

Unnamed: 0,scientific_name,park_name,total_observations,category,common_names,conservation_status
0,Abies bifolia,Bryce National Park,109,Vascular Plant,Rocky Mountain Alpine Fir,
1,Abies bifolia,Great Smoky Mountains National Park,72,Vascular Plant,Rocky Mountain Alpine Fir,
2,Abies bifolia,Yellowstone National Park,215,Vascular Plant,Rocky Mountain Alpine Fir,
3,Abies bifolia,Yosemite National Park,136,Vascular Plant,Rocky Mountain Alpine Fir,
4,Abies concolor,Bryce National Park,83,Vascular Plant,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",
5,Abies concolor,Great Smoky Mountains National Park,101,Vascular Plant,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",
6,Abies concolor,Yellowstone National Park,241,Vascular Plant,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",
7,Abies concolor,Yosemite National Park,205,Vascular Plant,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",
8,Abies fraseri,Bryce National Park,109,Vascular Plant,Fraser Fir,Species of Concern
9,Abies fraseri,Great Smoky Mountains National Park,81,Vascular Plant,Fraser Fir,Species of Concern


One last sense check: let's check that the length of our new dataframe is no more than 4 times the length of our new species dataframe.

In [22]:
print(len(new_df))
print(len(new_species))
print(22172 // 5543)

22172
5543
4


Success!

#### Exporting to csv

In [30]:
new_df.to_csv('new_df.csv')
new_species.to_csv('new_species.csv')