# Biodiversity Project

This project is part of the Codecademy course "Data Scientist: Natural Language Processing Career Path".

For this project, data from National Parks Service about endangered species in different parks will be interpreted in order to check any patterns to endangered species.

### Goals

The aim of the project is to investigate any patterns or themes to the types of species that become endangered and then assess the likeliness of becoming extinct.

### Resources

Data from National Parks Service:

- observations.csv

- species_info.csv

For classification status:

- nps.gov

- fisheries.noaa.gov

- wikipedia.org

- maine.gov

Other:

- wikipedia.org

### Analysis


Some analysis of the data will intend to answer the following questions:

- What is the distribution of `conservation_status` for animals?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which species were spotted the most at each park?



## 1. Importing Python Libraries

This project will be using Seaborn and Matplotlib libraries to plot data. Also, this project will be using Pandas and Numpy for data manipulation.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 2. Loading the data


In [None]:
species = pd.read_csv("species_info.csv")
observations = pd.read_csv("observations.csv")

## 3. Distribution of the conservation status 

Conservation status values are in the `species` dataset.

### 3.1 Inspecting the `species` dataset

In the `species` dataset there are four columns: `category`,`scientific_name`,`common_names`, and `conservation_status`.

The first five entries of the `conservation_status` columns have no data.

In [None]:
print(species.columns)
print(species.head(10))


As seen below, most entries in the conservation status columns (about 97%) have missing data.

In [None]:
print(species.info())

In [None]:
species.describe(include='all')

### 3.1.1 Handling missing data

Since there are 191 non-null values in the `conservation_status` column, it is crucial to tackle missing data. Despite the fact that there are 5824 rows, most of them have missing data for the conservation_status column.

In [None]:
species.conservation_status.isna().value_counts()

In [None]:
species.conservation_status.value_counts()

In [None]:
# getting to know species that have a value for the conservation_status column
species_conservation = species[species.conservation_status.notna()]
species_conservation = species_conservation.rename(columns={'scientific_name': 'scientific', 'common_names': 'nicknames',\
                                     'conservation_status': 'status'})
print(species_conservation.scientific.nunique())
species_conservation.scientific.unique()


In [None]:
# getting to know species that have no value for the conservation_status column
species_nan = species[species.conservation_status.isna()]
species_nan = species_nan.rename(columns={'scientific_name': 'scientific', 'common_names': 'nicknames',\
                                     'conservation_status': 'status'})
species_nan.head()
species_nan.duplicated().value_counts()

#### Some analysis

Provided that the number of missing data is too high, listwise deletion is not an option in that it would reduce the sample size too much. Although, that will be required to assess the distribution of conservation status for animals.

On the other hand, in spite of the `conservation_status` value is missing, information in other columns (category, scientific name, and common names) might be important to analyse the `observation` dataset.

### 3.1.2 Handling duplicated data

Since each species has its own unique scientific name, the expected unique number should be the same of the number of rows (5824). However, there are 5541 unique values in the `scientific name`  column, which reveals the existence of duplicated entries. 



### 3.1.2.1 Inspecting `scientific_name` duplicated data

In [None]:
species1 = species.rename(columns={'scientific_name': 'scientific',  
                        'common_names': 'nicknames', 'conservation_status':'status'})

# checking duplicated rows
print("Is there duplicated rows?", species1.duplicated().unique()[0])

#checking duplicated values in the scientific column

species_dup_scientific = species1[species1.duplicated(subset=['scientific'], keep=False)].\
                     sort_values(by='scientific')

species_dup_scientific.describe(include='all')

In [None]:
# duplicated values in the scientific_name column
species_dup_scientific[species_dup_scientific.duplicated(subset=['scientific'], keep=False)].\
   sort_values(by='scientific')

In [None]:
# duplicated values in the scientific_name AND common_names columns
species_dup_scientific[species_dup_scientific.duplicated(subset=['scientific', 'nicknames'], keep=False)]

In [None]:
species_dup_scientific[species_dup_scientific.duplicated(subset=['nicknames'], keep=False)]

In [None]:
species_dup_scientific[(species_dup_scientific.scientific == 'Silene vulgaris') |  \
                       (species_dup_scientific.nicknames == 'Bladder Campion') | \
                      (species_dup_scientific.scientific == 'Silene latifolia ssp. alba')]

#### Some findings
Unexpectadly, among 557 rows with duplicated values in the scientific_name column, there are 274 unique values. 
It turns out that most of duplicated rows for the same `scientific_name` was due to the `common_names` values recorded. For these `scientific_nam` duplicated values, there are at least two different records for the `common_name` column. For example, there are two entries for the `scientific_name` value 'Agrostis capillaris':
- the first entry has 'Colonial Bent, Colonial Bentgrass' for the `common_name` column;
- the second entry has 'Rhode Island Bent' for the `common_name` column.

On the other hand, where there are duplicated values in the `scientific_name` column AND `common_names` column, there are different entries in the `conservation_status` column. In the 'Canis lupus' value case, there are two different values for `conservation_status` value ('In recovery' and 'Endangered').

There is a single case of `common_names` value 'Bladder Campion' which has two different entries in the `scientific_name` value: 'Silene latifolia ssp. alba' and 'Silene vulgaris'. Each one has two different entries in the `common_names` column.

### 3.1.2.2 Inspecting `common_names` duplicated data

Common names refer to the usual name of species. Conversely to scientific name, a common name often applies to multiple species.

In [None]:
#checking duplicated values in the common_names column

species_dup_nicknames = species1[species1.duplicated(subset=['nicknames'], keep=False)].\
                     sort_values(by='nicknames')

species_dup_nicknames.describe(include='all')


In [None]:
# Getting to know duplicate data in the scientific_name column AND nicknames column

species_dup_nicknames[species_dup_nicknames.duplicated(subset=['scientific'], keep=False)]

In [None]:
# Getting to know duplicated data in the common_names column
species_dup_nicknames

In [None]:
species_dup_nicknames[species_dup_nicknames['nicknames'] == 'Brachythecium Moss']

In [None]:
# Getting to know duplicated data over than two entries in the common_names column
spe_dup_nick_over2 = species_dup_nicknames[species_dup_nicknames.duplicated(subset=['nicknames'], keep = False)]
spe_dup_nick_over2 = spe_dup_nick_over2.groupby('nicknames').count().reset_index().sort_values(by='category')
spe_dup_nick_over2.rename(columns={'category': 'counts'}, inplace=True)
spe_dup_nick_over2 = spe_dup_nick_over2[spe_dup_nick_over2.counts > 2].sort_values(by='counts', ascending=False)
spe_dup_nick_over2.describe(include='all')


##### Some findings

There are 568 duplicated values in the `common_names` column, most of them due to the common name designates two different species. 

Among these, there are 41 `common_names` values given to more than two differente species, v.g, *'Brachythecium Moss'* common name was given to seven different species: 'Brachythecium rutabulum', 'Brachythecium oxycladon', 'Brachythecium oedipodium', 'Brachythecium digastrum', 'Brachythecium rivulare', 'Brachythecium salebrosum', and 'Brachythecium plumosum'.

There are two cases of duplicated data in the `scientific_name`column AND `common_names` column, with different `conservation_status` values.

### 3.1.2.3 Removing duplicated data

Detected duplicated data in the `scientific_name` column of `species` dataset, they must be removed in order to assess the conservation status distribution correctly.

In [None]:
# removing rows with missing values in the 'conservation_status' column

species1_nan = species1.dropna()
species1_nan.describe(include='all')
  
    

In [None]:
# checking duplicated values

species1_nan[species1_nan.duplicated(subset=['scientific'], keep=False)].sort_values(by='scientific')

##### Some findings

Duplicated values in the 'scientific_name' column is mostly due to different entries in the 'common_names' column.

Most of the 'common_names' values for the same 'scientific_name' column is included in the other entry. For example, for *Pandion haliaetus* value in the 'scientific_name' column, there are two entries in the 'common_names' column: *Osprey* and *Osprey, Western Osprey*.

Duplicated values for *Canis lupus* and *Myotis lucifugus* will be addressed individually because there are three entries each. 

In [None]:
species1_nan['more_than_one_name'] = species1_nan.nicknames.apply(lambda x: True if ',' in x else False)


## 3.2 Inspecting the `observations` dataset

In the `observations` dataset there are three columns: `scientific_name`, `park_name`, and `observations`.


In [None]:
observations.head(10)

In [None]:
observations1 = observations.rename(columns={'scientific_name': 'scientific', 'park_name': 'park'})
print(observations1.info())

In [None]:
print(observations1.describe(include='all'))

In [None]:
print(observations1.scientific.duplicated().value_counts())

In [None]:
print(observations1.scientific.value_counts())

In [None]:
print(observations1.park.value_counts())

#### Some findings

There are 5541 different species spotted and 5824 entries for each park (Great Smoky Mountains National Park,
Yosemite National Park, Bryce National Park, and Yellowstone National Park).

### 3.2.1 Checking for duplicated scientific name values per park

First, data will be split per park.

In [None]:
smoky = observations1[observations1['park'] == 'Great Smoky Mountains National Park']
smoky.sort_values(by='observations', ascending=False).head()

In [None]:
print(smoky.scientific.duplicated().value_counts())
smoky_duplicated = smoky[smoky.duplicated(subset=['scientific'])]
smoky_duplicated[smoky_duplicated.duplicated(subset=['scientific'], keep=False)].sort_values(by='scientific')

In [None]:
yosemite = observations1[observations1['park'] == 'Yosemite National Park']
yosemite.sort_values(by='observations', ascending=False).head()

In [None]:
print(yosemite.scientific.duplicated().value_counts())
yosemite_duplicated = yosemite[yosemite.duplicated(subset=['scientific'])]
yosemite_duplicated[yosemite_duplicated.duplicated(subset=['scientific'], keep=False)].sort_values(by='scientific')

In [None]:
bryce = observations1[observations1['park'] == 'Bryce National Park']
bryce.sort_values(by='observations', ascending=False).head()


In [None]:
print(bryce.scientific.duplicated().value_counts())
bryce_duplicated = bryce[bryce.duplicated(subset=['scientific'])]
bryce_duplicated[bryce_duplicated.duplicated(subset=['scientific'], keep=False)].sort_values(by='scientific')

In [None]:
yellowstone = observations1[observations1['park'] == 'Yellowstone National Park']
yellowstone.sort_values(by='observations', ascending=False).head()

In [None]:
print(yellowstone.scientific.duplicated().value_counts())
yellowstone_duplicated = yellowstone[yellowstone.duplicated(subset=['scientific'])]
yellowstone_duplicated[yellowstone_duplicated.duplicated(subset=['scientific'], keep=False)].sort_values(by='scientific')

In [None]:
# Duplicated species observations per park
smoky_dup_list = smoky_duplicated[smoky_duplicated.duplicated(subset=['scientific'], keep=False)]\
.sort_values(by='scientific', ascending=True)
smoky_dup_list.scientific.unique()

In [None]:
yosemite_dup_list = yosemite_duplicated[yosemite_duplicated.duplicated(subset=['scientific'], keep=False)]\
.sort_values(by='scientific', ascending=True)
yosemite_dup_list.scientific.unique()

In [None]:
bryce_dup_list = bryce_duplicated[bryce_duplicated.duplicated(subset=['scientific'], keep=False)]\
.sort_values(by='scientific', ascending=True)
bryce_dup_list.scientific.unique()

In [None]:
yellowstone_dup_list = yellowstone_duplicated[yellowstone_duplicated.duplicated(subset=['scientific'], keep=False)]\
.sort_values(by='scientific', ascending=True)
yellowstone_dup_list.scientific.unique()

In [None]:
species_dup_list = species_duplicated[species_duplicated.duplicated(subset=['scientific'], keep=False)]\
.sort_values(by='scientific', ascending=True)
species_dup_list.scientific.unique()

#### Some findings
There are duplicated records for the species *'Canis lupus', 'Castor canadensis', 'Columba livia', 'Holcus lanatus', 'Hypochaeris radicata', 'Myotis lucifugus','Procyon lotor', 'Puma concolor', 'Streptopelia decaocto'* for all the four parks. 

Since the same species have duplicated values, it's unlikely that happened due to random circunstances.

Those species are the same that have duplicated values in the *species* dataset. As seen above, duplicated values in the `scientific_name` column is due to different entries of `common_names` for the same species.

### 4. Distribution of conservation status for animals

Under the Endangered Species Act (ESA) at-risk species are classified as endangered species and threatened species.

An **endangered species** is any species in danger of extinction.

A **threatened species** is any species which is likely to become endangered within the foreseeable future. 

Species designated as threatened or endangered are called **listed species** in that they are added to the federal lists of endangered and threatened wildlife and plants. Species must meet the definition of endangered and threatened under the Act.

The National Park Service dataset lists at-risk species in parks and includes those under the Endangered Species Act, and also state, local, and tribal listed species.

A **species of special concern** is any species that is particularly vulnerable, and could easily become endangered.

An **in recovery species** is any species that is subjected to a recovery program in order to no longer require special protection.

##### Some findings

There are 1021 records for the category animal ('Mammal', 'Bird', 'Reptile', 'Amphibian', and 'Fish') in the dataset.

As seen below, `Bird` is the largest group, with 521 records.

In [None]:
species_animals = species[species.category.str.contains('Plant') == False]
print('Number of entries:', species_animals.category.count())
species_animals.category.unique()

In [None]:
ax = sns.catplot(data=species_animals, x='category', kind='count')
ax = ax.facet_axis(0,0)
for i in ax.containers:
    ax.bar_label(i,)

#### 4.1 Dealing with duplicated data

Most of duplicated rows for the same species was due to the common names recorded for the same `scientific_name` value recorded. The duplicated species are 'Canis lupus', 'Castor canadensis', 'Columba livia','Holcus lanatus', 'Hypochaeris radicata', 'Myotis lucifugus','Procyon lotor', 'Puma concolor', 'Streptopelia decaocto'.
However, in the 'Canis lupus' case, there are two different for entries in the `conservation_status` column ('In recovery' and 'Endangered').


In [None]:
species_duplicated[species_duplicated.duplicated(subset=['scientific'], keep=False)].\
sort_values(by='scientific')


In [None]:
print(species[species['scientific_name'] == 'Canis lupus'])

# species2 = species
# species2.rename(columns={'scientific_name': 'scientific', 'common_names': 'nicknames'}, inplace=True)
# print(species2[species2.scientific == 'Canis lupus'])
# species2['More than one common name'] = species2.apply(lambda row: True if ', ' in row['nicknames']\
#                                              else False, axis=1)
# species2.sort_values(by='More than one common name', inplace=True)
# species2[species2.scientific == 'Canis lupus']


#### 4.2 Handling missing data

As seen before, most entries in the conservation status columns (about 97%) have missing data. For 5633 records, there are 191 species with information about its conservation status in the dataset.

For the purpose of this topic, it's important to assess the distribution of missing data among animals category.


In [None]:
# Visualizing the distribution of missing conservation status data
species_animal_null = species_animals.fillna(value='No data')
species_animal_null.head(10)

order=['Species of Concern','Threatened', 'Endangered', 'In Recovery', "No data"]

plt.figure(figsize=(8,6))
ax = sns.catplot(data=species_animal_null, x='conservation_status', kind='count', \
                 order= order)
ax = ax.facet_axis(0,0)
for i in ax.containers:
    ax.bar_label(i,)
axe = plt.subplot()
axe.set_xticks(ticks=range(5))
axe.set_xlabel(order)
axe.set_xticklabels(labels = order, rotation=45)
plt.title("Distribution of conservation status for animal including null values")
plt.show()
plt.clf()


##### Some findings

Missing data outnumber other `conservation_status` values for all groups of animals as seen on the graphs below. 

Possilble causes for that are:
- most species have not been included on the list;
- those species' conservation status were not informed.

In [None]:
g = sns.catplot(data=species_animal_null, x='conservation_status', col='category', col_wrap = 5,\
            kind='count', height=3, aspect=8/5, \
            order=['Species of Concern','Threatened', 'Endangered', 'In Recovery', "No data"])

g.fig.tight_layout()
g.set_xticklabels(labels=order, rotation=45)

plt.show()
plt.clf()

In [None]:
g = sns.catplot(data=species_animal_null, x='category', hue='conservation_status',\
            kind='count', height=4, aspect=13/5, legend_out=False)

sns.move_legend(g, "upper right")
plt.title("Distribution of missing values among animals")
plt.show()
plt.clf()

In [None]:
# Visualizing the distribution of conservation status

plt.figure(figsize=(10,7))
ax = sns.catplot(data=species_animals, x='conservation_status', kind='count', \
                 order=['Species of Concern','Threatened', 'Endangered', 'In Recovery'])
ax.set_xticklabels(rotation=45)
plt.title("Distribution of conservation status for animals")
plt.show()
plt.clf()

In [None]:
animals = sns.catplot(data=species_animals, x='conservation_status', col='category', col_wrap = 2,\
                      kind='count', order=['Species of Concern','Threatened', 'Endangered', 'In Recovery'],\
                      height=3, aspect=13/6)
plt.show()
plt.clf()


#### Some findings

Most animals were listed as *species of concern* conservation status.



## Which species were spotted the most at each park?

## Selecting species with conservation status data

In [None]:
# merging

conservation = species.merge(observations, on='scientific_name')
conservation.head()
conservation.info()
print(conservation.duplicated(subset=['scientific_name']).value_counts())
conservation['conservation_status'].isna().value_counts()

In [None]:
conservation1 = conservation[conservation['conservation_status'].notnull()]

In [None]:
conservation1.info()

In [None]:
conservation1.duplicated(subset=['scientific_name']).value_counts()

In [None]:
conservation1.head(10)

In [None]:
conservation1.groupby('scientific_name').sum('observations')

- What is the distribution of `conservation_status` for animals?

- Are the differences between species and their conservation status significant?

- Are certain types of species more likely to be endangered?