# Analyzing Biodiversity and Conservation Status in National Parks.

## Introduction:

This is my first python project shared on Github. I will explore and perform data analysis on 2 datasets, sourced from the National Parks Service.

### Goals:
I will be looking to answer more questions as I begin exploring the data, but these questions will guide my intial exploration:

- What is the distribution of `conservation_status` for animals?  
- Are certain types of species more likely to be endangered?  
- Are the differences between species and their conservation status significant?  
- Which species were spotted the most at each park?

### Raw Data Files:

**species_info.csv** - contains data about different species and their conservation status.  
`category` - class of animal  
`scientific_name` - the scientific name of each species  
`common_name` - the common names of each species  
`conservation_status` - each species' current conservation status  

**observations.csv** - holds recorded sightings of different species at several national parks for the past 7 days.  
`scientific_name` - the scientific name of each species  
`park_name` - park where species were found  
`observations` - the number of times each species was observed at the park

## Load CSV Files: inspect first 10 rows.

During my review of the column names, I have decided not to name the DataFrame 'Observations' due to the identical column name. To avoid confusion, I will instead use 'tracking' as a synonym instead of observations.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
species_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/species_info.csv")
tracking_df = pd.read_csv("~/Desktop/GitHub/BioDiversity-Project/Analysis/observations.csv")

In [3]:
species_df.head()

In [4]:
species_df.dtypes

In [5]:
tracking_df.head()

In [6]:
tracking_df.dtypes

## Digging into the missing data.

I only see `NaN` values in  **species_df** for the column `conservation_status`- this will be something to look into a bit deeper as I begin to clean the data.  
*I also see that both df's contain the column `scientific_name`. This will be useful to act as a primary key, linking both tables by a shared relation if I choose to join both df's.*

**For now, though, I want to continue exploring the data while looking more into each column to see if any others contain NaN, or other possible issues that could be addressed with data cleaning.**

In [7]:
species_df_shape = species_df.columns, species_df.shape

species_df_shape

In [8]:
tracking_df_shape = tracking_df.columns, tracking_df.shape

tracking_df_shape 

In [9]:
species_null = species_df.isnull().sum().sort_values(ascending=False)

species_null

In [10]:
total_rows, total_col = species_df.shape
null_rows = species_null[0]
null_per = (null_rows/total_rows) *100

print(f"{null_per.round(3)}% of the rows in the conservation_status column are null.")

In [11]:
tracking_null = tracking_df.isnull().sum().sort_values(ascending=False)

tracking_null

**Diving deeper I see a majority of the rows in the `conservation_status` column have null values. I want to know if the missing values are: systematic, MAR, MCAR, or MNAR? Without much domain knowledge, I will look closer at the non-null values.**

In [12]:
conservation_types = species_df['conservation_status'].unique()

print("conservation_types:", conservation_types)
species_df['conservation_status'].value_counts()

**`Conservation Status`** is an ordinal categorical variable with 4 categories: 
- Species of Concern, 
- Endangered, 
- Threatened, 
- In recovery. 
 
**It seems less surprising now, why there is a significant portion of null values (96.72%) - it is common for a species to not fit 1 of the 4 categories, which suggests that `NaN` values represent *no* conservation status,i.e. the species is not at risk.**

This insight allows me to make an important assumption: that `NaN` values are systematically missing due to the dataset only recording conservation status for species known to be at risk. However, it is crucial to remember that this is an assumption.

To perform a more comprehensive analysis, I will modify the null values to represent a 'Healthy' conservation status, indicating that the species is not considered at risk. This relabeling allows me to include these records in my analysis without the need to delete the rows entirely.

In [13]:
species_df['conservation_status'] = species_df['conservation_status'].fillna('Healthy').astype('category')
#con_status = species_df['conservation_status'].astype('category')
con_status = species_df['conservation_status']
labels = con_status.unique()
con_status_counts = con_status.value_counts()

con_status_counts

In [14]:
prop = (con_status_counts/total_rows)*100

prop

## Some insight into Q1. What is the distribution of `conservation_status` for animals?  

Before fully diving into Q1, I do want to quickly investigate the question with the current understanding so far established through the analysis up to this point. **Since I am not familiar with this dataset/domain of knowledge, I find it helpful to take extra time learning about the basic features, and how they relate.**

For example, it is more obvious looking at the proportions above to understand the distribution of species this dataset contains. It prepares me for further EDA, where I'd be interested in learning more about the distribution of `conservation_status` by national park - maybe some parks are home to more at-risk species than other parks!

### A visual of the `conservation_status` proportions:

In [15]:
legend_labels = [f'{label} - {prop:.1f}%' for label, prop in zip(labels, prop)]

In [16]:
plt.pie(con_status_counts)
plt.title('Proportion of Conservation Status')
plt.legend(legend_labels,bbox_to_anchor=(1, 0.5), loc='center left')

plt.show()
plt.clf()

It is hard to see any color except for **blue** and **orange**, indicating the other values are much less frequent.

With this additional understanding, I can think of 2 more questions I am interested in analyzing during EDA -- what is the proportion of `convservation_status` by:
- `national_park`?; to investigate if some parks are more difficult for a species to survive in.
- `category`?; to investigate if some categories of species have more difficulty surviving than others.

### Before performing in-depth EDA, I want to backup and finish the data cleaning.. I need to identify all duplicate rows.

In [17]:
s_dups = species_df.duplicated()
t_dups = tracking_df.duplicated()

s_dups.sum(), t_dups.sum()

In [18]:
duplicates = tracking_df[tracking_df.duplicated(keep=False)]

duplicates.sort_values(by=['scientific_name'])

The rows above are all duplicates. There seems to be no reason for including both records so i will now drop all duplicates, and check to ensure there are no further duplicates.

In [19]:
tracking_df = tracking_df.drop_duplicates()

tracking_df.duplicated().sum()

### At this point all null values have been handled, and all duplicates removed. The data is much cleaner!

Now, it would be beneficial to merge the two dataframes, enabling more comprehensive analysis, such as examining the relationship between `conservation_status` and `common_names`.

In [20]:
data = pd.merge(species_df, tracking_df, on='scientific_name')

In [21]:
check = data[data['scientific_name'] == 'Bos bison']
check

In [22]:
check2 = data[(data.park_name == 'Bryce National Park') & (data.conservation_status == 'Endangered')]

check2

### We can now see how helpful it is to combine both dataframes into 1 dataframe. With this quick example, 'check'/'check2' show that: a) The American Bison is 'Healthy' in all parks, and b) 

Perhaps, some parks have higher concentrations of non-healthy species? And if so, do some category types (e.g. mammal, fish,etc.) have a high liklihood of becoming non-healthy?

In [23]:
plt.bar(x=check.park_name,height=check.observations,color=['red','blue','green','black'])
plt.xticks(np.arange(4)-0.5,check.park_name,rotation=45)
plt.title("Bos bison Frequency by National Park")
plt.show()
plt.clf()

The bar plot clearly highlights the national park in which the American Bison is most commonly sighted. 
**This sample of data visually helps me understand the dataset better, so when it comes to conducting a comprehensive analysis, I am more prepared.**

## Summary Statistics
Before performing **Exploratory Data Analysis**, I want to learn about the summary statistics. Now that my data is clean, and both csv files are merged into 1 Dataframe (joined by their shared column, `scientific_name`), I think it is a great time to see the summary statistics:

In [24]:
sum_stats = data.describe(include='all')
sum_stats

In [25]:
# There are 7 different values for `category`, still I will look closer...
category_totals = data.groupby('category')['observations'].sum().sort_values()

category_totals

In [26]:
# I see the average park has 142 observations, but let's also look closer here...
park_totals = data.groupby('park_name')['observations'].sum().sort_values()

park_totals

## Exploring Distribution by National Park for All Species
At this point the data is clean, summary stats are known, and I have started learning more about the relationships within the dataset.

To begin my EDA now, I want to compare the total distribution of species across all national parks. This can help tell me if some parks are home to more species in general, and also individually.

In [27]:
import matplotlib.cm as cm
x = np.arange(len(data))

colors = cm.tab10(x)

In [28]:
x = park_totals.index
height = park_totals

ax = plt.subplot(1,1,1)
plt.bar(x,height,color=colors)
ax.set_xticks(ticks=np.arange(4)-.45,labels=x)
ax.set_xticklabels(x,rotation=40)
plt.title("Total Observations by National Park")
plt.xlabel("National Park")
plt.ylabel("Number of Observations")
plt.show()
plt.clf()

### The most frequent observations occur in Yellowstone National Park. 
Now lets compare the distribution of conservation status with the types of species, to help answer q1.

In [29]:
conservation_category = pd.pivot_table(species_df[species_df['conservation_status'] != 'Healthy'],
                                      values = 'common_names',
                                      index = 'conservation_status',
                                      columns = 'category',
                                      aggfunc = pd.Series.count)

conservation_category

In [30]:
conservation_category.plot(kind = 'barh',
                          subplots=True,
                          xlabel = "Observation Count",
                          title = 'Conservation Status Comparison',
                          ylabel = " ",
                          figsize=(5,10),
                          legend=False)

plt.tight_layout()

### I will now combine conservation_status to compare ALL non-healthy observations by species category.

In [31]:
species_df['protected'] = species_df['conservation_status'] != 'Healthy'

In [32]:
protected = species_df.groupby(['protected','category'])['scientific_name'].count().reset_index().pivot_table(values='scientific_name',
                                                                                                             index='protected',
                                                                                                             columns='category',
                                                                                                             aggfunc='sum')

protected_f = protected.loc[False]
protected_t = protected.loc[True]
protected.index

In [33]:
protected_t.plot(kind = 'bar',
              subplots=False,
                 color=colors,
               figsize=(10,10),
               title="All Protected Species: Total Count")
              
plt.ylabel("")
plt.show()
plt.clf()

**Because this compares absolute values, it will be helpful to also compute the `conservation_status` by percentage for all species.** To compute this, I will first add a column to the species_df called 'protected' which represents if a species has a non-healthy `conservation_status`; this allows me to group all of the healthy and non-healthy species, to compare proprotions. 

In [34]:
protected_T = protected.T
protected_T['ratio'] = (protected_t / (protected_t + protected_f))*100


protected_T

## Although Birds have the most absolute protection, they are not the most likely to be protected! 
### Infact, on a proportional basis, Mammals are the most frequently protected at 17%; while Birds are protected 15% of the time.

#### Could this point to a human bias - choosing to protect fellow mammals more than anything else?! Further analysis would need to look at factors contributing to decisions for protecting a species to answer this puzzle!

# Significant Differences across Species?

Now I will investigate the statistical significance of the species `conservation_status`, to help determine if there is a significant association between the species group (mammals, birds, etc.) and their conservation status (protected or not protected).

**To determine statistical significance, I will use the Chi-Square test.** First, I will comput the contingency table as observed, and then expected, using scipy.stats.

In [35]:
from scipy.stats import chi2_contingency

First I will find the p values for  a chi-squared test between mammals and vascular plants.

In [36]:
contingency1 = [[38,176],[46,4424]]

contingency1 = pd.DataFrame(contingency1,columns=['O_Protected','O_Not-Protected'],index=['Mammals','Vascular Plants'])

contingency1

### The table above shows the observed ratios, while the table below shows the expected ratios:

In [37]:
chi2, pval, dof, ex = chi2_contingency(contingency1)

ex = pd.DataFrame(ex.round(),columns=['e_Protected','e_Not-Protected'],index=['Mammals','Vascular Plants'])

ex

In [38]:
pval

The pval is extremely low! This gives me reasons to reject the **null hypothesis** which states there is no statistical significance between `conservation_status` and `category` of species - implying that random chance determines the number of protected observations. 

**The Chi-squared test has helped me determine that atleast for Mammals and Vascular Plants, there is a statistical difference between the observed protection count and the expected; where more mammals were protected than expected!**  
*Might this be confirming my intial theory, that humans protect mammals over all else?!*

# Which Species are Most Common?

Last question I will seek to answer is the most common species observed.

In [39]:
mammals = species_df[species_df.category == "Mammal"].common_names

In [40]:
temp1 = mammals.apply(lambda x: x.lower())\
        .apply(lambda x: x.replace("("," "))\
        .apply(lambda x: x.replace(")"," "))\
        .apply(lambda x: x.replace(","," "))\
        .apply(lambda x: x.replace("-"," "))\
        .apply(lambda x: x.replace("'",""))\
        .str.split()

temp1

In [41]:
temp2 = temp1.apply(lambda x: [*set(x)])
                    
temp2

In [42]:
temp3 = temp2.explode()

temp3

In [43]:
name_count = pd.DataFrame(temp3.value_counts().reset_index())

name_count.head(15)

### Bats are the most common type of species observed, followed by Shrew.
Because there are so many bats, I would like to know the vairety: 

In [44]:
bat_variety = pd.DataFrame(species_df.common_names.apply(lambda x: x if ' Bat' in x else "No"))

bat_true = bat_variety[bat_variety.common_names != "No"]

bat_true

In [45]:
print(f"Check the count to ensure all Bat types where included: {bat_true.shape[0]}")

# Conclusion

This project was a great reminder to me, just how important planning is at the onset of analysis. Specifically, when the data is outside of domain knowledge, I believe it is valuable to spend time critically thinking from the perspective of the researcher/data collector - there is a reason someone spent time collecting this data, which means they have expectations or hope to use it in some meaningful way. At the onset of the project, I did not spend enough time doing this.

Therefore, as I was half way through the project I found myself discovering new truths about the dataset I originally overlooked; for instace, I did not realize that the conservation_status was dependent on the species, and not the park_name; this makes sense, because if an animal is endangered it is endangered everywhere, and not just in 1 location typically. **But my initial assumption was perhaps an animal could be endangered in 1 park but not the other.** 

The errenous assumption proved to add confusion throughout my initial analysis, and it was not until I circled back to the inital dataset descriptions and research questions, that I realized my understanding was not fully accurate.

In the end, this project - while challenging my abilities as a data analyst - make me excited! Excited to apply the lessons i gained, so that next project I can produce even more insightful and meaningful analysis. I will spend additional time in future projects planning, and really trying to tap into the mindset of the original researcher that spent time collecting the data to begin with. 