## Project Disclaimer

This project has been developed as part of a Codecademy course, utilizing data provided specifically for the purpose of the project. The dataset, which has been supplied by Codecademy, is entirely fictional and has been created to simulate real-world scenarios, allowing learners to practice data analysis techniques in a controlled environment.

## Project Introduction

This project involves analyzing data from the National Parks Service regarding endangered species across different parks. The goal is to identify patterns and insights related to the conservation statuses of these species, exploring which types of species are more likely to become endangered and investigating broader conservation trends.



## Project Goals
- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

In [6]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

### species_info.csv

category - class of animal

scientific_name - the scientific name of each species

common_name - the common names of each species

conservation_status - each species’ current conservation status

In [27]:
species = pd.read_csv('species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


### observations.csv

scientific_name - the scientific name of each species

park_name - Park where species were found

observations - the number of times each species was observed at park


In [29]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


## Data Characteristics
Next, there will be a check for the dimensions of the data sets, for species there are 5,824 rows and 4 columns while observations has 23,296 rows and 3 columns.

In [34]:
print(f"species shape: {species.shape}")
print(f"observations shape: {observations.shape}")

species shape: (5824, 4)
observations shape: (23296, 3)


## Checking data types and missing values

In [42]:
species.info()
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None


In [54]:
observations.isnull().sum()
species.isnull().sum()

category                  0
scientific_name           0
common_names              0
conservation_status    5633
dtype: int64

## Handling Missing Data

Since the conservation status is missing 5633 values, this means that there are many species in the dataset that do not have the conservation status specified.

In [68]:
# By checking the categories in this column,it gave as a much clearer idea of what type of values are missing in the column.
species['conservation_status'].value_counts()

conservation_status
Species of Concern    161
Endangered             16
Threatened             10
In Recovery             4
Name: count, dtype: int64

In [88]:
# The absence of conservation status implies that the species is not categorized as endangered or threatened,, filling the missing value
# with 'Not classified' makes more sense. This way we retain the data yet provided context about the missing information.

species['conservation_status'].fillna('Not classified', inplace=True)

## Merging DataFrames

Merging both DataFrames by their common column (scientific_name) will give us a richer insight and it will make for a much easier and better understanding of the data and the questions we are trying to answer. 
