# DEEP SEA CORALS PROJECT
***

# Goal
***

The goal of this project is to gather insights about coral and it's related data from the data set provided.

# Acquire
Acquiring the data from local csv file
***

In [1]:
# establishing environment
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [2]:
# importing data
df = pd.read_csv('deep_sea_corals.csv')

In [3]:
# previewing data
df.head()

Unnamed: 0,CatalogNumber,DataProvider,ScientificName,VernacularNameCategory,TaxonRank,Station,ObservationDate,latitude,longitude,DepthInMeters,DepthMethod,Locality,LocationAccuracy,SurveyID,Repository,IdentificationQualifier,EventID,SamplingEquipment,RecordType,SampleID
0,,,,,,,,degrees_north,degrees_east,,,,,,,,,,,
1,625366.0,"NOAA, Deep Sea Coral Research & Technology Pro...",Madrepora oculata,stony coral (branching),species,D2-EX1504L3-05,2015-09-02,18.30817,-158.45392,959.0,reported,"Hawaiian Archipelago, Swordfish Seamount",50m,Hohonu Moana: Exploring Deep Waters off Hawai'i,University of Hawaii,ID by expert from video,D2-EX1504L3-05,ROV,video observation,EX1504L3_05_20150901T181522Z.mp4_05:45:26:28
2,625373.0,"NOAA, Deep Sea Coral Research & Technology Pro...",Madrepora oculata,stony coral (branching),species,D2-EX1504L3-05,2015-09-01,18.30864,-158.45393,953.0,reported,"Hawaiian Archipelago, Swordfish Seamount",50m,Hohonu Moana: Exploring Deep Waters off Hawai'i,University of Hawaii,ID by expert from video,D2-EX1504L3-05,ROV,video observation,EX1504L3_05_20150901T181522Z.mp4_05:24:35:53
3,625386.0,"NOAA, Deep Sea Coral Research & Technology Pro...",Madrepora oculata,stony coral (branching),species,D2-EX1504L3-05,2015-09-01,18.30877,-158.45384,955.0,reported,"Hawaiian Archipelago, Swordfish Seamount",50m,Hohonu Moana: Exploring Deep Waters off Hawai'i,University of Hawaii,ID by expert from video,D2-EX1504L3-05,ROV,video observation,EX1504L3_05_20150901T181522Z.mp4_05:15:22:09
4,625382.0,"NOAA, Deep Sea Coral Research & Technology Pro...",Madrepora oculata,stony coral (branching),species,D2-EX1504L3-05,2015-09-01,18.30875,-158.45384,955.0,reported,"Hawaiian Archipelago, Swordfish Seamount",50m,Hohonu Moana: Exploring Deep Waters off Hawai'i,University of Hawaii,ID by expert from video,D2-EX1504L3-05,ROV,video observation,EX1504L3_05_20150901T181522Z.mp4_05:13:29:50


# Prepare
Preparing the data for exploration and modeling
***

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513373 entries, 0 to 513372
Data columns (total 20 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   CatalogNumber            513372 non-null  float64
 1   DataProvider             513372 non-null  object 
 2   ScientificName           513372 non-null  object 
 3   VernacularNameCategory   513197 non-null  object 
 4   TaxonRank                513364 non-null  object 
 5   Station                  253590 non-null  object 
 6   ObservationDate          513367 non-null  object 
 7   latitude                 513373 non-null  object 
 8   longitude                513373 non-null  object 
 9   DepthInMeters            513372 non-null  float64
 10  DepthMethod              496845 non-null  object 
 11  Locality                 389645 non-null  object 
 12  LocationAccuracy         484662 non-null  object 
 13  SurveyID                 306228 non-null  object 
 14  Repo

- Drop columns that will not be used in this iteration of the project
    - CatalogNumber, SampleID, SurveyID, EventID, and Station
        - Categorical columns with vast amounts of unique values that don't offer insight to anything within the scope of this project
    - Locality
        - Column holds a very large amount of categorical values
        - Will be easier to work with if I bin the values as many appear to be near eachother but I'll save this for a later iteration of the project since it may take a significant amount of time

     
     
- Many null values
    - I'll drop them after dropping columns I don't plan to use for this first iteration of this project
        - If too many rows are lost I'll impute values to preserve more rows 
        
        
- Data types look okay for now but I'll update if needed to facilitate operations


- Rename columns 
    - all lowercase
    - "_" between words in names


- Make all values lowercase where applicable


- Edit: DepthInMeters has a negative value in 23 rows.
    - All had a value of -999
    - I'm not certain but I suspect that this value was used instead of null
    - I'm going to drop these rows as well

### Dropping Columns

In [5]:
# dropping specified columns
df = df.drop(columns = ['CatalogNumber', 'SampleID', 'SurveyID', 'EventID', 'Station', 'Locality'])

### Dropping Nulls

In [6]:
# dropping all null values
df = df.dropna()

### Renaming Columns

In [7]:
# adding underscores to various column names
df.columns = ['Data_Provider', 'Scientific_Name', 'Vernacular_Name_Category', 'Taxon_Rank',
              'Observation_Date', 'latitude', 'longitude', 'Depth_Meters','Depth_Method', 
              'Location_Accuracy', 'Repository', 'Identification_Qualifier', 'Sampling_Equipment',
              'Record_Type']

# lower casing all column names
df.columns = df.columns.str.lower()

### Converting all values to lower case

In [8]:
df = df.applymap(lambda string:string.lower() if type(string) == str else string)

### Dropping rows with negative depth_meters value

In [82]:
df = df[df.depth_meters >= 0]

# Explore
Exploring the data to draw insights about the corals and data related to them.
***

In [18]:
df.columns

Index(['data_provider', 'scientific_name', 'vernacular_name_category',
       'taxon_rank', 'observation_date', 'latitude', 'longitude',
       'depth_meters', 'depth_method', 'location_accuracy', 'repository',
       'identification_qualifier', 'sampling_equipment', 'record_type'],
      dtype='object')

## What are the major sources of this data?

In [47]:
round(df.data_provider.value_counts() / len(df),2).head(10)

monterey bay aquarium research institute                0.42
noaa, alaska fisheries science center                   0.16
noaa, southwest fisheries science center, santa cruz    0.09
noaa, olympic coast national marine sanctuary           0.08
hawaii undersea research laboratory                     0.07
noaa, office of ocean exploration and research          0.03
temple university                                       0.02
harbor branch oceanographic institute                   0.02
noaa, southwest fisheries science center, la jolla      0.02
noaa, northwest fisheries science center                0.02
Name: data_provider, dtype: float64

- After the Monterey Bay Aquarium Research Institute, NOAA appears to be the next largest provider
- I'm going to find out how many observations come from one of NOAA's facilities

In [46]:
a = round(df[df.data_provider.str.contains('noaa')].data_provider.value_counts() / len(df), 2)

print(a,'\n')
print(f'Total % of observations from NOAA {a.sum()}')

noaa, alaska fisheries science center                                                              0.16
noaa, southwest fisheries science center, santa cruz                                               0.09
noaa, olympic coast national marine sanctuary                                                      0.08
noaa, office of ocean exploration and research                                                     0.03
noaa, southwest fisheries science center, la jolla                                                 0.02
noaa, northwest fisheries science center                                                           0.02
noaa, deep sea coral research & technology program and office of ocean exploration and research    0.01
noaa, channel islands national marine sanctuary                                                    0.01
noaa, flower garden banks national marine sanctuary                                                0.01
noaa, office of response and restoration                        

- Approximately 42% of the data comes from the Monteret Bay Aquiarium Research Institute
- Approximately 43% of the data comes from NOAA facilities
- The remaining 15% of observations come from alternate facilities

## What is the range of dates that the observations cover?

In [88]:
first_date = df.observation_date.min()

last_date = df.observation_date.max()

print(f'The dates of the observations range from {first_date} to {last_date}.')

The dates of the observations range from 1868-05-04 to 2016-03-27.


## What are the shallowest and deepest depths within the data?

In [87]:
least_deep = df.depth_meters.min()
most_deep = df.depth_meters.max()

print(f'The most shallow observation(s) had a depth of {least_deep}, and the deepest observation(s) had a depth of {most_deep}.')

The most shallow observation(s) had a depth of 0.0, and the deepest observation(s) had a depth of 6292.0.
