# Oceanographic Analysis - CalCOFI Capstone

In [None]:
from IPython.display import IFrame
embed = "https://docs.google.com/presentation/d/e/2PACX-1vTFUKkshFjS3SpFm392ru6L5CVemXbfU2Op1NaaEFiW16x4Je70wKbiR0u_TcR0UyqOiINeGCNVUquK/embed?"
IFrame(embed,frameborder="0", width="900", align="center", height="569", allowfullscreen="true", mozallowfullscreen="true", webkitallowfullscreen="true")

# 1. Problem Identification

###  <font color='blue'>Problem Statement</font>
<strong>Help policy makers understand oceanographic trends to make intelligent choices that will save the marine ecosystem.</strong>

### Summary
The California Cooperative Oceanic Fisheries Investigations (CalCOFI) are a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography. The organization was formed in 1949 to study the ecological aspects of the sardine population collapse off California. Today their focus has shifted to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change. CalCOFI conducts quarterly cruises off southern & central California, collecting a suite of hydrographic and biological data on station and underway.  Data collected at depths down to 500 meters include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.

### Overview
<p>Oceonography, the study of the biological features of the ocean, is important to determine the factors threatening the ocean and its marine life. Studying the ocean is also importance since it covers more than 70 percent of the surface of our planet.
According to NASA, big shifts in salinity could be a warning that more severe droughts and floods are on their way, or even that global warming is speeding up.
</p>

### Context
Since climate changes today has been altering the ocean’s chemistry, it is important to understand the trends that are threatening the ocean and its marine life. The changes in temperature effects the melting of ice, and changes in sea levels and ocean currents.  
The migration pattern of marine species disrupts, and some marine species are on the verge of extinction.

### Criteria for Success
Build a model that can accurately obtain and predict the factors affecting the ocean.

### Scope of Solution Space
Determine the top 5 important features that are useful to predict the factors threatening the ocean.
Build additional features out of existing data (Feature Engineering) and perform Exploratory Data Analysis.

### Constraints within Solution Space
There are 61 features, but most of them have missing data.

### Data Acquisition and Key Data Sources
The data is provided from CalCOFI 

https://calcofi.org/ccdata/database.html

The following table includes the most important features of the dataset and their description.

| | <strong>Features</strong> | <strong>Description</strong> |
|------|------:|------|
| 1 | Depthm | Depth of ocean in meters|
| 2 | TempDegF | Temperature of Water in Fahrenheit|
| 3 | Salinity | Salinity of water|
| 4 | STheta | Potential Density of Water|


# 2. Data Wrangling
- Collect, organize, define, and clean relevant datasets.

## Data Collection

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np

# Plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# Analysing datetime
from datetime import datetime as dt
from datetime import timedelta

# File system manangement
import os,sys

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

#Interactive Shell
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

#Pandas profiling
from pandas_profiling import ProfileReport

import missingno as msno
import re 

%matplotlib inline

## Workspace

In [None]:
cwd = os.getcwd()
print(cwd)

In [None]:
os.listdir(cwd)
#os.listdir( os.getcwd() )

## Load the Data from CSV File

In [None]:
#KAGGLE.com
path = '/kaggle/input/calcofi/bottle.csv'
bottle = pd.read_csv(path) 

In [None]:
# Import CSV file and read the dataset
#path = '../data/calcofi/bottle.csv'
#bottle = pd.read_csv(path, encoding='latin-1') 

In [None]:
# Show all columns
pd.set_option('max_columns', None)

#### First 5 rows 

In [None]:
bottle.head()

#### Last 3 rows

In [None]:
bottle.tail(3)

### Functions

In [None]:
# Convert Celcius to Fahr
def cel_to_fahr(x):
    x = x * 1.8 + 32
    return float(x)

In [None]:
# Dimensions of the dataset. #(samples,features)
print("There are", bottle.shape[0], "Rows(Observations).")
print("There are", bottle.shape[1], "Columns(Features).")
bottle.shape

___

## Data Organization

#### Create SubFolders

In [None]:
newfolder = "../OceanographicAnalysisCalCOFI/data"

try:
    os.mkdir(newfolder)
except OSError:
    print ("Creation of the directory %s failed" % newfolder)
else:
    print ("Successfully created the directory %s " % newfolder)

In [None]:
newfolder = "../OceanographicAnalysisCalCOFI/figures"

try:
    os.mkdir(newfolder)
except OSError:
    print ("Creation of the directory %s failed" % newfolder)
else:
    print ("Successfully created the directory %s " % newfolder)

In [None]:
newfolder = "../OceanographicAnalysisCalCOFI/models"

try:
    os.mkdir(newfolder)
except OSError:
    print ("Creation of the directory %s failed" % newfolder)
else:
    print ("Successfully created the directory %s " % newfolder)

---

## Data Definition

### Explore the Data

In [None]:
# Get DataFrame Information
bottle.info()

There are no missing values. AppointmentDay and ScheduledDay should be converted to datetime. There are 3 objects (Gender, Neighborhood, and No-show)

#### Variable Types

In [None]:
bottle.dtypes.value_counts()

In [None]:
print(bottle.columns)

# Handling Missing Values

In [None]:
#Counts and percentage of null values 
dictionary = {
    "NullCount":bottle.isnull().sum().sort_values(ascending=False),
    "NullPercent":bottle.isnull().sum().sort_values(ascending=False)/len(bottle)*100
}

na_df = pd.DataFrame(dictionary)
na_df.columns = ['NullCount','NullPercent']
na_df[(na_df['NullCount'] > 0)].reset_index()


On line 40 in the previous DataFrame we see that O2ml_L is missing 19.10% of data.  So we will delete all rows that has more than 19% of missing data. 

In [None]:
pct_null = bottle.isnull().sum() / len(bottle)
missing_features = pct_null[pct_null > 0.19].index
bottle.drop(missing_features, axis=1, inplace=True)
df = bottle

In [None]:
# Visualize Missingness
msno.matrix(df)
plt.show()

In [None]:
print ( df.nunique() / df.shape[0] * 100 )

In [None]:
df.head()

### Rename Columns

In [None]:
df.columns

In [None]:
df = df.rename(columns = { 
    "Cst_Cnt": "CastCount",
    "Btl_Cnt": "BottleCount",
    "Depthm": "DepthMeters",
    "T_degC": "TempDegC",
    "Salnty": "Salinity",    
    "STheta": "PDensity"
    
})

### Set BottleCount to be Index

In [None]:
df = df.set_index('BottleCount')

### Extract Year and Month from Depth_ID 

In [None]:
# Extract Year
search = []    

for values in df['Depth_ID']:
    search.append(re.search(r'\d{2}-\d{2}', values).group())
    
df['Year'] = search
df['Year'] = df['Year'].replace(to_replace='-',value='', regex = True) 

df['Year'] = pd.to_datetime(df['Year']).values.astype('datetime64[Y]')
df['Year'] =  pd.DatetimeIndex(df['Year']).year

In [None]:
# Extract Month 
search = []    

for values in df['Depth_ID']:
    search.append(re.search(r'-\d+', values).group())
    
df['Month'] = search
df['Month'] = df['Month'].str[-2:]

df['Month'] = df['Month'].astype('int64')

### Convert from Celsius to Fahrenheit

In [None]:
df['TempDegF'] = df['TempDegC'].apply(cel_to_fahr)
df = df.drop("TempDegC", axis = 1)

In [None]:
print('Salinity:', df.Salinity.unique() ) 
print('TempDegF:', df.TempDegF.unique() ) 

## Dataset Statistics

In [None]:
df.describe(include="all").T

___

## Detect Anomalies & Outliers

#### Range of values per column

In [None]:
range = df.aggregate([min, max])
print(range)

### Year - Month

In [None]:
sns.boxplot(x='Year',data=df)

Currently, we are in 2020. We will remove all the years after that including 2093

In [None]:
df['Year'] = df['Year'].drop(df[df['Year']>2020].index)

The dataset has months of more than 12, so we will drop them too.

In [None]:
df['Month'] = df['Month'].drop(df[df['Month']>12].index)

In [None]:
df["Salinity"].describe(include="all").T

In [None]:
df["TempDegF"].describe(include="all").T

#### Check for Duplicated Rows

In [None]:
duplicateRowsDF = df.duplicated() 
df[duplicateRowsDF]

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

---

#  3. EXPLORATORY DATA ANALYSIS

#### Categorical columns and their associated levels.

In [None]:
dfo = df.select_dtypes(include=['object'], exclude=['datetime'])
dfo.shape
#get levels for all variables
vn = pd.DataFrame(dfo.nunique()).reset_index()
vn.columns = ['VarName', 'LevelsCount']
vn.sort_values(by='LevelsCount', ascending = False)

In [None]:
df = df.drop(['Depth_ID','Sta_ID'],axis=1)

In [None]:
corr = df.corr()

plt.figure(figsize=(20,10))
sns.heatmap(corr,
            linecolor='blue',linewidths=.1, 
            cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);

C14A1q, C14A2q, DarkAq, and MeanAq all have a high correlation so we will keep MeanAq. The Reported Depth, Salinity, temp, and R_DYNHT have a positive correlation with the actual findings so we will only keep the Reported Dynamic Height.  

In [None]:
df = df.drop(['CastCount','R_Depth','R_TEMP', 'R_SALINITY', 'C14A1q', 'C14A2q', 'DarkAq' ], axis=1)

# 4. PreProcessing

In [None]:
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

## SimpleImputer

In [None]:
numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
])

numeric_features = df.select_dtypes(include=['int64', 'float64']).columns

preprocessor = ColumnTransformer(
    remainder='passthrough',
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ]
)

steps = [('preprocessor', preprocessor)]

pipeline = Pipeline(steps)

pipeline.fit(df[:])
df_pipe = pipeline.transform(df[:])

In [None]:
df_pipe = pd.DataFrame(df_pipe)
df_pipe

In [None]:
#Change the name of columns back to original names.
df_pipe.columns = df.columns

df_pipe.sample()

# 5. MODELING

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error


from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
column_names = ["Salinity","TempDegF"]
sal_temp = df.reindex(columns=column_names)

In [None]:
sal_temp['Salinity'].fillna((sal_temp['Salinity'].mean()), inplace=True)
sal_temp['TempDegF'].fillna((sal_temp['TempDegF'].mean()), inplace=True)

## Simple Linear Regression:

In [None]:
sal_temp.head()

In [None]:
X = sal_temp.Salinity.values
y = sal_temp.TempDegF.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

#Predict the Test set results
y_pred = model.predict(X_test)

#### Visualize the Training set results

In [None]:
_= plt.scatter(X_train, y_train, color = 'red')
_= plt.plot(X_train, model.predict(X_train), color = 'blue')
_= plt.title('Temperature vs Salinity (Training set)')
_= plt.xlabel('Salinity')
_= plt.ylabel('Temperature')
_= plt.show()

#### Visualize the Test set results

In [None]:
_= plt.scatter(X_test, y_test, color = 'red')
_= plt.plot(X_train, model.predict(X_train), color = 'blue')
_= plt.title('Temperature vs Salinity (Training set)')
_= plt.xlabel('Salinity')
_= plt.ylabel('Temperature')
_= plt.show()

### Split DataSet to Training set and Test set - For entire Pipeline

In [None]:
X = df_pipe.drop(['TempDegF'],axis=1).values
y = df_pipe.TempDegF.values

SEED = 42
TS = 0.30

# Create training and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size = TS, random_state=SEED)

#Feature Scaling to prevent information leakage
#sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)

print (X_train.shape)
print (y_train.shape)

print (X_test.shape)
print (y_test.shape)

## Multiple Linear Regression:

In [None]:
# Create logistic regression model
linreg = LinearRegression()

# Train the model using the training sets
linreg.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = linreg.predict(X_test)

linreg.score(X_test,y_test)

linreg_training_score = round(linreg.score(X_train, y_train) * 100, 2)
linreg_test_score = round(linreg.score(X_test, y_test) * 100, 2)

print('Linear Regression Training Score: \n', linreg_training_score)
print('Linear Regression Test Score: \n', linreg_test_score)

# Compute and print R^2 and RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

In [None]:
# 5-fold cross-validation:
cv_scores_5 = cross_val_score(linreg, X, y, cv=5)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores_5)))


# 15-fold cross-validation:
cv_scores_15 = cross_val_score(linreg, X, y, cv=15)
print("Average 15-Fold CV Score: {}".format(np.mean(cv_scores_15)))

# 25-fold cross-validation:
cv_scores_25 = cross_val_score(linreg, X, y, cv=25)
print("Average 25-Fold CV Score: {}".format(np.mean(cv_scores_25)))

## DECISION TREE:

In [None]:
dtr = DecisionTreeRegressor(random_state=42)
model = dtr.fit(X_train, y_train)

y_pred = model.predict(X_test)


dtr_training_score = round(model.score(X_train, y_train) * 100, 2)
dtr_test_score = round(model.score(X_test, y_test) * 100, 2)

print('Decision Tree Training Score: \n', dtr_training_score)
print('Decision Test Score: \n', dtr_test_score)

## RANDOM FOREST:

In [None]:
rfr = RandomForestRegressor(random_state=0, n_jobs=-1)
model = rfr.fit(X_train, y_train)
y_pred = model.predict(X_test)

rfr_training_score = round(model.score(X_train, y_train) * 100, 2)
rfr_test_score = round(model.score(X_test, y_test) * 100, 2)

print('Random Forest Training Score: \n', rfr_training_score)
print('Random Forest Test Score: \n', rfr_test_score)

# We will look at the predicted prices to ensure we have something sensible.


In [None]:
print(y_pred)

In [None]:
models = pd.DataFrame({
    
    'Model': [ 
        'Linear Regression',
        'Decision Tree',
        'Random Forest',   
    ],
             
    
    'Training Score': [ 
        linreg_training_score,
        dtr_training_score, 
        rfr_training_score,
    ],
    
    'Test Score': [ 
        linreg_test_score,
        dtr_test_score,
        rfr_test_score,
    ]})


models.sort_values(by='Test Score', ascending=False)

In [None]:
df.aggregate([min, max])

# 6. DOCUMENTATION

### Summary
This is Supervised Regression project to analyze Oceanographic trends.

Temperature (TempDegF) is the target variable. Celsius was converted to Fahrenheit. 34 to 88 degrees F.  

The bottle samples were the observations and they were collected from 1949 to 2019.
The Salinity of the water was between 28 to 37.
The depth was between 0 to 5351 meters. 

### Data Preprocessing:
Dropped 15 duplicated rows. 
A Pipeline was used to direct the process of first imputing missing data by the mean using 'SimpleImputer', then 'StandardScaler' to scale
the dataframe. 
For simple linear regression, only Salinity and Temperature was used. The pandas fillna method by the mean was implemented for missing values. 


### Model Performance:
Accuracy Score: R^2: was used to determine the best model.

### Model Findings:
The most important features where: 
- 1. Salinity()
- 2. Temperature()

The reported Dynamic Height improved the model was removed then added to model because it improved the R^2 score significantly. 

## EXPORT DATA...

In [None]:
#df.to_csv('../data/calcofi/wrangle_csv.csv', index=True)

In [None]:
my_submission = pd.DataFrame({'Id': df_pipe.index, 'Temperature': print(y_pred)})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)