# Oceanographic Data Analysis using Regression

## Problem Statement

We aim to explore and analyze oceanographic data collected from various depths in the Pacific Ocean. The focus is to investigate how physical parameters such as **temperature**, **depth**, **salinity**, **oxygen concentration**, and **density** are related. Understanding these relationships is important for modeling ocean behavior and its impact on marine ecosystems and climate.

## Solution Approach

- Load a subset of the original `bottle.csv` dataset (trimmed to 1215 rows for performance and some rows are erased for data refining).
- Extract and convert relevant oceanographic parameters to NumPy arrays.
- Perform initial data exploration by printing samples of each parameter.
- Further analysis (e.g., regression, correlation, visualization) will be built on this structured data.

## Dataset Source

This notebook uses a trimmed version of the original dataset from Kaggle:  
[CalCOFI Bottle Data](https://www.kaggle.com/datasets/sohier/calcofi)

 File used: `bottle.csv` (first 1215 rows from `bottle.csv`)


In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load dataset from CSV file
df = pd.read_csv('bottle.csv', low_memory=False)

# Extract relevant columns as numpy arrays
Temperature = df['T_degC'].to_numpy()[:1215] 
Depth = df['Depthm'].to_numpy()[:1215] 
Oxygen = df['O2ml_L'].to_numpy()[:1215] 
Density = df['STheta'].to_numpy()[:1215] 
Salinity = df['Salnty'].to_numpy()[:1215]

print("Temperature:", Temperature[:10])
print("Depth:", Depth[:10])
print("Oxygen:", Oxygen[:10])
print("Density:", Density[:10])
print("Salinity:", Salinity[:10])

In [None]:
# Combine selected features into a single matrix
# Oxygen data is excluded becuase of no data and for simplicity
X = np.column_stack([Temperature, Depth, Density])  # shape: (1215, 3)

# Feature scaling: standardizing inputs: [important] Cost value will be e33 because of depth being too large than other features
means = X.mean(axis=0)
stds = X.std(axis=0)
X = (X - means) / stds  # Normalize each feature to mean 0, std 1

# Target variable
y = Salinity

In [None]:
# Computes Mean Squared Error cost
def compute_cost(X, y, w, b, m):
    cost = 0
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b  # predicted value
        cost += (f_wb_i - y[i]) ** 2  # squared error
    return cost / (2 * m)  # mean cost

In [None]:
# Computes gradient of the cost function with respect to weights and bias
def compute_grad(X, y, w, b, m, n):
    dw = np.zeros((n,))
    db = 0
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b
        err = f_wb_i - y[i]
        for j in range(n):
            dw[j] += err * X[i][j]
        db += err
    return dw / m, db / m  # average gradients


In [None]:
# Performs batch gradient descent to learn weights and bias
def grad_des(X, y, w, b, iter=100, alpha=0.0001):
    m = X.shape[0]
    n = X.shape[1]
    J_history = []
    p_history = []

    for i in range(iter):
        dw, db = compute_grad(X, y, w, b, m, n)
        w -= alpha * dw
        b -= alpha * db
        J_history.append(compute_cost(X, y, w, b, m))
        p_history.append((w.copy(), b))

        # Print progress every 10 iterations
        if i % 10 == 0 or i == iter - 1:
            print(f"Iteration {i:4}: Cost {J_history[-1]:.4f}, "
                  f"dj_dw: {dw}, dj_db: {db:.4f}, "
                  f"w: {w}, b: {b:.4f}")
    return w, b, J_history, p_history

In [None]:
# Initialize weights and bias
w = np.zeros(X.shape[1])
b = 0

# Run gradient descent
w, b, J_history, p_history = grad_des(X, y, w, b)

In [None]:
# Extract weights and bias history
w_history = np.array([p[0] for p in p_history])
b_history = [p[1] for p in p_history]

# Plot evolution of each weight and bias
plt.figure()
for i in range(w_history.shape[1]):
    plt.plot(w_history[:, i], label=f'w[{i}]')
plt.plot(b_history, label='b')
plt.xlabel('Iteration')
plt.ylabel('Value')
plt.title('Weights and Bias over Iterations')
plt.legend()
plt.show()

In [None]:
# 3D plot of cost vs first two weights (w[0], w[1])
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot(w_history[:, 0], w_history[:, 1], J_history, marker='o', color='blue', label='Gradient Descent Path')
ax.set_xlabel('w[0]')
ax.set_ylabel('w[1]')
ax.set_zlabel('Cost (J)')
ax.set_title('3D Plot of Cost vs w[0] and w[1]')
ax.legend()
plt.show()

### Key Observations

- **Missing data** in CSV can introduce `NaN` if we incorrectly replace missing with 0.
- **Decreasing learning rate (`alpha`)** helps cost converge more steadily.
- **Feature scaling** is critical: without it, large-magnitude features (e.g., Depth) dominate and cause cost divergence.
- Final cost remained around ~500 — improvement possible.

### Potential Improvements

- Try polynomial features or nonlinear transformations.
- Apply regularization to prevent overfitting.
- Include Oxygen as a feature and check correlation: More feature engg is required.
