# Chain Rule and Backpropagation
## The Foundation of Neural Network Training

---

## Table of Contents
1. [Chain Rule Fundamentals](#chain-rule)
2. [Multivariate Chain Rule](#multivariate)
3. [Computational Graphs](#comp-graphs)
4. [Backpropagation Algorithm](#backprop)
5. [Neural Network from Scratch](#neural-net)
6. [Visualizations](#visualizations)
7. [Practice Problems](#practice)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

---
# 1. CHAIN RULE FUNDAMENTALS

## 1.1 What is the Chain Rule?

The chain rule allows us to differentiate **composite functions** - functions of functions.

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

Or using Leibniz notation with $u = g(x)$:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

### Intuition
Think of it as: "How does y change with respect to x?" = "How does y change with respect to u?" × "How does u change with respect to x?"

In [None]:
# Example 1: Simple Chain Rule
# y = (3x + 2)^2
# Let u = 3x + 2, then y = u^2

def f(x):
    return (3*x + 2)**2

# Manual derivative using chain rule:
# dy/du = 2u = 2(3x + 2)
# du/dx = 3
# dy/dx = 2(3x + 2) * 3 = 6(3x + 2)

def f_derivative(x):
    return 6 * (3*x + 2)

# Numerical verification
def numerical_derivative(func, x, h=1e-7):
    return (func(x + h) - func(x - h)) / (2 * h)

x_test = 2.0
analytical = f_derivative(x_test)
numerical = numerical_derivative(f, x_test)

print(f"f(x) = (3x + 2)^2")
print(f"\nAt x = {x_test}:")
print(f"Analytical derivative: {analytical}")
print(f"Numerical derivative: {numerical:.6f}")
print(f"Difference: {abs(analytical - numerical):.10f}")

In [None]:
# Visualize the function and its derivative
x = np.linspace(-2, 2, 100)
y = f(x)
dy = f_derivative(x)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Function
axes[0].plot(x, y, 'b-', linewidth=2, label='$f(x) = (3x+2)^2$')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Function')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Derivative
axes[1].plot(x, dy, 'r-', linewidth=2, label="$f'(x) = 6(3x+2)$")
axes[1].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('x')
axes[1].set_ylabel('dy/dx')
axes[1].set_title('Derivative (using Chain Rule)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 1.2 Common Chain Rule Examples

In [None]:
# Example 2: Exponential function
# y = e^(x^2)
# dy/dx = e^(x^2) * 2x

def exp_squared(x):
    return np.exp(x**2)

def exp_squared_derivative(x):
    return np.exp(x**2) * 2 * x

print("Example 2: y = e^(x²)")
print("dy/dx = e^(x²) · 2x")
print(f"\nAt x = 1: dy/dx = {exp_squared_derivative(1):.4f}")
print(f"Numerical: {numerical_derivative(exp_squared, 1):.4f}")

In [None]:
# Example 3: Trigonometric composition
# y = sin(x^3)
# dy/dx = cos(x^3) * 3x^2

def sin_cubed(x):
    return np.sin(x**3)

def sin_cubed_derivative(x):
    return np.cos(x**3) * 3 * x**2

print("Example 3: y = sin(x³)")
print("dy/dx = cos(x³) · 3x²")
print(f"\nAt x = 1: dy/dx = {sin_cubed_derivative(1):.4f}")
print(f"Numerical: {numerical_derivative(sin_cubed, 1):.4f}")

In [None]:
# Example 4: Nested composition (double chain rule)
# y = sin(cos(x))
# dy/dx = cos(cos(x)) * (-sin(x)) = -cos(cos(x)) * sin(x)

def sin_cos(x):
    return np.sin(np.cos(x))

def sin_cos_derivative(x):
    return -np.cos(np.cos(x)) * np.sin(x)

print("Example 4: y = sin(cos(x))")
print("dy/dx = -cos(cos(x)) · sin(x)")
print(f"\nAt x = π/4: dy/dx = {sin_cos_derivative(np.pi/4):.4f}")
print(f"Numerical: {numerical_derivative(sin_cos, np.pi/4):.4f}")

---
# 2. MULTIVARIATE CHAIN RULE

## 2.1 Chain Rule with Multiple Variables

For $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$:

$$\frac{dz}{dt} = \frac{\partial f}{\partial x} \cdot \frac{dx}{dt} + \frac{\partial f}{\partial y} \cdot \frac{dy}{dt}$$

This is the **total derivative** and is crucial for backpropagation!

In [None]:
# Example: z = x^2 * y where x = t^2 and y = t^3
# dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
#       = (2xy)(2t) + (x^2)(3t^2)
#       = 2(t^2)(t^3)(2t) + (t^4)(3t^2)
#       = 4t^6 + 3t^6 = 7t^6

def z_of_t(t):
    x = t**2
    y = t**3
    return x**2 * y

def dz_dt_analytical(t):
    return 7 * t**6

t_test = 2.0
print("z = x² · y, where x = t², y = t³")
print("\nUsing multivariate chain rule:")
print("dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)")
print("      = (2xy)(2t) + (x²)(3t²)")
print("      = 7t⁶")
print(f"\nAt t = {t_test}:")
print(f"Analytical dz/dt = {dz_dt_analytical(t_test)}")
print(f"Numerical dz/dt = {numerical_derivative(z_of_t, t_test):.4f}")

## 2.2 Jacobian Matrix

For vector-valued functions, the chain rule involves the **Jacobian matrix**.

If $\mathbf{y} = f(\mathbf{x})$ and $\mathbf{x} = g(\mathbf{t})$, then:

$$\frac{d\mathbf{y}}{d\mathbf{t}} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \cdot \frac{d\mathbf{x}}{d\mathbf{t}}$$

In [None]:
# Example: Computing Jacobian
# f(x, y) = [x^2 + y, x*y]
# Jacobian = [[∂f1/∂x, ∂f1/∂y],
#             [∂f2/∂x, ∂f2/∂y]]
#          = [[2x, 1],
#             [y, x]]

def f_vector(x, y):
    return np.array([x**2 + y, x * y])

def jacobian(x, y):
    return np.array([[2*x, 1],
                     [y, x]])

x, y = 2.0, 3.0
J = jacobian(x, y)

print("f(x, y) = [x² + y, x·y]")
print(f"\nAt (x, y) = ({x}, {y}):")
print(f"f = {f_vector(x, y)}")
print(f"\nJacobian matrix:")
print(J)

---
# 3. COMPUTATIONAL GRAPHS

## 3.1 What are Computational Graphs?

A computational graph represents a function as a **directed acyclic graph (DAG)** where:
- **Nodes** represent operations or variables
- **Edges** represent data flow

This is how deep learning frameworks (PyTorch, TensorFlow) track computations for automatic differentiation.

In [None]:
# Example: f(x, y, z) = (x + y) * z
# 
# Computational Graph:
#   x ----\
#          (+) = a ---\
#   y ----/            (*) = f
#   z ----------------/

class ComputationalGraph:
    """Simple computational graph for f = (x + y) * z"""
    
    def forward(self, x, y, z):
        """Forward pass: compute the output."""
        self.x = x
        self.y = y
        self.z = z
        
        # Intermediate computation
        self.a = x + y  # a = x + y
        
        # Final output
        self.f = self.a * z  # f = a * z
        
        return self.f
    
    def backward(self):
        """Backward pass: compute gradients using chain rule."""
        # Start with df/df = 1
        df_df = 1
        
        # f = a * z
        # df/da = z
        # df/dz = a
        df_da = self.z * df_df
        df_dz = self.a * df_df
        
        # a = x + y
        # da/dx = 1
        # da/dy = 1
        # Using chain rule:
        # df/dx = df/da * da/dx
        # df/dy = df/da * da/dy
        df_dx = df_da * 1
        df_dy = df_da * 1
        
        return df_dx, df_dy, df_dz

# Test the computational graph
graph = ComputationalGraph()
x, y, z = 2.0, 3.0, 4.0

# Forward pass
f = graph.forward(x, y, z)
print(f"f(x, y, z) = (x + y) * z")
print(f"\nFor x={x}, y={y}, z={z}:")
print(f"Forward pass: f = {f}")

# Backward pass
df_dx, df_dy, df_dz = graph.backward()
print(f"\nBackward pass (gradients):")
print(f"∂f/∂x = {df_dx}")
print(f"∂f/∂y = {df_dy}")
print(f"∂f/∂z = {df_dz}")

# Verify: f = (x+y)*z, so df/dx = z, df/dy = z, df/dz = x+y
print(f"\nVerification:")
print(f"Expected ∂f/∂x = z = {z}")
print(f"Expected ∂f/∂y = z = {z}")
print(f"Expected ∂f/∂z = x+y = {x+y}")

In [None]:
# Visualize the computational graph
fig, ax = plt.subplots(figsize=(10, 6))

# Draw nodes
nodes = {
    'x': (0, 2), 'y': (0, 1), 'z': (0, 0),
    '+': (2, 1.5), '*': (4, 0.75), 'f': (6, 0.75)
}

for name, (px, py) in nodes.items():
    circle = plt.Circle((px, py), 0.3, color='lightblue', ec='black')
    ax.add_patch(circle)
    ax.text(px, py, name, ha='center', va='center', fontsize=12, fontweight='bold')

# Draw edges
edges = [
    ('x', '+'), ('y', '+'), ('+', '*'), ('z', '*'), ('*', 'f')
]

for start, end in edges:
    x1, y1 = nodes[start]
    x2, y2 = nodes[end]
    ax.annotate('', xy=(x2-0.3, y2), xytext=(x1+0.3, y1),
                arrowprops=dict(arrowstyle='->', color='black'))

# Add labels for gradients
ax.text(1, 2.3, 'df/dx = z', fontsize=10, color='red')
ax.text(1, 0.7, 'df/dy = z', fontsize=10, color='red')
ax.text(1, -0.3, 'df/dz = x+y', fontsize=10, color='red')

ax.set_xlim(-1, 7)
ax.set_ylim(-1, 3)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Computational Graph: f = (x + y) * z\nwith gradient flow', fontsize=14)

plt.show()

---
# 4. BACKPROPAGATION ALGORITHM

## 4.1 The Core Idea

Backpropagation efficiently computes gradients using the chain rule, working **backwards** from the output.

For a neural network with loss $L$:
1. **Forward pass**: Compute all intermediate values and final loss
2. **Backward pass**: Compute gradients from output to input using chain rule

In [None]:
# Complete example: Simple neuron with sigmoid activation
# y = sigmoid(w*x + b)
# Loss = (y - target)^2

class SimpleNeuron:
    """A single neuron with sigmoid activation."""
    
    def __init__(self):
        self.w = np.random.randn()
        self.b = np.random.randn()
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, x):
        """Forward pass."""
        self.x = x
        self.z = self.w * x + self.b  # Linear combination
        self.y = self.sigmoid(self.z)  # Activation
        return self.y
    
    def compute_loss(self, target):
        """Compute MSE loss."""
        self.target = target
        self.loss = (self.y - target) ** 2
        return self.loss
    
    def backward(self):
        """Backward pass: compute gradients."""
        # Chain rule:
        # dL/dw = dL/dy * dy/dz * dz/dw
        # dL/db = dL/dy * dy/dz * dz/db
        
        # dL/dy = 2(y - target)
        dL_dy = 2 * (self.y - self.target)
        
        # dy/dz = sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z))
        dy_dz = self.sigmoid_derivative(self.z)
        
        # dz/dw = x
        dz_dw = self.x
        
        # dz/db = 1
        dz_db = 1
        
        # Apply chain rule
        self.dL_dw = dL_dy * dy_dz * dz_dw
        self.dL_db = dL_dy * dy_dz * dz_db
        
        return self.dL_dw, self.dL_db
    
    def update(self, learning_rate):
        """Update parameters using gradients."""
        self.w -= learning_rate * self.dL_dw
        self.b -= learning_rate * self.dL_db

# Test the neuron
neuron = SimpleNeuron()
x = 1.0
target = 0.8

print("Simple Neuron: y = sigmoid(w*x + b)")
print(f"Initial w = {neuron.w:.4f}, b = {neuron.b:.4f}")

# Forward pass
y = neuron.forward(x)
loss = neuron.compute_loss(target)
print(f"\nForward pass:")
print(f"  x = {x}, target = {target}")
print(f"  y = {y:.4f}, loss = {loss:.4f}")

# Backward pass
dL_dw, dL_db = neuron.backward()
print(f"\nBackward pass:")
print(f"  dL/dw = {dL_dw:.4f}")
print(f"  dL/db = {dL_db:.4f}")

In [None]:
# Train the neuron
neuron = SimpleNeuron()
x = 1.0
target = 0.8
learning_rate = 0.5
epochs = 100

losses = []
predictions = []

for epoch in range(epochs):
    # Forward
    y = neuron.forward(x)
    loss = neuron.compute_loss(target)
    
    losses.append(loss)
    predictions.append(y)
    
    # Backward
    neuron.backward()
    
    # Update
    neuron.update(learning_rate)

# Plot training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(losses, 'b-', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(predictions, 'g-', linewidth=2, label='Prediction')
axes[1].axhline(y=target, color='r', linestyle='--', linewidth=2, label='Target')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Output')
axes[1].set_title('Prediction vs Target')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final prediction: {predictions[-1]:.4f}")
print(f"Target: {target}")
print(f"Final loss: {losses[-1]:.6f}")

---
# 5. NEURAL NETWORK FROM SCRATCH

## 5.1 Two-Layer Neural Network

In [None]:
class TwoLayerNN:
    """Simple 2-layer neural network."""
    
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights with small random values
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))
    
    def relu(self, z):
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        return (z > 0).astype(float)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def forward(self, X):
        """Forward pass through the network."""
        # Layer 1
        self.Z1 = X @ self.W1 + self.b1
        self.A1 = self.relu(self.Z1)
        
        # Layer 2
        self.Z2 = self.A1 @ self.W2 + self.b2
        self.A2 = self.sigmoid(self.Z2)
        
        return self.A2
    
    def compute_loss(self, y_pred, y_true):
        """Binary cross-entropy loss."""
        m = y_true.shape[0]
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss
    
    def backward(self, X, y_true):
        """Backward pass: compute gradients."""
        m = X.shape[0]
        
        # Output layer gradients
        dZ2 = self.A2 - y_true  # Derivative of BCE + sigmoid
        self.dW2 = (1/m) * self.A1.T @ dZ2
        self.db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)
        
        # Hidden layer gradients (chain rule!)
        dA1 = dZ2 @ self.W2.T
        dZ1 = dA1 * self.relu_derivative(self.Z1)
        self.dW1 = (1/m) * X.T @ dZ1
        self.db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)
    
    def update(self, learning_rate):
        """Update parameters."""
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2
    
    def train(self, X, y, epochs, learning_rate):
        """Train the network."""
        losses = []
        
        for epoch in range(epochs):
            # Forward
            y_pred = self.forward(X)
            loss = self.compute_loss(y_pred, y)
            losses.append(loss)
            
            # Backward
            self.backward(X, y)
            
            # Update
            self.update(learning_rate)
            
            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{epochs}, Loss: {loss:.4f}")
        
        return losses
    
    def predict(self, X):
        """Make predictions."""
        return (self.forward(X) > 0.5).astype(int)

In [None]:
# Create XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])  # XOR

print("XOR Problem:")
print("Input -> Output")
for i in range(len(X)):
    print(f"{X[i]} -> {y[i][0]}")

# Train the network
nn = TwoLayerNN(input_size=2, hidden_size=4, output_size=1)
print("\nTraining...")
losses = nn.train(X, y, epochs=1000, learning_rate=1.0)

# Test
predictions = nn.predict(X)
print("\nPredictions:")
for i in range(len(X)):
    print(f"{X[i]} -> {predictions[i][0]} (expected: {y[i][0]})")

In [None]:
# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(losses, 'b-', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Neural Network Training Loss (XOR Problem)')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Visualize decision boundary
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                     np.linspace(-0.5, 1.5, 100))
grid = np.c_[xx.ravel(), yy.ravel()]
Z = nn.forward(grid).reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, levels=50, cmap='RdBu', alpha=0.7)
plt.colorbar(label='Output')

# Plot data points
plt.scatter(X[y.flatten()==0, 0], X[y.flatten()==0, 1], 
            c='blue', s=200, edgecolors='black', label='Class 0')
plt.scatter(X[y.flatten()==1, 0], X[y.flatten()==1, 1], 
            c='red', s=200, edgecolors='black', label='Class 1')

plt.xlabel('x₁')
plt.ylabel('x₂')
plt.title('Neural Network Decision Boundary for XOR')
plt.legend()
plt.show()

---
# 6. VISUALIZATIONS

## 6.1 Gradient Flow Visualization

In [None]:
# Visualize how gradients flow through activation functions
x = np.linspace(-5, 5, 100)

# Sigmoid
sigmoid = 1 / (1 + np.exp(-x))
sigmoid_grad = sigmoid * (1 - sigmoid)

# Tanh
tanh = np.tanh(x)
tanh_grad = 1 - tanh**2

# ReLU
relu = np.maximum(0, x)
relu_grad = (x > 0).astype(float)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Sigmoid
axes[0, 0].plot(x, sigmoid, 'b-', linewidth=2)
axes[0, 0].set_title('Sigmoid')
axes[0, 0].grid(True, alpha=0.3)

axes[1, 0].plot(x, sigmoid_grad, 'r-', linewidth=2)
axes[1, 0].set_title('Sigmoid Gradient')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_ylim([0, 0.3])

# Tanh
axes[0, 1].plot(x, tanh, 'b-', linewidth=2)
axes[0, 1].set_title('Tanh')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 1].plot(x, tanh_grad, 'r-', linewidth=2)
axes[1, 1].set_title('Tanh Gradient')
axes[1, 1].grid(True, alpha=0.3)

# ReLU
axes[0, 2].plot(x, relu, 'b-', linewidth=2)
axes[0, 2].set_title('ReLU')
axes[0, 2].grid(True, alpha=0.3)

axes[1, 2].plot(x, relu_grad, 'r-', linewidth=2)
axes[1, 2].set_title('ReLU Gradient')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Activation Functions and Their Gradients', y=1.02, fontsize=14)
plt.show()

print("Key observations:")
print("- Sigmoid gradient vanishes for large |x| (vanishing gradient problem)")
print("- Tanh has similar issues but gradients are stronger")
print("- ReLU has constant gradient of 1 for positive values (no vanishing gradient)")

---
# 7. PRACTICE PROBLEMS

## Problem 1: Compute Chain Rule Derivatives

In [None]:
# Problem: Find dy/dx for y = ln(sin(x²))
# Solution:
# Let u = x², v = sin(u), y = ln(v)
# dy/dx = dy/dv * dv/du * du/dx
#       = (1/v) * cos(u) * 2x
#       = (1/sin(x²)) * cos(x²) * 2x
#       = 2x * cot(x²)

def y_func(x):
    return np.log(np.sin(x**2))

def dy_dx(x):
    return 2 * x * (np.cos(x**2) / np.sin(x**2))

# Verify
x_test = 1.5
print("y = ln(sin(x²))")
print("dy/dx = 2x·cot(x²)")
print(f"\nAt x = {x_test}:")
print(f"Analytical: {dy_dx(x_test):.6f}")
print(f"Numerical: {numerical_derivative(y_func, x_test):.6f}")

## Problem 2: Implement Backprop for Custom Function

In [None]:
# Implement backprop for: f(x, y) = x² + xy + y²

class QuadraticFunction:
    def forward(self, x, y):
        self.x = x
        self.y = y
        return x**2 + x*y + y**2
    
    def backward(self):
        # df/dx = 2x + y
        # df/dy = x + 2y
        df_dx = 2 * self.x + self.y
        df_dy = self.x + 2 * self.y
        return df_dx, df_dy

# Test
func = QuadraticFunction()
x, y = 3.0, 2.0

f = func.forward(x, y)
df_dx, df_dy = func.backward()

print(f"f(x, y) = x² + xy + y²")
print(f"\nAt (x, y) = ({x}, {y}):")
print(f"f = {f}")
print(f"∂f/∂x = {df_dx}")
print(f"∂f/∂y = {df_dy}")

---
## Summary

### Key Concepts:

1. **Chain Rule**: Derivative of composite functions
   - $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

2. **Multivariate Chain Rule**: For functions of multiple variables
   - Total derivative includes all paths

3. **Computational Graphs**: Visual representation of computations
   - Nodes = operations, Edges = data flow

4. **Backpropagation**: Efficient gradient computation
   - Forward pass: compute values
   - Backward pass: compute gradients

### Why It Matters for ML:

- Neural networks are compositions of simple functions
- Training requires gradients of loss w.r.t. all parameters
- Backprop computes all gradients in one backward pass
- Understanding chain rule = understanding deep learning!

---

**Next Steps:**
1. Implement backprop for more complex architectures
2. Study automatic differentiation frameworks
3. Learn about gradient flow and vanishing/exploding gradients