> # `Diffrentiation`

<html lang="en">
<head>
  <meta charset="utf-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1"/>
  <title>Differentiation Notes</title>

  <!-- MathJax for rendering LaTeX -->
  <script>
    window.MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\\(', '\\)']],
        displayMath: [['$$','$$'], ['\\[','\\]']]
      }
    };
  </script>
  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async></script>

  <style>
    body {
      font-family: Inter, system-ui, sans-serif;
      margin: 24px;
      background: #f9fcff;
      color: #1a2a33;
      line-height: 1.55;
    }
    .wrap {
      max-width: 900px;
      margin: auto;
      background: #fff;
      border-radius: 12px;
      padding: 20px 28px;
      box-shadow: 0 8px 24px rgba(0,0,0,0.08);
    }
    details { margin: 12px 0; }
    summary { cursor: pointer; font-weight: 600; font-size: 16px; }
    h2 { font-size: 18px; margin: 12px 0 6px; }
    p { margin: 6px 0; }
    .example {
      background: #f2f9ff;
      border: 1px solid #ddeaff;
      border-radius: 8px;
      padding: 10px 14px;
      margin: 8px 0;
    }
    .answer {
      background: #fff9eb;
      border-left: 4px solid #ffb347;
      padding: 10px 14px;
      border-radius: 6px;
      margin: 10px 0;
    }
  </style>
</head>
<body>
<div class="wrap">

<details open>
<summary>Click to expand</summary>

<section>
  <h2>1. What is Differentiation?</h2>
  <p><strong>Idea:</strong> Differentiation measures the <em>rate of change</em> of a function.</p>
  <p>If $y=f(x)$, the derivative $f'(x)$ tells us how $y$ changes when $x$ changes a tiny bit.</p>
  <p><strong>Geometric view:</strong> slope of the tangent line to the curve.</p>

  <div class="answer">
    <strong>Quick check:</strong> If $f(x)=x^2$, slope at $x=2$?  
    $f'(x)=2x \;\Rightarrow\; f'(2)=4$.
  </div>
</section>

<hr/>

<section>
  <h2>2. Differentiation of a Constant</h2>
  <p>Rule:</p>
  <div class="example">$$\frac{d}{dx}(c) = 0$$</div>
  <p>Constants don’t change, so slope = 0.</p>
</section>

<hr/>

<section>
  <h2>3. Power Rule</h2>
  <p>If $f(x)=x^n$, then</p>
  <div class="example">$$\frac{d}{dx}(x^n)=n x^{n-1}$$</div>
  <p>Example: $\tfrac{d}{dx}(x^3)=3x^2$.</p>
</section>

<hr/>

<section>
  <h2>4. Sum Rule</h2>
  <div class="example">
    $$\frac{d}{dx}[f(x)+g(x)] = f'(x)+g'(x)$$
  </div>
</section>

<hr/>

<section>
  <h2>5. Product Rule</h2>
  <div class="example">
    $$\frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)$$
  </div>
</section>

<hr/>

<section>
  <h2>6. Quotient Rule</h2>
  <div class="example">
    $$\frac{d}{dx}\!\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x)-f(x)g'(x)}{(g(x))^2}$$
  </div>
</section>

<hr/>

<section>
  <h2>7. Chain Rule</h2>
  <p>For composition $f(g(x))$:</p>
  <div class="example">
    $$\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)$$
  </div>
  <p><strong>ML connection:</strong> Basis of <em>backpropagation</em> in neural networks.</p>
</section>

<hr/>

<section>
  <h2>8. Partial Differentiation</h2>
  <p>If $f(x,y)$ depends on multiple variables:</p>
  <ul>
    <li>$\tfrac{\partial f}{\partial x}$: treat $y$ constant, differentiate w.r.t $x$.</li>
    <li>$\tfrac{\partial f}{\partial y}$: treat $x$ constant, differentiate w.r.t $y$.</li>
  </ul>

  <div class="example">
    $f(x,y)=x^2 y + y^3$  
    $$\frac{\partial f}{\partial x} = 2xy, \quad 
      \frac{\partial f}{\partial y} = x^2 + 3y^2$$
  </div>
</section>

<hr/>

<section>
  <h2>9. Higher-Order Derivatives</h2>
  <p>Differentiate multiple times:</p>
  <ul>
    <li>Second derivative: $f''(x)$ = derivative of derivative.</li>
    <li>Tells about <em>curvature</em>.</li>
  </ul>
  <p><strong>ML connection:</strong> Hessians (matrix of second partial derivatives) appear in optimization.</p>
</section>

<hr/>

<section>
  <h2>10. Matrix Differentiation</h2>
  <p>Extends to vector/matrix functions:</p>
  <ul>
    <li>If $f(x)=a^\top x$, then $\nabla_x f = a$.</li>
    <li>If $f(x)=x^\top A x$, then $\nabla_x f = (A+A^\top)x$.</li>
  </ul>
  <p><strong>ML connection:</strong> Gradients in cost functions, training neural networks, optimization.</p>
</section>

</details>
</div>
</body>
</html>


> # `Optimization Theory`

<html lang="en">
<head>
  <meta charset="utf-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1"/>
  <title>Functions & Optimization — Notes</title>

  <!-- MathJax config (supports $...$ and $$...$$) -->
  <script>
    window.MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\\(', '\\)']],
        displayMath: [['$$','$$'], ['\\[','\\]']]
      },
      options: {
        skipHtmlTags: ['script','noscript','style','textarea','pre','code']
      }
    };
  </script>
  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async></script>

  <style>
    :root{
      --bg:#f8fcff;
      --card:#ffffff;
      --muted:#64748b;
      --accent:#0b63d6;
      --mono: ui-monospace, SFMono-Regular, Menlo, Monaco, "Roboto Mono", "Courier New", monospace;
    }
    body{font-family:Inter, system-ui, -apple-system, "Segoe UI", Roboto, "Helvetica Neue", Arial; background:var(--bg); color:#0b2030; margin:22px;}
    .card{max-width:980px; margin:0 auto; background:var(--card); padding:22px; border-radius:12px; box-shadow:0 10px 30px rgba(11,30,45,0.06); border:1px solid rgba(11,99,214,0.04);}
    header{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px;}
    header h1{margin:0; font-size:20px;}
    header p{margin:0; color:var(--muted); font-size:13px;}
    details{margin:8px 0;}
    summary{cursor:pointer; font-weight:700; font-size:15px;}
    details>summary::-webkit-details-marker{display:none;}
    details[open] > summary::after { content: "▾"; padding-left:8px; color:var(--muted); }
    details>summary::after { content: "▸"; padding-left:8px; color:var(--muted); }
    section{margin:12px 0;}
    h2{font-size:16px; margin:8px 0 6px;}
    p{margin:6px 0;}
    .example{background:#f1f8ff; padding:10px; border-radius:8px; border:1px solid rgba(11,99,214,0.06);}
    pre{background:#f7fbff; padding:10px; border-radius:8px; overflow:auto; border:1px solid rgba(11,30,45,0.03);}
    code{font-family:var(--mono); background:#eef6ff; padding:2px 6px; border-radius:6px;}
    .answer{background:#fffaf0; border-left:4px solid #ffb347; padding:10px 12px; border-radius:6px; margin-top:8px;}
    ul, ol{margin:8px 0 8px 20px;}
    .hint{font-size:13px; color:var(--muted);}
    hr{border:0; border-top:1px solid rgba(11,30,45,0.06); margin:16px 0;}
  </style>
</head>
<body>
  <div class="card">
    <details open>
      <summary>Click to expand</summary>
      <section>
        <h2>1. Function</h2>
        <p>A <strong>function</strong> maps inputs to outputs:</p>
        <div class="example">
          $$ y = f(x) $$
        </div>
        <p>In ML, $f$ often denotes a model (linear model, neural network, etc.).</p>
        <div class="answer">
          <strong>Example (linear regression):</strong><br>
          The common function form relating inputs $X$ and output $y$ is
          $$
          y = Xw + b \quad\text{or}\quad \hat{y} = w^\top x + b,
          $$
          where $w$ are weights and $b$ is a bias/offset.
        </div>
      </section>
      <hr/>
      <section>
        <h2>2. Multivariate Functions</h2>
        <p>Functions of many variables:</p>
        <div class="example">
          $$ f(x_1,x_2,\dots,x_n) $$
        </div>
        <p>In ML inputs are usually high-dimensional vectors; cost functions depend on many parameters (weights).</p>
      </section>
      <hr/>
      <section>
        <h2>3. Parameters of a Function</h2>
        <p>Parameters are the adjustable knobs (weights, biases) we optimize. Example:</p>
        <div class="example">
          $$ f(x) = w_1 x_1 + w_2 x_2 + b $$
        </div>
      </section>
      <hr/>
      <section>
        <h2>4. Maxima & Minima</h2>
        <p>Optimization aims to find minima (loss) or maxima (reward). First-order necessary condition:</p>
        <div class="example">
          $$ \nabla f(x) = 0 $$
        </div>
        <p>Second derivative (Hessian) indicates curvature: positive definite → local minimum.</p>
      </section>
      <hr/>
      <section>
        <h2>5. Loss Functions</h2>
        <p>Loss measures error between predictions and targets. Common examples:</p>
        <ul>
          <li>MSE (Mean Squared Error): $$ L_{\text{MSE}} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 $$</li>
          <li>Cross-entropy (classification): $$ L_{\text{CE}} = -\sum_i y_i \log \hat{p}_i $$ (for one-hot $y$)</li>
        </ul>
      </section>
      <hr/>
      <section>
        <h2>6. How to Select a Good Loss Function</h2>
        <p>Choice depends on the task:</p>
        <ul>
          <li>Regression → MSE, MAE.</li>
          <li>Classification → Cross-entropy (softmax + log-loss).</li>
          <li>Ranking → Hinge, pairwise losses.</li>
        </ul>
        <div class="answer">
          <strong>Why cross-entropy often beats MSE for classification:</strong>
          <ul>
            <li>Cross-entropy corresponds to the negative log-likelihood under a probabilistic model (softmax + categorical distribution), so optimizing it performs maximum likelihood estimation.</li>
            <li>It produces larger gradients when predictions are confidently wrong, giving stronger corrective updates. MSE treats probabilities poorly and can produce vanishing gradients near 0/1 probabilities.</li>
            <li>Cross-entropy's gradient structure aligns with the softmax output, leading to stable, faster convergence for classification tasks.</li>
          </ul>
        </div>
      </section>
      <hr/>
      <section>
        <h2>7. Calculating Parameters of a Loss Function</h2>
        <p>Find parameters $w$ that minimize $L(w)$. Usually done with gradient-based methods because closed-form solutions are rare for complex models.</p>
      </section>
      <hr/>
      <section>
        <h2>8. Convex & Concave Loss Functions</h2>
        <p><strong>Convex:</strong> any line segment between two points on the graph lies above the graph → single global minimum. Convex losses are easier to optimize reliably.</p>
        <p><strong>Concave:</strong> opposite.</p>
      </section>
      <hr/>
      <section>
        <h2>9. Gradient Descent</h2>
        <p>Iterative update rule:</p>
        <div class="example">
          $$ w \leftarrow w - \eta \,\nabla_w L(w) $$
        </div>
        <p>Where $\eta$ is the learning rate. Variants: SGD, mini-batch, momentum, Adam, RMSProp, etc.</p>
      </section>
      <hr/>
      <section>
        <h2>10. Hessians</h2>
        <p>Matrix of second partial derivatives:</p>
        <div class="example">
          $$ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$
        </div>
        <p>Hessians describe curvature and are central to Newton's method and second-order optimization; positive definite Hessian → local convexity.</p>
      </section>
      <hr/>
      <section>
        <h2>11. Problems Faced in Optimization</h2>
        <ul>
          <li>Local minima (less of a problem in very-high-dimensional neural nets).</li>
          <li>Saddle points (common and can slow training).</li>
          <li>Vanishing / exploding gradients (especially in deep networks).</li>
          <li>Poor learning rate choice (too small → slow; too large → divergence).</li>
        </ul>
      </section>
      <hr/>
      <section>
        <h2>12. Constrained Optimization</h2>
        <p>Sometimes we optimize subject to constraints (e.g., Lasso: $\|w\|_1 \le \alpha$ or equivalently add $\lambda\|w\|_1$ penalty). Methods include Lagrange multipliers and KKT conditions.</p>
      </section>
    </details>
    <hr/>
  </div>
</body>
</html>
