This introduction to logistic regression follows Chapter 5 of Stanford's Speech and Language Processing course. 

## Binary Logistic Regression

Recall the the Naive Bayes classifier assigns class $c$ to a document $d$, not by directly computing $P(c|d)$, but by computing a likelihood and a prior

$$
\hat{c} = \underset{c \in C}{\text{argmax}} \:\: \overbrace{P(d|c)}^{\text{likelihood}}\:\: \overbrace{P(c)}^{\text{prior}}
$$
Generative models (like Naive Bayes) use a likelihood term to express the probability of generating the document's features given that it was of class $c$. Discriminative models directly compute $P(c|d)$ to discriminate between potential classes, even if it couldn't generate an example of a given class. 


#### Components of a Probablistic Machine Learning Classifier
Naive Bayes and logistic regression both require a training set of $m$ input/ouput pairs $\left( x^{(i)},y^{(i)} \right)$, where the supserscript notation refers an inidivudal instance in the training set. The ML classification system has four parts: 
* Feature representation of input. Input $x^{(i)}$ is represented as a vector of features $[x_1,x_2,...,x_n]$. $x^{(j)}_i$ represents feature $i$ for input $x^{(j)}$ , synonymous with $x_i$, $f_i$, and $f_i(x)$. 
* Classification function to calculate estimated class $\hat{y}$ through $p(y|x)$, using sigmoid and softmax tools. 
* Objective function for learning to minimizes error on training examples. In this case, the cross-entropy loss functions.
* An algorithm to optimize the objective function. In this case, stochastic gradient descent (SGD). 

#### Sigmoid Function
The sigmoid function helps classifiers make a binary decision about the class of a new input. For the input observation $x$, represented by feature vector $[x_1, x_2,...,x_n]$, the classifier output $y$ can be 1 (observation is a member of class) or 0. To calculate $P(y=1|x)$, logistic regression learns (from a training set) a vector of weights and a bias term. Each weight $w_i$ is associated with one of the input features $x_i$, reflecting how important the input feature is to the classification decision, and can be positive or negative. The bias term (intercept) is added to the weighted inputs. 

After learning these values in training, the classifier decides on a test instance by first multiplying each $x_i$ by its weight $w_i$, summing these values, and adding the bias term. $z$, the weighted sum of evidence for this class, can be expressed using dot product notation, using boldface notation to represent vectors: 
$$
z = \left( \sum_{i=1}^n w_i x_i \right) + b = \textbf{w} \cdot \textbf{x} + b
$$
To convert $z$ to a legal probability between 0 and 1, we pass it through the sigmoid function $\sigma(z)$ (resembling an s, and also called the logistic regression). 
$$
\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + \exp(-z)}
$$
We can then calculate probabilities of the two cases as follows (which sum to 1):
$$
P(y=1) = \sigma(\textbf{w} \cdot \textbf{x} + b) = \frac{1}{1 + \exp(-(\textbf{w} \cdot \textbf{x} + b))}
$$
$$
P(y=0) = 1 - \sigma(\textbf{w} \cdot \textbf{x} + b) = \frac{\exp(-(\textbf{w} \cdot \textbf{x} + b))}{1 + \exp(-(\textbf{w} \cdot \textbf{x} + b))}
$$
Setting 0.5 as the decision boundary, for a test instance $x$, we say "yes" if $P(y=1|x) > 0.5$, and " no" otherwise. 

#### Features
Features are traditionally designed by exaimining the training dataset. When doing this, it's important to consider feature interaction (expressing the prediction as a sum of feature effects when the effect of one feature depends on the value of the other). To avoid extensive human effort of feature design, more recent NLP work has incorporated representation learning: ways to learn features in an automatic and unsupervised manner from the input. 

When different features have different ranges, it's common to scale them by standariziing input values, resulting in a zero mean and standard deviation of one. That is, if $\mu_i$ is the mean of the values of feature $\textbf{x}_i$ across $m$ observations in the input dataset, and $\sigma_i$ is the standard deviation of these values, we replace each feature $\textbf{x}_i$ with $\textbf{x}^\prime_i$:
$$
\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)} \quad \text{and} \quad \sigma_i \sqrt{\frac{1}{m} \sum_{j=1}^m \left(x_i^{(j)} - \mu_i \right)^2}
$$
$$
\textbf{x}^\prime_i = \frac{\textbf{x}_i - \mu_i}{\sigma_i}
$$
Alternatively, we can normalize the input features, creating a range between 0 and 1: 
$$
\textbf{x}^\prime_i  = \frac{\textbf{x}_i - \min(\textbf{x}_i)}{\max(\textbf{x}_i) - min(\textbf{x}_i)}
$$
Both data scaling options help compare values across features. 

#### Vectorization
We need to process an entire test set with $m$ test examples needing classification. Each test example $x^{(i)}$ has a feature vector $\textbf{x}^{(i)}$, for $1 \leq i \leq m$. We could calculate $\hat{y}^{(i)}$ with a for-loop:
$$
\begin{align*}
\textbf{foreach} \:\: & x^{(i)} \text{ in input } [x^{(1)}, x^{(2)},...,x^{(m)}] \\
&y^{(i)} = \sigma(\textbf{w}\cdot\textbf{x}^{(i)}+b)
\end{align*}
$$
For the first 3 test examples,
$$
P(y^{(1)} = 1|x^{(1)}) = \sigma(\textbf{w}\cdot\textbf{x}^{(1)}+b) \\
P(y^{(2)} = 1|x^{(2)}) = \sigma(\textbf{w}\cdot\textbf{x}^{(2)}+b) \\
P(y^{(3)} = 1|x^{(3)}) = \sigma(\textbf{w}\cdot\textbf{x}^{(3)}+b)
$$
Seeing a pattern, we can use matrix arithemtic to make the process more efficient. We will include all input feature vectors for input $x$ (remembering that $x$ has $m$ test examples) into a single input matrix $\textbf{X}$. Each row $i$ corrresponds to the feature vector for input example $x^{(i)}$. If each exaample has $f$ features, we construct a $m \times f$ matrix: 
$$
\textbf{X} = \begin{bmatrix}
x^{(1)}_1 & x^{(1)}_2 & \dots & x^{(1)}_f \\
x^{(2)}_1 & x^{(2)}_2 & \dots & x^{(2)}_f \\
\vdots    & \vdots    & \ddots& \vdots    \\
x^{(m)}_1 & x^{(m)}_2 & \dots & x^{(m)}_f \\
\end{bmatrix}_{m \times f}
$$

Now, introducing $\textbf{b}$ as a vector of length $m$ with the scalar bias terms repeated $m$ times and $\hat{\textbf{y}}^{(i)}$, 
$$
\textbf{b} = \begin{bmatrix}
b \\ b \\ \vdots \\ b
\end{bmatrix} \quad \text{and} \quad {\textbf{y}} = \begin{bmatrix}
\hat{y}^{(1)} \\ \hat{y}^{(2)} \\ \vdots \\ \hat{y}^{(m)}
\end{bmatrix}
$$

Now, all outputs can be computed with a single matrix multiplication followed by an addition: 
$$
\textbf{y = Xw + b}
$$
Note that this computes the same thing as the for-loop. Also note that $\textbf{y}$ is $(m \times 1)$, $\textbf{X}$ is $(m \times f)$, $\textbf{w}$ is $(f \times 1)$, and $\textbf{b}$ is $(m \times 1)$. Also note that logistic regression works generally works better on larger datasets or documents. In settings with many correlated features, logistics regression will outperform Naive Bayes, due to the independence assumption. 