# Binary Classification Logic

In the domain of binary classification, the primary objective is to categorize the provided data, denoted by "x", into one of two distinct classes, symbolized as $y \in {0, 1}$. Within this framework, "y" is referred to as a "label".

Initially, the process involves selecting a probability distribution that spans the output space $y \in {0, 1}$. A commonly utilized distribution for this purpose is the Bernoulli distribution. This distribution is characterized by its definition on the discrete set ${0, 1}$ and is parameterized by a single parameter $\lambda \in (0, 1)$, which signifies the probability of the event where “$y$” assumes the value of one. The probability mass function of the Bernoulli distribution can be articulated as:

$$\begin{equation}
Pr(y|\lambda) = (1-\lambda)^{1-y} \cdot \lambda^y
\end{equation}$$

Alternatively, this can be expressed through piecewise function notation as:

$$\begin{equation}
Pr(y|\lambda) =
\begin{cases}
1-\lambda & \text{if } y=0 \\
\lambda & \text{if } y=1 \\
\end{cases}
\end{equation}$$

Subsequently, the objective is to configure a machine learning model, denoted by $f(x, \phi)$, to predict the single distribution parameter “$\lambda$”. Given that “$\lambda$” must only occupy values within the interval [0, 1], and the output of the network may not inherently conform to this constraint, the output is then transformed through a mapping function that converts real numbers $\mathbb{R}$ to the interval [0, 1]. An appropriate function for this conversion is the “logistic sigmoid”, represented as:

$$\begin{equation}
sig(z) = \frac{1}{1+\exp(-z)}
\end{equation}$$

Therefore, the prediction of the distribution parameter $\lambda$ is computed as $\lambda = sig(f(x,\phi))$:

$$\begin{equation}
Pr(y|x) = (1 - sig(f(x,\phi)))^{1-y} \cdot sig(f(x,\phi))^y
\end{equation}$$

The loss function, essential for training the model, is designated as the negative log-likelihood of the observed training set. This function is pivotal for optimizing the parameters $\phi$ of the model and is mathematically defined as:

$$\begin{equation}
L(\phi) = \sum_{i=1}^{I} -[(1-y_i) \cdot \log(1-sig(f(x_i,\phi))) + y_i \cdot \log(sig(f(x_i,\phi)))]
\end{equation}$$

During the inference phase, a point estimate of "y" is often required. To determine this, a threshold is set such that "y" is assigned a value of 1 if $\lambda > 0.5$, and a value of 0 otherwise, enabling the model to make binary decisions based on the predicted probabilities.

In [2]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
def BernoulliProbability(label:int,lambdaValue:int|float)->int|float:
  """
    Calculates the probability of y given lambda using the Bernoulli distribution.

    Parameters:
    - label: The binary class label (0 or 1).
    - lambdaValue: The probability of y being 1.

    Returns:
    - Probability of label given lambda.
  """
  return (1-lambdaValue)**(1-label)*lambdaValue**label

In [4]:
def LogisticSigmoid(value:np.ndarray)->np.ndarray:
  return 1/(1+np.exp(-value))

In [5]:
def LinearModel(sample:np.ndarray,phi:np.ndarray)->np.ndarray:
  """
    A simplistic linear model for demonstration. In practice, this would be replaced
    with a more complex model, like a neural network.

    Parameters:
    - sample: Input features.
    - phi: Model parameters.

    Returns:
    - Predicted value before applying sigmoid.
  """
  return np.dot(sample,phi)

In [6]:
def PredictLambda(sample:np.ndarray,phi:np.ndarray)->np.ndarray:
  """
    Predicts lambda by applying the sigmoid function on the model's output.

    Parameters:
    - sample: Input features.
    - phi: Model parameters.

    Returns:
    - Predicted lambda.
  """
  value = LinearModel(sample,phi)
  return LogisticSigmoid(value)

In [7]:
def BinaryClassificationLoss(labels:np.ndarray,sample:np.ndarray,phi:np.ndarray)->np.ndarray:
  """
    Computes the negative log-likelihood loss for binary classification.

    Parameters:
    - labels: Array of binary class labels.
    - sample: Input features.
    - phi: Model parameters.

    Returns:
    - The negative log-likelihood loss.
  """
  lambdaPrediction = PredictLambda(sample,phi)
  loss = -(np.sum((1-labels)*np.log(1-lambdaPrediction)+labels*np.log(lambdaPrediction)))
  return loss

In [8]:
def MakePrediction(sample:np.ndarray,phi:np.ndarray)->np.ndarray:
  """
    Makes a binary prediction based on the predicted lambda.

    Parameters:
    - sample: Input features.
    - phi: Model parameters.

    Returns:
    - Binary class prediction (0 or 1).
  """
  lambdaPrediction = PredictLambda(sample,phi)
  return (lambdaPrediction>0.5).astype(int)

In [9]:
irisData = datasets.load_iris()
sample = irisData.data[:,2:]  # We only use petal length and petal width for simplicity
groundTruth = (irisData.target==0).astype(int) # 1 if Setosa, 0 otherwise

In [10]:
print(f"Labels:\n{set(groundTruth)}")

Labels:
{0, 1}


In [11]:
print(f"Data shape:\n{sample.shape}")

Data shape:
(150, 2)


In [12]:
# Split dataset into training and testing sets
xTrain,xTest,yTrain,yTest = train_test_split(sample,groundTruth,test_size=0.2,random_state=45)

In [13]:
def UpdatePhiValues(sample:np.ndarray,groundTruth:np.ndarray,phi:np.ndarray,learningRate=0.01)->np.ndarray:
  """
    Updates the model parameters using gradient descent.

    Parameters:
    - sample: Input features.
    - groundTruth: True labels.
    - phi: Current model parameters.
    - learningRate: Step size for gradient descent.

    Returns:
    - Updated model parameters.
  """
  count = len(groundTruth)
  lambdaPrediction = PredictLambda(sample,phi)
  gradient = np.dot(sample.T,(lambdaPrediction-groundTruth))/count
  phiUpdated = phi-learningRate*gradient
  return phiUpdated

In [14]:
phiValues = np.random.randn(2) # random initialize parameters

In [15]:
print(f"Phi Values:\n{phiValues}")

Phi Values:
[-0.78721354 -0.5843397 ]


In [16]:
for epoch in range(1000):
  phi = UpdatePhiValues(xTrain,yTrain,phiValues)

In [17]:
predictions = MakePrediction(xTest,phi)
print(f"Prediction Labels:\n{set(predictions)}")

Prediction Labels:
{0}


In [18]:
accuracy = accuracy_score(yTest,predictions)
print(f"Model Accuracy:\n{accuracy}")

Model Accuracy:
0.6333333333333333


- another example:

In [31]:
def BernoulliProbability(labels:np.ndarray,lambdaParameters:np.ndarray)->int|float:
  """
    Bernoulli probability mass function.

    Parameters:
    - labels: The binary outcome (0 or 1).
    - lambdaParameters: The probability of the outcome being 1.

    Returns:
    - The probability of labels given lambda.
  """
  return (1-lambdaParameters)**(1-labels)*lambdaParameters**labels

In [32]:
def LogisticSigmoid(value:np.ndarray)->np.ndarray:
  """
    Logistic sigmoid function, mapping any real number to the (0, 1) interval.

    Parameters:
    - value: The input value(s).

    Returns:
    - The sigmoid of value.
  """
  return 1/(1+np.exp(-value))

In [33]:
def ModelPrediction(sample:np.ndarray,phiValues:np.ndarray)->np.ndarray:
  """
    Predicts the probability of y=1 using the logistic sigmoid on the linear model output.

    Parameters:
    - sample: The input features.
    - phiValues: The model parameters.

    Returns:
    - The predicted probability of y being 1.
  """
  value = np.dot(sample,phiValues)
  return LogisticSigmoid(value)

In [34]:
def ComputeLoss(groundTruth:np.ndarray,sample:np.ndarray,phiValues:np.ndarray)->np.ndarray:
  """
    Computes the negative log-likelihood loss for binary classification.

    Parameters:
    - groundTruth: The true labels.
    - sample: The input features.
    - phiValues: The model parameters.

    Returns:
    - The computed loss.
  """
  predictions = ModelPrediction(sample,phiValues)
  loss = -np.mean(groundTruth*np.log(predictions)+(1-groundTruth)*np.log(1-predictions))
  return loss

In [35]:
def UpdateParameters(sample:np.ndarray,groundTruth:np.ndarray,phiValues:np.ndarray,learningRate:float=0.01)->np.ndarray:
  """
    Updates the model parameters using gradient descent.

    Parameters:
    - sample: The input features.
    - groundTruth: The true labels.
    - phiValues: The model parameters.
    - learningRate: The learning rate for gradient descent.

    Returns:
    - Updated model parameters.
  """
  predictions = ModelPrediction(sample,phiValues)
  gradient = np.dot(sample.T,(predictions-groundTruth))/len(groundTruth)
  phiUpdated = phiValues-learningRate*gradient
  return phiUpdated

In [36]:
def BinaryClassificationInference(sample:np.ndarray,phiValues:np.ndarray)->np.ndarray:
  probabilities = ModelPrediction(sample,phiValues)
  return (probabilities>0.5).astype(int)

In [37]:
sample = np.array([[5,2],[3,5],[6,1.5],[2,3]]) # Feature matrix
groundTruth = np.array([1,0,1,0]) # True labels

In [38]:
phiValues = np.random.rand(sample.shape[1])
print(f"Phi Values:\n{phiValues}")

Phi Values:
[0.13385617 0.22222434]


In [39]:
iterations = 1000
learningRate = 0.01
for idx in range(iterations):
  phiValues = UpdateParameters(sample,groundTruth,phiValues,learningRate)
  if idx % 100 == 0:
    print(f"Loss at iteration: {idx} --> ::[{ComputeLoss(groundTruth,sample,phiValues)}]::")
predictions = BinaryClassificationInference(sample,phiValues)
print(f"Predictions:\n{predictions}")

Loss at iteration: 0 --> ::[0.8661142104923024]::
Loss at iteration: 100 --> ::[0.3142663868497561]::
Loss at iteration: 200 --> ::[0.18990551803602562]::
Loss at iteration: 300 --> ::[0.13646302662037677]::
Loss at iteration: 400 --> ::[0.10694560435903737]::
Loss at iteration: 500 --> ::[0.08820184442972846]::
Loss at iteration: 600 --> ::[0.07521487839898558]::
Loss at iteration: 700 --> ::[0.06566656108823572]::
Loss at iteration: 800 --> ::[0.05833884639559832]::
Loss at iteration: 900 --> ::[0.05253011549673224]::
Predictions:
[1 0 1 0]
