<a href="https://colab.research.google.com/github/MithunSR/Gradient_Descent_Tutorial/blob/main/Cost_Function_Binary_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
Binary classification tasks involve predicting one of two classes for a given instance. To evaluate and optimize models for such tasks, the cross-entropy loss function is widely used. This function measures the dissimilarity between predicted probabilities and true binary labels. By minimizing the cross-entropy loss, we aim to train a model that can accurately classify instances into one of the two classes.

**Cross-Entropy Loss Function**

The cross-entropy loss function, derived from information theory, is a popular choice due to its desirable properties. It penalizes confident and incorrect predictions more heavily, allowing the model to focus on correctly classifying challenging instances. The loss is defined as the average negative logarithm of the predicted probability for the correct class.

**Dataset and Binary Classification**

In a binary classification problem, we have a dataset comprising input features and corresponding binary labels, where each label represents one of the two classes (e.g., 0 or 1, negative or positive). Our goal is to build a model that can learn from the data and make accurate predictions on unseen instances.

**Calculating Cross-Entropy Loss**

To calculate the cross-entropy loss, we pass the input features through the model to obtain predicted probabilities. These probabilities are then compared to the true binary labels. If the true label is 1, the loss is the negative logarithm of the predicted probability for class 1. Conversely, if the true label is 0, the loss is the negative logarithm of the predicted probability for class 0. The average loss across all instances in the dataset gives us the overall cross-entropy loss.

**Minimizing the Cross-Entropy Loss**

Minimizing the cross-entropy loss is typically achieved using optimization algorithms like gradient descent. The model's parameters are iteratively adjusted in the direction that reduces the loss, improving the model's ability to accurately classify instances.

**Application and Conclusion**

By minimizing the cross-entropy loss, we aim to train a binary classification model that can make informed and accurate predictions on new and unseen instances. This has applications in various domains such as disease diagnosis, fraud detection, sentiment analysis, and more.

Now that we have an understanding of the cross-entropy loss function and its role in binary classification, let's proceed to code examples that demonstrate its calculation and optimization.

#Build Example model

##Importing Dependencies
We begin by importing the necessary libraries for our code:

numpy (as np) for numerical computations.
matplotlib.pyplot (as plt) for plotting.
We also import the load_breast_cancer function from sklearn.datasets and the train_test_split function from sklearn.model_selection.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

##Importing Dependencies
We begin by importing the necessary libraries for our code:

numpy (as np) for numerical computations.
matplotlib.pyplot (as plt) for plotting.
We also import the load_breast_cancer function from sklearn.datasets and the train_test_split function from sklearn.model_selection.

In [2]:
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()

##Splitting the Dataset
Next, we split the dataset into training and testing sets using the train_test_split function. We assign the feature matrices and target vectors for both the training and testing sets to X_train, X_test, y_train, and y_test, respectively.

In [3]:
# Extract the features and target variable
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##Defining the Sigmoid Function
We define the sigmoid function, which maps any real-valued number to a value between 0 and 1. It is used to convert the linear function output into a probability.

In [4]:
# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

##Defining the Hypothesis Function
The hypothesis function computes the predicted probabilities using the sigmoid function. It takes the feature matrix X and the parameter vector theta as inputs and returns the predicted probabilities.

In [5]:
# Define the hypothesis function
def hypothesis(theta, X):
    return sigmoid(np.dot(X, theta))

##Defining the Cost Function
The cost function calculates the cross-entropy loss, which measures the dissimilarity between the predicted probabilities and the true binary labels. It takes the parameter vector theta, the feature matrix X, and the target vector y as inputs. The cost is computed by comparing the predicted probabilities to the true labels and applying the cross-entropy formula.

In [6]:
# Define the cost function (Cross-entropy loss)
def cost_function(theta, X, y):
    m = len(X)
    h = hypothesis(theta, X)
    error = y * np.log(h) + (1 - y) * np.log(1 - h)
    cost = -np.sum(error) / m
    return cost

##Defining the Gradient Descent Function
The gradient descent function performs the optimization process by iteratively updating the parameter vector theta to minimize the cost function. It takes the initial parameter vector theta, the feature matrix X, the target vector y, the learning rate, and the number of iterations as inputs. In each iteration, it computes the gradients, updates the parameters, and calculates the cost. The updated parameters and the cost history are returned.

In [7]:
# Define the gradient descent function
def gradient_descent(theta, X, y, learning_rate, num_iterations):
    m = len(X)
    history = []
    for _ in range(num_iterations):
        h = hypothesis(theta, X)
        error = h - y
        gradient = np.dot(X.T, error) / m
        theta -= learning_rate * gradient
        cost = cost_function(theta, X, y)
        history.append(cost)
    return theta, history

##Adding a Bias Term
We add a column of ones to the feature matrix X_train to account for the bias term in the linear equation. This allows the model to learn an intercept term.

In [8]:
# Add a column of ones to the feature matrix for the bias term
X_train_bias = np.c_[np.ones((len(X_train), 1)), X_train]

##Setting Initial Values and Hyperparameters
We set the initial values for the parameter vector theta, the learning rate, and the number of iterations.

In [9]:
# Set the initial values and hyperparameters
initial_theta = np.zeros(X_train_bias.shape[1])
learning_rate = 0.01
num_iterations = 1000

##Running Gradient Descent
We call the gradient descent function with the initial values and hyperparameters to minimize the cost function. The optimized parameter vector theta and the cost history are returned.

In [10]:
# Run gradient descent to minimize the cost function
theta, cost_history = gradient_descent(initial_theta, X_train_bias, y_train, learning_rate, num_iterations)

# Print the optimized values of theta
print('Optimized theta:', theta)

Optimized theta: [ 4.24516790e-01  3.24467767e+00  4.24208364e+00  1.88482765e+01
  7.75655856e+00  2.92499148e-02 -1.48711100e-02 -5.91801434e-02
 -2.52974302e-02  5.50134051e-02  2.35598788e-02  1.27194627e-02
  3.11289644e-01 -8.00874115e-02 -8.12217831e+00  1.56530348e-03
 -3.37237831e-03 -6.60996497e-03 -8.22805700e-04  4.97242660e-03
  5.15483696e-04  3.40847459e+00  5.28081593e+00  1.90129635e+01
 -1.11976889e+01  3.57708025e-02 -6.52173044e-02 -1.27355091e-01
 -2.97116594e-02  7.05396872e-02  2.15588138e-02]


  return 1 / (1 + np.exp(-z))
  error = y * np.log(h) + (1 - y) * np.log(1 - h)
  error = y * np.log(h) + (1 - y) * np.log(1 - h)
