In [1]:
# In this notebook, you learn:
#
# 1) What is likelihood?
# 2) What is Maximum Likelihood Estimation (MLE)? 
# 3) How to use likelihood in the context of our name generation problem?
# 4) How to evaluate the quality of the model using the concept of likelihood?
#
# Resources:
#
# 1) https://www.youtube.com/watch?v=ScduwntrMzc&t=41s
#       -- This video explains the concept of likelihood and several other related concepts.
# 2) https://www.youtube.com/watch?v=7kLHJ-F33GI&t=40s
#       -- This video explains the concept of Maximum Likelihood Estimation (MLE) and several other related concepts.
#       -- First 20 minutes of this video is enough to understand the concept of MLE.
#
# Please watch these videos before you proceed.

In [2]:
# Let's start with understanding Likelihood first.
#
# Let's say we have a coin that was flipped 'n' times and we observed heads 'h' times. Now, we want to estimate the 
# probability 'p' of getting heads when we flip the coin. We can use the likelihood function to estimate the probability.
#
# Likelihood shows how likely a particular probability 'p' (0.5 for example) is given the data we have. Higher the
# likelihood, more likely the particular probability value 'p' is. However, likelihood by itself doesn't give us very 
# useful information. It is the relative likelihood that is generally more useful. Relative likelihood is the ratio of 
# likelihood for two probability values (p). For example, if the likelihood of getting heads with probability 0.5 is 
# 0.1 and the likelihood of getting heads with probability 0.6 is 0.2, then the relative likelihood of getting heads
# with probability 0.6 is (0.2/0.1) = 2.0.
#
# For the case of coin flip, likelihood is defined as:
# Likelihood(p|data) = Probabability of observing the data given the heads probability is 'p'
#
# In general, let 'theta' be the population parameter (equivalent to 'p' in the coin flip example) and 'Y' be the data. 
# Then, the likelihood function is defined as:
# 
# Likelihood(theta|Y) = Probabaility of observing the data 'Y' given that the population parameter is 'theta'.
#
# Likelihood describes the extent to which the data supports any particular value of the population parameter. 
# Higher support corresponds to higher value of likelihood.
#
# Extending this concept to the case of multiple data points, the likelihood function is a function that relates the 
# population parameter to the data. Basically, likelihood function is just a mathematical function that takes in data,
# population parameter as input and returns the likelihood of the population parameter.
#
# NOTE: In most places (textbooks, blogs, ... etc), data is referred to as sample.

In [3]:
# In practice, there are usually multiple data points for the same population parameter. In such case, we need to use 
# all the data points to calculate the likelihood. In this case, the likelihood is the product of the likelihoods of 
# each data point. This turns out to be the product of the probabilities of observing each data point given the 
# population parameter.

Formally, likelihood is defined as:

$$L(\theta_{0};y) = Prob(Y=y;\theta=\theta_{0}) = f_{Y}(y;\theta_{0})$$

If it is a discrete probability distribution, then we use the probability mass function. <br>
If it is a continuous probability distribution, then we the probability density function. <br>

Generalizing this, likelihood function is defined as:

$$L(\theta) = L(\theta;y) = f_{Y}(y;\theta)$$

In general, Likelihood function is calculated as follows when there are multiple data points:

$$L(\theta;y) = \prod_{i=1}^{n} f_{Y}(y_{i};\theta)$$

In [4]:
# Now, let's extend the concept of likelihood to Maximum Likelihood Estimation (MLE).
# MLE provides the parameter values that make the observed sample most likely among all possible samples. Basically, 
# it says that the when the parameters are set to MLE values, the observed data is most the likely result to be 
# observed if an experiment is conducted. 
#
# MLE is referred as any of the following:
# 1) Maximum Likelihood Estimation
# 2) Maximum Likelihood Estimate
# 3) Maximum Likelihood Estimator 
#
# The MLE is calculated by maximizing the likelihood function with respect to the parameter. In other words, we find
# the parameter value that maximizes the likelihood function. This is done by taking the derivative of the likelihood
# function with respect to the parameter, setting it to zero and solving for the parameter.

In general, $\hat{\theta}$ is the maximum likelihood estimate if
$$L(\hat{\theta}) > L(\theta_{0})$$
for all values of $\theta_{0}$ in the parameter space.

In [5]:
# Now that we have understood the concept of likelihood and MLE, let's see how can we use it in the context of name
# generator model we built in the previous notebook (building_makemore_step_by_step/step_2_rule_based_name_generator.ipynb).
#
# First, let's try to understand what does likelihood even mean in the context of name generator model. Please note
# that we are dealing with individual character predictions in the name generator model. So, likelihood also relates 
# to individual character predictions.
#
# Let's say we have the name "Virat" in the training data. That means 
# 'V' occured at the first position with 100% certainty.
# 'i' occured at the second position with 100% certainty.
# 'r' occured at the third position with 100% certainty.
# 'a' occured at the fourth position with 100% certainty.
# 't' occured at the end position with 100% certainty.
#
# However, in the name generation model (rule based), the model assigns some probability for each character to occur
# at that specific position depending on the previous character. Ideally, all of these probabilities should be 1.0. 
# But, in practice, they are not. So, the question we need to ask is "How likely are the model parameters given the 
# name 'Virat'?" This is where likelihood comes into picture. Likelihood is the probability of observing the 
# character-pairs in the name 'Virat' given the model parameters. Higher the likelihood, more likely the model 
# parameters are.
#
# --------------------------------------------------------------------------------------------
# NOTE: The model parameters are just the pre-computed probabilities in the rule based name generator model.
# --------------------------------------------------------------------------------------------
# 
# character-pair -- Some specific consecutive character pair from the name.
# model_parameters -- The pre-computed probabilities in the model.
#
# Likelihood(model_parameters|character-pair) = Probability of observing the character-pair given the model parameters.
#
# Now, we have multiple character-pairs in the name 'Virat' and there are 60k such names in the dataset. So, we need to
# use all the character-pairs to calculate the likelihood.
#
# data -- All the character-pairs in the entire dataset.
#
# Likelihood(model_parameters|data) = Probability of observing the data given the model parameters.
#
# Higher the likelihood, more likely the model parameters are.
#
# --------------------------------------------------------------------------------------------
#
# Now, let's think about how high can likelihood be used for our model?
# Likelihood is the product of probabilities where each probability is between 0 and 1. So, the maximum value 
# likelihood can take is 1.0. The closer the likelihood is to 1.0, the better the model parameters are i.e., the 
# better the model is at predicting the character-pairs in the training dataset.
#
# Thinking about this a bit, we can see that the likelihood is a good measure of the quality of the model. If the
# likelihood is high, then the model is good at predicting the character-pairs in the dataset. If the likelihood is 
# low, then the model is not good at predicting the character-pairs in the dataset.
#
# --------------------------------------------------------------------------------------------
#
# Since, we have thousands of data points, the likelihood value can be very small. To avoid the floating point issues 
# with small likelihood values, we generally take the log of the likelihood. This is called log-likelihood. 
# Log-likelihood is the log of the likelihood value. The log-likelihood value is generally more interpretable than 
# the likelihood value.
#
# Log-likelihood(model_parameters|data) = log(Probability of observing the data given the model parameters)
#
# --------------------------------------------------------------------------------------------
#
# In general, loss functions are used to train the model. However, loss functions traditionally follow the standard
# that lower the loss, better the model. But, likelihood has the opposite behavior. Higher the likelihood, better 
# the model. So, we generally convert the likelihood to loss by taking the negative of the log-likelihood. This is 
# called negative log-likelihood.
#
# Negative Log-likelihood(model_parameters|data) = -log(Probability of observing the data given the model parameters)
#
# ---------------------------------------------------------------------------------------------

As explained above, Likelihood is given by

$$L(\theta;y) = \prod_{i=1}^{n} f_{Y}(y_{i};\theta)$$

Let's calculate log-likelihood from likelihood.

$$ln(L(\theta;y)) = \ln(\prod_{i=1}^{n} f_{Y}(y_{i};\theta))$$
$$\implies \ln(L(\theta;y)) = \ln(f_{Y}(y_{1};\theta) * f_{Y}(y_{2};\theta) * ... * f_{Y}(y_{n};\theta))$$

In general, $$\ln(a * b * c) = \ln(a) + \ln(b) + \ln(c)$$

$$\implies \ln(L(\theta;y)) = \ln(f_{Y}(y_{1};\theta)) + \ln(f_{Y}(y_{2};\theta)) + ... + \ln(f_{Y}(y_{n};\theta))$$
$$\implies LogLikelihood = \sum_{i=1}^{n} \ln(f_{Y}(y_i;\theta))$$

In [6]:
import string
import torch

In [7]:
# Let me create a random probabilities tensor similar to the one we have in the rule based name generator model.
# We will use this tensor to calculate the log-likelihood.
probs = torch.rand(size=(27, 27), dtype=torch.float32)
print(probs.shape)
print(probs)

torch.Size([27, 27])
tensor([[0.0606, 0.3155, 0.3703, 0.0925, 0.3342, 0.9187, 0.0161, 0.2763, 0.4431,
         0.3939, 0.5886, 0.9199, 0.4528, 0.8423, 0.6368, 0.4088, 0.8076, 0.0119,
         0.1013, 0.1469, 0.8517, 0.3040, 0.2748, 0.5576, 0.6734, 0.6187, 0.5335],
        [0.8193, 0.1403, 0.8351, 0.8965, 0.8788, 0.7559, 0.1769, 0.4117, 0.4928,
         0.5601, 0.9379, 0.1382, 0.7537, 0.1907, 0.4578, 0.9954, 0.6802, 0.0898,
         0.8701, 0.6965, 0.2352, 0.5502, 0.3446, 0.2392, 0.9965, 0.0893, 0.5960],
        [0.5915, 0.9247, 0.7120, 0.5068, 0.9814, 0.7232, 0.9259, 0.1437, 0.1048,
         0.5992, 0.3271, 0.0415, 0.5631, 0.9979, 0.9045, 0.1884, 0.5286, 0.7172,
         0.4548, 0.8255, 0.8548, 0.3602, 0.6327, 0.7611, 0.5000, 0.2107, 0.6601],
        [0.5832, 0.6810, 0.9763, 0.7383, 0.9999, 0.5205, 0.9395, 0.6658, 0.4594,
         0.6697, 0.7130, 0.3808, 0.5926, 0.4847, 0.3609, 0.0166, 0.3119, 0.4930,
         0.5470, 0.8848, 0.4311, 0.4799, 0.6165, 0.9476, 0.5108, 0.3714, 0.7660],
   

In [8]:
char_to_int = {char: idx + 1 for idx, char in enumerate(string.ascii_lowercase)}
char_to_int['.'] = 0
print(char_to_int)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}


In [9]:
# Consider the following two names in the dataset.add
# 1) 'virat'
# 2) 'prabhas'
#
# Let's calculate the log-likelihood for these two names using the model parameters (random probs created above) we have.
# The log-likelihood is the sum of the log of the probabilities of the character-pairs in the name as explained above.
log_loss = 0.0
num_char_pairs = 0
for name in ['virat', 'prabhas']:
    num_char_pairs += len(name) + 1
    name = '.' + name + '.'
    for first_char, second_char in zip(name, name[1:]):
        first_char_idx = char_to_int[first_char]
        second_char_idx = char_to_int[second_char]
        # Calculate the log-likelihood for the character-pair.
        log_loss = log_loss + torch.log(probs[first_char_idx, second_char_idx])

# Take the negative of the log-likelihood to get the negative log-likelihood.
log_loss = -log_loss
print(f"Negative Log-likelihood or log_loss: {log_loss}")
# It is good to calculate the normalized log loss. This sometimes can help in interpreting the log loss value.
normalized_log_loss = log_loss / num_char_pairs
print(f"Normalized Negative Log-likelihood or Normalized log_loss: {normalized_log_loss}")

Negative Log-likelihood or log_loss: 11.622832298278809
Normalized Negative Log-likelihood or Normalized log_loss: 0.8302022814750671
