<h1 align="center">Statistical Inference 2, Demo 1</h1>

<br>

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.optimize import minimize
from scipy.stats import norm
from scipy.stats import laplace

<h3 align="left">Task 3</h3>

The concentrations of a certain compound are measured with a device that involves independent normally distributed measurement errors. The variance of the measurement error $\, \sigma^2 \,$ is unknown. If the device's reading (concentration + measurement error) exceeds a known limit 𝑐 = 80, the device only reports that the result is greater than 80. There are 10 samples available, which are known to have the same unknown concentration $\, \mu \,$. The device reports the result $\, y_i \,$ for six samples and indicates that the reading is greater than 80 for the remaining four samples. The available measurement results are 71.2, 67.5, 77.6, 67.3, 77.7, 67.1. Determine numerically the maximum likelihood estimates for the parameters $\, \mu \,$ and $\, \sigma^2. \,$

**Solution:**

In [109]:
data = np.array([71.2, 67.5, 77.6, 67.3, 77.7, 67.1])
n = 10
m = 6

In [110]:
def log_likelihood(params: list[float], c = 80) -> float:
    """
    Calculate the negative logarithmic likelihood for a set of observations 
    following a normal distribution, considering both known values and censored data.
    
     This function assumes that the data consists of two parts:
        1. Known measurements: These are directly used to calculate the 
           log-likelihood using the normal distribution's probability density function.
        2. Censored measurements: These are observations that exceed a certain threshold.
           For these, the log-likelihood is computed using the complementary cumulative
           distribution function (survival function) of the normal distribution.
       
    Args:
        params: A list containing two elements:
            1. mu: The mean of the normal distribution.
            2. sigma: The standard deviation of the normal distribution (>=0).
        
    Returns:
        The negative of the total logarithmic likelihood for both known 
        and censored observations.
        
    Notes:
        - Note that mathematically speaking, maximizing the function f(x)
        is the same as minimizing the function -f(x).
    """
    mu, sigma = params
    
    # Calculate the logarithm of the probability density function 
    # of a normal distribution for each value in the given array and sum them up
    # (i.e.compute the log-likelihood of observing the data
    # under the assumption that these data points are drawn 
    # from a normal distribution with a mean of mu and a standard deviation of sigma).
    log_likelihood_known = np.sum(norm.logpdf(x=data, loc=mu, scale=sigma))
    
    # Censored log-likelihood can be calculated using the survival function.
    log_likelihood_censored = (n-m) * norm.logsf(x=c, loc=mu, scale=sigma)
    
    # (negative) total log-likelihood
    return -(log_likelihood_known + log_likelihood_censored)

In [111]:
init_vals = [np.mean(data), np.std(data)]

In [112]:
# Maximize the log-likelihood
result = minimize(log_likelihood, init_vals, method='Nelder-Mead')

In [113]:
mu_hat, sigma_hat = result.x[0], result.x[1]**2

In [114]:
print(f'MLE for mu: {mu_hat}')
print(f'MLE for sigma^2: {sigma_hat}')

MLE for mu: 77.15357507694819
MLE for sigma^2: 70.92748610686299


<br>

<h3 align="left">Task 4</h3>

Concentrations of a certain compound are measured with a device that involves independent Laplace-distributed measurement errors with a mean of 0. The scale parameter b of the Laplace distribution is not known. If the device's reading (concentration + measurement error) exceeds a known limit c=80, the device only reports that the result is greater than 80. There are 10 samples available, which are known to have the same unknown concentration $\, \mu. \,$ The device reports the result $\, y_i \,$ for seven samples and indicates that the reading is greater than 80 for the remaining three samples. The available measurement results are 70.3, 73.1, 75.1, 76.8, 77.8, 78.2, 78.6. Determine the maximum likelihood estimates for the parameters  $\, \mu \,$ and b.

**Solution:**

In [101]:
data = np.array([70.3, 73.1, 75.1, 76.8, 77.8, 78.2, 78.6])
a = 80    # Threshold for censoring
n = 10    # Total number of samples
m = 7     # Number of observed samples

In [102]:
def ll(params: list[int]) -> float:
    """
    Calculate the negative logarithmic likelihood for a set of observations 
    following a normal distribution, considering both known values and censored data.
    
    This function assumes that the data consists of two parts:
        1. Known measurements: These are directly used to calculate the 
           log-likelihood using the laplace distribution's probability density function.
        2. Censored measurements: These are observations that exceed a certain threshold.
           For these, the log-likelihood is computed using the complementary cumulative
           distribution function (survival function) of the laplace distribution.
           
    Args:
        params: A list containing two elements:
            1. mu: The location parameter of the laplace distribution.
            2. b: The scale parameter of the laplace distribution  (>=0).
    
    Returns:
        The negative of the total logarithmic likelihood for both known 
        and censored observations.
        
    Notes:
        - Note that mathematically speaking, maximizing the function f(x)
            is the same as minimizing the function -f(x).
        - The function name 'll' stands for logarithmic likelihood.
    """
    mu, b = params
    ll_uncensored = np.sum(laplace.logpdf(x=data, loc=mu, scale=b))
    ll_censored = (n-m) * laplace.logsf(x=a, loc=mu, scale=b)
    return -(ll_uncensored + ll_censored)

In [103]:
init_vals = [np.mean(data), np.std(data)]
result = minimize(ll, init_vals, method='Nelder-Mead')

In [94]:
mu_hat, b_hat = result.x
print(f'MLE for mu: {mu_hat}')
print(f'MLE for b: {b_hat}')

MLE for mu: 78.15047140900046
MLE for b: 3.385714301931162


<br>

<h3 align="left">Task 5</h3>

The concentrations of a compound are measured with a measuring device, which involves both bias and independent normally distributed measurement errors. It is assumed that the relationship between the concentration and the measurement result is described by a linear regression model

$$y_i = a + bx_i + \epsilon,$$

where a and b are unknown parameters and $\, \epsilon \sim N(0, \sigma^2). \,$
The variance of the measurement error $\, \sigma^2 \,$ is not known. If the reading of the measuring device exceeds a known limit 𝑐 = 80, the device only indicates that the result is greater than 80. There are a total of 25 samples available, such that from each of five different concentrations $\, x_1,...,x_5 \,$ there are five samples. Compute the maximum likelihood estimates for the parameters a, b, and $\, \sigma^2 \,$ numerically.

**Solution:**

In [2]:
data = pd.read_csv("/Users/herrakaava/Documents/SI2/data-sets/mittalaite.csv")

In [3]:
data.rename(columns={'ylirajan': 'censored'}, inplace=True)

In [4]:
data.head()

Unnamed: 0,x,y,censored
0,60,64.6,0
1,65,50.8,0
2,70,66.2,0
3,75,71.7,0
4,80,80.0,1


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   x         25 non-null     int64  
 1   y         25 non-null     float64
 2   censored  25 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 732.0 bytes


In [7]:
data['x'].value_counts()

x
60    5
65    5
70    5
75    5
80    5
Name: count, dtype: int64

In [62]:
x_uncens = data[data['censored'] == 0]['x'].values
y_uncens = data[data['censored'] == 0]['y'].values

n = len(data)    # Total
m = len(data[data['censored'] == 0])    # Observed

In [63]:
def ll_reg(params: list[int], c=80) -> float:  
    a, b, sigma = params
    mu = a + b*x
    ll_uncensored = np.sum(norm.logpdf(y_uncens, mu, sigma))
    ll_censored = (n-m) * norm.logsf(c, a + b*c, sigma)
    return -(ll_uncensored + ll_censored)

In [64]:
init_vals = [1, 1, 1]
result = minimize(ll_reg, init_vals, method='Nelder-Mead')

In [65]:
result.x

array([-11.37681763,   1.22875447,   6.4576462 ])