Věrohodnost

In [1]:
import numpy as np
import pandas as pd 


import matplotlib.pyplot as plt
import scipy.optimize as optimize
import scipy
import seaborn as sns
import json
from zipfile import ZipFile

1. 
Weibullovo rozdelenie
$$
f(x) = 
= 
\begin{cases}
\frac{k}{\lambda}(\frac{x}{\lambda}^{k-1})\text{e}^{-(\frac{x}{\lambda})^k} & \text{ if } x \geq 0.5\\
0 & \text{ if } x < 0.5
\end{cases}
$$

$$ 
k \in (0, \infty) 
$$ 
k je shape,tvar parameter
$$ 
\lambda \in (0, \infty) 
$$ 
λ je scale, meritko parameter

#### Logaritmická-verohodnostna funkcia
$$
l(k, \lambda) = n*\ln{k} - n*k*\ln{\lambda} + (k+1) * \sum_{i=1}^{n} \ln{x_{i}} - \sum_{i=1}^{n} (\frac{x_{i}}{\lambda})^k
$$

#### Parciálna derivácia podľa parametru shape
$$
\frac{\partial l}{\partial k} = \frac{n}{k} - n*\ln{\lambda} + \sum_{i=1}^{n} \ln{x_{i}} - \sum_{i=1}^{n} (\ln{x_{i}} - \ln{\lambda}) * (\frac{x_{i}}{\lambda})^k
$$
#### Parciálna derivácia podľa parametru scale
$$
\frac{\partial l}{\partial \lambda} = - \frac{n*k}{\lambda} + \frac{k}{\lambda^{k+1}} * \sum_{i=1}^{n} x_{i}^{k}
$$

In [2]:
# load data
fileName = "Data_2024.xlsx"
df = pd.read_excel(fileName, sheet_name="Data_věrohodnost") # need pip install xlrd and openpyxl
df = df.iloc[:, :-2] # remove columns without information
# check types
df['censored'] = df['censored'].astype(int)
df['doba práce v oboru [roky]'] = df['doba práce v oboru [roky]'].astype(float)
print(df)

     censored  doba práce v oboru [roky]
0           0                      6.528
1           0                      6.013
2           1                      6.055
3           0                      7.243
4           1                      5.629
..        ...                        ...
316         1                      5.562
317         0                      5.491
318         0                      6.761
319         0                      7.062
320         0                      5.784

[321 rows x 2 columns]


In [78]:
x = df['doba práce v oboru [roky]'].values
censored = df['censored'].values

# logaritmicka-verohodnostna funkcia
def log_likelihood(params):
    k, lamb = params
    # log density - log of pdf
    log_pdf = np.log(k) - np.log(lamb) * k + (k-1) * np.log(x) - ((x/lamb) ** k)
    sf = np.exp(-(x / lamb)**k)  # survivor function - from CDF
    # Logaritmická věrohodnost
    log_l = (1 - censored) * log_pdf + censored * np.log(sf)
    return -np.sum(log_l)  # negation for minimization

# Počáteční odhady
initial_params = [1.0, 1.0]

# Optimalizace
result = optimize.minimize(log_likelihood, initial_params, method='L-BFGS-B')

# Výsledky
if result.success:
    k_mle, lam_mle = result.x
    print(f"Odhadnuté parametry: k = {k_mle:.4f}, λ = {lam_mle:.4f}")
else:
    print("Optimalizace selhala:", result.message)

Odhadnuté parametry: k = 6.1728, λ = 7.4295


$$\hat k = 6.1728 $$

$$\hat λ = 7.4295$$

3) Test verohodnostnym pomerom

In [None]:
# fix k to 1 and estimate MLE again
x = df['doba práce v oboru [roky]'].values
censored = df['censored'].values

# logaritmicka-verohodnostna funkcia
def log_likelihood(k_fix, lamb):
    k = k_fix
    # log density - log of pdf
    log_pdf = np.log(k) - np.log(lamb) * k + (k-1) * np.log(x) - ((x/lamb) ** k)
    sf = np.exp(-(x / lamb)**k)  # survivor function - from CDF
    # Logaritmická věrohodnost
    log_l = (1 - censored) * log_pdf + censored * np.log(sf)
    return -np.sum(log_l)  # negation for minimization

# Počáteční odhady
initial_lamb = [1.0]
k_fix = 1.0

# Optimalizace
result_exp = optimize.minimize(lambda lamb:log_likelihood(k_fix, lamb), initial_lamb, method='L-BFGS-B', bounds=[(1e-5, None)])

# Výsledky
if result_exp.success:
    lam_exp_mle = result_exp.x[0]
    print(f"Odhadnutý parameter: λ = {lam_exp_mle:.4f}")
else:
    print("Optimalizace selhala:", result_exp.message)

print(result_exp.fun)

Odhadnutý parameter: λ = 9.0533
746.3288140610031


In [87]:
lr = -2*(-result_exp.fun - (-result.fun))
print(lr)

592.3898153434229


Test
$$
H_0: k = 1 \text{Stačí exponenciálne rozdelenia.}
$$
$$
H_A: k \neq 1 \text{Exponenciálne rozdelenia nestačí.}
$$
$$
stupne volnosti 2 - 1 = 1
W'_0.05 = <0,\Chi^2 0.95(1)> = <0, 3.841>
$$
testova statistika = 592.39 nelezi v kritickom obore, preto $H_0$ zamietam

4. Bodové odhady pre strednu dobu zamestania v obore 
10% percentil zamestnania v obore

In [102]:
import scipy
# mean of weibull random variable is mean time in the field
Ex = lam_mle * scipy.special.factorial((1 + 1/k_mle) - 1)
print(f"Stredna doba zamestnania v obore je {Ex:.4f} rokov.")

# 10% employment in the field
p = 0.1
percentile = lam_mle * (-np.log(1-p))**(1/k_mle)
print(f"10% percentil zamestnania v obore je {percentile:.4f} rokov.")

Stredna doba zamestnania v obore je 6.9032 rokov.
10% percentil zamestnania v obore je 5.1598 rokov.


Odpoveď: 
Stredna doba zamestnania v obore je 6.9032 rokov.
10% percentil zamestnania v obore je 5.1598 rokov.

Väčšina ľudí pracuje obore do 7 rokov, desať percent ľudí zmení obor po piatich rokoch. 

Regrese