# Example code for fitting Normal and Log-Normal Distributions using MLE and MOM

First, import the following libraries
- pandas: to read in data  
- numpy: for basic mathematical functions over arrays  
- scipy.stats: for distribution-fitting functions
- matplotlib.pyplot: for plotting distribution fits
- google.colab.drive: for accessing data on Google Drive

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
from google.colab import drive

Load data of annual maxima at Azibe Soltane on the Sebou River in Morocco

In [None]:
# allow access to google drive
drive.mount('/content/drive')

maxQ = pd.read_csv("drive/MyDrive/Colab Notebooks/CE6280/Data/Problem6.17.csv")
maxQ.head()

Visualize the data in a histogram

In [None]:
plt.hist(maxQ["Flow"], density=True)
plt.xlabel("Flow (cfs)")
plt.ylabel("Density")
plt.title("PDF of Annual Maxima of Sebou River at Azibe Soltane")

Calculate the mean, variance/standard deviation, and skewness using default numpy functions

In [None]:
mu = np.mean(maxQ["Flow"])
sigmaSq = np.var(maxQ["Flow"])
sigma = np.std(maxQ["Flow"])
gamma = ss.skew(maxQ["Flow"])

# what are they?
print("mean %0.2f" % mu)
print("variance: %0.2f" % sigmaSq)
print("std. dev.: %0.2f" % sigma)
print("skewness: %0.2f" % gamma)

With the exception of the mean, these estimates are actually biased. How do they compare with the unbiased estimates?

In [None]:
# using unbiased estimators (formulas given in lecture)
sigmaSq_unbiased = np.var(maxQ["Flow"], ddof=1)
sigma_unbiased = np.std(maxQ["Flow"], ddof=1)
gamma_unbiased = ss.skew(maxQ["Flow"], bias=False)

# compare the estimators
print("unbiased variance: %0.2f" % sigmaSq_unbiased)
print("unbiased std. dev.: %0.2f" % sigma_unbiased)
print("unbiased skewness: %0.2f" % gamma_unbiased)

# how different are they?
print("percent difference in biased vs. unbiased variance: %0.1f" % np.abs((sigmaSq - sigmaSq_unbiased)*100 / (0.5*(sigmaSq + sigmaSq_unbiased))))
print("percent difference in biased vs. unbiased std. dev.: %0.1f" % np.abs((sigma - sigma_unbiased)*100 / (0.5*(sigma + sigma_unbiased))))
print("percent difference in biased vs. unbiased skewness: %0.1f" % np.abs((gamma - gamma_unbiased)*100 / (0.5*(gamma + gamma_unbiased))))


Fit a normal distribution to the data using MLE. This is the approach in scipy.stats.fit, which returns the location and scale parameters of the distribution.

In [None]:
loc, scale = ss.norm.fit(maxQ["Flow"])

print("loc: %0.2f" % loc)
print("scale: %0.2f" % scale)

This prints the location and scale parameters, which in this case are $\mu$ and $\sigma$. You can see from the code above, this is the same as the mean and BIASED variance. Let's rename this mu_fit and sigma_fit, the fitted values of $\mu$ and $\sigma$.

In [None]:
mu_fit = loc
sigma_fit = scale

How does the fit look? Let's compare the fitted PDF with the histogram of the data.  
ss.norm.pdf(x, loc, scale) calculates the value of a normal PDF, f(x), with parameters loc and scale at input values x

In [None]:
x = np.arange(0,4500,10)
f_x = ss.norm.pdf(x, mu_fit, sigma_fit)

plt.hist(maxQ["Flow"], density=True)
plt.plot(x,f_x)
plt.ylim([0,0.0015])
plt.xlim([0,4500])
plt.title('Normal MLE fit')
plt.xlabel('Flow (m^3/s)')
plt.ylabel('Probability Density')

Clearly this is not a good fit! We need a skewed distribution. Let's use a log-normal.  
ss.lognorm.fit returns the shape, location and scale parameters of a 3-parameter log-normal distribution. You can fit a 2-parameter log-normal distribution by fixing the lower bound (location) parameter at 0 with floc=0.  
The parameter $\mu$ of a LN distribution is the log of the scale parameter reported by scipy.stats. The parameter $\sigma$ is the shape parameter.

In [None]:
# fit a log-normal distribution to the data
shape, loc, scale = ss.lognorm.fit(maxQ['Flow'], floc=0)

# convert shape and scale to estimates of mu and sigma
mu_LN_MLE = np.log(scale)
sigma_LN_MLE = shape

print('mu: %0.2f' % mu_LN_MLE)
print('sigma: %0.2f' % sigma_LN_MLE)

Compare these estimates with what we get from the formulas computed in class

In [None]:
# compute using equations from class for MLE
mu_LN_MLE_check = np.mean(np.log(maxQ["Flow"]))
sigma_LN_MLE_check = np.sqrt( np.mean( (np.log(maxQ["Flow"]) - mu_LN_MLE_check)**2 ) )

# compare estimates
print("Python mu: %0.2f" % mu_LN_MLE)
print("Class mu: %0.2f" % mu_LN_MLE_check)
print("Python sigma: %0.2f" % sigma_LN_MLE)
print("Class sigma: %0.2f" % sigma_LN_MLE_check)

How does this fit look?

In [None]:
x = np.arange(0,4500,10)
f_x = ss.lognorm.pdf(x, shape, loc, scale)

plt.hist(maxQ["Flow"], density=True)
plt.plot(x,f_x)
plt.ylim([0,0.0015])
plt.xlim([0,4500])
plt.title('2-parameter log-normal MLE fit')
plt.xlabel('Flow (m^3/s)')
plt.ylabel('Probability Density')

Much better! What is our estimate of the 100-year flood from this fit? The 100-year flood occurs on average 1/100 years. 1/100 = 0.01, so there is a 1\% chance of it being exceeded each year. That means it is the 0.99 quantile of the distribution.  
We can estimates quantiles of distributions in scipy with ppf, the "point percentile function".

In [None]:
q0_99_LN = ss.lognorm.ppf(0.99, shape, loc, scale)
print("100-year flood estimate: %0.0f cfs" % q0_99_LN)

What if we wanted to fit the Log-normal distribution using MOM? scipy.stats does not do this, so we need to write our own code for it. Use the formulas we computed in class.

In [None]:
# compute using equations from class for MOM
sigma_LN_MOM = np.sqrt( np.log( 1 + np.var(maxQ["Flow"], ddof=1) / (np.mean(maxQ["Flow"])**2) ) )
mu_LN_MOM = np.log(np.mean(maxQ["Flow"])) - 0.5*sigma_LN_MOM**2

print("MOM mu: %0.2f" % mu_LN_MOM)
print("MOM sigma: %0.2f" % sigma_LN_MOM)

We can write a function to fit LN2 using MOM or MLE depending on the input arguments.

In [None]:
def findMoments(data):
    xbar = np.mean(data)
    var = np.var(data, ddof=1)
    skew = ss.skew(data,bias=False)

    return xbar, var, skew

def fitLN(data, method):
    assert method == 'MLE' or method == 'MOM',"method must = 'MLE' or 'MOM'"

    xbar, var, skew = findMoments(data)

    if method == 'MLE':
        shape, loc, scale = ss.lognorm.fit(data, floc=0)
        mu = np.log(scale)
        sigma = shape
    elif method == 'MOM':
        sigma = np.sqrt(np.log(1+var/xbar**2))
        mu = np.log(xbar) - 0.5*sigma**2

    return mu, sigma

Now run it and report the parameters

In [None]:
mu_LN_MOM, sigma_LN_MOM = fitLN(maxQ["Flow"], "MOM")
mu_LN_MLE, sigma_LN_MLE = fitLN(maxQ["Flow"], "MLE")

print("MOM mu: %0.2f" % mu_LN_MOM)
print("MOM sigma: %0.2f" % sigma_LN_MOM)

print("MLE mu: %0.2f" % mu_LN_MLE)
print("MLE sigma: %0.2f" % sigma_LN_MLE)

If we can't calculate closed-form equations for these parameters, we can use optimization to find the values of the parameters that make the sample moments equal to the theoretical moments. We'll show that next class.