# Fitting a Multivariate Normal Distribution and Developing a Drought Index from it

In [None]:
!pip install pyncei
!pip install lmoments3

This notebook shows how to develop a summer drought index for Charlottesville based on its mean monthly summer precipitation and temperature. We'll use the pyncei package to get the climate data from NOAA (https://github.com/adamancer/pyncei). To illustrate spatial correlation, we'll download the data for Richmond as well.

## Data Collection
To download the NOAA climate data, you'll need to get your own token here: https://www.ncdc.noaa.gov/cdo-web/token. Replace my token below with your own.

In [None]:
from pyncei import NCEIBot, NCEIResponse
#ncei = NCEIBot("ExampleNCEIAPIToken")
ncei = NCEIBot("klhIxphHLKDrrnjPtJGNpfEnfbERfUvZ")

Download monthly precipitation and temperature at Charlottesville and Richmond over overlapping period of record.

Charlottesville: https://www.ncdc.noaa.gov/cdo-web/datasets/GSOM/stations/GHCND:USW00093736/detail  
Richmond: https://www.ncdc.noaa.gov/cdo-web/datasets/GSOM/stations/GHCND:USW00013740/detail

In [None]:
# loop through each year and append to data frame because you can't download more than 1 year at a time
cville_df = NCEIResponse()
for year in range(1961,2024):
  cville_df.extend(
      ncei.get_data(
      datasetid="GSOM",
      stationid=["GHCND:USW00093736"],
      datatypeid=["TAVG", "PRCP"], # degrees Celsius, mm
      startdate=str(year) + "-05-01",
      enddate=str(year+1) + "-12-31",
  )
)

cville_df = cville_df.to_dataframe()
cville_df.head()

Get average monthly summer precipitation and temperature (June, July, August). First, determine the month and year of each data point.

In [None]:
from datetime import datetime

cville_df['date'] = [datetime.strftime(dt, "%Y-%m-%d") for dt in cville_df['date']]
cville_df['date'] = [datetime.strptime(dt, "%Y-%m-%d") for dt in cville_df['date']]
cville_df['Month'] = [cville_df['date'].iloc[i].month for i in range(len(cville_df.index))]
cville_df['Year'] = [cville_df['date'].iloc[i].year for i in range(len(cville_df.index))]
cville_df

Next, find the monthly averages over the summer months.

In [None]:
cville_df = cville_df.loc[cville_df['Month'].isin([6,7,8])]
cville_summer = cville_df.groupby(['Year','datatype'])['value'].mean().reset_index()
cville_summer

Now repeat this process for the Richmond data.

In [None]:
# loop through each year and append to data frame because you can't download more than 1 year at a time
richmond_df = NCEIResponse()
for year in range(1961,2024):
  richmond_df.extend(
      ncei.get_data(
      datasetid="GSOM",
      stationid=["GHCND:USW00013740"],
      datatypeid=["TAVG","PRCP"], # degrees Celsius, mm
      startdate=str(year) + "-05-01",
      enddate=str(year+1) + "-12-31",
  )
)

# convert data to dataframe and find month and year of each data point
richmond_df = richmond_df.to_dataframe()
richmond_df['date'] = [datetime.strftime(dt, "%Y-%m-%d") for dt in richmond_df['date']]
richmond_df['date'] = [datetime.strptime(dt, "%Y-%m-%d") for dt in richmond_df['date']]
richmond_df['Month'] = [richmond_df['date'].iloc[i].month for i in range(len(richmond_df.index))]
richmond_df['Year'] = [richmond_df['date'].iloc[i].year for i in range(len(richmond_df.index))]

# compute average monthly precipitation and temperature over summer months
richmond_df = richmond_df.loc[richmond_df['Month'].isin([6,7,8])]
richmond_summer = richmond_df.groupby(['Year','datatype'])['value'].mean().reset_index()
richmond_summer

Combine data from Charlottesville and Richmond into a single data frame.

In [None]:
import pandas as pd

cville_summer = cville_summer.pivot(index='Year', columns='datatype', values='value')
richmond_summer = richmond_summer.pivot(index='Year', columns='datatype', values='value')
cville_summer.columns = ['PRCP_cville', 'TAVG_cville']
richmond_summer.columns = ['PRCP_richmond', 'TAVG_richmond']
climate_df = pd.concat([cville_summer, richmond_summer], axis=1)
climate_df

Remove NaN data points.

In [None]:
climate_df = climate_df.dropna()
climate_df

## Data Visualization

See how the data variables are correlated.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = climate_df.corr()

# Plot the correlation matrix using seaborn
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

Temperature in Richmond is positively correlated with temperature in Charlottesville ($\rho=0.69$). The same is true for precipitation at each location ($\rho=0.6$). However, at each location, temperature and precipitation are negatively correlated with each other ($\rho=-0.22$ in Charlottesville and $\rho=-0.26$ in Richmond). So drier summers are also hotter, exacerbating drought conditions.

## Fitting Multivariate Normal Distributions

### Ensuring Marginal Distributions are Normal

Let's create a multivariate drought index for Charlottesville incorporating both of these variables. We can do this by fitting a multivariate normal distribution to the two variables and using their inverse CDF to map their joint probability to an index. To fit a MVN, the marginal distributions of each variable have to be normal. Let's investigate that.

In [None]:
sns.pairplot(climate_df[["PRCP_cville","TAVG_cville"]], kind='reg', diag_kind='kde')
plt.show()

Load code for fitting a normal distribution and testing the fit with a PPCC test.

In [None]:
from google.colab import drive

# allow access to google drive
drive.mount('/content/drive')

!cp "drive/MyDrive/Colab Notebooks/CE6280/CodingExamples/utils.py" .
from utils import *

In [None]:
import scipy.stats as ss
import matplotlib.pyplot as plt
from lmoments3 import distr
import numpy as np

class Normal(Distribution):
  def __init__(self):
    super().__init__()
    self.mu = None
    self.sigma = None

  def fit(self, data, method):
    assert method == 'MLE' or method == 'MOM' or method == 'Lmom',"method must = 'MLE', 'MOM' or 'Lmom'"

    self.findMoments(data)
    if method == 'MLE':
      self.mu, self.sigma = ss.norm.fit(data)
    elif method == 'MOM':
      self.mu = self.xbar
      self.sigma = np.sqrt(self.var)
    elif method == 'Lmom':
      norm_params = distr.nor.lmom_fit(data)
      self.mu = norm_params["loc"]
      self.sigma = norm_params["scale"]

  def findReturnPd(self, T):
    q_T = ss.norm.ppf(1-1/T, self.mu, self.sigma)
    return q_T

  def plotHistPDF(self, data, min, max, title):
    x = np.arange(min, max,(max-min)/100)
    f_x = ss.lognorm.pdf(x, self.mu, self.sigma)
    self.plotDistFit(data, x, f_x, min, max, title)

  def ppccTest(self, data, title, m=10000):
    # calculate test statistic, rho
    x_sorted = np.sort(data)
    p_observed = ss.mstats.plotting_positions(x_sorted)
    x_fitted = ss.norm.ppf(p_observed, self.mu, self.sigma)
    self.ppcc_rho = np.corrcoef(x_sorted, x_fitted)[0,1]

    # generate m synthetic samples of n observations to estimate null distribution of rho
    rhoVector = np.zeros(m)
    for i in range(m):
      np.random.seed(i)
      x = ss.norm.rvs(self.mu, self.sigma, size=len(data))
      rhoVector[i] = np.corrcoef(np.sort(x), x_fitted)[0,1]

    # calculate p-value of test and make QQ plot
    count = 0
    for i in range(len(rhoVector)):
      if self.ppcc_rho < rhoVector[i]:
        count = count + 1

    self.p_value_PPCC = 1 - count/(len(rhoVector) + 1)

    # make Q-Q plot
    plt.scatter(x_sorted,x_fitted,color='b')
    plt.plot(x_sorted,x_sorted,color='r')
    plt.xlabel('Observations')
    plt.ylabel('Fitted Values')
    plt.title(title)
    plt.show()

  def calcCI(self, data, p, CI, method, npars, seed):
    n = len(data)
    alpha = (100.0-CI)/100.0
    # calculate theoretical confidence interval using formula from slides
    z_p = ss.norm.ppf(p)
    z_crit = ss.norm.ppf(1-alpha/2)
    x_p = self.mu + z_p*self.sigma
    LB = x_p - z_crit * np.sqrt(self.sigma**2 * (1+0.5*z_p**2)/n)
    UB = x_p + z_crit * np.sqrt(self.sigma**2 * (1+0.5*z_p**2)/n)
    return LB, UB

Use a PPCC test to test for normality of precipitation.

In [None]:
dist_precip = Normal()
dist_precip.fit(climate_df["PRCP_cville"], "MLE")
dist_precip.ppccTest(climate_df["PRCP_cville"], "Normal Fit")
dist_precip.p_value_PPCC

We do not reject that the precipitation data is normal, so we do not need to transform it. Let's repeat with the temperature data.

In [None]:
dist_temp = Normal()
dist_temp.fit(climate_df["TAVG_cville"], "MLE")
dist_temp.ppccTest(climate_df["TAVG_cville"], "Normal Fit")
dist_temp.p_value_PPCC

Once again, we do not reject that the temperature data is normal, so we do not need to transform it.

### Ensuring the Mahalanobis distance has a $\chi^2$ distribution

Below we write a class for the $\chi^2$ distribution that initializes it with its single parameter (the number of degrees of freedom), and tests the goodness of fit with a PPCC test.

In [None]:
class ChiSq(Distribution):
  def __init__(self):
    super().__init__()
    self.df = None

  def ppccTest(self, data, title, m=10000):
    # calculate test statistic, rho
    x_sorted = np.sort(data)
    p_observed = ss.mstats.plotting_positions(x_sorted)
    x_fitted = ss.chi2.ppf(p_observed, self.df)
    self.ppcc_rho = np.corrcoef(x_sorted, x_fitted)[0,1]

    # generate m synthetic samples of n observations to estimate null distribution of rho
    rhoVector = np.zeros(m)
    for i in range(m):
      np.random.seed(i)
      x = ss.chi2.rvs(self.df, size=len(data))
      rhoVector[i] = np.corrcoef(np.sort(x), x_fitted)[0,1]

    # calculate p-value of test and make QQ plot
    count = 0
    for i in range(len(rhoVector)):
      if self.ppcc_rho < rhoVector[i]:
        count = count + 1

    self.p_value_PPCC = 1 - count/(len(rhoVector) + 1)

    # make Q-Q plot
    plt.scatter(x_sorted,x_fitted,color='b')
    plt.plot(x_sorted,x_sorted,color='r')
    plt.xlabel('Observations')
    plt.ylabel('Fitted Values')
    plt.title(title)
    plt.show()

We can compute the Mahalanobis distances of all observations from the mean using the function [scipy.spatial.distance.mahalanobis](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.mahalanobis.html), which takes in three arguments: the coordinates of the two points between whose distance is being calculated (the observations and the mean), and the inverse of the covariance matrix between the variables.

In [None]:
from scipy.spatial import distance

mu = np.mean(climate_df[["PRCP_cville","TAVG_cville"]],axis=0)
print("Mean vector: \n", mu)
cov = np.cov(climate_df[["PRCP_cville","TAVG_cville"]].T)
print("Covariance matrix: \n", cov)

dists = np.zeros(len(climate_df.index))

for i in range(len(dists)):
  dists[i] = distance.mahalanobis(climate_df[["PRCP_cville","TAVG_cville"]].iloc[i], mu, np.linalg.inv(cov))

dists_chi2fit = ChiSq()
dists_chi2fit.df = 2 # number of variables
dists_chi2fit.ppccTest(dists**2, "Chi-Squared Fit")
dists_chi2fit.p_value_PPCC

We do not reject that the Mahalanobis distances come from a $\chi^2$ distibution with 2 degrees of freedom (for the 2 variables: average monthly summer precipitation and temperature in Charlottesville). This suggests the MVN distribution is appropriate.

## Visualizing the MVN Distribution

We can create an object of the [scipy.stats.multivariate_normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html) class and compute its pdf to visualize the joint probability over a grid of possible precipitation and temperature values. We plot this on a contour map below using [matplotlib.pyplot.contourf](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html).

In [None]:
x, y = np.mgrid[0:200:2, 21:27:0.06]
pos = np.empty(x.shape + (2,))
pos[:, :, 0] = x
pos[:, :, 1] = y
rv = ss.multivariate_normal(mu, cov)

colors = plt.contourf(x, y, rv.pdf(pos))
plt.scatter(climate_df['PRCP_cville'], climate_df['TAVG_cville'],c='k')
cbar = plt.colorbar(colors)
cbar.set_label('Probability Density')
plt.xlabel('Avg Monthly Summer Precip (mm)')
plt.ylabel('Avg Monthly Summer Temp (C)')
plt.title('Multivariate Normal Distribution')
plt.show()

We can also visualize this in three dimensions using [mpl_toolkits.mplot3d.Axes3D](https://matplotlib.org/3.5.1/api/_as_gen/mpl_toolkits.mplot3d.axes3d.Axes3D.html).

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

fig = plt.figure()
ax = fig.add_subplot(projection='3d')
surf = ax.plot_surface(x,y,rv.pdf(pos),cmap=cm.coolwarm,linewidth=0,antialiased=False)
ax.set_xlabel('Avg Monthly Summer Precip (mm)')
ax.set_ylabel('Avg Monthly Summer Temp (C)')
ax.set_zlabel('Probability Density')
fig.tight_layout()
fig.show()

## Conditional Distributions

Given the correlation between these variables, we can use the joint distribution to compute the conditional distribution of one variable given a particular value of another. Let's compute the conditional distribution of average summer monthly temperature given average summer monthly precipitation of 40 mm.

In [None]:
# find conditional distribution of temperature given precipitation is 40
mu_cond = mu[1] + cov[1,0]*(1/cov[0,0])*(40-mu[0])
cov_cond = cov[1,1] - cov[1,0]*(1/cov[0,0])*cov[0,1]

# plot marginal vs. conditional temperature distribution
x = np.arange(np.min(climate_df['TAVG_cville']),np.max(climate_df['TAVG_cville']),\
              (np.max(climate_df['TAVG_cville'])-np.min(climate_df['TAVG_cville']))/100)
plt.plot(x,ss.norm.pdf(x,mu[1],np.sqrt(cov[1,1])),c='b',label="Unconditional Temp Dist")
plt.plot(x,ss.norm.pdf(x,mu_cond,np.sqrt(cov_cond)),c='g',label="Temp Dist Given P=40 mm")
plt.legend()
plt.xlabel('Avg Monthly Summer Temp (C)')
plt.ylabel('Probability Density')
plt.show()

## Developing Drought Indices

It is common to use the CDF of a distribution to create a drought index by mapping the cumulative probability to the corresponding z-score of a standard normal distribution. We can do this using the CDF of a univariate or multivariate distribution.

Below we plot the CDF of the joint distribution between temperature and precipitation from our MVN fit. Note, we have negated the temperature so that low values of each correspond to drought conditions. This is because the CDF is the probability both variables are below particular values, so a drought index based on that should have both variables (or all variables if >2) arranged such that low values of each (all) correlate with stronger drought conditions.

In [None]:
# plot cumulative probabilities contours
# negate temperature so that low values of both are bad and high values of both are good
# we will use this to create a drought index where low cumulative probabilities are low
# and high cumulative probabilities are high
climate_df['TAVG_cville'] = -climate_df['TAVG_cville']
mu = np.mean(climate_df[["PRCP_cville","TAVG_cville"]],axis=0)
cov = np.cov(climate_df[["PRCP_cville","TAVG_cville"]].T)
rv = ss.multivariate_normal(mu, cov)

x, y = np.mgrid[0:200:2, 21:27:0.06]
pos[:, :, 0] = x
pos[:, :, 1] = -y
colors = plt.contourf(x,-y,rv.cdf(pos))
plt.scatter(climate_df['PRCP_cville'],climate_df['TAVG_cville'],c='k')
plt.colorbar(colors)

Now we use the above joint CDF, as well as the marginal CDFs for comparison, to develop drought indices for Charlottesville. For each year, we compute the percentile of the observed precipitation and temperature alone and together using the CDF. We then map that percentile to the corresponding z-score of a standard normal distributionb using its inverse CDF (point percentile function in scipy.stats).

In [None]:
# find cumulative probability for each observation
# convert that to a drought index by finding its corresponding z-score of the standard normal
cumProbs = np.zeros(len(climate_df['TAVG_cville']))
MVdroughtIndices = np.zeros(len(climate_df['TAVG_cville']))
for i in range(len(cumProbs)):
    cumProbs[i] = rv.cdf([climate_df['PRCP_cville'].iloc[i],climate_df['TAVG_cville'].iloc[i]])
    MVdroughtIndices[i] = ss.norm.ppf(cumProbs[i])

# find drought indices with just temperature or precipitation
rv_x = ss.norm(mu[0],np.sqrt(cov[0,0]))
rv_y = ss.norm(mu[1],np.sqrt(cov[1,1]))
tempDroughtIndices = np.zeros(len(MVdroughtIndices))
precipDroughtIndices = np.zeros(len(MVdroughtIndices))
for i in range(len(tempDroughtIndices)):
    precipDroughtIndices[i] = ss.norm.ppf(rv_x.cdf(climate_df['PRCP_cville'].iloc[i]))
    tempDroughtIndices[i] = ss.norm.ppf(rv_y.cdf(climate_df['TAVG_cville'].iloc[i]))

# compare them
plt.plot(climate_df.index,MVdroughtIndices,c='g',label="Multivariate Drought Index")
plt.plot(climate_df.index,tempDroughtIndices,c='r',label="Temperature Drought Index")
plt.plot(climate_df.index,precipDroughtIndices,c='b',label="Precipitation Drought Index")
plt.plot(climate_df.index,np.zeros(len(climate_df.index)),c='k',linestyle='--')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Drought Index')
plt.show()