### Linear Regression with Maps

Often, in climate analysis:

* We perform linear regression between an index and the time series of anomalies at each gridpoint.  

* Because we use anomalies, *b* is zero (or very nearly so), so the slope is all we need.

* We make maps of the slope (also called regression coefficient) as a measure of the direction and strength of the linear relationship.  

* We then stipple or mask where this relationship is significant.   

We will continue with our previous example of looking at the relationship between the Nino3.4 index and precipitation anomalies. 

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
import numpy as np

import cartopy.crs as ccrs
import cartopy.mpl.ticker as cticker
from cartopy.util import add_cyclic_point

In [None]:
file_nino34='/home/pdirmeye/classes/clim680_2022/nino34_1982-2019.oisstv2_anoms.nc'
ds_nino34=xr.open_dataset(file_nino34)
ds_nino34

In [None]:
file='/home/pdirmeye/classes/clim680_2022/GPCP_precip.mon.mean.nc'
ds_precip=xr.open_dataset(file)
ds_precip

In [None]:
da_precip = ds_precip.precip.sel(time=slice(ds_nino34['time'][0],ds_nino34['time'][-1]))

da_climo = da_precip.groupby('time.month').mean()
da_anoms = da_precip.groupby('time.month')-da_climo
da_anoms

### Linear Regression
Again, we use the `linregress` function from the `stats` package in `scipy`...

In [None]:
from scipy.stats import linregress

We will loop through all the grid cells across the globe and calculate the regression parameters and statistics at each grid cell.
Doing this, we populate new global grids with the results.

In [None]:
linregress?

In [None]:
# Find the size of the global grid array
nx = len(da_anoms['lon'])
ny = len(da_anoms['lat'])

# Create new empty arrays to contain the results of our calculations
p_array = np.zeros((ny,nx))   # The p-value (significance) of the fit of the regression
r_array = np.zeros((ny,nx))   # The correlation between the independent and dependent variables
m_array = np.zeros((ny,nx))   # The slope of the best-fit linear regression line

x =       ds_nino34['sst']    # The independent variable (predictor)

#### Loop through all the grid cells in the global grid
for j in range(ny):
    print(f"{j},",end=" ")
    for i in range(nx):
        
        y = da_anoms[:,j,i] # The dependent variable (predictand)
        
        m,b,r,p,e = linregress(x,y)   # b is intercept, e is standard error
        
        # Populate our new arrays with the results
        m_array[j,i] = m
        r_array[j,i] = r
        p_array[j,i] = p
print("*** DONE ***")

### Make a mask 

Only include points where our regression coefficient is significantly different from zero.

In [None]:
mask_sig = np.where(p_array<0.05,m_array,np.nan) # NaNs where the signficance test fails

### Plot the regression coefficient 

In [None]:
clevs=np.arange(-3,3.25,0.25)

fig = plt.figure(figsize=(11,8.5))

# Set the axes using the specified map projection
ax = plt.axes(projection=ccrs.PlateCarree(central_longitude=200))

# Add cyclic point
data = m_array
data,lon = add_cyclic_point(data,coord=da_anoms['lon'])
mask_data,lons = add_cyclic_point(mask_sig,coord=da_anoms['lon'])

# Make a filled contour plot
cs = ax.contourf(lon,da_anoms['lat'],
            data,clevs,
            transform=ccrs.PlateCarree(),
            cmap='coolwarm',extend='both')

ax.contourf(lon,da_anoms['lat'],mask_data,[0,1],
            transform = ccrs.PlateCarree(),colors='None',
            hatches=['..','..'],extend='both',alpha=0)

# Add coastlines
ax.coastlines()

# Add gridlines
ax.gridlines()

# Define the xticks for longtitude 
ax.set_xticks(np.arange(-180,181,60),crs=ccrs.PlateCarree())
lon_formatter = cticker.LongitudeFormatter()
ax.xaxis.set_major_formatter(lon_formatter)

# Define ytick for latitude
ax.set_yticks(np.arange(-90,91,30),crs=ccrs.PlateCarree())
lat_formatter = cticker.LatitudeFormatter()
ax.yaxis.set_major_formatter(lat_formatter)

# Call colorbar
cbar = plt.colorbar(cs,orientation='horizontal',shrink=0.7,
                 label='Regression Coefficient (mm/day)')

# Add title
plt.title('Regression between Nino3.4 and Precipitation Anomalies') ;

## Final Note on Regression

The correlation coefficient and regression coefficient tell us similar information about the linear relationship between our two datasets.  

The benefit of a linear regression is that we get a measure of that relationship in the form of the regression coefficient. The regression coefficient is the slope of line fit to the data - is is a measure of the _sensitivity_ of $y$ to variations in $x$.

The slope (and intercept) of the line also provide us a linear model of this relationship.  If we have a _good_ model, then we can predict the value of $y$ based on new values of $x$. What consitutes a _good_ model and when, how I should use a linear regression as a prediction model, and lots of other details is left to a statistics class. 

In [None]:
np.datetime_as_string(da_anoms['time'][0]).split('T')[0]

----------------------------

## Writing data to a .nc file - Understanding `xarray` DataArrays and Datasets

When you work data sets and perform analysis, in this course and beyond, 
you will want to preserve your results by writing the data you produce to new files on the computers you are using. 
The most common format for climate data files is NetCDF, which we will examine here, 
but these principles apply to any data format, particularly self-describing formats (including HDF and GRIB).

### Creating `xarray` DataArrays and Datasets

Suppose we want to write out our regression calculation to a file (or apply some xarray function to our data).  If we have a `xarray.Dataset` called `ds`, we could do the following:

`ds.to_netcdf('regression.nc')`

But `m_array` is not an `xarray` dataset; now it's just a `numpy` array.  How do we create an `xarray` dataset?
1. Create `xarray.DataArray` (note the odd capitalization) with named dimensions and assigned coordinates (plus any other attributes we want to add)
2. Convert to a `xarray.Dataset`:
    * Use the `to_dataset` method to convert a single DataArray into a Dataset.
    * Invoke the `xarray.Dataset` object to assign multiple DataArrays to a single Dataset
    * Attributes can be assigned to the Dataset that are different from the DataArray attributes
    
Attributes are to data in self-describing formats (like NetCDF) as comments are to your code. 

It is good practice and a service to others (and to your future self) to include descriptive and complete documentation in the attributes of your data files.

In [None]:
# Create a DataArray with all the details
da_m = xr.DataArray(m_array,
                    coords={'lat':da_anoms['lat'],
                            'lon': da_anoms['lon']},
                    dims=['lat','lon'],
                    name='slope',
                    attrs={'name':'regression coefficient',
                           'units':'mm/day',
                           'description':'Linear regression of monthly precipitation against Niño3.4 index'})
# Convert the DataArray into a Dataset
ds_m = da_m.to_dataset()
ds_m = ds_m.assign_attrs({'start date':np.datetime_as_string(da_anoms['time'][0]).split('T')[0],
                          'end date':np.datetime_as_string(da_anoms['time'][-1]).split('T')[0]})
ds_m

Notice that the attributes of our coordinate variables have been inherited from the precipitation data they came from.

Since our new variable, which we named `slope` in the `xarray` DataSet, was created from a `numpy` array,
it had no attributes to inherit. We had to specify them.

For the dataset itself, we added information about the range of time across which the regression was calculated. 
We took that information from the time dimension of the precipitaiton data.

-----------------------

### Merging multiple variables into a DataSet

Suppose we wanted to put m, r, and p together as separate variables in a `xarray.Dataset`.  We can repeat this for all of them (m,r,p):

In [None]:
# Create more DataArrays
da_r = xr.DataArray(r_array,
                    coords={'lat':da_anoms['lat'],
                            'lon': da_anoms['lon']},
                    dims=['lat','lon'],
                    name='corr',
                    attrs={'name':'correlation coefficient',
                           'units':'none',
                           'description':'Correlation between monthly precipitation and Niño3.4 index'})
da_p = xr.DataArray(p_array,
                    coords={'lat':da_anoms['lat'],
                            'lon': da_anoms['lon']},
                    dims=['lat','lon'],
                    name='p',
                    attrs={'name':'p-value',
                           'units':'none',
                           'description':'Signficance of regression of monthly precipitation against Niño3.4 index'})

# Convert the DataArrays into Datasets
ds_r = da_r.to_dataset()
ds_p = da_p.to_dataset()

We can then merge the DataSets together. 
Note that each must have different variable names or `xarray` doesn't know how to deal with them.

#### Method 1: Merge DataSets

In [None]:
ds_regr = xr.merge([ds_m,ds_p,ds_r])
ds_regr

#### Method 2: Add DataArrays to an existing DataSet

We had already made a Dataset containing the DataArray for the **slope**. 
We could just add the two other DataArrays to that Dataset. 
This is especially handy if you want to add a small number of variables to an exisiting Dataset having many variables.

In [None]:
ds_regr = ds_m
ds_regr['p'] = da_p
ds_regr['corr'] = da_r
ds_regr

#### Method 3: Create Dataset from list of DataArrays

Actually, we did not need to make the intermediate Dataset at all - we could have made the Dataset directly from the set of DataArrays.

Notice that we also did not need to _name_ the variables in the DataArrays either - we do it at this step.

All three methods created the same final result. 
For a small Dataset like this, there is little difference between these methods.
However, depending on your situation, one of these methods may be clearly better (or worse) than the others.

In [None]:
ds_regr = xr.Dataset(data_vars={'slope':da_m,'p':da_p,'corr':da_r})
ds_regr = ds_regr.assign_attrs({'start date':np.datetime_as_string(da_anoms['time'][0]).split('T')[0],
                                'end date':np.datetime_as_string(da_anoms['time'][-1]).split('T')[0]})
ds_regr

### Writing a Dataset to a file on disk

Now we can write our Dataset containing multiple variables to a single NetCDF file.

It is `xarray`'s seamless handling of NetCDF and other self-describing data formats that makes it such a handy Python package for Climate Science.

In [None]:
ds_regr.to_netcdf('regression.nc')

In [None]:
ds_check=xr.open_dataset('regression.nc')
ds_check

### Reducing file size: NetCDF _deflation_

NetCDF is a very useful file format because it is _self-describing_ (i.e., metadata is stored with the data in the same files).

However, versions since Version 4 have an additional useful option that you should use when storing large files. 
That is _deflation_, which is a file compression feature.
It works in the same way that images (`JPEG`) and video (`MPEG`) are compressed.

In [None]:
deflate = dict(zlib=True, complevel=1)          # For deflated NetCDF4 output without data loss
encoding = {var: deflate for var in ds_regr}    # Apply to every variable in the Dataset
ds_regr.to_netcdf(path='regression.nc4',engine="netcdf4",format="netCDF4",encoding=encoding)

In [None]:
ls -l reg*

That isn't much of a difference in this case (less than 8% compression), 
but for very large files, files with many dimensions,
or DataArrays that contain a lot of grid cells with the same value 
(e.g., zeros or NaNs), the compression rate can be very large. 
This can save a large amount of disk space, yet the extra time to read
deflated files is usually small.

#### Some other points to note above:
* We had to specify the `engine` and `format` to invoke version 4 of NetCDF.
By default, `xarray` uses version 3, which does not support `deflation`.
* We used a different suffix `.nc4` mainly not to overwrite the original file 
so we could compare their sizes. But it is not required for a deflated file - 
we could have used `.nc`. However, the `.nc4` suffix is used as an indication that
a file is compressed.
* `complevel` is the compression level. Choosing `complevel=1` ensures that no data is lost when it is deflated. 
A larger number will result in smaller files but <u>lossy</u> compression. 
This can be a problem with positive-definite variables that have a lot of zeros,
like precipitaiton or shortwave radiation. Higher values of `complevel` can turn 
zeros into small negative or positive numbers that may cause problems with your 
calculations.
* It is possible to apply `encoding` only to a subset of the variables in the Dataset. This is one way to 
selectively avoid problems like those above.
* Note the `dict` function. This is an alternative way to write a python dictionary. 
`dict(zlib=True, complevel=1)` is the same as `{'zlib':True, 'complevel':1}`
