# Rayleigh distributions

General ideas: since the original images had real and complex components (where we assume a Gaussian distribution) , we expect the amplitude ($\sqrt{real^2 + im^2}$) to be Rayleigh distributed and the intensity ($\text{amplitude}^2$) to be exponentially distributed.

The goal here is to fit a Rayleigh distribution to each set of amplitude values for each road, and compare the location and scale parameters to see if different IRI classes have differenlty-shaped distributions.

In [None]:
import math 
import numpy as np 
import pandas as pd
import geopandas

import seaborn as sns 
import matplotlib.pyplot as plt

from scipy.stats import rayleigh

from glob import glob 

In [None]:
# Mount Google Drive for file access - skip this cell if not using Colab
from google.colab import drive 
drive.mount('/content/drive')

In [4]:
# function to load datasets into a dictionary

def load_data_dict(dir):
    """
    in: path to directory with merged/cleaned .pkl datasets
    out: dictionary with all the merged SAR and IRI datasets as dataframes
    """
    datasets = {}

    for path in glob(dir + '*'):
        key = path.split('/')[-1][:-4]
        df = pd.read_pickle(path)
        datasets[key] = df
    
    return datasets

In [6]:
# define data directory 
DATA_DIR = '/content/drive/Shared drives/Remote sensing/Summer 2020/DATA/'

## Estimate distribution on each road segment

The goal here is to calculate the distribution on the pixel data for each OID. Running something like `df.groupby('oid').agg(lambda x: rayleigh.fit(x))` takes a very _very_ long time and is definitely not the way to do it.


In [8]:
# road-level data
roadlevel_m = load_data_dict(DATA_DIR +'road_level_merged/')

In [9]:
# a pixel-level dataframe 
pixels = pd.read_pickle(DATA_DIR + 'pixel_level/raw_pixels.pkl')

In [12]:
"""
Aggregating on one image, since across all images is far too slow 
"""
data = pixels.loc[:, ['oid_2012_buffered_masked', '20110829', '20120416']]
data.columns = ['oid', '20110829', '20120416']

def rayfit(g):
  # Rayleigh fitting
  loc, scale = rayleigh.fit(g)
  return loc, scale

ray = data.groupby('oid', as_index=True).agg(rayfit)

# extract location and scale - this works for several image rows
loc = ray.apply(lambda x: x.str[0], axis=1).add_suffix('_loc')
scale = ray.apply(lambda x: x.str[1], axis=1).add_suffix('_scale')

merged = pd.concat([loc, scale], axis=1, join='outer')
# merged.columns = ['loc', 'scale']


In [15]:
# merge above information with existing road-level data 
df = roadlevel_m['despeck_buffered_masked']
merged = pd.concat([df, merged], axis=1, join='outer')

Here we just plot the location and scale parameters from a single image. If we had a DataFrame with a _loc and _scale column for every image, we could plot the parameters from the closest SAR acquisition to the IRI test date (same logic as closest_mean and closest_std) 

In [None]:
# plot location and scale for one image 
sns.scatterplot(data=merged, x='20120416_loc', y='20120416_scale', hue='quality')

In [None]:
# separating by quality
g = sns.FacetGrid(data=merged, col='quality', hue='quality')
g.map(sns.scatterplot, '20120416_loc', '20120416_scale')
plt.xlim(-0.5, 0.5)
plt.ylim(0, 2.5)