#### Digital Signal Processing Courseware: An Introduction (copyright © 2024)
## Authors: J. Christopher Edgar and Gregory A. Miller

Conversion from Mathematica to Jupyter Notebook by Song Liu.

The authors of this book are indebted to Prof. Bruce Carpenter (University of Illinois Urbana-Champaign). Bruce inspired the creation of this courseware, he consulted with the authors as this courseware was being developed, and he provided the original version of the code and text for several sections of this courseware (e.g. the section on complex numbers and the section on normal distributions). 

# <font color=red>DSP.04 Convolution and Filtering - Spatial Domain</font>

# <font color=red>Give it a Try!</font>
# <font color=red>Part 3</font>

### Setup

In [None]:
# general imports
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import image as img
from matplotlib import cm
from mpl_toolkits import mplot3d
from scipy.fft import fft, fftfreq
import matplotlib.patches as patches
import math
import cmath
import pandas as pd
from sympy import Symbol, sin, series
from sympy import roots, solve_poly_system
import scipy.special

import warnings
warnings.filterwarnings('ignore')

# Figure size 
plt.rc("figure", figsize=(8, 6))

#function to create time course figure
#one waveform
def make_plot_1(x1,y1,type="b",linewidth = 1): 
    plt.plot(x1, y1,type)
    plt.margins(x=0, y=0)
    plt.axhline(y=0, color='k')
    plt.tick_params(labelbottom = False, bottom = False)
    
#two overlaid waveforms with red and blue   
def make_plot_2(x1,y1,type1,x2,y2,type2): 
    plt.plot(x1, y1, type1)
    plt.plot(x2, y2, type2)
    plt.margins(x=0, y=0)
    plt.axhline(y=0, color='k')
    plt.tick_params(labelbottom = False, bottom = False)
    
#three overlaid waveforms with red, blue and green   
def make_plot_3(x1,y1,type1,x2,y2,type2,x3,y3,type3): 
    plt.plot(x1, y1, type1)
    plt.plot(x2, y2, type2)
    plt.plot(x3, y3, type3)
    plt.margins(x=0, y=0)
    plt.axhline(y=0, color='k')
    plt.tick_params(labelbottom = False, bottom = False)
    
def make_plot_3d(ax,x,y,z):    
    ax.contour3D(x, y, z, 50, cmap=cm.coolwarm)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('z')
    
def make_plot_freq_1(x1,sample_rate, duration=1): 
    N = sample_rate * duration
    Nhalf = math.ceil(N/2)
    yf = fft(x1)
    xf = fftfreq(N, 1 / sample_rate)
    yf = yf[0:Nhalf]
    xf = xf[0:Nhalf]
    plt.plot(xf, np.abs(yf))
    
#two spectrums
def make_plot_freq_2(x1,x2,sample_rate, duration=1): 
    N = sample_rate * duration
    Nhalf = math.ceil(N/2)
    yf1 = fft(x1)
    yf2 = fft(x2)
    xf = fftfreq(N, 1 / sample_rate)

    yf1 = yf1[0:Nhalf]
    yf2 = yf2[0:Nhalf]
    xf = xf[0:Nhalf]

    plt.plot(xf, np.abs(yf1))
    plt.plot(xf, np.abs(yf2), color = 'r')
    
def make_imshow(x):
    plt.imshow(x,cmap='Greys_r')
    plt.tick_params(labelbottom = False, bottom = False)
    plt.tick_params(labelleft = False, left = False)
    
def make_imshow_color(x):
    plt.imshow(x)
    plt.tick_params(labelbottom = False, bottom = False)
    plt.tick_params(labelleft = False, left = False)
    
def round_complex(x):
    return complex(np.round(x.real,4),np.round(x.imag,4))

## <font color=red>DSP.04.G3) Using standard deviation values to identify outliers</font>

### <font color=red>DSP.04.G3.a) Calculating Z-scores and handling outliers</font>

Load some data.

In [None]:
rawdata = np.array([4.161803, -0.100638, 2.138029, 1.372499, -0.163366, 0.549151, 2.955684,
-3.951537, 5.354644, -3.736839, -6.748785, -7.913515, 1.414414, -1.160518,
-3.943281, -2.920922, -12.913827, -1.65564, 3.745551, -0.509798, 6.938785,
0.181353, 4.817792, 8.724311, 6.437805, -2.693438, 1.133871, 5.189491, 7.243207,
0.888545, -1.923403, -3.686132, -0.132783, -2.718729, 2.678307, 2.530582,
-0.204171, -2.087249, 217.431945, 0.636894, -1.49119, -6.703748, -2.99446,
0.804615, 4.96383, -3.833015, 6.271903, 0.937844, 3.010456, -0.389282, 8.077521,
-1.422039, -3.611092, 3.295595, 5.712019, 10.466032, -1.95425, 4.452664,
4.337543, 0.109857, 3.280643, -0.388772, 0.244246, 0.978405, -0.012601, 0.523191,
1.891633, 2.447331, 2.876316, 1.933092, -0.773207, 7.890352, 2.865899, 1.148594,
5.325821, 11.691833, 8.23212, 9.905125, 15.775607, 15.046681, 16.909649,
10.737247, 5.230532, 4.704697, -5.161913, -17.312115, -28.057915, -29.840624,
-30.954744, -37.426754, -45.539013, -41.198257, -46.638786, -51.361076,
-50.202209, -54.358242, -54.827358, -65.26075, -64.494339, -62.19976, -67.891083,
-64.709808, -62.695301, -63.910999, -65.830643, -63.877754, -59.0495, -53.860088,
-50.702389, -48.609245, -46.051105, -47.389957, -44.839188, -40.158932,
-39.228531, -35.659504, -28.532249, -30.350662, -38.398552, -28.498417,
-27.416334, -23.32937, -24.34053, -28.01129, -32.17485, -29.345148, -32.787132,
-31.421492, -34.017815, -38.747742, -34.092876, -36.653595, -32.668087,
-33.902489, -32.462177, -28.102474, -33.042095, -29.791214, -23.973711,
-17.174719, -16.891647, -20.68807, -20.423393, -16.694559, -8.641809,
-13.102222, -10.848848, -11.987523, -12.343177, -5.389377, -9.791459, -16.60183,
-15.996319, -10.052487, -10.610134, -10.640911, 1.584374, -214.014744, -2.376763,
-6.416342, -2.395929, -8.566953, -4.311759, 1.170809, 0.131636, -6.57483,
1.076944, -2.88642, -5.374261, -3.839308, -1.936249, -2.084193, -1.268859,
5.33434, 1.509975, 4.161803, -0.100638, 2.138029, 1.372499, -0.163366, 0.549151,
2.955684, -3.951537, 5.354644, -3.736839, -6.748785, -7.913515, 1.414414,
-1.160518, -3.943281, -2.920922, -12.913827, -1.65564, 3.745551, -0.509798,
6.938785, 0.181353, 4.817792, 8.724311, 6.437805, -2.693438, 1.133871, 5.189491,
7.243207, 0.888545, -1.923403, -3.686132, -0.132783, -2.718729, -2.718729])

rawdata

In [None]:
len(rawdata)

There are 210 datapoints.

In this timeseries dataset, assume that information was collected starting at -200 ms. That meanss that we chose some point in time to treat as 0 ms and that we began data collection 200 ms before that. (The reason for doing that isn't import
ant here.) Also, assume that a sample was collected once every 4 ms. In other words, the sampling rate was 250 Hz.

Plot the data

In [None]:
time = np.arange(-200,640, 4)

# Plotting time vs amplitude using plot function from pyplot
plt.plot(time, rawdata)
plt.margins(x=0, y=0)
plt.axhline(y=0, color='k')

plt.show() 

The data start at -200 ms, a datapoint is collected once every 4 ms, and there are a total of 210 datapoints.
Modify the x axis to correctly show the time information.

210 datapoints * 4 ms = 840 ms

Because the start time is -200 ms, in the above figure the x axis ranges from -200 to 640 ms. 

Several deflections from the x-axis are observed - a sharp spike at about -50 ms, a smaller, slower defection at
about 130 ms, another similarly slow but larger and more sustained deflection at about 210 ms, and another sharp spike at about 430 ms. The sharp
spikes at -40 ms and 450 ms look dissimilar from the other datapoints. 

Let's assume that we conclude that these spikes don't represent real activity. These datapoints are errors. For example, if this is physiological data, we might be confident that there's no way the physiological system could produce such large spikes, but we're aware that electrical noise in the nearby environment could do so.

One way to quantify how unusual those two datapoints are from the rest of the values is to calculate how many standard deviations those points are from the timeseries mean. A Z-score calculates this
measure. In particular, a Z-score tells us how far a particular point is from the mean in units of standard
deviations.

The Z-score formula is:
    
z = datapoint - mean / standard deviation

Here are the two outliers (the very first value and the very last value in the sorted timeseeries).

In [None]:
unfiltered = np.sort(rawdata)
unfiltered

Use the z-score formula to calculate how many standard deviations each outlier value is from the
mean.

Given what you know about normal distributions and standard deviations, write a sentence about how
likely it is that these two points are outliers.


### <font color=red>DSP.04.G3.b) Correcting the data</font>

A variety of techniques are used to remove outliers. One option is to simply replace the outlier values
with the value of the nearest point (or average of the nearest points).

Give this a try. Create a corrected plot (as a friendly gesture, the code below will get you started).

First identify the location of the points nearest to one of the outliers.

Here is the position of the first outlier in the (original, unsorted) timeseries.

In [None]:
np.where(rawdata == 217.431945)[0][0]

And here are the neighboring points.

In [None]:
rawdata[37]

In [None]:
rawdata[38]

In [None]:
rawdata[39]

This is the average value of the two nearby points.

In [None]:
(rawdata[37]+rawdata[39])/2

And this replaces the outlier value with the new point (-0.725178 replaces the element in the 38th
position in the list). 

In [None]:
rawdata[38] = -0.725178

Check.

In [None]:
rawdata[38]

What we just did was a digital filter, similar to but not the same as the moving average convolution technique that we discussed earlier. What's the difference? Previously, when we used a 3-item kernel, we set the weights to 1/3, 1/3, 1/3. Here, we also used a 3-item kernel, but we set the weights to 1/2, 0, 1/2. That is, we put a weight of 0 on the point we wanted to replace with the average of its 2 closest neighbors, and we put a weight of 1/2 on each of those 2 neighbors.

Now, you fix the second outlier and then replot the data.

For extra credit, modify the x axis so ticks are placed every 100 ms. It may not be as easy as it looks (and
maybe it doesn't even look easy).

### <font color=red>DSP.04.G3.c) Smoothing the Corrected Data</font>

Finally, let's assume that you wanted to smooth the original timeseries, even before you noticed those 2 spikey error values. Let's check out what the impact is of our choosing to correct - or not correct - those 2 error values before we do our smoothing. You're going to use the moving average convolution technique to remove high-frequency activity.
Apply the procedure to the corrected and uncorrected datasets (i.e., with and without outliers).
Plot both filtered datasets and comment on whether it is better to remove outliers before applying the
filter.