# Todo before class

Install plotly, scipy

Start jupyter notebook in chrome?

https://towardsdatascience.com/fourier-transformation-and-its-mathematics-fff54a6f6659

Download data

# Beginning of class

HW review?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import plotly.express as px
from scipy import signal

# Basic Data Exploration

In [None]:
df = pd.read_csv('temperatures/temperature.csv')

In [None]:
df.head()

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'])

In [None]:
plt.plot(df['datetime'], df['Saint Louis'])

In [None]:
plt.plot(df['datetime'], df['Boston'])

#### Discussion

Why are the temperature values between 250 and 310? What might you change them to?

Why does the data look so noisy?

What kinds of things would you like to be able to do to clean up this data?

What problems would you like to be able to check for?

# Dealing with missing values in time series data

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
px.line(df, x='datetime', y='Vancouver')

In [None]:
df.loc[df['Vancouver'].isnull()]

In [None]:
#x is time stamps, xp is x's for existing records, fp is y's for xps
x = df['datetime']
knowns = df.loc[~df['Vancouver'].isnull(), ['datetime','Vancouver']]
xp = knowns['datetime']
fp = np.array(knowns['Vancouver'])

test = pd.DataFrame({'datetime':x, 'Vancouver': np.interp(x, xp, fp)})

In [None]:
px.line(test, x='datetime', y='Vancouver')

#### Discussion

What else could you do to address missing values, especially if they were frequent and evenly distributed?

When might it be a bad idea to use the np.interp function, and what other options might you have? (Hint: check out the scipy interpolation functions)

# Intro to Fourier Analysis

In [None]:
x = np.arange(1,100,1)
y = np.sin(x/3)
plt.plot(x,y)
plt.show()

In [None]:
psd = np.abs(np.fft.rfft(y))
freqs = np.fft.rfftfreq(len(y))
plt.plot(freqs, psd)
plt.axvline(1/(6*np.pi), c='r')

#### Discussion/exploration

Try changing the period of the sine fuction.

Try adding multiple sine functions of different periods together - what happens?

# Fourier analysis applied to real data; Welch's method

In [None]:
psd = np.abs(np.fft.rfft(test['Vancouver']))
fs = np.fft.rfftfreq(len(test['Vancouver']))

In [None]:
plt.plot(fs, psd)
plt.axvline(x=1/24, c='red')
plt.axvline(x=1/(24*365), c='red')
plt.xscale('log')
plt.yscale('log')

In [None]:
fs, psd = signal.welch(test['Vancouver'], nperseg=10000, window='hann')

In [None]:
plt.plot(fs, psd)
plt.axvline(x=1/24, c='red')
plt.axvline(x=1/(24*365), c='red')
plt.xscale('log')
plt.yscale('log')

# Discussion/exploration

What are the advantages/disadvantages to making the "nperseg" in welch's method smaller or larger?

# Rolling Windows and Aggregation

In [None]:
df['date']=df['datetime'].dt.date

In [None]:
px.line(df.groupby('date').agg({'Vancouver':'mean'}).reset_index(), x='date', y='Vancouver')

In [None]:
plt.plot(df['Vancouver'].rolling(1000).mean())

#### Discussion

Use the rolling() method to find a measure of variance in the data instead of the mean. Check out the pandas rolling documentation to get a sense of what other functions you could use if you wanted to.

Pick a coastal city and a midwestern city to compare the variance by plotting the time series charts

# Exponential smoothing/exponential moving average

In [None]:
smoothed_boston = np.zeros(len(df)-1)

factor = 0.1

for k in range(len(df)-1):
    if k == 0:
        smoothed_boston[k]=df.iloc[1]['Boston']
    else:
        current_data = df.iloc[k]['Boston']
        if np.isnan(current_data):
            smoothed_boston[k]=smoothed_boston[k-1]
        else:
            smoothed_boston[k] = (1-factor)*smoothed_boston[k-1] + factor*current_data

In [None]:
plt.plot(df['Boston'], label='orig')
plt.plot(smoothed_boston, label='smoothed')
plt.legend()

#### Discussion

Why is this called "exponential"? (Try writing out what the first few terms work out to be)

What happens if you make the "factor" in the code much smaller? Try .01, .001.

Do you expect this to be faster or slower than implementing a rolling average?

Try running the two methods - you can even use the time module to measure performance.  Which method is faster?

In applications where you want to smooth data as it arrives in real time, it's very common to use exponential smoothing instead of a rolling average - why do you think this might be?
