In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit

Let us define some data as a simple dictionary and then convert it to a pandas dataframe

In [None]:
data_dict = { 
	'run1': [360, 0.91, 20.1],
	'run2': [305, 0.98, 22.0]
}

We can very trivially convert this to a pandas dataframe.  We can also pass lables via "index", to tell us what each of the entries means and allow us to more easily search.

In [None]:
data_frame = pd.DataFrame(data_dict, index=['T', 'S2', 'angle'])

We can easily output this as a nice table

In [None]:
data_frame

If we use the print function it is basically the same, just not quite as nice looking. 

In [None]:
print(data_frame)

Let's look at how to extract info.  To get info for a single entry in the dict, can easily just use the name as an index.

In [None]:
data_frame['run1']

To get the Temperature info from each of the runs, we access the data a little bit differently, using the "loc" command.

In [None]:
data_frame.loc['T']

These two can be combined to access, e.g., the temperature of run1.

In [None]:
data_frame['run1'].loc['T']

Alternatively, since T is the first entry, can access it just using an integer index of 0, or pass the index label (since we defined one):

In [None]:
data_frame['run1'][0]

In [None]:
data_frame['run1']['T']

It is easy to write dataframes to CSV files (and read them).  

In [None]:
data_frame.to_csv(r'test_data.csv', index=True)

In [None]:
data_frame_from_csv = pd.read_csv('test_data.csv',index_col=0)

In [None]:
data_frame_from_csv

Let's now read in some real data as a CSV file.  This is the number of  COVID-19 cases in the state of Tennessee per day from tn.gov [Daily Case Information](https://www.tn.gov/health/cedep/ncov/data/downloadable-datasets.html)

In [None]:
tenn_data = pd.read_csv('datasets/Tenn_Pandemic_Data.csv')

In [None]:
tenn_data

You can see by the way this has been imported, the dictionary keys are 'DATE', 'TOTAL_CASES', 'NEW_CASES', and etc., so we can use these to extract out specific information.  For example, let's plot  'DATE' and 'TOTAL_CASES'

In [None]:
plt.plot(tenn_data['DATE'], tenn_data['TOTAL_CASES'])

It's rather difficult to read that x-axis, let's do some quick formatting. 

In [None]:
#first let's make a copy of the data that gives the date/time of update
formatted_date = tenn_data['DATE'][:].copy()
#next, let's only keep the date, discarding the time, this is just the first
#10 characters of the string
formatted_date = [date.replace(date[10:], '') for date in formatted_date]

fig, ax = plt.subplots()

#to be able to read the labels, we'll use a built-in function to tilt them
fig.autofmt_xdate()

# and then define the number of tick markers to show a more manageable set
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

ax.plot(formatted_date, tenn_data['TOTAL_CASES'])
ax.invert_xaxis()

plt.ylabel('number of positive cases')
plt.xlabel('date')

It's rather easy to just restrict the data range we plot, e.g., let's just pick out data from the "beginning" where the number of cases was most rapidly growing (the end of this window roughly corresponds  with the end of the first wave, around 07-31-20).

In [None]:
fig, ax = plt.subplots()

#to be able to read the labels, we'll use a built in function to tilt them
fig.autofmt_xdate()
# and then define the number of tick markers to show
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

ax.plot(formatted_date[300:-1], tenn_data['TOTAL_CASES'][300:-1])
plt.ylabel('number of positive cases')
ax.invert_xaxis()

Let us  use the scipy optimize routine to fit this region, so we can interpolate the initial exponential growth. For more information of scipy, follow this [link.](https://docs.scipy.org/doc/scipy/reference/index.html)

In [None]:
def exp_func(x, a, b, c):
    return a * np.exp(-b * x) + c

In [None]:
# Create an array of 'N' points from 1 to 0 corresponding to the number of entrees
# This is necessary since the x-axis is dates rather than floats, and is in a reversed order

xdata = np.linspace(1, 0, len(tenn_data['TOTAL_CASES']))

#fit the curve, but limit to that middle data above
popt, pcov = curve_fit(exp_func, xdata[300:-1], tenn_data['TOTAL_CASES'][300:-1])

Curve fit doesn't give you an R-squared value by default, so we just need to do a few quick calculations.

In [None]:
residuals = tenn_data['TOTAL_CASES'][300:-1]- exp_func(xdata[300:-1], *popt)
#calculate residual sum of squares
ss_res = np.sum(residuals**2)
#get the total sum of squares
ss_tot = np.sum((tenn_data['TOTAL_CASES'][300:-1]-np.mean(tenn_data['TOTAL_CASES'][300:-1]))**2)
#get the r-squared value
r_squared = 1 - (ss_res / ss_tot)
print(r_squared)

Now we can plot our fit to get an idea of where it fits the total_cases_information outside of that region. If you run the cell below, you should see that the initial exponential growth overpredicts the total cases further into 2020 and 2021. 

In [None]:
fig, ax = plt.subplots()

#to be able to read the labels, we'll use a built-in function to tilt them
fig.autofmt_xdate()
# and then define the number of tick markers to show
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

#plot all of the data
ax.plot(formatted_date, tenn_data['TOTAL_CASES'], marker='o', ls='')
ax.invert_xaxis()

#plot of the fit of the middle region
#to plot the fit, we'll just pass the function we have our x_data points and the fitted a,b,c values
# that are saved in the popt area
ax.plot(formatted_date,  exp_func(xdata, *popt), 'r--')

plt.ylabel('number of positive cases')
plt.xlabel('date')

#change to a log scale
plt.yscale("log")

The plot below gives an alternative way to evaluate our exponential fit. A value of 1 means the prediction matched the actual data perfectly. A value of less than one means we overpredicted the actual data. Looking at plots like this can show you if your fit is done well. There should be randomly dispersed scatter both over and under one. In this case, our fit stops performing evenly remotely acceptably outside of our fit domain (April 2020 to August 2020).

In [None]:
fig, ax = plt.subplots()
fig.autofmt_xdate()
# and then define the number of tick markers to show
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

#plot all of the data
ax.plot(formatted_date[0:-1],  tenn_data['TOTAL_CASES'][0:-1]/exp_func(xdata[0:-1], *popt), 'o', ls='--')
ax.invert_xaxis()

plt.ylabel('actual/predicted postitive cases')
plt.xlabel('date')

It's been over a year that we've been dealing with the pandemic. How do our infection rates compare to last year? We can do this by using the Get_Rate function to evaluate the number of new infections each day. The Reverse function will also help get our data in the forward orientation through time.

In [None]:
def Reverse(it_object):
    lst = list(it_object)
    lst.reverse()
    return lst

def Get_Rate(series):
    lst = list(series)
    delta_cases = []
    for i in np.arange(0,len(lst[1:])):
        delta_cases.append(int(lst[i+1]-lst[i]))
    return delta_cases

fig, ax = plt.subplots()
fig.autofmt_xdate()
# and then define the number of tick markers to show
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
# and then define the number of tick markers to show

#plot all of the data
ax.plot(Reverse(list(formatted_date[1:31])),  
        Get_Rate(Reverse(tenn_data['TOTAL_CASES'][0:31])), 'o', ls='--', 
        color='purple', label='This Year')
ax.plot(Reverse(list(formatted_date[1:31])),  
        Get_Rate(Reverse(tenn_data['TOTAL_CASES'][365:396])), 'o', ls='--',
        color='blue', label='Last Year' )


plt.ylabel('Covid cases for this year compared to last year')
plt.legend()

One of the most obvious differences between this year and last year is the introduction of the free vaccine in the state. However, since this year we are still getting more positive cases than we had last year, we may be led to conclude that the vaccine is not having any effect or even making the number of cases worse! Let's take a closer statistical look at this correlation.

Let's do a quick statistical analysis concerning the effect of vaccines on our total_cases. Here's some vaccine data from https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations.

In [None]:
# Read in the data
vac_data = pd.read_csv('datasets/us_state_vaccinations.csv')
# Parse out data just for Tennessee
tenn_vac_data = vac_data[vac_data['location']=='Tennessee']
# We will look at this dataset for people considered fully vaccinated
print(len(tenn_vac_data['people_fully_vaccinated']))

#first let's make a copy of the data that gives the date/time of update
formatted_vac_date = tenn_vac_data['date'][:].copy()

#plot the vaccinations
fig, ax = plt.subplots()

#to be able to read the labels, we'll use a built-in function to tilt them
fig.autofmt_xdate()

# and then define the number of tick markers to show a more manageable set
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

ax.plot(formatted_vac_date, tenn_vac_data['people_fully_vaccinated'])

plt.ylabel('number of fully vaccinated people')

We can see that we now have vaccination data for the last 140 or so days. If we align this with our total_cases_data, we can get an idea of the correlation between the two. Let's take a look at this below.

In [None]:
# Because the dates are off, we need to cut out a few dates so the data matches for each date.
# We'll take our data from 02-01-21 to 05-25-21
# We also need to do some work to remove a missing value from our dataset. We don't have vaccine 
# information for the 34th day, so we'll remove that day from the positive_cases_data. 

# Let's grab that Reverse function again
def Reverse(it_object):
    lst = list(it_object)
    lst.reverse()
    return lst

# Let's make sure we're grabbing data from the right dates. 
#This will be printed out at the end to compare with the total_cases_dates
formatted_vac_date = tenn_vac_data['date'][:].copy()
formatted_vac_date = list(formatted_vac_date[20:-3])
# this will turn our values into integers to make them easier to work with
vac_data = list((tenn_vac_data['people_fully_vaccinated'][20:-3]))
for i,value in enumerate(vac_data.copy()):
    # check to make sure values are nan
    if not pd.isna(vac_data[i]):
        vac_data[i] = int(value)
# The 14th value (34th in the original dataset) needs to get removed since it doesn't have a value
vac_data.pop(14)
vac_data = np.array(vac_data)

# Similar dataworkup for the total_cases information from before.
formatted_cases_date = Reverse(list(formatted_date[0:116]))
tenn_cases_data = list(Reverse(list(tenn_data['TOTAL_CASES'][0:116])))
tenn_cases_data.pop(14)
tenn_cases_data = np.array(tenn_cases_data)

formatted_vac_date[0],formatted_cases_date[0],formatted_vac_date[-1],formatted_cases_date[-1]

In [None]:
fig, ax = plt.subplots()
ax.plot(vac_data[1:], Get_Rate(tenn_cases_data),':')

plt.xlabel('number of fully vaccinated people')
plt.ylabel('number of new positive covid cases')

In [None]:
# Now let's do some linear interpolation for this correlation.

In [None]:
def Fitted_Line(x,a,b):
    return a * x + b

In [None]:
# Fit the curve to the Get_Rate information from tenn_cases_data
# The extra information is to give the solver method some better places to guess.
popt, pcov = curve_fit(Fitted_Line, vac_data[1:], Get_Rate(tenn_cases_data),
                       p0=[-0.001,2000], check_finite=True, 
                       method='trf',bounds = ([-0.002,0],[0,4000]))
fig, ax = plt.subplots()

#plot all of the data
ax.plot(vac_data[1:], Get_Rate(tenn_cases_data),':')

#to plot the fit, we'll just pass the function we have our vaccine data points and the fitted a,b values
# that are saved in the popt area
ax.plot(vac_data[1:],  Fitted_Line(vac_data[1:], *popt), 'r--')

plt.ylabel('number of positive cases')
print(popt)

Let's do some statistics on our function: daily_cases = -5.9e-4 * vaccinated + 1.8e3 <br>
H0: slope = 0 <br>
H1: slope < 0 <br>

For this we need to apply linear regression statistics to get the uncertainty in our parameter a (the slope). <br><br>

uncertainty = t(alpha, df=n-2) * standard deviation <br>
standard deviation of slope = sqrt( sy,x^2 / SSxx )


In [None]:
from scipy import stats
#degrees of freedom
df = len(vac_data[1:]) - 2
# solve for varxy standard deviation of y(x)
varyx = np.sum((Get_Rate(tenn_cases_data) - Fitted_Line(vac_data[1:], *popt))**2) * df**-1

# solve for SSxx
SSxx = np.sum((vac_data[1:] - vac_data[1:].mean())**2)
    
t = stats.t.ppf(1-0.025, df)
print('The 95% confidence of the slope is {0:.1e} +- {1:.1e} cases/vaccinations'.format(popt[0], 
                                                                                        t * np.sqrt(varyx / SSxx)))

Since the slope is significantly different from 0, we can reject our null hypothesis and support the alternative hypothesis.

Stepping back to where we started, the dictionaries allow us to easily manage our data space and keep track of lots of different pieces of information that can be easily iterated over.

Let's create some totally fictuious data for T and PE for two different runs (data that most likely would be read in from a simulation energy file or the result of analysis by a code and wouldn't be defined by hand).  We will then make a dataframe for each run and then put these in a dictionary.   

In [None]:
run1_data = { 
	'T': [300, 305, 310, 315, 310, 315, 320, 325, 320, 315, 310, 315],
	'PE': [1489, 1523, 1649, 1554, 1634, 1780, 1900, 1843, 1724, 1652, 1400, 1323]
}

run2_data = { 
	'T': [300, 305, 305, 310, 315, 320, 325, 320, 325, 320, 315, 320],
	'PE': [1482, 1512, 1432, 1623, 1723, 1849, 1948, 2200, 2129, 2003, 1802, 1938]
}

r1_pd = pd.DataFrame(run1_data)
r2_pd = pd.DataFrame(run2_data)


sim_data_dict = {'run1': r1_pd,  'run2': r2_pd}


By using a dictionary we can again, hone in on specific pieces of information, like e.g., only run2

In [None]:
sim_data_dict['run2']

In [None]:
for sim in sim_data_dict:
    plt.plot(sim_data_dict[sim]['T'], label=sim)
plt.legend()

This is of course not the only way to define a dataspace.  This just happens to be a way I personally like.  

E.g., instead of having a dictionary be the top level container, we could put the run1_data and run2_data dictionaries into a dictionary, then convert that to a pandas dataframe. 

In [None]:
sim_data_dict2 = {'run1': run1_data,  'run2': run2_data}

sim_df = pd.DataFrame(sim_data_dict2)

In [None]:
for sim in sim_df:
    plt.plot(sim_df[sim]['T'], label=sim)
    plt.legend()
    

There are some built in functions that make things easy to get information quickly out.

In [None]:
print(sim_data_dict['run1']['T'].mean(), '+/-', sim_data_dict['run1']['T'].std())

We can easily export to a numpy array as well.  A quick note, pandas uses the Bessel's correction in the standard deviation formulat. That is N-1, rather than N.  So this will give a slightly different value than numpy.std(). 

In [None]:
T_array = sim_data_dict['run1']['T'].to_numpy()
print(T_array)

In [None]:
print(T_array.mean(), '+/-', T_array.std())