Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector averaging upgrade #94

Merged
merged 39 commits into from
Mar 24, 2021
Merged

Conversation

hivanov-nrel
Copy link
Contributor

Hi everyone, I am not sure why this is saying that I am 33 commits ahead of the main master. I am pretty sure most of those commits were implemented with my upgrade to loads module PR....let me know if I am missing something.

Anyways, this should only involve implementation of the vector averaging upgrade commit. It adds functionality to the stats function so that you more accurately calculate stats for directional channels that need vector averaging.

Ivanov and others added 30 commits July 13, 2020 11:07
@ssolson
Copy link
Contributor

ssolson commented Feb 12, 2021

Thank you @hivanov-nrel. The 33 commits ahead is essentially just how many commits you are adding to the repository. If you were 0 ahead then you would have no changes.

Can you check out why your commits are not passing the build before we review this PR?

Copy link
Contributor

@rpauly18 rpauly18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for adding this update!

Copy link
Contributor

@ssolson ssolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Chris I have a couple of things we need to change and a couple of questions. An overarching question I have is about why I would use the get_statistics function instead of just calling *.describe(). I get the same answer using the .describe() and it is all in one DataFrame.

I would say if we keep the get_statistics function we need to adjust the results so that the returned index is meaningful. Currently the function is returning the first time index for min, max etc. I believe we should be returning 1 DataFrame indexed by max, min etc like 'describe() does.

image

@@ -104,7 +104,7 @@ def tip_speed_ratio(rotor_speed,rotor_diameter,inflow_speed):
'''
Function used to calculate the tip speed ratio (TSR) of a MEC device with rotor

Parameters
Parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The colons on the end of Parameters and Return will not build correctly. Please remove the colons from all Parameters and Returns in the docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird one related to my initial concerns when i posted this PR. I know this change had already been implemented in a previous PR months ago (as with a lot of the commits that are labeled as "ahead" of the master). Rebecca had mentioned it might be the case where I worked on something before doing a git pull and this caused the commit history to get screwy. Anyways, I will update this particular change. My main concern is that when this gets merged into the master, it doesn't mess up its commit history somehow. I am not much of an expert on Github so maybe it's harmless, but just wanted to throw that out there.

mhkit/utils.py Outdated
for v in vector_channels:
Ux = sum(np.sin(datachunk[v]*np.pi/180))/len(datachunk)
Uy = sum(np.cos(datachunk[v]*np.pi/180))/len(datachunk)
vector_avg = (90 - np.arctan2(Uy,Ux)*180/np.pi) # number doesnt seem right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning comment. Let us resolve this prior to merging. What is the issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is just a leftover comment when i was developing....will erase

mhkit/utils.py Outdated
Comment on lines 80 to 81
if vector_avg<0: vector_avg = vector_avg+360
elif vector_avg>360: vector_avg = vector_avg-360
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the vector average could be off by more than 1 rotation? E.g. 720 degrees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is related to the previous line where you calculate vector_avg where the arctan2 output could be greater than 90, which results in a negative angle. The if statement converts that to a positive angle. The second elif is unnecessary so ill remove.

mhkit/utils.py Outdated
if vector_avg<0: vector_avg = vector_avg+360
elif vector_avg>360: vector_avg = vector_avg-360
means[i][v] = vector_avg # overwrite scalar average for channel
magsum = round((Ux**2 + Uy**2)*1e8)/1e8 # round to 8th decimal place
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we rounding this number to single precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent very rare cases where roundoff error in some data acquisition systems can cause the following calculation to hiccup. Could be unnecessary.

@hivanov-nrel
Copy link
Contributor Author

@ssolson Thanks for reviewing. Sorry for the late reply. I am actively working to answer each of your comments.

As for the overarching concern that you had....i agree with you that describe gets you the same type of results for most cases. But if you are directly comparing describe() to get_statistics(), I do think that get_statistics() is more robust in handling more types of cases. For instance, it does a qc check on the raw timestamps, it allows the user to resample the data regardless of file/dataframe length, and the vector averaging upgrade is important for directional channels (which I assume users will definitely have).

So I think it comes down to whether we should use describe() within the get_statistics function. The main difference there being how we want the output to be organized. In my experience, it is most useful to organize the stats indexed by a timestamp, especially when you are looping through thousands of 10 minute files and concatenating all of that data together. I am not entirely happy that the result ends with having 4 separate dataframes. I personally would use a multi-level dataframe indexed by time and stat, but i dont think thats possible within MHKit. Alternatively, if describe() is applied, I think you would run into issues with the way the data is organized especially when concatenating all the files together.

Let me know if this doesn't make sense.

@ssolson
Copy link
Contributor

ssolson commented Mar 22, 2021

Hey Chris I finally got a chance to review this again with your updated input. I believe what you have written here is the pandas resample function with specific methods (mean, max, min, std). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
This method has pad method for handling NaNs.
I am not aware of any limits to the DataFrame size with respect to describe or resample which you mention for calling describe above.

The quick timestep check seems useful.

I think the vector averaging thing you have done here seems very useful and I almost think it should be its own function. Then we could use this vector averaging function outside of the statistics. What do you think about breaking out the vector averaging into its own function (is this useful?) and maybe utilizing the builtin resample function in the get stats here to make the code more readable and utilize optimizations made by the pandas team for this type of data analysis?

@hivanov-nrel
Copy link
Contributor Author

@ssolson resample is a great function which I use from time to time. The reason I created get_statistics was to circumvent one caveat that i have found. And this is mostly centered around user workflow and IEC standards rather than pure functionality. So IEC standard usually specifies that data is stored as 10 minutes files. When recording data the start and end time of a file could be 2021-01-01 12:03:47 to 2021-01-01 12:13:47. When post-processing hundreds of these files, you usually run a loop that parses each file, formats it, calculates, and stores statistics. If you were to use resample on that file to calculate statistics, you would get two values that correspond to times starting at 12:00:00 and 12:10:00 (and they would be incorrect). The function i created avoids this by specifically grabbing chunks of continuous 10 min data regardless of start time and stores the statistics based on that first timestamp value. This is pretty standard timestamp indexing as later on in the data analysis process when you make plots and find outliers, you can easily map those outliers to the specific 10 minute file that contains them. From there you can troubleshoot the raw data and figure out what happened. Maybe there is a way to force resample to do this, but I am not aware.

I would be in favor of breaking out vector averaging to be a separate function, but i think it doesnt hurt to also keep it in get_statistics as i imagine that it would be convenient for the user to let the function do it all in one step. But I am sure there might be cases where vector averaging can be used independently.

Thoughts?

@ssolson
Copy link
Contributor

ssolson commented Mar 22, 2021

Thanks for explaining Chris. I believe I see the problem you are solving here. Okay so if you think vector averaging would be useful generall then I believe we are on the same page. I was not suggesting removing vector averaging from the current get_stats function but to call the proposed/ new broken out vector averaging function from within get stats. I outline the idea below where we call vec_avg from get_stats. I think we are in agreement here. Thank you again for working with me to explain this.

def vec_avg():
    insert math here
   return vector_avg

def get stats():
    calc stats
    if vector avg:
        vector_avg = vec_avg()
    return stats
'''

@hivanov-nrel
Copy link
Contributor Author

No problem. Glad we are on the same page. I made the update and pushed the commit.

Copy link
Contributor

@ssolson ssolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chris thank you for the quick turn around I have couple of minor comments and questions.

mhkit/utils.py Outdated
Parameters
----------
data : pandas Series, numpy array, list
Vector channel to calculate statistics on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the data in this function is an array-like structure containing data in degrees e.g. (63, 93, 65, 54, 101,...)?

Probably worth noting that the data must be in degrees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We definitely should be explicit about that.

mhkit/utils.py Outdated
Comment on lines 118 to 120
if not np.isreal(epsilon): # check if epsilon is imaginary (error)
vector_std = 0
print('WARNING: vector averaging error in calculating epsilon')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says error. The warning message calls it an error. I think we need a bit more clarity around this.

I think the warning should contain why it was flagged e.g. 'WARNING: epsilon contains imaginary values'. Also what is epsilon and what would cause this to have imaginary values? Can we provide better feedback to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epsilon being imaginary comes from the infinitesimal chance that the magsum variable is greater than 1. I basically eliminate this possibility with the roundoff error correction i have in place in the step before it but i left this if statement as a catch-all. But i dont expect this warning to ever be thrown (hopefully).

mhkit/utils.py Outdated
@@ -76,6 +86,43 @@ def get_statistics(data,freq,period=600):

return means,maxs,mins,stdevs

def vector_averaging(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a dedicated test for this function? It will help make it clear that it is being tested and serve as an example for using it by itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

else:
vector_std = np.arcsin(epsilon)*(1+0.1547*epsilon**3)*180/np.pi

return vector_avg, vector_std
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like this function might be called directional_data_stats because it returns more than just the average.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought....i initially left it as vector_averaging since its what everyone refers to although i agree its technically not appropriate. I have updated this to be called vector_statistics instead. I think thats close enough to where people should be able to recognize what its for.

mhkit/utils.py Outdated
@@ -22,7 +22,9 @@ def get_statistics(data,freq,period=600):
Sample rate of data [Hz]
period : float/int
Statistical window of interest [sec], default = 600

vector_channels : string or list (optional)
List of channel names that are to be vector averaged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these channels required to be in units of degrees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes....ill make that explicit as well.

@ssolson
Copy link
Contributor

ssolson commented Mar 23, 2021

Chris we also just got the testing fixed. Could you merge the main repo changes into your branch so the builds will start to pass?

@ssolson
Copy link
Contributor

ssolson commented Mar 24, 2021

Fantastic work Chris! I will merge these changes now.

@ssolson ssolson merged commit 238d85f into MHKiT-Software:master Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants