Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydmap Read #2

Merged
merged 16 commits into from
Jan 7, 2019
Merged

Pydmap Read #2

merged 16 commits into from
Jan 7, 2019

Conversation

mts299
Copy link
Contributor

@mts299 mts299 commented Jan 4, 2019

First pydarn functionality!
DmapRead allows for users to read in dmap files and obtain records in a list of dictionaries.
The original code of dmap was taken from https://github.com/SuperDARNCanada/backscatter
and updated to python 3.6 and improved upon for speed and readability.

README is updated for installation notes and example code.

To test:

import pydarn
dmap = pydarn.DmapRead('20170114.map')
dmap_data = dmap.read_records()

This PR is to focus on the DmapRead functionality and the README.

Testing suite setup, performance test suite, DEVELOPERS.md, and DmapWrite are for other PRs

Marina Schmidt added 15 commits November 28, 2018 17:01
Fixed the file reading error with few other error in testing
and the dmap file. Added some TODO's to look into for the future
redesign.
Made some minor modifications to get pydmap to work with python 3.0.
This included using memoryview instead of buffer for unpacking the C
structs. Using encoding on chars and strings since parsing needs a byte
object. Decoding was also needed for string in unpacking data.

Stream test did not work and further cleanup of the code needs to happen
including a better framework for testing. Performance testing also is
needed to try and speed up the reading of files.
Moved the dmap data structures:
- scalars
- arrays
- record
to its own file for modularity. Cleaned up the code:
- getter/setter property (pythonic)
- use ChainMap for data structure on records for performance
- got rid of some unused methods
Moved read and write for damp to io for better package readbility.
Previous commit also moved various packages around for better
organization and readibility.
Started cleaning up code with better documentation, variable names,
comments and functionality.
Cleaned up dmap.py some more. The following things have changed:
    - added several new expections for more specific errors
    - cleaned up error messages to be more descriptive
    - renamed variable for better clarity
    - re-wrote log messages for better readability
    - some code improvements for modularity and encapsulation
        - byte checking
        - format converting

More work on the reading procedure then writing procedure needs to be
cleaned up still.
Did the final clean up with dmap read. Time to test it!
Restructed the testing suite file directory system and file names
for better readability. Also, applied pep8 style to the io.py file and
restructured how to store records from a dmap file.
Modified the io.py to change the syntax for implementing
the new data structures for dmap record, scalar, and array.
Dmap records in DmapRead has also changed to deque for memory
efficiency and performance reasons.

made some minor modifications to init py for proper importing
of the classes.
All test run and pass. Added an extra sanity check for
empty files. All that is left is documentation.
Documented DmapRead following numpy style guide
for compilation to read the docs.
Documented the dmap data structure file and
did some spell checking on io.py as well.
Keith point out some speed difference in backscatter vs. pydarn.
Main difference was logging being a large factor to time.
Possible solution is logging on a separate thread or
let it be an option.
Minor improvements were also made to pydarn to improve the overall
speed of the code.
WARNING: logging is commented out for now.
This commit focused on decreasing the time of the DmapRead functionality
as much as possible. One problem, to reach minimum amount of time to
read dmap files the logging functionality had to be removed. Further
investigation is needed on making this optional.

Huge improvements were, removing the amount of calls to other
library calls. Some fiddling with unpack and numpy.frombuffer.

Comments are left in the code indicating non-pythonic coding
that allowed for a speed performance.

This current version has a 42% increase in performance compared to
backscatters dmap in python 2.7 and a 55% increase when compared to
an earlier version of pydarn's dmap reader when only what was needed
to convert backscatters dmap to python 3.0.
Updated the README for pydarn installation.
Added unit tests for pydarn data structures.
Update documentation in unit tests.
Set logging to be disabled and reduced logging
information that is not needed given a enough information
in the error messages stored in logging.
Updating README to be more clear on package requirements. Also, added
numpy in the setup.py install for virtual environments.
@ksterne
Copy link
Contributor

ksterne commented Jan 4, 2019

I know I'm not listed as a reviewer, but I couldn't help myself. These are mostly nit-picky comments, but hopefully helpful.

Do we need to add numpy to README.md as a prerequisite since it's been added with this PR? Should "Example" be changed to "Examples"? And then under the first example, the "DMAP files are a SuperDARN" doesn't seem like a complete thought. Is there more to that line that got misplaced?

I'll see if I can find time tomorrow to give it a whirl.

@mts299
Copy link
Contributor Author

mts299 commented Jan 4, 2019

@ksterne No worries, more the marrier for reviewing. Plus others might be busy with the Borealis project, and I would like to push this PR ASAP so I can start building off of it.

Sorry, where do you see that numpy has been added to this PR?
I added numpy into the setup.py just in case it was not installed. So it should be installed on setup if the user does not have it.
I did the same for the python yaml library, but in some cases, you also need to install the development code thus it being a prerequisite.

Yes, I must have written that sentence half awake or near the end of the day. I will re-look over it tomorrow.

Sounds awesome! Please let me know what you think. Currently, I am working on the DEVELOPERS guide, some extra documentation and how to set up a nice test suite without having large files.

@egthomas
Copy link
Member

egthomas commented Jan 6, 2019

Will support for grid files be included at some point? Also, it looks like the xtd fields for power and spectral width are missing from the mapfile_types in dicts_to_file.

@mts299
Copy link
Contributor Author

mts299 commented Jan 6, 2019

Sorry, there is two branches with the same name. Please make sure you are using the one that has the complete README file that runs with the following test code I put in the description.

@egthomas you are using/looking at the dmapwrite code which is not the scope of the this PR. I left a note above dmapwrite that I need to clean it up and review it. That is legacy code from backscatter. Sorry for the confusion.

@mts299
Copy link
Contributor Author

mts299 commented Jan 6, 2019

However, I forgot to add grid files in the unit tests so thank you for pointing that out @egthomas

@egthomas
Copy link
Member

egthomas commented Jan 6, 2019

Ah my mistake, thanks for the clarification!

@asreimer asreimer added the enhancement New feature or request label Jan 7, 2019
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pydarn/logging_config.yaml Show resolved Hide resolved
@asreimer
Copy link
Contributor

asreimer commented Jan 7, 2019

Hey @mts299! This looks great! Most of my comments are minor things like typos. I was able to get the example code to run.

One major comment I didn't include in my review: I couldn't get the testing to work, except for test/pydmap_tests/test_datastructures.py

For both test_dmapread.py and test_dmapwrite.py:

Traceback (most recent call last):
  File "test_dmapwrite.py", line 17, in <module>
    import test_data
ModuleNotFoundError: No module named 'test_data'

And finally: python tests.py yields:

ImportError: Failed to import test module: tests
Traceback (most recent call last):
  File "tests.py", line 18, in suite
    mod = __import__(t, globals(), locals(), ['suite'])
ModuleNotFoundError: No module named 'pydarn.tests'

@mts299
Copy link
Contributor Author

mts299 commented Jan 7, 2019

@asreimer as the description states:

Testing suite setup, performance test suite, DEVELOPERS.md, and DmapWrite are for other PRs
As in python tests.py is not set up as that will be part of the Testing suite setup PR since we still need to figure out testing data files.

Also, DmapWrite is also not part of this PR so the test_dmapwrite.py would be expected to fail as I have not focused on cleaning or getting the code to work.

Will fix test_dmapread.py module problem right away, however, you need to change the data filenames in that test or download the specific files.

@asreimer
Copy link
Contributor

asreimer commented Jan 7, 2019

Hmmm, the edit to my comment didn't save. I noticed your description after I posted that comment about the testing not working and editted my comment accordingly. 2am might have had something to do with it.

@mts299
Copy link
Contributor Author

mts299 commented Jan 7, 2019

@asreimer Okay, I mean test_dmapread.py and test_datastructures.py should work. However, there is that change you need to do with test_dmapread.py for data files. Which I will bring up sometime soon so I can figure out the test suite :)

Mainly fixed some README typos, test_dmapread.py error and
added some setup features for hopefully better installation.
@asreimer
Copy link
Contributor

asreimer commented Jan 7, 2019

I tested everything using the examples in the README.md and they work fine. Good to go.

That setup.cfg and MANIFEST.in are slick. Really nice work @mts299!

@asreimer asreimer merged commit 7696edb into develop Jan 7, 2019
@asreimer asreimer deleted the pydmap_fix branch January 7, 2019 23:34
pydarn_logger.error(self.message)


class ZeroByteError(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be renamed to 'InvalidByteError' since its raised if the bytes are 0 or less than zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I like Zero because it is concise and something may have set it to zero which might not have made the bytes invalid. However, I get what you mean with the less than zero cases, so I added NegativeByteError which could indicate something really bad happened :/

pydarn/exceptions/pydmap_exceptions.py Show resolved Hide resolved
pydarn/logging_config.yaml Show resolved Hide resolved
"""
class DmapArray(NamedTuple):
name: str
value: np.ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen a file that has this, but as part of the DMAP spec, its technically possible to store strings or even whole dmap records in a dmap value. I'm not sure if thats an issue if the type here is ndarray, but something to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off I really would like to see a DMAP example where DMAP records are stored in records. The Spec may be outdated or have legacy code that I would not like to introduce into newer code. Especially if the legacy introduces potential bugs.

I am more than happy to go back if a given example is present in the current DMAP format.

With that said, ndarrays can hold any type of object including strings and DmapRecords so there is no actual issue here (The bittersweet of python :p ). It just informs developers that the object field should be a data type of ndarray. Technically if they store a list or something else nothing will really happen because it is python. Just later on in the downstream something might throw a TypeError but that is the pythonic way.

I think in later python versions type hinting might actually enforce type checking.

https://docs.python.org/3/library/typing.html

pydarn/pydmap/datastructures.py Show resolved Hide resolved
test/pydmap_tests/test_dmapread.py Show resolved Hide resolved
test/pydmap_tests/test_dmapread.py Show resolved Hide resolved
test/pydmap_tests/test_dmapread.py Show resolved Hide resolved
test/pydmap_tests/test_dmapwrite.py Show resolved Hide resolved
dmap_data = dmap_reader.read_records()

# dmap_data[record number][paramter name].value
print(dmap_data[0]['bmnum'].value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create a getter or some sort of decorator so that the user doesn't have to explicitly call .value on every data value they want? ex. if i call dmap_data[0]['bm_num'] then under the hood it returns me dmap_data[0]['bm_num'].value by default? I don't think this affects too much, but if you are writing a large script or something you are going to have a lot of extraneous looking code for .value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I like the .value because it is explicit and shows self-documenting code (best practices). It also keeps things consistent. Especially with the complaints that RST and DaVitpy have gotten with unclear variable names and code.

For example:

for i in range(record[0]['beam'].shape[1] ):
      amplify *= record[0]['beam'].value

vs.

for i in range(record[0]['beam'].shape[1] ):
      amplify *= record[0]['beam']    

The former is clearer and easier to understand at first glance (which is me in debugging mode) and the latter is more confusing or black magic.

Or to avoid extraneous looking code you can just store it in a variable ahead of time and then use it later to make the lines smaller.

For Example:

Plot (record[0][velocity'].value,  record[0]['time'].value, default .... plot ... settings)

vs.

velocity_data = record[0][velocity'].value
time_data = record[0]['time'].value
Plot (velocity_data,  time_data, default .... plot ... settings)

The latter is also considered cleaner code too. I mean at the end of the day you are going to have to access the record[0][velocity'] to get most of your data (assuming no loops are involved to cut some stuff down) so an extra .value especially with autocomplete seems okay to me :)

Also, I don't know if you can make a getter function/decorator with a collection containing a collection. You are potentially looking at some overhead with a getter method and that means a decrease in performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think this is partly a preference thing and I don't think there are necessarily any right or wrong answers here. I see a couple of thumbs ups on my suggestion, so maybe some others can comment on what they think since this is ultimately a decision on top level functionality. Personally, I still stick by my suggestion as a preference provided there is no performance hits.

@egthomas
Copy link
Member

egthomas commented Jan 9, 2019

I don't know if this is the appropriate place to post these links, but in one of the review or code comments I thought I saw the general question "What is dmap format?". Here are a few short guides/tutorials/documentation written by the original developer (Rob Barnes) which may be of some general help:

https://superdarn.github.io/rst/superdarn/src.doc/rfc/0006.html
https://github.com/SuperDARN/rst/wiki/Using-DataMap-Files-In-IDL
https://superdarn.github.io/rst/general/src.lib/dmap/index.html

@kkotyk
Copy link
Contributor

kkotyk commented Jan 9, 2019

@egthomas I'm not sure, but with Borealis, we usually use the issue tracker and use the description tags when we ask our questions or comments. "What is dmap format" would be a good issue with the question tag.

@kkotyk
Copy link
Contributor

kkotyk commented Jan 9, 2019

Other than maybe moving that data question to an issue or something, I think everything else is resolved. I can't submit an approval, so I'm just writing it here.

@egthomas
Copy link
Member

egthomas commented Jan 9, 2019

I don't know where it was in the 100+ comments, but somewhere I'd like to echo Keith's sentiment that ideally the dmap I/O routines should be as general as possible and therefore able to support real-time data streams and files with different formats from the standard SuperDARN rawacf, fitacf, etc.

@mts299
Copy link
Contributor Author

mts299 commented Jan 9, 2019

@egthomas I was not disagreeing, I have not tested streaming and focused more on file reading. I will be revisiting streaming in the test suite. There is also talk about supporting other file formats.

@asreimer
Copy link
Contributor

asreimer commented Jan 9, 2019

@egthomas we are all on the same page with this. And AFAIK, pydmap does still support data streams. @mts299 and I have been hashing out how to support multiple file types today.

Finally, can we please stop commenting on a closed PR and move this discussion into the chat?

@egthomas
Copy link
Member

egthomas commented Jan 9, 2019

@mts299 sorry I wasn't trying to imply anyone was disagreeing, just having trouble keeping up with all of the messages coming through my inbox. This is my last comment on this PR request.

@mts299 mts299 restored the pydmap_fix branch February 11, 2019 23:10
mts299 pushed a commit that referenced this pull request May 16, 2019
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
@mts299 mts299 deleted the pydmap_fix branch July 5, 2019 20:23
@mts299 mts299 added this to Done in IO - Deprecation Jan 3, 2020
@mts299 mts299 added this to the v1.0.0 milestone Jan 3, 2020
mts299 pushed a commit that referenced this pull request Feb 5, 2020
mts299 pushed a commit that referenced this pull request Feb 5, 2020
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
mts299 pushed a commit that referenced this pull request Feb 26, 2020
mts299 pushed a commit that referenced this pull request Feb 26, 2020
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
mts299 pushed a commit that referenced this pull request Feb 26, 2020
mts299 pushed a commit that referenced this pull request Feb 26, 2020
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
mts299 pushed a commit that referenced this pull request Feb 26, 2020
mts299 pushed a commit that referenced this pull request Feb 26, 2020
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
mts299 pushed a commit that referenced this pull request Feb 26, 2020
From Issue #2 suggestion, added `__str__` and `__repr__`
for better printing of the object that is being used.
Only dmap classes were changed because inherited classes
will obtain the same attributes for printing of the object
but the class name will be adjusted to be same as the actual class
and not the dmap classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants