Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dateutil kwargs to csv2rec #1210

Merged
merged 2 commits into from Apr 2, 2013
Merged

Conversation

dmcdougall
Copy link
Member

Fixes ambiguous date problems between US/UK date conventions.

My attempt at resolving #208.

I'm not sure this is the best way to resolve this issue. Another option is to keep the dateutil stuff separate from the specific csv2rec kwargs by having only one dateutil_properties kwarg, say. This kwarg would take a dictionary of keyword arguments that are passed directly to the dateutil parser. The functionality would be the same, with the only difference being an aesthetic one.

Another method is suggested by @efiring in the comments in #208.

Edit: @efiring's suggestion is to have one kwarg specifying the exact date in strptime format, rather than having dateutil do it.

Fixes ambiguous date problems between US/UK date conventions.
@@ -2090,7 +2090,7 @@ def extract(r):

def csv2rec(fname, comments='#', skiprows=0, checkrows=0, delimiter=',',
converterd=None, names=None, missing='', missingd=None,
use_mrecords=False):
use_mrecords=False, dayfirst=False, yearfirst=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use the builtin datetime for this? (showing my naivety here). If a user specifies a string format as a kwarg, use that with strftime. If not, let dateparser do its (frightening) magic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pelson That's a nice extension of @efiring's suggestion. I'm happy with that method, too. I'll leave this for a while and let others weigh in with thoughts and opinions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about it, that wouldn't fix current scripts...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hopeless; there is nothing reasonably simple you can do to fix existing usage of this function when applied to ambiguous dates. All we can do is make it easier for users to write new code, or fix old code, so it doesn't blow up in their faces. Dateparser is badly designed for our purposes, but for the time being we are stuck with it. Some method of feeding it the kwargs seems like an essential improvement.

@dmcdougall
Copy link
Member Author

This may or may not be helpful.

@pelson
Copy link
Member

pelson commented Mar 29, 2013

@dmcdougall - where do you want to go with this PR? For this to get merged a bit of documentation would be needed, but not much else.

A much more extreme option would be to start down the road of deprecating these routines and reducing the amount of non visualisation code that sits in the mpl codebase - if I ever need to read a CSV file matplotlib would not be the first place I looked. I suspect this option wouldn't be very popular though... 😉

@dmcdougall
Copy link
Member Author

@pelson I think the kwargs option, as @efiring pointed out some time ago, is the way to go. As far as removing CSV support is concerned, I think that is a matter needing discussion and consensus on the mailing list.

I'll push a rebase and doc update for this.

P.S. Thanks for peeling the scabs off these old wounds. They have been sitting for too long and have become stale.

@dmcdougall
Copy link
Member Author

@pelson Done.

pelson added a commit that referenced this pull request Apr 2, 2013
@pelson pelson merged commit ff6d7eb into matplotlib:master Apr 2, 2013
@ultra-andy
Copy link
Contributor

Hello, I've been looking at csv2rec for a bit, trying to understand what looks like a bug in the code.

If a test.csv file contains datetime type entries (ie, file contains "11/01/14 12:00:01 AM" only), and you attempt to run the following script:

import matplotlib.mlab as mlab
import datetime as datetime

dataset = mlab.csv2rec('test.csv', delimiter=',', names=['datetime'], dayfirst = True)
#dataset = mlab.csv2rec('test.csv', delimiter=',',converterd={0: lambda x: datetime.datetime.strptime(x, '%d/%m/%y %H:%M:%S %p'),}, names= ['datetime'], dayfirst = True)

for elem in dataset:
    print(elem, end = "\n")

...then the datetime object echoed misinterprets the datetime by swapping the day and the month.

The problem is that there are inconsistent handlings of formats in the converter candidate dictionary in mlab.csv2rec. If the time entry is a simple date ("11/01/14"), then the dates will be echoed by the print statement correctly. If the column in the .csv file contains a datetime object (ie date and time, for instance "11/01/14 12:00:01 AM"), then the datetime data will be interpreted as though dayfirst and yearfirst have NOT been specified, and so will NOT be echoed correctly.

This happens because the mydate converter (which uses dayfirst and yearfirst) returns an error if d.hour or d.minute or d.second is greater than zero, but the alternative mydateparser converter, which does not return an error, does not interpret dayfirst or yearfirst, so assumes monthfirst. So then datetime strings in the .csv file are not interpreted correctly if they are other than month first format, and are a datetime string with hour or minute or second other than zero.

In other words, mydateparser is NOT consistent with mydate. Mydateparser should reflect the approach used for mydate and also refer to the dayfirst and yearfirst arguments so that datetime elements of the .csv file column are interpreted in a way that matches dayfirst and/or yearfirst arguments in the call, if specified as True.

A fix that works in my own C:\Python33\Lib\site-packages\matplotlib\mlab.py file is to replace this line:

mydateparser = with_default_value(dateparser, datetime.date(1,1,1))

with this block:

def mydateparser(x):
    # try and return a datetime object
    d = dateparser(x, dayfirst=dayfirst, yearfirst=yearfirst)
    return d
mydateparser = with_default_value(mydateparser, datetime.datetime(1,1,1,0,0,0))

This problem is also present in Python 3.4, and probably exists in other releases as well. This appears to be a design bug. I think it is a dangerous bug because it fails in a way that is inconsistent with other aspects of function behaviour, and can silently leave the user with invalid data despite the correct format having been specified.

The proposed fix shouldn't break any code that uses the workaround of explicitly specifying a converterd in the csv2rec arguments (which also works - included as commented out line in script code above).

It would be good to get this bug fixed properly, so let me know if I need to do anything further to achieve this. I'm new to this bug reporting system, so apologies if I've done anything unconventional or improper - please just point me in the right direction!

@jenshnielsen
Copy link
Member

@ultra-andy Can you please create a new issue or pr. Comments on a almost 3 year old pull request that have already been merged are almost sure to be lost.

@ultra-andy
Copy link
Contributor

Thanks Jen, have created new issue #6184 - I went with creating a new issue as I'm new on this and would prefer not to do a pull request myself until I understand how the system works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants