Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a helper method to create FeatureSet from pandas data frames. #292

Merged
merged 17 commits into from
May 19, 2016

Conversation

dmnapolitano
Copy link
Collaborator

Hello! The changes in this pull request should resolve issue #261. Thanks! 😄

'''

if not _HAVE_PANDAS:
logger.warning(('pandas not installed. Please install pandas or '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just raise a ValueError if they don't have pandas, because that's the only time anyone should use from_data_frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take that back. You're not actually using pandas anywhere in here, so why even bother checking if you can import it? Duck typing should prevail here. (How is there no duck emoji?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok word, we had the same thoughts here. To be revised 😄 (Also yeah the lack of duck emoji makes no sense :shipit: )

@dmnapolitano dmnapolitano changed the title #261 pandas.DataFrame helper [WIP] #261 pandas.DataFrame helper Feb 12, 2016
…t there to be an error. Modify travis to run optional pandas tests depending on an environment variable.
return (expected, current)


@attr('have_pandas')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, didn't know about that feature.

@desilinguist
Copy link
Member

I really like this solution but do we really need to run all the tests in the PANDAS=true environments? Seems like its presence won't affect anything except the specific test that we want to run. Am I wrong?

@dmnapolitano dmnapolitano changed the title [WIP] #261 pandas.DataFrame helper #261 pandas.DataFrame helper Feb 15, 2016
@dmnapolitano
Copy link
Collaborator Author

@desilinguist Hmmmmmmmm...so in Travis, when WITH_PANDAS is True, only run the pandas tests and skip the others? I can make that happen 😄 🐼

…kip the others (since they'll be run when WITH_PANDAS = False).
features = [dict(row) for (i, row) in df.iterrows()]

return FeatureSet(name, list(df.index), labels=labels,
features=features, vectorizer=vectorizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a more efficient way to do this (like in RSMTool):

if labels_column:
    feature_columns = [column for column in df.columns if column != labels_column]
    labels = df[labels_column].tolist()
else:
    feature_columns = df.columns
    labels = None

features = df[feature_columns].to_dict(orient='records')
return FeatureSet('train', ids=df.index.tolist(), labels=labels, features=features, vectorizer=vectorizer)

@desilinguist
Copy link
Member

Since this is a major new feature, it would be nice to:

  1. Update the documentation to explicitly mention it in an appropriate place.
  2. Provide an example showing the feature in action.

@desilinguist desilinguist changed the title #261 pandas.DataFrame helper Add a helper method to create FeatureSet from pandas data frames. Feb 15, 2016
@dmnapolitano
Copy link
Collaborator Author

Ok, I will try to finish this up today and Sunday. 😄

@dmnapolitano dmnapolitano changed the title Add a helper method to create FeatureSet from pandas data frames. [WIP] Add a helper method to create FeatureSet from pandas data frames. Feb 19, 2016
@desilinguist
Copy link
Member

Thanks! I think the code part is good (assuming Travis agrees :) ) . Now we need a nice example (perhaps a panda-ized version of the Titanic example?) and updated documentation.

@desilinguist
Copy link
Member

All this needs is an example and updated documentation. We will include this in the next release.

@desilinguist
Copy link
Member

@dmnapolitano do you have time to work on this for the 1.2.1 release which we want to do in a week or so? I believe all it needs is an example and updated documentation and making sure that the tests still pass.

@dmnapolitano
Copy link
Collaborator Author

Hmmm yeah, I believe you're right. 👍 Ok, let me see what I can do by Monday. If the answer is "nothing" I'll be sure to let you know. 😬 Thanks!

@coveralls
Copy link

coveralls commented May 16, 2016

Coverage Status

Coverage increased (+0.03%) to 90.712% when pulling 99012a9 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

@dmnapolitano dmnapolitano changed the title [WIP] Add a helper method to create FeatureSet from pandas data frames. Add a helper method to create FeatureSet from pandas data frames. May 16, 2016
@coveralls
Copy link

coveralls commented May 16, 2016

Coverage Status

Coverage increased (+0.03%) to 90.712% when pulling ca7b034 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

@dmnapolitano
Copy link
Collaborator Author

Ok, I think this is all set, but let me know what you think of course 😄 👍

@desilinguist
Copy link
Member

@dan-blanchard can you take a look as well please? I will too.

def featureset_creation_from_dataframe_helper(with_labels):
"""
Helper function for the two unit tests for FeatureSet.from_data_frame().
Since labels are optional, run two tests, one with, one without.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorizer is also optional. It might be nice to add tests for with/without vectorizer as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok 👍🏼 I don't know much about vectorizers, honestly. Aren't there quite a few of them one could use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at how it is created in the skll utils.py method make_classification_data and see whether you could apply that here too on your data?

@coveralls
Copy link

coveralls commented May 16, 2016

Coverage Status

Coverage increased (+0.03%) to 90.712% when pulling 2b7be60 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

@aoifecahill
Copy link
Collaborator

This looks ok to me. Any other comments?

@desilinguist
Copy link
Member

Looks good to me too. @dan-blanchard do you have anything?

Train a linear svm (assuming we have `train_examples`)::
from skll import FeatureSet

train_examples = FeatureSet.from_data_frame(my_data_frame, 'A Name for My Data', data_labels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is a bit misleading. When I read it, I assumed data_labels was a list of labels, not the name of a column in the my_data_frame df.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry! I forgot how my own code works 😞 I'll update this now, thanks.

…t is for the name of the column, not the labels themselves...
:type name: str
:param labels_column: The name of the column containing the labels (data to predict).
:type labels_column: str or None
:param vectorizer: Vectorizer that created feature matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description of vectorizer seems weird? Shouldn't this be the vectorizer you want to use when creating the featureset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not really sure. I copied it from line 44. 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aoifecahill this will be the vectorizer used to create the featureset and NOT one that has already been used for something, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it's the one that will be used to create this new featureset. It could have been used for something else though I think (e.g. if you want to apply a pre-existing vectorizer to a new test set)

Copy link
Member

@desilinguist desilinguist May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right. In any case, @dmnapolitano can you fix the docstring to say that this is a vectorizer that will be used to create the featureset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, should I change line 44 too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, line 44 is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then can you help me understand why they're different? In both cases we're creating a FeatureSet, so why would the vectorizer do something different in this case?

Copy link
Member

@desilinguist desilinguist May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both can docstrings can just say “Vectorizer to use for generating feature matrix”? It's not a big deal.

@coveralls
Copy link

coveralls commented May 18, 2016

Coverage Status

Coverage increased (+0.03%) to 90.712% when pulling 7b874ed on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

@coveralls
Copy link

coveralls commented May 19, 2016

Coverage Status

Coverage increased (+0.03%) to 90.735% when pulling f941db2 on feature/skll-261-pandas-dataframe-helper into 23b8d2b on master.

@coveralls
Copy link

coveralls commented May 19, 2016

Coverage Status

Coverage increased (+0.03%) to 90.735% when pulling 9600450 on feature/skll-261-pandas-dataframe-helper into 23b8d2b on master.

@dan-blanchard
Copy link
Contributor

Looks good to me.

@desilinguist desilinguist merged commit 911fecf into master May 19, 2016
@desilinguist desilinguist deleted the feature/skll-261-pandas-dataframe-helper branch May 19, 2016 16:24
@dmnapolitano
Copy link
Collaborator Author

Thanks, everyone! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants