-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a helper method to create FeatureSet from pandas data frames. #292
Add a helper method to create FeatureSet from pandas data frames. #292
Conversation
…a pandas DataFrame (if pandas is installed).
…/skll into feature/skll-261-pandas-dataframe-helper
…test creation of a FeatureSet from a pandas DataFrame.
''' | ||
|
||
if not _HAVE_PANDAS: | ||
logger.warning(('pandas not installed. Please install pandas or ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just raise a ValueError
if they don't have pandas, because that's the only time anyone should use from_data_frame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I take that back. You're not actually using pandas
anywhere in here, so why even bother checking if you can import it? Duck typing should prevail here. (How is there no duck emoji?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok word, we had the same thoughts here. To be revised 😄 (Also yeah the lack of duck emoji makes no sense )
…t there to be an error. Modify travis to run optional pandas tests depending on an environment variable.
return (expected, current) | ||
|
||
|
||
@attr('have_pandas') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, didn't know about that feature.
I really like this solution but do we really need to run all the tests in the |
…/skll into feature/skll-261-pandas-dataframe-helper
@desilinguist Hmmmmmmmm...so in Travis, when |
…kip the others (since they'll be run when WITH_PANDAS = False).
features = [dict(row) for (i, row) in df.iterrows()] | ||
|
||
return FeatureSet(name, list(df.index), labels=labels, | ||
features=features, vectorizer=vectorizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a more efficient way to do this (like in RSMTool):
if labels_column:
feature_columns = [column for column in df.columns if column != labels_column]
labels = df[labels_column].tolist()
else:
feature_columns = df.columns
labels = None
features = df[feature_columns].to_dict(orient='records')
return FeatureSet('train', ids=df.index.tolist(), labels=labels, features=features, vectorizer=vectorizer)
Since this is a major new feature, it would be nice to:
|
Ok, I will try to finish this up today and Sunday. 😄 |
Thanks! I think the code part is good (assuming Travis agrees :) ) . Now we need a nice example (perhaps a panda-ized version of the Titanic example?) and updated documentation. |
All this needs is an example and updated documentation. We will include this in the next release. |
@dmnapolitano do you have time to work on this for the 1.2.1 release which we want to do in a week or so? I believe all it needs is an example and updated documentation and making sure that the tests still pass. |
Hmmm yeah, I believe you're right. 👍 Ok, let me see what I can do by Monday. If the answer is "nothing" I'll be sure to let you know. 😬 Thanks! |
…/skll into feature/skll-261-pandas-dataframe-helper
…kstart documentation.
Ok, I think this is all set, but let me know what you think of course 😄 👍 |
@dan-blanchard can you take a look as well please? I will too. |
def featureset_creation_from_dataframe_helper(with_labels): | ||
""" | ||
Helper function for the two unit tests for FeatureSet.from_data_frame(). | ||
Since labels are optional, run two tests, one with, one without. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectorizer
is also optional. It might be nice to add tests for with/without vectorizer as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok 👍🏼 I don't know much about vectorizers, honestly. Aren't there quite a few of them one could use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at how it is created in the skll utils.py
method make_classification_data
and see whether you could apply that here too on your data?
This looks ok to me. Any other comments? |
Looks good to me too. @dan-blanchard do you have anything? |
Train a linear svm (assuming we have `train_examples`):: | ||
from skll import FeatureSet | ||
|
||
train_examples = FeatureSet.from_data_frame(my_data_frame, 'A Name for My Data', data_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is a bit misleading. When I read it, I assumed data_labels
was a list of labels, not the name of a column in the my_data_frame
df.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry! I forgot how my own code works 😞 I'll update this now, thanks.
…t is for the name of the column, not the labels themselves...
:type name: str | ||
:param labels_column: The name of the column containing the labels (data to predict). | ||
:type labels_column: str or None | ||
:param vectorizer: Vectorizer that created feature matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description of vectorizer
seems weird? Shouldn't this be the vectorizer you want to use when creating the featureset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm not really sure. I copied it from line 44. 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aoifecahill this will be the vectorizer used to create the featureset and NOT one that has already been used for something, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, it's the one that will be used to create this new featureset. It could have been used for something else though I think (e.g. if you want to apply a pre-existing vectorizer to a new test set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right. In any case, @dmnapolitano can you fix the docstring to say that this is a vectorizer that will be used to create the featureset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, should I change line 44 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, line 44 is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then can you help me understand why they're different? In both cases we're creating a FeatureSet
, so why would the vectorizer
do something different in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both can docstrings can just say “Vectorizer to use for generating feature matrix”? It's not a big deal.
…/skll into feature/skll-261-pandas-dataframe-helper
Looks good to me. |
Thanks, everyone! 🎉 |
Hello! The changes in this pull request should resolve issue #261. Thanks! 😄