Add a helper method to create FeatureSet from pandas data frames. #292

dmnapolitano · 2016-02-12T20:11:25Z

Hello! The changes in this pull request should resolve issue #261. Thanks! 😄

…a pandas DataFrame (if pandas is installed).

…/skll into feature/skll-261-pandas-dataframe-helper

…test creation of a FeatureSet from a pandas DataFrame.

…ithout labels

dan-blanchard · 2016-02-12T20:20:58Z

skll/data/featureset.py

+        '''
+
+        if not _HAVE_PANDAS:
+            logger.warning(('pandas not installed.  Please install pandas or '


This should just raise a ValueError if they don't have pandas, because that's the only time anyone should use from_data_frame.

Actually, I take that back. You're not actually using pandas anywhere in here, so why even bother checking if you can import it? Duck typing should prevail here. (How is there no duck emoji?)

Ok word, we had the same thoughts here. To be revised 😄 (Also yeah the lack of duck emoji makes no sense )

…t there to be an error. Modify travis to run optional pandas tests depending on an environment variable.

dan-blanchard · 2016-02-13T20:27:21Z

tests/test_featureset.py

+    return (expected, current)
+
+
+@attr('have_pandas')


Huh, didn't know about that feature.

desilinguist · 2016-02-15T02:38:31Z

I really like this solution but do we really need to run all the tests in the PANDAS=true environments? Seems like its presence won't affect anything except the specific test that we want to run. Am I wrong?

…/skll into feature/skll-261-pandas-dataframe-helper

dmnapolitano · 2016-02-15T16:27:02Z

@desilinguist Hmmmmmmmm...so in Travis, when WITH_PANDAS is True, only run the pandas tests and skip the others? I can make that happen 😄 🐼

…kip the others (since they'll be run when WITH_PANDAS = False).

desilinguist · 2016-02-15T20:22:12Z

skll/data/featureset.py

+            features = [dict(row) for (i, row) in df.iterrows()]
+
+        return FeatureSet(name, list(df.index), labels=labels,
+                          features=features, vectorizer=vectorizer)


I believe there's a more efficient way to do this (like in RSMTool):

if labels_column: feature_columns = [column for column in df.columns if column != labels_column] labels = df[labels_column].tolist() else: feature_columns = df.columns labels = None features = df[feature_columns].to_dict(orient='records') return FeatureSet('train', ids=df.index.tolist(), labels=labels, features=features, vectorizer=vectorizer)

desilinguist · 2016-02-15T20:38:56Z

Since this is a major new feature, it would be nice to:

Update the documentation to explicitly mention it in an appropriate place.
Provide an example showing the feature in action.

dmnapolitano · 2016-02-19T15:14:47Z

Ok, I will try to finish this up today and Sunday. 😄

desilinguist · 2016-02-19T15:21:33Z

Thanks! I think the code part is good (assuming Travis agrees :) ) . Now we need a nice example (perhaps a panda-ized version of the Titanic example?) and updated documentation.

desilinguist · 2016-02-22T19:16:50Z

All this needs is an example and updated documentation. We will include this in the next release.

desilinguist · 2016-05-13T15:37:24Z

@dmnapolitano do you have time to work on this for the 1.2.1 release which we want to do in a week or so? I believe all it needs is an example and updated documentation and making sure that the tests still pass.

dmnapolitano · 2016-05-13T16:01:10Z

Hmmm yeah, I believe you're right. 👍 Ok, let me see what I can do by Monday. If the answer is "nothing" I'll be sure to let you know. 😬 Thanks!

…/skll into feature/skll-261-pandas-dataframe-helper

coveralls · 2016-05-16T14:59:22Z

Coverage increased (+0.03%) to 90.712% when pulling 99012a9 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

…kstart documentation.

coveralls · 2016-05-16T15:41:40Z

Coverage increased (+0.03%) to 90.712% when pulling ca7b034 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

dmnapolitano · 2016-05-16T16:00:51Z

Ok, I think this is all set, but let me know what you think of course 😄 👍

desilinguist · 2016-05-16T16:04:10Z

@dan-blanchard can you take a look as well please? I will too.

aoifecahill · 2016-05-16T18:15:41Z

tests/test_featureset.py

+def featureset_creation_from_dataframe_helper(with_labels):
+    """
+    Helper function for the two unit tests for FeatureSet.from_data_frame().
+    Since labels are optional, run two tests, one with, one without.


vectorizer is also optional. It might be nice to add tests for with/without vectorizer as well?

Ok 👍🏼 I don't know much about vectorizers, honestly. Aren't there quite a few of them one could use?

Take a look at how it is created in the skll utils.py method make_classification_data and see whether you could apply that here too on your data?

coveralls · 2016-05-16T19:53:06Z

Coverage increased (+0.03%) to 90.712% when pulling 2b7be60 on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

aoifecahill · 2016-05-17T13:30:18Z

This looks ok to me. Any other comments?

desilinguist · 2016-05-17T14:11:15Z

Looks good to me too. @dan-blanchard do you have anything?

aoifecahill · 2016-05-18T19:10:40Z

doc/api/quickstart.rst

-Train a linear svm (assuming we have `train_examples`)::
+    from skll import FeatureSet
+
+    train_examples = FeatureSet.from_data_frame(my_data_frame, 'A Name for My Data', data_labels)


This example is a bit misleading. When I read it, I assumed data_labels was a list of labels, not the name of a column in the my_data_frame df.

Right, sorry! I forgot how my own code works 😞 I'll update this now, thanks.

…t is for the name of the column, not the labels themselves...

desilinguist · 2016-05-18T19:21:03Z

skll/data/featureset.py

+        :type name: str
+        :param labels_column: The name of the column containing the labels (data to predict).
+        :type labels_column: str or None
+        :param vectorizer: Vectorizer that created feature matrix.


This description of vectorizer seems weird? Shouldn't this be the vectorizer you want to use when creating the featureset?

Honestly I'm not really sure. I copied it from line 44. 😬

@aoifecahill this will be the vectorizer used to create the featureset and NOT one that has already been used for something, right?

right, it's the one that will be used to create this new featureset. It could have been used for something else though I think (e.g. if you want to apply a pre-existing vectorizer to a new test set)

ah, right. In any case, @dmnapolitano can you fix the docstring to say that this is a vectorizer that will be used to create the featureset?

Ok, should I change line 44 too?

No, line 44 is correct.

Ok, then can you help me understand why they're different? In both cases we're creating a FeatureSet, so why would the vectorizer do something different in this case?

I think both can docstrings can just say “Vectorizer to use for generating feature matrix”? It's not a big deal.

coveralls · 2016-05-18T19:40:50Z

Coverage increased (+0.03%) to 90.712% when pulling 7b874ed on feature/skll-261-pandas-dataframe-helper into 5e40b4c on master.

…/skll into feature/skll-261-pandas-dataframe-helper

…t creation.

coveralls · 2016-05-19T15:36:16Z

Coverage increased (+0.03%) to 90.735% when pulling f941db2 on feature/skll-261-pandas-dataframe-helper into 23b8d2b on master.

coveralls · 2016-05-19T16:22:18Z

Coverage increased (+0.03%) to 90.735% when pulling 9600450 on feature/skll-261-pandas-dataframe-helper into 23b8d2b on master.

dan-blanchard · 2016-05-19T16:23:54Z

Looks good to me.

dmnapolitano · 2016-05-19T16:34:22Z

Thanks, everyone! 🎉

Diane Napolitano added 4 commits February 12, 2016 13:08

Added static method to FeatureSet to create a FeatureSet object from …

933856b

…a pandas DataFrame (if pandas is installed).

Merge branch 'master' of https://github.com/EducationalTestingService…

9cfd08a

…/skll into feature/skll-261-pandas-dataframe-helper

Add pandas to .travis.yml list of packages to be installed so we can …

7b576e5

…test creation of a FeatureSet from a pandas DataFrame.

Unit tests for creating a FeatureSet from a DataFrame both with and w…

6067f9b

…ithout labels

dan-blanchard reviewed Feb 12, 2016
View reviewed changes

dmnapolitano changed the title ~~#261 pandas.DataFrame helper~~ [WIP] #261 pandas.DataFrame helper Feb 12, 2016

When someone tries to use a DataFrame but doesn't have pandas, we wan…

56c1d9f

…t there to be an error. Modify travis to run optional pandas tests depending on an environment variable.

dan-blanchard reviewed Feb 13, 2016
View reviewed changes

tests/test_featureset.py

return (expected, current)

@attr('have_pandas')

Copy link

Contributor

dan-blanchard Feb 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, didn't know about that feature.

Diane Napolitano added 2 commits February 15, 2016 11:21

Removing now unused logging from FeatureSet.

5874b7b

Merge branch 'master' of https://github.com/EducationalTestingService…

b073f0c

…/skll into feature/skll-261-pandas-dataframe-helper

dmnapolitano changed the title ~~[WIP] #261 pandas.DataFrame helper~~ #261 pandas.DataFrame helper Feb 15, 2016

With Travis, only run the pandas tests when WITH_PANDAS is True and s…

6992899

…kip the others (since they'll be run when WITH_PANDAS = False).

desilinguist reviewed Feb 15, 2016
View reviewed changes

desilinguist changed the title ~~#261 pandas.DataFrame helper~~ Add a helper method to create FeatureSet from pandas data frames. Feb 15, 2016

dmnapolitano changed the title ~~Add a helper method to create FeatureSet from pandas data frames.~~ [WIP] Add a helper method to create FeatureSet from pandas data frames. Feb 19, 2016

desilinguist added 2 commits February 19, 2016 10:18

Use a built-in pandas feature for creating feature dictionaries.

08ad8d1

Merge branch 'master' into feature/skll-261-pandas-dataframe-helper

37bceac

Fix stupid typo.

89dc0ad

Merge branch 'master' of https://github.com/EducationalTestingService…

99012a9

…/skll into feature/skll-261-pandas-dataframe-helper

Adding an example of FeatureSet.from_data_frame() use to the API Quic…

ca7b034

…kstart documentation.

dmnapolitano changed the title ~~[WIP] Add a helper method to create FeatureSet from pandas data frames.~~ Add a helper method to create FeatureSet from pandas data frames. May 16, 2016

dmnapolitano self-assigned this May 16, 2016

desilinguist assigned dan-blanchard and aoifecahill and unassigned dmnapolitano and dan-blanchard May 16, 2016

aoifecahill reviewed May 16, 2016
View reviewed changes

Adding pandas unit tests using optional vectorizer.

2b7be60

aoifecahill reviewed May 18, 2016
View reviewed changes

Made a mistake in the quickstart example...the labels-related argumen…

7b874ed

…t is for the name of the column, not the labels themselves...

desilinguist reviewed May 18, 2016
View reviewed changes

Diane Napolitano added 2 commits May 19, 2016 11:11

Merge branch 'master' of https://github.com/EducationalTestingService…

f941db2

…/skll into feature/skll-261-pandas-dataframe-helper

Clarified doc strings regarding the vectorizer arguments to FeatureSe…

9600450

…t creation.

desilinguist merged commit 911fecf into master May 19, 2016

desilinguist deleted the feature/skll-261-pandas-dataframe-helper branch May 19, 2016 16:24

desilinguist mentioned this pull request May 20, 2016

Add pandas data frame helper #261

Closed

Add a helper method to create FeatureSet from pandas data frames. #292

Add a helper method to create FeatureSet from pandas data frames. #292

Conversation

dmnapolitano commented Feb 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desilinguist commented Feb 15, 2016

dmnapolitano commented Feb 15, 2016

Choose a reason for hiding this comment

desilinguist commented Feb 15, 2016

dmnapolitano commented Feb 19, 2016

desilinguist commented Feb 19, 2016

desilinguist commented Feb 22, 2016

desilinguist commented May 13, 2016

dmnapolitano commented May 13, 2016

coveralls commented May 16, 2016 • edited Loading

coveralls commented May 16, 2016 • edited Loading

dmnapolitano commented May 16, 2016

desilinguist commented May 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented May 16, 2016 • edited Loading

aoifecahill commented May 17, 2016

desilinguist commented May 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desilinguist May 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desilinguist May 18, 2016 • edited Loading

Choose a reason for hiding this comment

coveralls commented May 18, 2016 • edited Loading

coveralls commented May 19, 2016 • edited Loading

coveralls commented May 19, 2016 • edited Loading

dan-blanchard commented May 19, 2016

dmnapolitano commented May 19, 2016

coveralls commented May 16, 2016 •

edited

Loading

coveralls commented May 16, 2016 •

edited

Loading

coveralls commented May 16, 2016 •

edited

Loading

desilinguist May 18, 2016 •

edited

Loading

desilinguist May 18, 2016 •

edited

Loading

coveralls commented May 18, 2016 •

edited

Loading

coveralls commented May 19, 2016 •

edited

Loading

coveralls commented May 19, 2016 •

edited

Loading