Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeat dataset: promoters #19

Closed
lacava opened this issue Jun 6, 2019 · 2 comments
Closed

repeat dataset: promoters #19

lacava opened this issue Jun 6, 2019 · 2 comments

Comments

@lacava
Copy link
Collaborator

lacava commented Jun 6, 2019

pretty sure the promoters dataset and molecular-biology-promoters datasets are clones.

In [1]: import pandas as pd

In [2]: from pmlb import fetch_data

In [3]: df1 = fetch_data('promoters')

In [4]: df2 = fetch_data('molecular-biology_promoters')

In [6]: from pandas.util import hash_pandas_object

In [7]: import hashlib

In [8]: rowHashes1 = hash_pandas_object(df1).values

In [9]: hash1 = hashlib.sha256(rowHashes1).hexdigest()

In [10]: rowHashes2 = hash_pandas_object(df2).values

In [11]: hash2 = hashlib.sha256(rowHashes2).hexdigest()

In [12]: hash1
Out[12]: '37c2d79bd3ecaff76ab53f3f20742245e56a3ccb1354b5c45ea4a4429afff261'

In [13]: hash2
Out[13]: '37c2d79bd3ecaff76ab53f3f20742245e56a3ccb1354b5c45ea4a4429afff261'

In [14]: hash1==hash2
Out[14]: True

@trangdata
Copy link
Collaborator

Also, looks like column name is the row id and should be removed. I vote that we keep promoters and remove the column name from it.

lacava added a commit that referenced this issue Jul 31, 2020
trangdata added a commit that referenced this issue Aug 18, 2020
* removes promoters, fixing issue #19

* updates to ml.bio promoters metadata

* remove the column instance

* remove promoters from dataset list

* remove instance from feature list

Co-authored-by: Trang Le <grixor@gmail.com>
@trangdata
Copy link
Collaborator

I believe this has been addressed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants