Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coin name is mislabeled in dataset for duplicate-symbols #2

Closed
mikelambert opened this issue Jan 3, 2018 · 4 comments
Closed

Coin name is mislabeled in dataset for duplicate-symbols #2

mikelambert opened this issue Jan 3, 2018 · 4 comments
Assignees

Comments

@mikelambert
Copy link

Thanks for doing this work, greatly appreciated.

Was going to do some analysis of it on my own, and got confused by the presence of duplicates on some coins.

For example, up until 12-12, we have one datapoint per date, whereas after that shows two datapoints per date:

...
"PRO",2017-12-08,0.350132,0.35717,0.323824,0.353445,70668,4921880,"Propy",385
"PRO",2017-12-09,0.357165,0.391303,0.341962,0.368005,129489,5020740,"Propy",385
"PRO",2017-12-10,0.368921,0.368921,0.336974,0.34417,61321,5186000,"Propy",385
"PRO",2017-12-11,0.343705,0.398727,0.337944,0.398727,65037,4831540,"Propy",385
"PRO",2017-12-12,0.397612,0.529153,0.385998,0.488748,206352,5589310,"Propy",385
"PRO",2017-12-13,0.386384,0.483263,0.382225,0.428319,2150570,0,"Propy",385
"PRO",2017-12-13,0.490256,0.606039,0.489343,0.569292,190018,6891630,"Propy",385
"PRO",2017-12-14,0.428882,0.428882,0.363489,0.411705,2015520,0,"Propy",385
"PRO",2017-12-14,0.568401,0.60429,0.533319,0.575479,110802,7990140,"Propy",385
"PRO",2017-12-15,0.41248,0.427116,0.375678,0.413479,1403610,0,"Propy",385
"PRO",2017-12-15,0.576018,0.582851,0.554633,0.572736,108230,8097210,"Propy",385
"PRO",2017-12-16,0.414505,0.489333,0.401455,0.456695,3102170,0,"Propy",385
"PRO",2017-12-16,0.573556,0.805846,0.573556,0.671252,184437,8062590,"Propy",385
"PRO",2017-12-17,0.462755,0.520551,0.440206,0.454255,1461940,0,"Propy",385
"PRO",2017-12-17,0.672319,0.697868,0.607462,0.689788,145157,9450930,"Propy",385
"PRO",2017-12-18,0.453932,0.477431,0.423593,0.473935,1583290,0,"Propy",385
"PRO",2017-12-18,0.701731,0.701731,0.575447,0.644326,177428,9864380,"Propy",385

Only one has a non-empty market value...so I'm going to go with that. (I assume market refers to market-cap? I thought at first it might be showing data from two different market exchanges or something.)

@mikelambert
Copy link
Author

Actually, my prioritization logic appears flawed:

...
"BTG",2017-10-20,0.819804,1.2,0.80772,1.19,80,0,"Bitcoin Gold",11
"BTG",2017-10-21,0.873455,1.25,0.862919,0.991196,59,0,"Bitcoin Gold",11
"BTG",2017-10-22,1.01,2.09,0.844422,1.7,1756,0,"Bitcoin Gold",11
"BTG",2017-10-23,479.82,539.72,479.82,500.13,7652060,0,"Bitcoin Gold",11
"BTG",2017-10-23,1.7,13.43,1.11,7.04,41557,0,"Bitcoin Gold",11
...
"BTG",2017-11-23,241.97,299.89,241.97,293.61,154038000,0,"Bitcoin Gold",11
"BTG",2017-11-23,5.84,6.45,5.72,5.9,7800,345650,"Bitcoin Gold",11
"BTG",2017-11-24,295.75,413.74,284.26,394.22,537472000,0,"Bitcoin Gold",11
"BTG",2017-11-24,5.89,7.96,4.1,5.42,10850,348941,"Bitcoin Gold",11
"BTG",2017-11-25,394.04,394.04,339.1,356.04,208662000,0,"Bitcoin Gold",11
"BTG",2017-11-25,5.4,6.64,4.68,5.68,7480,320152,"Bitcoin Gold",11
"BTG",2017-11-26,355.72,366.79,334.74,366.79,141228000,5930460000,"Bitcoin Gold",11
"BTG",2017-11-26,5.68,6.39,4.36,5.31,3204,336402,"Bitcoin Gold",11
"BTG",2017-11-27,370.18,387.88,353.67,359.25,129160000,6172140000,"Bitcoin Gold",11
"BTG",2017-11-27,5.31,5.5,4.39,5.43,8423,314816,"Bitcoin Gold",11
...

So:

  • On coinmarketcap, BTG does not exist prior to 10/23. So I am unclear what these 0.87 and 1.01 datapoints refer to.
  • After 10/23, it begins listing on coinmarketcap, with a value of 479. So I am unclear what this 1.7 datapoint refers to.
  • Starting 11/26, one of these two-datapoint-lines gets a non-zero market value, corresponding to coinmarketcap showing a non-zero marketcap. Which is great, since I assume that's when the coins came into existence, and the real market started. Unfortunately, there still two datapoints, both with non-zero market values, making it difficult to distinguish which one I should be using. I assume the "largest" one makes the most sense.

Am I parsing this data wrong, and I should know a better way to deal with these duplicate timeseries, or is there some extraneous data creeping in here? Thanks!

@mikelambert
Copy link
Author

Ooooh, sorry, I figured out that this is due to coins on coinmarketcap that share a ticker. PRO, BTG, ACC, etc.

Not sure of a correct way to distinguish them in the dataset...especially since name column appears to choose an arbitrary coin instead of naming both coins. For example, there are only datapoints for Bitcoin Gold (instead of Bitgem), Propy (instead of ProChain), etc.

@mikelambert mikelambert changed the title Duplicate coin-dates Coin name is mislabeled in dataset for duplicate-symbols Jan 3, 2018
@JesseVent JesseVent self-assigned this Jan 3, 2018
@JesseVent
Copy link
Owner

Hi Mike, you're spot on the issue is due to several tokens sharing the same symbol. I didn't know how to go about resolving it, but then figured i'd use the slug i'm using to generate the urls for scraping, and then use that as a unique identifier instead.

The change I just committed should resolve the duplication issues and also theres a couple extra features included.

Let me know how you go, thanks

> head(pro)
   slug symbol  name       date ranknow     open     high      low    close volume   market close_ratio spread
1 propy    PRO Propy 2017-09-19     295 0.823919 0.858425 0.628423 0.745318  26854 11582000      0.5082   0.23
2 propy    PRO Propy 2017-09-20     295 0.744813 0.933790 0.644857 0.862584 102433 10470000      0.7536   0.29
3 propy    PRO Propy 2017-09-21     295 0.859565 0.982731 0.743939 0.809898  74579 12083100      0.2762   0.24
4 propy    PRO Propy 2017-09-22     295 0.773040 0.792471 0.588509 0.658002 136747 10866800      0.3407   0.20
5 propy    PRO Propy 2017-09-23     295 0.657034 1.470000 0.559158 0.724104 298708  9236070      0.1811   0.91
6 propy    PRO Propy 2017-09-24     295 0.731472 0.734890 0.571775 0.615710 204870 10282500      0.2693   0.16
> tail(pro)
        slug symbol     name       date ranknow     open     high      low    close  volume market close_ratio spread
122 prochain    PRO ProChain 2017-12-28    1096 0.360183 0.365566 0.331167 0.354053  626739      0      0.6653   0.03
123 prochain    PRO ProChain 2017-12-29    1096 0.352030 0.401564 0.345843 0.357103  523329      0      0.2021   0.06
124 prochain    PRO ProChain 2017-12-30    1096 0.355327 0.358116 0.307769 0.326998  661599      0      0.3819   0.05
125 prochain    PRO ProChain 2017-12-31    1096 0.328685 0.378672 0.324813 0.358117  888244      0      0.6184   0.05
126 prochain    PRO ProChain 2018-01-01    1096 0.358136 0.358136 0.331516 0.345254 1424280      0      0.5161   0.03
127 prochain    PRO ProChain 2018-01-02    1096 0.345629 0.446480 0.345629 0.417606 4645990      0      0.7137   0.10

@mikelambert
Copy link
Author

Awesome, the slug works great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants