-
Notifications
You must be signed in to change notification settings - Fork 28
Added describe method for numeric dataframes #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Based on my comments, I would rewrite this method in the following way: DataFrame >> describe
"Answer another data frame with statistics describing the columns of this data frame"
| numericalColumns numericalColumnNames contents column |
numericalColumns := self columns
"TODO: Remove this line when columns returns a collection of series"
collect: [ :column | column asDataSeries ]
thenSelect: [ :column | column isNumerical ].
"TODO: Rewrite this when columns returns a collection of series"
numericalColumnNames := self columnNames select: [ :name |
(data column: name) isNumerical ].
contents := numericalColumns collect: [ :each |
"TODO: Remove this line when specific statistical methods (average, stdev, etc.) can handle nils)"
column := each removeNils.
{
column countNonNils .
column average .
column stdev .
column min .
column firstQuartile .
column secondQuartile .
column thirdQuartile .
column max .
column calculateDataType
} ].
^ DataFrame
withRows: contents
rowNames: numericalColumnNames
columnNames: #(count mean std min '25%' '50%' '75%' max dtype).However, in my image, I do not have the |
|
It also suggest three more issues:
If that is done, we can rewrite the DataFrame >> describe
"Answer another data frame with statistics describing the columns of this data frame"
| contents |
contents := self numericalColumns collect: [ :column |
{
column countNonNils .
column average .
column stdev .
column min .
column firstQuartile .
column secondQuartile .
column thirdQuartile .
column max .
column calculateDataType
} ].
^ DataFrame
withRows: contents
rowNames: self numericalColumnNames
columnNames: #(count mean std min '25%' '50%' '75%' max dtype). |
jecisc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave some comments on the general way to code in smalltalk but the change proposed by Oleks seems good.
I think we can optimize it further but it is better to treat that later.
src/DataFrame/DataFrame.class.st
Outdated
| "method to statistically describe a numerical dataframe" | ||
|
|
||
| | nCol nRow describeDF col count dtype | | ||
| nCol := self numberOfColumns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In smalltalk we have the convention to not use abbreviation. It makes the code harder to understand. I prefer to read "numberOfColumns" instead of "nCol" for example.
The philosophy of smalltalk is to have the code that is the closest possible to english so that reading code is the easiest possible to the developper and abbreviations are not good for that.
You might gain a few seconds while writing the code (and it's not even sure with the auto completion), but in the future we might lose some minutes reading the code.
src/DataFrame/DataFrame.class.st
Outdated
| describeDF at: i at: 1 put: count. | ||
|
|
||
| describeDF at: i at: 2 put: mean. | ||
|
|
||
| describeDF at: i at: 3 put: std. | ||
|
|
||
| describeDF at: i at: 4 put: mini. | ||
|
|
||
| describeDF at: i at: 5 put: fQ. | ||
|
|
||
| describeDF at: i at: 6 put: sQ. | ||
|
|
||
| describeDF at: i at: 7 put: tQ. | ||
|
|
||
| describeDF at: i at: 8 put: maxi. | ||
|
|
||
| describeDF at: i at: 9 put: dtype ]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oleks proposed a better implementation but just for your knowledge in general we would write this like that:
{ count . mean . std . mini . fQ . sQ . tQ . maxi . dtype } collectWithIndex: [ :stat :index | describeDF at: index put: stat ]
Yes! but the implementation of #columns should be kept in #arrayOfColumns and most users should use this one. The reason is that I optimized #columns a while ago because it's a method that can be long on big dataframe and a lot of things were slow because of that. So, when it's not necessary to have a DataSeries instead of an Array, we should use an Array because creating a DataSeries takes more time.
I was gonna propose that but I think there is a way faster way using:
I think I already opened an issue about that? If I didn't it's just that I forgot :(
|
Should I change the implementation of #asArrayOfColumns in DataFrame from Or should I change the implementation of #columns from |
|
#columns should return a collection of DataSeries I though we have a method named #rows but apparently we do not, sorry for the confusion. For #columns we could have a simple implementation for a start that is: self ifEmpty: [ ^ #() ].
^ self columnsFrom: 1 to: self numberOfColumnsI don't know if in term of perf this would be the best, but we can do a first implementation with tests and iterate later if the perfs are not good on big data. If you want to do those changes I can propose you two ways:
I would prefer that we do not tackle 4 issues in 1 PR. |
|
Okay thanks, I think I will do one PR for each improvement/missing feature and then I will update this PR. |
Yes I had made a PR for this a few days ago. |
For some reason when I try to use When I tried to implement it like this: It gives an out of bounds error, even if the number of columns is greater than 1. And is there any difference between just |
|
Do the image freeze when you save the method or when you run the tests? If it's while running the tests: I think that what must happens is that doing this change impact the perfs a lot and some tests are taking a lot of time (because the methods using #columns currently should use #asArrayOfColumns that is much faster). And the image seems frozen because the tests are long (we have a time limit for the tests but it is 10sec by test by default) |
|
It freezes when I run the tests and even if I try to use it in the playground image. There is no issue while saving the method. |
|
@jecisc could you let me know your thoughts regarding this implementation of We cannot use this directly for
because |
|
Those two methods seems good to me! |
There are a few issues with this kind of implementation:
We get an error because |
|
Since #numericalColumns should return an array, then there is no question of "row" or "columns" wise. It should be applied to all data series in the array. And in this case the data series should represent the columns. |
This method is similar to pandas.describe() and it will be useful in providing a statistic description of the dataframe.
Tests are also written for this method.
This fixes issue #167