Added describe method for numeric dataframes #226

Joshua-Dias-Barreto · 2023-05-15T06:11:10Z

This method is similar to pandas.describe() and it will be useful in providing a statistic description of the dataframe.
Tests are also written for this method.
This fixes issue #167

olekscode · 2023-05-15T08:10:35Z

Based on my comments, I would rewrite this method in the following way:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | numericalColumns numericalColumnNames contents column |

    numericalColumns := self columns
        "TODO: Remove this line when columns returns a collection of series"
        collect: [ :column | column asDataSeries ]
        thenSelect: [ :column | column isNumerical ].
        
    "TODO: Rewrite this when columns returns a collection of series"
    numericalColumnNames := self columnNames select: [ :name |
        (data column: name) isNumerical ].

    contents := numericalColumns collect: [ :each |
        "TODO: Remove this line when specific statistical methods (average, stdev, etc.) can handle nils)"
        column := each removeNils.
    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

However, in my image, I do not have the countNonNils method
Maybe it was added recently into the dev version

olekscode · 2023-05-15T08:23:58Z

It also suggest three more issues:

DataFrame >> columns should return a collection of series and not a collection of arrays. Each series should remember the column name
Maybe also add a method DataFrame >> numericalColumns that will simply return self columns select: [ :column | column isNumerical ] and also DataFrame >> numericalColumnNames that would return self numericalColumns collect: [ :column | column name ]
Make sure that statistical methods (average, stdev, etc.) can handle nil values. For example, #(2 nil 3 1 4) asDataSeries average should return 2.5

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

jecisc

I gave some comments on the general way to code in smalltalk but the change proposed by Oleks seems good.

I think we can optimize it further but it is better to treat that later.

jecisc · 2023-05-15T09:42:34Z

src/DataFrame/DataFrame.class.st

+	"method to statistically describe a numerical dataframe"
+
+	| nCol nRow describeDF col count dtype |
+	nCol := self numberOfColumns.


In smalltalk we have the convention to not use abbreviation. It makes the code harder to understand. I prefer to read "numberOfColumns" instead of "nCol" for example.

The philosophy of smalltalk is to have the code that is the closest possible to english so that reading code is the easiest possible to the developper and abbreviations are not good for that.

You might gain a few seconds while writing the code (and it's not even sure with the auto completion), but in the future we might lose some minutes reading the code.

jecisc · 2023-05-15T09:46:00Z

src/DataFrame/DataFrame.class.st

+		describeDF at: i at: 1 put: count.
+
+		describeDF at: i at: 2 put: mean.
+
+		describeDF at: i at: 3 put: std.
+
+		describeDF at: i at: 4 put: mini.
+
+		describeDF at: i at: 5 put: fQ.
+
+		describeDF at: i at: 6 put: sQ.
+
+		describeDF at: i at: 7 put: tQ.
+
+		describeDF at: i at: 8 put: maxi.
+
+		describeDF at: i at: 9 put: dtype ].


Oleks proposed a better implementation but just for your knowledge in general we would write this like that:

{ count . mean . std . mini . fQ . sQ . tQ . maxi . dtype } collectWithIndex: [ :stat :index | describeDF at: index put: stat ]

jecisc · 2023-05-15T09:55:33Z

It also suggest three more issues:

1. `DataFrame >> columns` should return a collection of series and not a collection of arrays. Each series should remember the column name

Yes! but the implementation of #columns should be kept in #arrayOfColumns and most users should use this one. The reason is that I optimized #columns a while ago because it's a method that can be long on big dataframe and a lot of things were slow because of that. So, when it's not necessary to have a DataSeries instead of an Array, we should use an Array because creating a DataSeries takes more time.

2. Maybe also add a method `DataFrame >> numericalColumns` that will simply return `self columns select: [ :column | column isNumerical ]` and also `DataFrame >> numericalColumnNames` that would return `self numericalColumns collect: [ :column | column name ]`

I was gonna propose that but I think there is a way faster way using: ((self dataTypes at: column) includesBehavior: Number). Because DataSeries>>isNumerical will scan the full DataSeries.

3. Make sure that statistical methods (`average`, `stdev`, etc.) can handle `nil` values. For example, `#(2 nil 3 1 4) asDataSeries average` should return `2.5`

I think I already opened an issue about that? If I didn't it's just that I forgot :(
I already managed the nils in #average IIRC

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

Joshua-Dias-Barreto · 2023-05-15T14:27:05Z

Yes! but the implementation of #columns should be kept in #arrayOfColumns and most users should use this one. The reason is that I optimized #columns a while ago because it's a method that can be long on big dataframe and a lot of things were slow because of that. So, when it's not necessary to have a DataSeries instead of an Array, we should use an Array because creating a DataSeries takes more time.

Should I change the implementation of #asArrayOfColumns in DataFrame from
^ contents asArrayOfColumns. to ^ contents asDataSeries .

Or should I change the implementation of #columns from
^ self asArrayOfColumns so that it returns a series collection.

jecisc · 2023-05-15T15:10:38Z

#columns should return a collection of DataSeries
#arrayOfColumns should return an array of array

I though we have a method named #rows but apparently we do not, sorry for the confusion.

For #columns we could have a simple implementation for a start that is:

    self ifEmpty: [ ^ #() ].

    ^ self columnsFrom: 1 to: self numberOfColumns

I don't know if in term of perf this would be the best, but we can do a first implementation with tests and iterate later if the perfs are not good on big data.

If you want to do those changes I can propose you two ways:

Either we integrate this PR with the version of the method given by Oleks here: Added describe method for numeric dataframes #226 (comment). And after that we add the 3 missing features and update this method afterward
Either you do one PR for each improvement/missing feature and once it's done you can merge master into your branch and update your PR

I would prefer that we do not tackle 4 issues in 1 PR.

Joshua-Dias-Barreto · 2023-05-15T17:06:09Z

Okay thanks, I think I will do one PR for each improvement/missing feature and then I will update this PR.

Joshua-Dias-Barreto · 2023-05-15T17:07:57Z

However, in my image, I do not have the countNonNils method
Maybe it was added recently into the dev version

Yes I had made a PR for this a few days ago.

Joshua-Dias-Barreto · 2023-05-16T05:38:36Z

For #columns we could have a simple implementation for a start that is:

self ifEmpty: [ ^ #() ].

^ self columnsFrom: 1 to: self numberOfColumns

For some reason when I try to use columns after implementing it like this, my image crashes.

When I tried to implement it like this:

^ self columnsFrom: 1 to: self numberOfColumns

It gives an out of bounds error, even if the number of columns is greater than 1.

And is there any difference between just ^ self and ^ self columnsFrom: 1 to: self numberOfColumns

jecisc · 2023-05-16T12:23:43Z

^ self will return the current instance (self). If there is no return in the method, this is what is returned by default.

^ self columnsFrom: 1 to: self numberOfColumns will return the result of #columnsFrom:to:.

Do the image freeze when you save the method or when you run the tests?

If it's while running the tests: I think that what must happens is that doing this change impact the perfs a lot and some tests are taking a lot of time (because the methods using #columns currently should use #asArrayOfColumns that is much faster). And the image seems frozen because the tests are long (we have a time limit for the tests but it is 10sec by test by default)

Joshua-Dias-Barreto · 2023-05-16T15:42:13Z

It freezes when I run the tests and even if I try to use it in the playground image.

There is no issue while saving the method.

Joshua-Dias-Barreto · 2023-05-19T05:40:50Z

@jecisc could you let me know your thoughts regarding this implementation of numericalColumns and numericalColumnNames

numericalColumnNames

	^ self columnNames select: [ :columnName |
		  (self dataTypes at: columnName) includesBehavior: Number ]

numericalColumns

	^ self columns: self numericalColumnNames

We cannot use this directly for numericalColumns

Maybe also add a method DataFrame >> numericalColumns that will simply return self columns select: [ :column | column isNumerical ] and also DataFrame >> numericalColumnNames that would return self numericalColumns collect: [ :column | column name ]
I was gonna propose that but I think there is a way faster way using: ((self dataTypes at: column) includesBehavior: Number). Because DataSeries>>isNumerical will scan the full DataSeries.

because dataTypesAt: requires the column name

jecisc · 2023-05-20T21:40:06Z

Those two methods seems good to me!

Joshua-Dias-Barreto · 2023-05-22T13:19:15Z

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

There are a few issues with this kind of implementation:

collect: applies the block row-wise, we need to apply it column-wise
We get an error Instance of Array did not understand #keys.

Maybe we can implement a collectColumnWise: method, unless there already exists a way to do this.
This is the implementation of the collect: method:

| firstRow newDataFrame |

	firstRow := aBlock value: (self rowAt: 1) copy.
	newDataFrame := self class new: 0@firstRow size.
	newDataFrame columnNames: firstRow keys.

	self do: [:each | newDataFrame add: (aBlock value: each copy)].
	^ newDataFrame

We get an error because firstRow is an array when the dataframe ( self ) has all numerical values.

jecisc · 2023-05-22T14:26:36Z

Since #numericalColumns should return an array, then there is no question of "row" or "columns" wise. It should be applied to all data series in the array. And in this case the data series should represent the columns.

Added describe method for numeric dataframes

30c616a

jecisc requested changes May 15, 2023

View reviewed changes

Joshua-Dias-Barreto and others added 2 commits May 25, 2023 02:07

Merge branch 'PolyMathOrg:master' into describe

0a4a4a7

Implemented a cleaner and more efficient describe method.

e13f4d5

jecisc approved these changes May 24, 2023

View reviewed changes

jecisc merged commit 43fde11 into PolyMathOrg:master May 24, 2023

Joshua-Dias-Barreto deleted the describe branch May 27, 2023 10:31

Added describe method for numeric dataframes #226

Added describe method for numeric dataframes #226

Uh oh!

Conversation

Joshua-Dias-Barreto commented May 15, 2023

Uh oh!

olekscode commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olekscode commented May 15, 2023

Uh oh!

jecisc left a comment

Choose a reason for hiding this comment

Uh oh!

jecisc May 15, 2023

Choose a reason for hiding this comment

Uh oh!

jecisc May 15, 2023

Choose a reason for hiding this comment

Uh oh!

jecisc commented May 15, 2023

Uh oh!

Joshua-Dias-Barreto commented May 15, 2023

Uh oh!

jecisc commented May 15, 2023

Uh oh!

Joshua-Dias-Barreto commented May 15, 2023

Uh oh!

Joshua-Dias-Barreto commented May 15, 2023

Uh oh!

Joshua-Dias-Barreto commented May 16, 2023

Uh oh!

jecisc commented May 16, 2023

Uh oh!

Joshua-Dias-Barreto commented May 16, 2023

Uh oh!

Joshua-Dias-Barreto commented May 19, 2023

Uh oh!

jecisc commented May 20, 2023

Uh oh!

Joshua-Dias-Barreto commented May 22, 2023

Uh oh!

jecisc commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

olekscode commented May 15, 2023 •

edited

Loading