Skip to content

Conversation

@Joshua-Dias-Barreto
Copy link
Collaborator

This method is similar to pandas.describe() and it will be useful in providing a statistic description of the dataframe.
Tests are also written for this method.
This fixes issue #167

@olekscode
Copy link
Member

olekscode commented May 15, 2023

Based on my comments, I would rewrite this method in the following way:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | numericalColumns numericalColumnNames contents column |

    numericalColumns := self columns
        "TODO: Remove this line when columns returns a collection of series"
        collect: [ :column | column asDataSeries ]
        thenSelect: [ :column | column isNumerical ].
        
    "TODO: Rewrite this when columns returns a collection of series"
    numericalColumnNames := self columnNames select: [ :name |
        (data column: name) isNumerical ].

    contents := numericalColumns collect: [ :each |
        "TODO: Remove this line when specific statistical methods (average, stdev, etc.) can handle nils)"
        column := each removeNils.
    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

However, in my image, I do not have the countNonNils method
Maybe it was added recently into the dev version

@olekscode
Copy link
Member

It also suggest three more issues:

  1. DataFrame >> columns should return a collection of series and not a collection of arrays. Each series should remember the column name

  2. Maybe also add a method DataFrame >> numericalColumns that will simply return self columns select: [ :column | column isNumerical ] and also DataFrame >> numericalColumnNames that would return self numericalColumns collect: [ :column | column name ]

  3. Make sure that statistical methods (average, stdev, etc.) can handle nil values. For example, #(2 nil 3 1 4) asDataSeries average should return 2.5

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

Copy link
Member

@jecisc jecisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some comments on the general way to code in smalltalk but the change proposed by Oleks seems good.

I think we can optimize it further but it is better to treat that later.

"method to statistically describe a numerical dataframe"

| nCol nRow describeDF col count dtype |
nCol := self numberOfColumns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In smalltalk we have the convention to not use abbreviation. It makes the code harder to understand. I prefer to read "numberOfColumns" instead of "nCol" for example.

The philosophy of smalltalk is to have the code that is the closest possible to english so that reading code is the easiest possible to the developper and abbreviations are not good for that.

You might gain a few seconds while writing the code (and it's not even sure with the auto completion), but in the future we might lose some minutes reading the code.

Comment on lines 970 to 986
describeDF at: i at: 1 put: count.

describeDF at: i at: 2 put: mean.

describeDF at: i at: 3 put: std.

describeDF at: i at: 4 put: mini.

describeDF at: i at: 5 put: fQ.

describeDF at: i at: 6 put: sQ.

describeDF at: i at: 7 put: tQ.

describeDF at: i at: 8 put: maxi.

describeDF at: i at: 9 put: dtype ].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oleks proposed a better implementation but just for your knowledge in general we would write this like that:

{ count . mean . std . mini . fQ . sQ . tQ . maxi . dtype } collectWithIndex: [ :stat :index | describeDF at: index put: stat ]

@jecisc
Copy link
Member

jecisc commented May 15, 2023

It also suggest three more issues:

1. `DataFrame >> columns` should return a collection of series and not a collection of arrays. Each series should remember the column name

Yes! but the implementation of #columns should be kept in #arrayOfColumns and most users should use this one. The reason is that I optimized #columns a while ago because it's a method that can be long on big dataframe and a lot of things were slow because of that. So, when it's not necessary to have a DataSeries instead of an Array, we should use an Array because creating a DataSeries takes more time.

2. Maybe also add a method `DataFrame >> numericalColumns` that will simply return `self columns select: [ :column | column isNumerical ]` and also `DataFrame >> numericalColumnNames` that would return `self numericalColumns collect: [ :column | column name ]`

I was gonna propose that but I think there is a way faster way using: ((self dataTypes at: column) includesBehavior: Number). Because DataSeries>>isNumerical will scan the full DataSeries.

3. Make sure that statistical methods (`average`, `stdev`, etc.) can handle `nil` values. For example, `#(2 nil 3 1 4) asDataSeries average` should return `2.5`

I think I already opened an issue about that? If I didn't it's just that I forgot :(
I already managed the nils in #average IIRC

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

@Joshua-Dias-Barreto
Copy link
Collaborator Author

Yes! but the implementation of #columns should be kept in #arrayOfColumns and most users should use this one. The reason is that I optimized #columns a while ago because it's a method that can be long on big dataframe and a lot of things were slow because of that. So, when it's not necessary to have a DataSeries instead of an Array, we should use an Array because creating a DataSeries takes more time.

Should I change the implementation of #asArrayOfColumns in DataFrame from
^ contents asArrayOfColumns. to ^ contents asDataSeries .

Or should I change the implementation of #columns from
^ self asArrayOfColumns so that it returns a series collection.

@jecisc
Copy link
Member

jecisc commented May 15, 2023

#columns should return a collection of DataSeries
#arrayOfColumns should return an array of array

I though we have a method named #rows but apparently we do not, sorry for the confusion.

For #columns we could have a simple implementation for a start that is:

    self ifEmpty: [ ^ #() ].

    ^ self columnsFrom: 1 to: self numberOfColumns

I don't know if in term of perf this would be the best, but we can do a first implementation with tests and iterate later if the perfs are not good on big data.

If you want to do those changes I can propose you two ways:

  • Either we integrate this PR with the version of the method given by Oleks here: Added describe method for numeric dataframes #226 (comment). And after that we add the 3 missing features and update this method afterward
  • Either you do one PR for each improvement/missing feature and once it's done you can merge master into your branch and update your PR

I would prefer that we do not tackle 4 issues in 1 PR.

@Joshua-Dias-Barreto
Copy link
Collaborator Author

Okay thanks, I think I will do one PR for each improvement/missing feature and then I will update this PR.

@Joshua-Dias-Barreto
Copy link
Collaborator Author

However, in my image, I do not have the countNonNils method
Maybe it was added recently into the dev version

Yes I had made a PR for this a few days ago.

@Joshua-Dias-Barreto
Copy link
Collaborator Author

For #columns we could have a simple implementation for a start that is:

self ifEmpty: [ ^ #() ].

^ self columnsFrom: 1 to: self numberOfColumns

For some reason when I try to use columns after implementing it like this, my image crashes.

When I tried to implement it like this:

^ self columnsFrom: 1 to: self numberOfColumns

It gives an out of bounds error, even if the number of columns is greater than 1.

And is there any difference between just ^ self and ^ self columnsFrom: 1 to: self numberOfColumns

@jecisc
Copy link
Member

jecisc commented May 16, 2023

^ self will return the current instance (self). If there is no return in the method, this is what is returned by default.

^ self columnsFrom: 1 to: self numberOfColumns will return the result of #columnsFrom:to:.

Do the image freeze when you save the method or when you run the tests?

If it's while running the tests: I think that what must happens is that doing this change impact the perfs a lot and some tests are taking a lot of time (because the methods using #columns currently should use #asArrayOfColumns that is much faster). And the image seems frozen because the tests are long (we have a time limit for the tests but it is 10sec by test by default)

@Joshua-Dias-Barreto
Copy link
Collaborator Author

It freezes when I run the tests and even if I try to use it in the playground image.

There is no issue while saving the method.

@Joshua-Dias-Barreto
Copy link
Collaborator Author

@jecisc could you let me know your thoughts regarding this implementation of numericalColumns and numericalColumnNames

numericalColumnNames

	^ self columnNames select: [ :columnName |
		  (self dataTypes at: columnName) includesBehavior: Number ]
numericalColumns

	^ self columns: self numericalColumnNames

We cannot use this directly for numericalColumns

Maybe also add a method DataFrame >> numericalColumns that will simply return self columns select: [ :column | column isNumerical ] and also DataFrame >> numericalColumnNames that would return self numericalColumns collect: [ :column | column name ]
I was gonna propose that but I think there is a way faster way using: ((self dataTypes at: column) includesBehavior: Number). Because DataSeries>>isNumerical will scan the full DataSeries.

because dataTypesAt: requires the column name

@jecisc
Copy link
Member

jecisc commented May 20, 2023

Those two methods seems good to me!

@Joshua-Dias-Barreto
Copy link
Collaborator Author

If that is done, we can rewrite the describe method like this:

DataFrame >> describe
    "Answer another data frame with statistics describing the columns of this data frame"
    
    | contents |

    contents := self numericalColumns collect: [ :column |    
        {     
            column countNonNils .
            column average .
            column stdev .
            column min .
            column firstQuartile .
            column secondQuartile .
            column thirdQuartile .
            column max .
            column calculateDataType
        } ].

    ^ DataFrame 
        withRows: contents
        rowNames: self numericalColumnNames
        columnNames: #(count mean std min '25%' '50%' '75%' max dtype).

There are a few issues with this kind of implementation:

  • collect: applies the block row-wise, we need to apply it column-wise
  • We get an error Instance of Array did not understand #keys.
  1. Maybe we can implement a collectColumnWise: method, unless there already exists a way to do this.
  2. This is the implementation of the collect: method:
| firstRow newDataFrame |

	firstRow := aBlock value: (self rowAt: 1) copy.
	newDataFrame := self class new: 0@firstRow size.
	newDataFrame columnNames: firstRow keys.

	self do: [:each | newDataFrame add: (aBlock value: each copy)].
	^ newDataFrame

We get an error because firstRow is an array when the dataframe ( self ) has all numerical values.

@jecisc
Copy link
Member

jecisc commented May 22, 2023

Since #numericalColumns should return an array, then there is no question of "row" or "columns" wise. It should be applied to all data series in the array. And in this case the data series should represent the columns.

@jecisc jecisc merged commit 43fde11 into PolyMathOrg:master May 24, 2023
@Joshua-Dias-Barreto Joshua-Dias-Barreto deleted the describe branch May 27, 2023 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants