dataframe `count` and `size` groupby aggregations #2831

stress-tess · 2023-10-27T16:45:09Z

While working with @tgstevensonRedRocket we found discrepancies between the return type of pd_df.groupby.count() and ak_df.groupby.count(). And I don't think we have a size aggregation at all

The text was updated successfully, but these errors were encountered:

ajpotts · 2023-12-21T19:38:54Z

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

Pandas Example

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

import arkouda as ak
ak.connect()

import numpy as np
import pandas as pd

ivalues = ak.array([4, 1, 3, 2, 2, 2, 5, 5, 2, 3])

ak_df = ak.DataFrame({"nums":ivalues})
display(ak_df)

pd_df = ak_df.to_pandas()
print(pd_df)

ak_count = ak_df.groupby("nums").count()
display(ak_count)
type(ak_count)

pd_count = pd_df.groupby(["nums"]).count()
display(pd_count)
type(pd_count)

pd_size = pd_df.groupby(["nums"]).size()
display(pd_size)
type(pd_size)

#############################################################################################################################

Output

#############################################################################################################################

In [2]: ivalues = ak.array([4, 1, 3, 2, 2, 2, 5, 5, 2, 3])
...:
...: ak_df = ak.DataFrame({"nums":ivalues})
...: display(ak_df)
nums
0 4
1 1
2 3
3 2
4 2
5 2
6 5
7 5
8 2
9 3 (10 rows x 1 columns)

In [3]:
...: pd_df = ak_df.to_pandas()
...: print(pd_df)
...:
nums
0 4
1 1
2 3
3 2
4 2
5 2
6 5
7 5
8 2
9 3

In [4]:
...: ak_count = ak_df.groupby("nums").count()
...: display(ak_count)
1 1
2 4
3 2
4 1
5 2
dtype: int64

In [5]: type(ak_count)
Out[5]: arkouda.series.Series

In [6]: pd_count = pd_df.groupby(["nums"]).count()
...: display(pd_count)
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]

In [7]: type(pd_count)
Out[7]: pandas.core.frame.DataFrame

In [8]:
...: pd_size = pd_df.groupby(["nums"]).size()
...: display(pd_size)
nums
1 1
2 4
3 2
4 1
5 2
dtype: int64

In [9]: type(pd_size)
Out[9]: pandas.core.series.Series

ajpotts · 2023-12-21T19:42:56Z

`
#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

Pyspark Example

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

import numpy as np
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType

Define the schema

schema = StructType([
StructField("nums", IntegerType(), True)
])

ivalues = np.array([[4, 1, 3, 2, 2, 2, 5, 5, 2, 3]]).T

Create the PySpark DataFrame

pyspark_df = spark.createDataFrame(ivalues.tolist(), schema=schema)

pyspark_count = pyspark_df.groupby("nums").count()

pyspark_count.show()

type(pyspark_count)

#############################################################################################################################

Output

#############################################################################################################################

import numpy as np
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType

Define the schema

schema = StructType([
... StructField("nums", IntegerType(), True)
... ])

ivalues = np.array([[4, 1, 3, 2, 2, 2, 5, 5, 2, 3]]).T

Create the PySpark DataFrame

pyspark_df = spark.createDataFrame(ivalues.tolist(), schema=schema)

pyspark_count = pyspark_df.groupby("nums").count()

pyspark_count.show()
+----+-----+
|nums|count|
+----+-----+
| 4| 1|
| 1| 1|
| 3| 2|
| 2| 4|
| 5| 2|
+----+-----+

type(pyspark_count)
<class 'pyspark.sql.dataframe.DataFrame'>

`

ajpotts · 2023-12-21T19:46:52Z

Attaching the examples as a file as well:

2831_example.txt

…pandas

…oupby().sum() to pandas (#2892) * Closes ticket #2831 to align dataframe.groupby().size() to pandas * clean up formatting * remove usage of | union for dictionaries from dataframe.py because it unsuported in python 3.8 * fix formatting in dataframe.py * update dataframe.GroupBy.size() and .count() to default as_series = None, and return series when as_index=True and as_series=None * change default value to as_index=True in dataframe.GroupBy to match pandas * fix a type in PROTO_tests/tests/series_test.py and other minor code efficiencies --------- Co-authored-by: Amanda Potts <ajpotts@users.noreply.github.com>

stress-tess added bug Something isn't working enhancement New feature or request User Reported A user submitted the issue labels Oct 27, 2023

stress-tess self-assigned this Oct 27, 2023

stress-tess assigned ajpotts and jaketrookman and unassigned stress-tess Dec 19, 2023

ajpotts added a commit to ajpotts/arkouda that referenced this issue Jan 4, 2024

Closes ticket Bears-R-Us#2831 to align dataframe.groupby().size() to …

54d35b0

…pandas

ajpotts linked a pull request Jan 4, 2024 that will close this issue

Closes ticket #2831 to align dataframe.groupby().size(), dataframe.groupby().sum() to pandas #2892

Merged

stress-tess closed this as completed in #2892 Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataframe `count` and `size` groupby aggregations #2831

dataframe `count` and `size` groupby aggregations #2831

stress-tess commented Oct 27, 2023

ajpotts commented Dec 21, 2023

ajpotts commented Dec 21, 2023

Define the schema

Create the PySpark DataFrame

ajpotts commented Dec 21, 2023

dataframe count and size groupby aggregations #2831

dataframe count and size groupby aggregations #2831

Comments

stress-tess commented Oct 27, 2023

ajpotts commented Dec 21, 2023

Pandas Example

Output

ajpotts commented Dec 21, 2023

Pyspark Example

Define the schema

Create the PySpark DataFrame

Output

Define the schema

Create the PySpark DataFrame

ajpotts commented Dec 21, 2023

dataframe `count` and `size` groupby aggregations #2831

dataframe `count` and `size` groupby aggregations #2831