Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe count and size groupby aggregations #2831

Closed
stress-tess opened this issue Oct 27, 2023 · 3 comments · Fixed by #2892
Closed

dataframe count and size groupby aggregations #2831

stress-tess opened this issue Oct 27, 2023 · 3 comments · Fixed by #2892
Assignees
Labels
bug Something isn't working enhancement New feature or request User Reported A user submitted the issue

Comments

@stress-tess
Copy link
Member

While working with @tgstevensonRedRocket we found discrepancies between the return type of pd_df.groupby.count() and ak_df.groupby.count(). And I don't think we have a size aggregation at all

@stress-tess stress-tess added bug Something isn't working enhancement New feature or request User Reported A user submitted the issue labels Oct 27, 2023
@stress-tess stress-tess self-assigned this Oct 27, 2023
@ajpotts
Copy link
Contributor

ajpotts commented Dec 21, 2023

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

Pandas Example

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

import arkouda as ak
ak.connect()

import numpy as np
import pandas as pd

ivalues = ak.array([4, 1, 3, 2, 2, 2, 5, 5, 2, 3])

ak_df = ak.DataFrame({"nums":ivalues})
display(ak_df)

pd_df = ak_df.to_pandas()
print(pd_df)

ak_count = ak_df.groupby("nums").count()
display(ak_count)
type(ak_count)

pd_count = pd_df.groupby(["nums"]).count()
display(pd_count)
type(pd_count)

pd_size = pd_df.groupby(["nums"]).size()
display(pd_size)
type(pd_size)

#############################################################################################################################

Output

#############################################################################################################################

In [2]: ivalues = ak.array([4, 1, 3, 2, 2, 2, 5, 5, 2, 3])
...:
...: ak_df = ak.DataFrame({"nums":ivalues})
...: display(ak_df)
nums
0 4
1 1
2 3
3 2
4 2
5 2
6 5
7 5
8 2
9 3 (10 rows x 1 columns)

In [3]:
...: pd_df = ak_df.to_pandas()
...: print(pd_df)
...:
nums
0 4
1 1
2 3
3 2
4 2
5 2
6 5
7 5
8 2
9 3

In [4]:
...: ak_count = ak_df.groupby("nums").count()
...: display(ak_count)
1 1
2 4
3 2
4 1
5 2
dtype: int64

In [5]: type(ak_count)
Out[5]: arkouda.series.Series

In [6]: pd_count = pd_df.groupby(["nums"]).count()
...: display(pd_count)
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]

In [7]: type(pd_count)
Out[7]: pandas.core.frame.DataFrame

In [8]:
...: pd_size = pd_df.groupby(["nums"]).size()
...: display(pd_size)
nums
1 1
2 4
3 2
4 1
5 2
dtype: int64

In [9]: type(pd_size)
Out[9]: pandas.core.series.Series

@ajpotts
Copy link
Contributor

ajpotts commented Dec 21, 2023

`
#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

Pyspark Example

#############################################################################################################################
#############################################################################################################################
#############################################################################################################################

import numpy as np
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType

Define the schema

schema = StructType([
StructField("nums", IntegerType(), True)
])

ivalues = np.array([[4, 1, 3, 2, 2, 2, 5, 5, 2, 3]]).T

Create the PySpark DataFrame

pyspark_df = spark.createDataFrame(ivalues.tolist(), schema=schema)

pyspark_count = pyspark_df.groupby("nums").count()

pyspark_count.show()

type(pyspark_count)

#############################################################################################################################

Output

#############################################################################################################################

import numpy as np
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType

Define the schema

schema = StructType([
... StructField("nums", IntegerType(), True)
... ])

ivalues = np.array([[4, 1, 3, 2, 2, 2, 5, 5, 2, 3]]).T

Create the PySpark DataFrame

pyspark_df = spark.createDataFrame(ivalues.tolist(), schema=schema)

pyspark_count = pyspark_df.groupby("nums").count()

pyspark_count.show()
+----+-----+
|nums|count|
+----+-----+
| 4| 1|
| 1| 1|
| 3| 2|
| 2| 4|
| 5| 2|
+----+-----+

type(pyspark_count)
<class 'pyspark.sql.dataframe.DataFrame'>

`

@ajpotts
Copy link
Contributor

ajpotts commented Dec 21, 2023

Attaching the examples as a file as well:

2831_example.txt

ajpotts added a commit to ajpotts/arkouda that referenced this issue Jan 4, 2024
github-merge-queue bot pushed a commit that referenced this issue Jan 8, 2024
…oupby().sum() to pandas (#2892)

* Closes ticket #2831 to align dataframe.groupby().size() to pandas

* clean up formatting

* remove usage of | union for dictionaries from dataframe.py because it unsuported in python 3.8

* fix formatting in dataframe.py

* update dataframe.GroupBy.size() and .count() to default as_series = None, and return series when as_index=True and as_series=None

* change default value to  as_index=True in dataframe.GroupBy to match pandas

* fix a type in PROTO_tests/tests/series_test.py and other minor code efficiencies

---------

Co-authored-by: Amanda Potts <ajpotts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request User Reported A user submitted the issue
Projects
None yet
3 participants