# Pandas dataframe.groupby() Method:
<p>Pandas <b>groupby()</b> function is a powerful tool used to split a DataFrame into groups based on one or more columns, allowing for efficient data analysis and aggregation. It follows a "split-apply-combine" strategy, where data is divided into groups, a function is applied to each group, and the results are combined into a new DataFrame.</p>

In [1]:
# example 
import pandas as pd
data = pd.DataFrame({"Product_category":["Electronics","Furniture","Electronics","Furniture"],
                    "Sales":[200,150,300,100]})
# group by product_category and sum sales:
result = data.groupby("Product_category")["Sales"].sum()
print(result)

Product_category
Electronics    500
Furniture      250
Name: Sales, dtype: int64


<h1>Use of groupby() method:</h1>
<p>The groupby() function in Pandas involves three main steps: <b>Splitting, Applying, and Combining</b>.</p><br>
<p><b>splitting:</b> This step involves dividing the DataFrame into groups based on some criteria. The groups are defined by unique values in one or more columns.<br>
<b>Applying:</b> In this step, a function is applied to each group independently. You can apply various functions to each group, such as:<br>
<p>   1.Aggregation: Calculate summary statistics (e.g., sum, mean, count) for each group.<br>
      2.Transformation: Modify the values within each group.<br>
       3.Filtering: Keep or discard groups based on certain conditions.</p><br>
<b>Combining:</b> Finally, the results of the applied function are combined into a new DataFrame or Series.</p>

<b>parameters of groupby() method:</b><br>
<b>DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)</b><br>
<p>Parameters :<br>
<b>by:</b> Required parameter to specify the column(s) to group by.<br>
<b>axis:</b> Optional, specifies the axis to group by (default is 0 for rows).<br>
<b>level:</b> Optional, used for grouping by a certain level in a MultiIndex.<br>
<b>as_index:</b> Optional, whether to use the group labels as the index (default is True).<br>
<b>sort:</b> Optional, whether to sort the group keys (default is True).<br>
<b>group_keys:</b> Optional, whether to add the group keys to the index (default is True).<br>
<b>dropna:</b> Optional, whether to include rows/columns with NULL values (default is True).</p>

<b>Example 1: Grouping by a Single Column:</b><br>

In [7]:
import pandas as pd
url = r"C:\Users\ASHRAF\OneDrive\Desktop\all_of_pandas\nba.csv"
df = pd.read_csv(url)
team = df.groupby("Team")
# print the first entries in all the groups formed.
print(team.first())

                                         Name  Number Position  ...  Weight                College      Salary
Team                                                            ...                                           
Atlanta Hawks                   Kent Bazemore    24.0       SF  ...   201.0           Old Dominion   2000000.0
Boston Celtics                  Avery Bradley     0.0       PG  ...   180.0                  Texas   7730337.0
Brooklyn Nets                Bojan Bogdanovic    44.0       SG  ...   216.0         Oklahoma State   3425510.0
Charlotte Hornets               Nicolas Batum     5.0       SG  ...   200.0  Virginia Commonwealth  13125306.0
Chicago Bulls                Cameron Bairstow    41.0       PF  ...   250.0             New Mexico    845059.0
Cleveland Cavaliers       Matthew Dellavedova     8.0       PG  ...   198.0           Saint Mary's   1147276.0
Dallas Mavericks              Justin Anderson     1.0       SG  ...   228.0               Virginia   1449000.0
D

Can group by multiple column like
<b>team = df.groupby(["Team","position"])</b>

<b>Example 2 : Applying Aggregation with GroupBy:</b><br>
apply functions like sum(), mean(), min(), max(), and more.

In [8]:
import pandas as pd
df2 = pd.read_csv(url)

aggeration_data = df2.groupby(["Team","Position"],).agg(
    total_salary = ("Salary","sum"),
    avg_salary = ("Salary","mean"),
    player_count = ("Name","count")
)
print(aggeration_data)

                             total_salary    avg_salary  player_count
Team               Position                                          
Atlanta Hawks      C           22756250.0  7.585417e+06             3
                   PF          23952268.0  5.988067e+06             4
                   PG           9763400.0  4.881700e+06             2
                   SF           6000000.0  3.000000e+06             2
                   SG          10431032.0  2.607758e+06             4
...                                   ...           ...           ...
Washington Wizards C           24490429.0  8.163476e+06             3
                   PF          11300000.0  5.650000e+06             2
                   PG          18022415.0  9.011208e+06             2
                   SF          11158800.0  2.789700e+06             4
                   SG          11356992.0  2.839248e+06             4

[149 rows x 3 columns]


<b>Example 4: How to Apply Transformation Methods?:</b><br>
Transformation functions return an object that is indexed the same as the original group.<br>
<p><b>Purpose: Apply group-specific operations while maintaining the original shape of the dataset.</b><br>
Unlike aggregation, which reduces data, transformations allow group-specific modifications without altering the shape of the data.</p>
<b>For example:  Rank players within their teams based on their salaries: </b>

In [9]:
import pandas as pd
df3 = pd.read_csv(url)
# Rank players within each team by Salary
df3['Rank within Team'] = df3.groupby('Team')['Salary'].transform(lambda x: x.rank(ascending=False))
print(df3)

              Name            Team  Number Position  ...  Weight            College     Salary Rank within Team
0    Avery Bradley  Boston Celtics     0.0       PG  ...   180.0              Texas  7730337.0              2.0
1      Jae Crowder  Boston Celtics    99.0       SF  ...   235.0          Marquette  6796117.0              4.0
2     John Holland  Boston Celtics    30.0       SG  ...   205.0  Boston University        NaN              NaN
3      R.J. Hunter  Boston Celtics    28.0       SG  ...   185.0      Georgia State  1148640.0             14.0
4    Jonas Jerebko  Boston Celtics     8.0       PF  ...   231.0                NaN  5000000.0              5.0
..             ...             ...     ...      ...  ...     ...                ...        ...              ...
453   Shelvin Mack       Utah Jazz     8.0       PG  ...   203.0             Butler  2433333.0              8.0
454      Raul Neto       Utah Jazz    25.0       PG  ...   179.0                NaN   900000.0          

<b>Example 5 : Filtering Groups Using Filtration Methods:</b><br>
<p><b>Filtration allows to drop entire groups from a GroupBy object based on a condition.</b><br>
This method helps in cleaning data by removing groups that do not meet specific criteria, thus focusing analysis on relevant subsets</p>
<b>example: filter out groups where the average salary of players is below a certain threshold.</b>

In [12]:
import pandas as pd 
df4 = pd.read_csv(url)

# Filter groups where the average Salary is >= 5 million
filtered_df = df4.groupby('Team').filter(lambda x: x['Salary'].mean() >= 5000000)
print(filtered_df)

                  Name                   Team  Number Position  ...  Height Weight         College      Salary
76     Leandro Barbosa  Golden State Warriors    19.0       SG  ...     6-3  194.0             NaN   2500000.0
77     Harrison Barnes  Golden State Warriors    40.0       SF  ...     6-8  225.0  North Carolina   3873398.0
78        Andrew Bogut  Golden State Warriors    12.0        C  ...     7-0  260.0            Utah  13800000.0
79           Ian Clark  Golden State Warriors    21.0       SG  ...     6-3  175.0         Belmont    947276.0
80       Stephen Curry  Golden State Warriors    30.0       PG  ...     6-3  190.0        Davidson  11370786.0
..                 ...                    ...     ...      ...  ...     ...    ...             ...         ...
422      Cameron Payne  Oklahoma City Thunder    22.0       PG  ...     6-3  185.0    Murray State   2021520.0
423     Andre Roberson  Oklahoma City Thunder    21.0       SG  ...     6-7  210.0        Colorado   1210800.0
4