# pandas portfolio part 4
in this session, the *group by* will be discussed.
Firstly let's ask a question: what does **grouping** some values together mean? the answer is not so far from mind; 

It means that the function collects all the rows in your DataFrame that have the same value in a particular column (or set of columns) and treats them as a single group. It does this for each unique value in that column. This is like sorting your data into different "bins" based on the values in that column.
checkout here:

In [2]:
import pandas as pd

In [2]:
data = {
    'Product': ['Apple', 'Apple', 'Banana', 'Banana', 'Banana', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Banana', 'Apple', 'Cherry'],
    'Region': ['North', 'South', 'North', 'South', 'East', 'North', 'South', 'East', 'West', 'North', 'West', 'East'],
    'Sales': [200, 150, 100, 120, 130, 180, 170, 160, 190, 110, 160, 150],
    'Discount': [10, 15, 5, 10, 5, 20, 15, 10, 25, 0, 10, 20]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Product,Region,Sales,Discount
0,Apple,North,200,10
1,Apple,South,150,15
2,Banana,North,100,5
3,Banana,South,120,10
4,Banana,East,130,5
5,Cherry,North,180,20
6,Cherry,South,170,15
7,Cherry,East,160,10
8,Cherry,West,190,25
9,Banana,North,110,0


Now according to that DataFrame, we got multiple repeated data in the column 'product'. let's see what happens if *groupby* is used by the column *product*:

In [3]:
df.groupby('Product')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000266F28EFBD0>

#### oh, why did that not showed a dataframe with only non-repeated data?
When you use df.groupby('Product'), it doesn’t immediately show you the manipulated or grouped data because groupby only groups the data internally without performing any calculations or manipulations yet. It essentially prepares the data to be grouped but doesn't do anything with those groups until you tell it what to do.
#### so the function in the end of the sentence matters!

### but wait a second...
firstly, you groupedby some values by a columm, like when you used this syntax:
```python
df.groupby('sth')
```
but there is more. There is sth that goes right after that, which is a column name that you want to indicate. like this:
```python
df.groupby('sth').my_column
# or
df.groupby('sth')['my_column']
```
ok, now from the **grouped data**, you choosed your column. now you can perform any function that you need. like sum(), count() and so on...
```python
df.groupby('sth').my_column.sum()
# or
df.groupby('sth')['my_column'].sum()
```
that was all. easy like that!
there are examples below:

The groupby operation is just the first step. It’s like telling pandas, "Hey, I want to organize this data by the 'Product' column," but without telling it what to do with the organized data. Pandas is waiting for you to specify what operation you want to perform on these groups.

In [4]:
df.groupby('Product')['Sales'].sum()

Product
Apple     510
Banana    460
Cherry    850
Name: Sales, dtype: int64

In [6]:
df.groupby(['Product', 'Region']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Discount
Product,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
Apple,North,200,10
Apple,South,150,15
Apple,West,160,10
Banana,East,130,5
Banana,North,210,5
Banana,South,120,10
Cherry,East,310,30
Cherry,North,180,20
Cherry,South,170,15
Cherry,West,190,25


In [7]:
df.groupby(['Region', 'Product']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Discount
Region,Product,Unnamed: 2_level_1,Unnamed: 3_level_1
East,Banana,130,5
East,Cherry,310,30
North,Apple,200,10
North,Banana,210,5
North,Cherry,180,20
South,Apple,150,15
South,Banana,120,10
South,Cherry,170,15
West,Apple,160,10
West,Cherry,190,25


In [8]:
df

Unnamed: 0,Product,Region,Sales,Discount
0,Apple,North,200,10
1,Apple,South,150,15
2,Banana,North,100,5
3,Banana,South,120,10
4,Banana,East,130,5
5,Cherry,North,180,20
6,Cherry,South,170,15
7,Cherry,East,160,10
8,Cherry,West,190,25
9,Banana,North,110,0


In [14]:
df.groupby('Product').Sales.sum()

Product
Apple     510
Banana    460
Cherry    850
Name: Sales, dtype: int64

In [15]:
df.groupby('Product')['Sales'].sum()

Product
Apple     510
Banana    460
Cherry    850
Name: Sales, dtype: int64

In [16]:
my_new_df = df.groupby('Product')['Sales'].sum()
my_new_df

Product
Apple     510
Banana    460
Cherry    850
Name: Sales, dtype: int64

### Warning!
groupby: The groupby method in pandas groups the DataFrame by the specified column(s) and returns a **DataFrameGroupBy** object andn not a typical DataFrame, so you cannot directly call some methods like sort_values on it.

---
### sorting in pandas
typically, there are several ways to sort data in pandas, both in dataFrames and Series. these two methods bellow are the most commen ones to use:
1. **sort_values**: Sorts a DataFrame or Series by the values in one or more columns or a Series by its values.
2. **sort_index**: Sorts a DataFrame or Series by its index labels.

let's review each one with some examples:

In [28]:
data = {
    'Name': ['Bob', 'Alice', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 95, 80]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Score
0,Bob,25,85
1,Alice,30,90
2,Charlie,35,95
3,David,40,80


In [29]:
df = df.sort_values(by='Score') 
# for that above, It's good to metion about the usage of the 'inplace' parameter.
df

Unnamed: 0,Name,Age,Score
3,David,40,80
0,Bob,25,85
1,Alice,30,90
2,Charlie,35,95


In [30]:
df.sort_values(by='Name', inplace=True)
df

Unnamed: 0,Name,Age,Score
1,Alice,30,90
0,Bob,25,85
2,Charlie,35,95
3,David,40,80


##### what about *sort_index* function? 

In [31]:
df.sort_index(inplace=True)
df

Unnamed: 0,Name,Age,Score
0,Bob,25,85
1,Alice,30,90
2,Charlie,35,95
3,David,40,80


A good point to mention is the simultanious usage of both set_index() and sort_index() functions at the same time.

In [32]:
df.set_index('Age', inplace=True)
df

Unnamed: 0_level_0,Name,Score
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
25,Bob,85
30,Alice,90
35,Charlie,95
40,David,80


In [33]:
df.sort_index(inplace=True)
df

Unnamed: 0_level_0,Name,Score
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
25,Bob,85
30,Alice,90
35,Charlie,95
40,David,80


as a fun fact, there could be more parameters to use set_index() function to sort by. Those parameters just have to be a noted as an array:

In [35]:
df.sort_values(by=['Name', 'Score'], inplace=True)
df

Unnamed: 0_level_0,Name,Score
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
30,Alice,90
25,Bob,85
35,Charlie,95
40,David,80


---