# Groupby

_Dr. Junaid Qazi, PhD_

Groupby is one of the most important and key functionality in pandas. It allows us to group data together, call aggregate functions and combine the results in three steps *split-apply-combine*: <br>
Before we move on to the hands-on, let's try to understand how this split-apply-combine work, using a data in different colours!

* **Split:** In this process, data contained in a pandas object (e.g. Series, DataFrame) is split into groups based on one or more keys that we provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). <br>
* **apply:** Once splitting is done, a function is applied to all groups independently, producing a new value.
* **combine:** Finally, the results of all those functions applications are combined into a resultant object. The form of the resulting object will usually depend on what's being done to the data.<br>

Lets explore with some examples:

```Python
import pandas as pd
```

Let's create a dictionary and convert that into pandas dataframe

In [1]:
# Create a dataframe
u

```Python
df = pd.DataFrame(data)
df
```

In the df, we have a Customer unique name, Sales in numbers and store name. <br>
Let's group the data, in df, based on column "Store" using groupby method. This will create a DataFrameGroupBy object.

Grab the df, access the gropby method using "." and pass the column we want to group the data on. <br>
Notice, we get a groupby object, stored in a memory 0x.... 

```Python
df.groupby("Store")
```

Let's save the created object as a new variable. 

```Python
by_store = df.groupby("Store")
```

Now, we have grouped data in "by_store" object, we can call aggregate method on this object. 

```Python
by_store.mean()
```

Pandas will apply `mean()` on number columns "Sales". It ignore not numeric columns automatically. Same is True for sum, std, max, and so on..

```Python
# The steps above in a sinlge line code
df.groupby('Store').mean()
```

Notice that, the result is a dataframe with "Store" as index and "Sales" as column. We can use loc method to locate any value for certain company after aggregation function. This will give us the value (e.g. sales) for a single store.

```Python
# In oneline code
df.groupby('Store').sum().loc["Target"]
```

We can perform whole lots of aggregation operations on "by_store" object.

```Python
by_store.min()
```

```Python
by_store.max()
```

```Python
by_store.std()
```

```Python
# count the no of instances in the columns, works with strings as well
# we have 2 customers and 2 sales in each store
by_store.count()
```

describe is a useful method, that gives a bunch of useful information, such as, mean, min, quartile values etc for each company.

```Python
by_store.describe()
```

Let's use `transpose()` after describe so that the output looks good!

```Python
by_store.describe().transpose()
```

We can call a column name for a selected store to separate information with `transpose()` as well!

```Python
by_store.describe().transpose()['Costco']
```