# Groupby

The groupby method allows you to group rows of data together and call aggregate functions

### Import `numpy` with the alias `np` and `pandas` with the alias `pd`

### Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/intro-to-python/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/intro-to-python

#### `.read_csv()`: Load `csv` data into a `pandas` DataFrame called `ri`.

Directory system keys:

`./` - current directory 

`../` - previous directory; **Note** that the current folder lives inside the previous folder. 


The data is in the `data` folder inside a folder called `data` which lives inside the current directory. The filename is `rhode-island-police-stops.csv`  

```python
# Fix Me!
path_to_file = "path/to/rhode-island-police-stops.csv"

ri = pd.read_csv(path_to_file)
```

<br>

### Explore Columns.
> Before we can group `rows`, we should have an idea of what `column` we want to `group` by and what summary statistic were interested in exploring.

```python
ri.dtypes
```

We will pick to group by `driver_race`, so it is important that we know how many unique groups we will be working with.
<br>
Use the `.unique()` method on the `driver_race` column to see those groups!

```python 
ri['driver_race'].unique()
```

<br>

### `.groupby()`

**Let's use the `.groupby()` method to group `rows` together based off of driver_race, in this case.** 
> Since we're grouping `rows` by `driver_race`. This will create a `DataFrameGroupBy` object that we can then call aggregate statistics on!

```python
ri.groupby('driver_race')
```

**You can save this object as a new variable:**

```python
by_race = ri.groupby("driver_race")
print(by_race)
```

**And then call aggregate methods off the object:**

Now that the dataframe has been stored in a `groupby` object we can ask for summary statistics.

This `.mean()` on `by_race` is essentially asking for the `.mean()` of all numerical features in `df`, for each group in `driver_race`.
<br>
This allows us to answer questions like:
* What was the average `driver_age` for all `White` drivers
* What is the average rate of `contrabound_found` for each race
* etc...

```python
by_race.mean()
```

<br>

#### All together now!

```python
ri.groupby('driver_race').mean()
```

<hr>
<br>
<br>

**More examples of aggregate methods:**

`by_race` is holding the grouping of all rows in the dataframe, by race. So when we call the `aggregate` method `.std()` this will return the `standard deviation` of all of the numerical columns, grouped by race.

```python
by_race.std()
```

<br>

`by_race` grouped all rows in the `ri` dataframe by race. So when we call the aggregate method `.min()` this will return the `minimum values` of all of the numerical columns, grouped by race. This allows us to know who the youngest person of each race was in the dataframe, shortest stop duration, etc...

```python
by_race.min()
```

<br>

`by_race` grouped all rows in the `ri` dataframe by race. So when we call the aggregate method `.max()` this will return the `maximum values` of all of the numerical columns, grouped by race. This allows us to know who the oldest person of each race was in the dataframe, longest stop duration, etc...

```python
by_race.max()
```

<br>

`by_race` grouped all rows in the `ri` dataframe by race. So when we call the aggregate method `.describe()` this will return the `descriptive statistics` for all numerical columns in the dataframe, grouped by race. Note: `.describe()` incorporates things like `.mean()`, `.median()`, `.min()`, `.max()`, etc... into a single method call, Where each row in the resulting dataframe represents a race from the original dataframe.

```python
by_race.describe()
```

<br>

#### We can also get descriptive statistics on categorical columns

Do this by invoking the `describe` method and passing the the following argument:
```python 
by_race.describe(include=object)
```



Calling `.transpose()` on a `.describe()` dataframe switches the `row` and `column` `indices`, so instead of each `row` being a race, each `column` is now the race. 
<br>
This is particularly useful when you have a `large number of columns` and a `small number of rows`, that way when you transpose it makes the columns easier to read.

```python
by_race.describe().transpose()
```

<br>

Since we `transposed` the descriptive statistics dataframe, grouped by race. We can now grab the descriptive statistics for a single race with the following code, Note that it was the `.transpose()` that made the column indexing of `['Asian']` possible.

```python
by_race.describe().transpose()['Asian']
```

In [0]:
by_race.describe().transpose()['Asian']

#### If we didn't transpose the returned data frame we would need to use the `loc` bracket notation to target the desired race's numerical summary statistics.

```python 
by_race.describe().loc['Other']
```

In [0]:
by_race.describe().loc['Other']

### We can also invoke a select group of DataFrame methods of our choice using the `agg` method and passing in a list as an argument of the desired methods we want invoked on the DataFrame  

Remember that `pandas` is built on top of `numpy`, so once we have used pandas' group by method that returns a dataframe we can use numpy's prebuilt methods to understand our dataset better.

[List of numpy calculation methods to consider](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#calculation)

```python
by_race.agg([np.mean, np.max, ...]) # Do not invoke the methods
```