# Transforming, Grouping & Sorting Data

In this notebook we will be looking at how to transform our data, such as series or dataframe, using `map()` and `apply()` respectively. These two functions allows us to returned transformed data without modifying the orginal data. We will also look further at groupby() and sort() to pull out data points that we are interested in to gather information.

Notebook adapted from Wendy Lee 2022

## Data transformation
A **map** is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often: `map()` and `apply()`


In [2]:
## Import libraries
import pandas as pd

In [17]:
## Data set open with pandas
wine_filepath="https://gist.githubusercontent.com/clairehq/79acab35be50eaf1c383948ed3fd1129/raw/407a02139ae1e134992b90b4b2b8c329b3d73a6a/winemag-data-130k-v2.csv"

wine = pd.read_csv(wine_filepath)
wine.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Clean up data
If you notice, the first column does not provide any information and appears to be redundant with the index. Let's drop the first column using the pandas DataFrame method [`drop()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html).  

The `drop()` method will return a new DataFrame by default. If you want to overwrite the current DataFrame, you can set the argument `inplace=True`. The `drop()` method can drop either rows or columns, default is rows (`axis = 0`). In this case, we will need to set the argument `axis = 1` to drop a column.

In [18]:
## Remove first column directly in wine dataframe
wine.drop(labels=wine.columns[[0]], axis=1, inplace=True)
wine.info()

## To show # rows in a pandas DataFrame use 'display.max_rows'. If none is value, shows all
pd.set_option("display.max_rows", 5)
wine

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65499 entries, 0 to 65498
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                65467 non-null  object 
 1   description            65499 non-null  object 
 2   designation            46588 non-null  object 
 3   points                 65499 non-null  int64  
 4   price                  60829 non-null  float64
 5   province               65467 non-null  object 
 6   region_1               54744 non-null  object 
 7   region_2               25170 non-null  object 
 8   taster_name            51856 non-null  object 
 9   taster_twitter_handle  49467 non-null  object 
 10  title                  65499 non-null  object 
 11  variety                65499 non-null  object 
 12  winery                 65499 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 6.5+ MB


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
65497,US,This wine wears its 15.8% alcohol better than ...,Block 24,90,31.0,California,Napa Valley,Napa,,,Hendry 2004 Block 24 Primitivo (Napa Valley),Primitivo,Hendry
65498,Spain,"A unique take on Manzanilla Sherry, which is o...",Manzanilla,90,10.0,Andalucia,Jerez,,Michael Schachner,@wineschach,Bodegas Dios Baco S.L. NV Manzanilla Sherry (J...,Sherry,Bodegas Dios Baco S.L.



## `Map()` ##
The Series method [`Map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) maps values of Series according to an input function and returns a new Series.   

> `new_series = a_series.map(some_function)`

**Scenario**: Suppose that we wanted to re-mean wine points to zero by subtracting the mean score from the score of each wine. Once subtracted the points by the mean, we can see how the points vary about the mean.

We will be using a <i>**lambda function**</i> in this example. You can review Python lambda function [here](https://www.w3schools.com/python/python_lambda.asp).

In [15]:
## Example function that adds 5 to argument passed in
def fxn_add5(x):
    return x+5
print(fxn_add5(10))

## Rewrite the function above as a lambda function
## lambda arg: expression
arg_add5 = lambda x: x+5
print(arg_add5(10))

15
15


In [19]:
## To re-mean the wine points by subracting the points by the mean.

# First, find the mean
wine_points_mean = wine.points.mean()
wine_points_mean

88.43403716087269

In [20]:
## Function to define the re-meaning function
def remean(point):
    return point - wine_points_mean

In [21]:
## Re-mean the points column using map()
remeaned_pts = wine.points.map(remean)
remeaned_pts

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
65497,US,This wine wears its 15.8% alcohol better than ...,Block 24,90,31.0,California,Napa Valley,Napa,,,Hendry 2004 Block 24 Primitivo (Napa Valley),Primitivo,Hendry
65498,Spain,"A unique take on Manzanilla Sherry, which is o...",Manzanilla,90,10.0,Andalucia,Jerez,,Michael Schachner,@wineschach,Bodegas Dios Baco S.L. NV Manzanilla Sherry (J...,Sherry,Bodegas Dios Baco S.L.


In [None]:
# Alternatively, we can use a lamda function instead of defining
# a remean function

remeaned_pts = wine.points.map(lambda p: p - wine_points_mean)
remeaned_pts

Unnamed: 0,points
0,-1.434037
1,-1.434037
2,-1.434037
3,-1.434037
...,...
65495,1.565963
65496,1.565963
65497,1.565963
65498,1.565963


In [None]:
## Alternative way without using map but directly subtracting the wine_points_mean
remeaned_points = wine.points - wine_points_mean #faster than map or apply

Unnamed: 0,points
0,-1.434037
1,-1.434037
2,-1.434037
3,-1.434037
...,...
65495,1.565963
65496,1.565963
65497,1.565963
65498,1.565963


In [22]:
## New column in our df for the values we calculated
wine['remeaned_pts'] = remeaned_pts

In [23]:
wine.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.434037
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.434037


The function you pass to `map()` should <u>expect a single value from the Series</u> (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.

#### Create a new column called 'stars', which stores the values of number of stars depending on the points the wine received.
-  \>85 - 5 stars
-  between 80 and 85 - 4 stars
-  between 75 and 80 - 3 stars
-  below 75 - 2 stars

In [24]:
# create function to do transformation for map()
# since we are only looking at points, we don't need to use apply
def get_stars(x): # x is going to be a value from the Series
    if x > 85:
        return 5
    elif 80 < x <= 85:
        return 4
    elif 75 < x <= 80:
        return 3
    else:
        return 2

stars = wine.points.map(get_stars) # should return a Series that contains the star values

wine['stars'] = stars # Add a new column stars to the DataFrame wine
wine.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,stars
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.434037,5
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.434037,5
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.434037,5
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.434037,5
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.434037,5


In [25]:
## To see our stars distribution, get the count number for each stars value
wine['stars'].value_counts()

Unnamed: 0_level_0,count
stars,Unnamed: 1_level_1
5,54102
4,11242
3,155





## `apply()` ##
The DataFrame method, [`apply()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html), should be used if we want to <u>transform the whole DataFrame</u> by calling a custom method on each row. We can use `apply()` to pass a function and apply it on every single value of the panda series.  

Notice below that we will call `wine.apply()` with `axis='columns'`. If we  use `axis='index'`, then instead of passing a function to transform each row for the column(s), we would need to provide a function to transform each column for the row(s).

In [27]:
## Example of using apply() to apply to each row in the column points
def remean_points(row):
    row.points = row.points - wine_points_mean
    return row

## apply() to each row in the column(s)
wine.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,stars
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.434037,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.434037,5
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.434037,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.434037,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65497,US,This wine wears its 15.8% alcohol better than ...,Block 24,1.565963,31.0,California,Napa Valley,Napa,,,Hendry 2004 Block 24 Primitivo (Napa Valley),Primitivo,Hendry,1.565963,5
65498,Spain,"A unique take on Manzanilla Sherry, which is o...",Manzanilla,1.565963,10.0,Andalucia,Jerez,,Michael Schachner,@wineschach,Bodegas Dios Baco S.L. NV Manzanilla Sherry (J...,Sherry,Bodegas Dios Baco S.L.,1.565963,5


**Note:**
`map()` and `apply()` **return new, transformed Series and DataFrames, respectively**. They <i>**don't modify**</i> the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.

#### Let's try to use `apply()` to assign the star values based on the points

In [None]:
def get_star2(x): # x is row from the DataFrame
    if x.points > 85:
        return 5
    elif 80 < x.points <= 85:
        return 4
    elif 75 < x.points <= 80:
        return 3
    else:
        return 2

wine['Star2'] = wine.apply(get_star2, axis='columns') # axis='columns' is for transforming each row


In [None]:
wine.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,Star,Star2
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.434037,5,5
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.434037,5,5
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.434037,5,5
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.434037,5,5
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.434037,5,5


### Smart pandas: ###
Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

In [None]:
wine_points_mean = wine.points.mean()
print(wine_points_mean)
print("This is a Series")
print(wine.points)
wine.points - wine_points_mean

88.43403716087269
This is a Series
0        87
1        87
2        87
3        87
4        87
         ..
65494    90
65495    90
65496    90
65497    90
65498    90
Name: points, Length: 65499, dtype: int64


Unnamed: 0,points
0,-1.434037
1,-1.434037
2,-1.434037
3,-1.434037
4,-1.434037
...,...
65494,1.565963
65495,1.565963
65496,1.565963
65497,1.565963


In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

In [None]:
print("a" + " - " + "b")
wine.country + " - " + wine.region_1

a - b


Unnamed: 0,0
0,Italy - Etna
1,
2,US - Willamette Valley
3,US - Lake Michigan Shore
4,US - Willamette Valley
...,...
65494,France - Chablis
65495,Australia - McLaren Vale
65496,US - Dry Creek Valley
65497,US - Napa Valley


### Smart pandas are not as flexible
Using standard Python operators are faster than `map()` or `apply()` because they uses speed ups built into pandas. All of the standard Python operators (`>`, `<`, `==`, and so on) work in this manner.

However, they are limited as they are not as flexible as `map()` or `apply()`. Smart pandas does not allow for more advanced functions, like applying conditional logic, which cannot be done with addition and subtraction alone.

# Groupwise analysis
Recall from the Data Exploration lecture, any time we see a question involving the words ”how many ... for each ...” the answer is `value_counts`. We can replicate what `value_counts()` does by doing the following:

In [None]:
wine.groupby('points').points.count()

Unnamed: 0_level_0,points
points,Unnamed: 1_level_1
80,155
81,305
82,923
83,1442
...,...
97,99
98,39
99,15
100,8


In [None]:
wine.points.value_counts()


Unnamed: 0_level_0,count
points,Unnamed: 1_level_1
87,8872
88,8423
90,7697
86,6179
...,...
97,99
98,39
99,15
100,8


In [None]:
wine.points.value_counts().sort_index()


Unnamed: 0_level_0,count
points,Unnamed: 1_level_1
80,155
81,305
82,923
83,1442
...,...
97,99
98,39
99,15
100,8


In [None]:
wine[(wine.points.between(87,90))].points.value_counts()

Unnamed: 0_level_0,count
points,Unnamed: 1_level_1
87,8872
88,8423
90,7697
89,5724


In [None]:
wine[(wine.points.between(87,90))].points.value_counts().sort_index()


Unnamed: 0_level_0,count
points,Unnamed: 1_level_1
87,8872
88,8423
89,5724
90,7697


In [None]:
wine[(wine.points.between(87,90))].groupby('points').points.count()

Unnamed: 0_level_0,points
points,Unnamed: 1_level_1
87,8872
88,8423
89,5724
90,7697


[`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the `points()` column and counted how many times it appeared. value_counts() is just a shortcut to this `groupby()` operation.

We can use any of the summary functions with this data. For example, to get the cheapest wine in each point value category, we can do the following:

In [None]:
wine.groupby('points').price.min()

Unnamed: 0_level_0,price
points,Unnamed: 1_level_1
80,5.0
81,5.0
82,5.0
83,4.0
84,4.0
...,...
96,27.0
97,40.0
98,50.0
99,75.0


You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the `apply()` method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

In [None]:
wine.groupby('winery').apply(lambda row: row.title.iloc[0])

Unnamed: 0_level_0,0
winery,Unnamed: 1_level_1
1+1=3,1+1=3 NV Rosé Sparkling (Cava)
10 Knots,10 Knots 2010 Viognier (Paso Robles)
100 Percent Wine,100 Percent Wine 2015 Moscato (California)
1000 Stories,1000 Stories 2013 Bourbon Barrel Aged Zinfande...
...,...
Öko,Öko 2013 Made With Organically Grown Grapes Ma...
Ökonomierat Rebholz,Ökonomierat Rebholz 2007 Von Rotliegenden Spät...
àMaurice,àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka,Štoka 2009 Izbrani Teran (Kras)


For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine (highest points) by country and province:

In [None]:
wine.groupby(['country', 'province']).apply(lambda row: row.loc[row.points.idxmax()])

Unnamed: 0_level_0,Unnamed: 1_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,Star,Star2
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Argentina,Mendoza Province,Argentina,If you love massive Argentine reds with purity...,Finca Pedregal Single Vineyard Barrancas Maipú...,95,74.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Pascual Toso 2014 Finca Pedregal Single Vineya...,Cabernet Sauvignon-Malbec,Pascual Toso,6.565963,5,5
Argentina,Other,Argentina,This single-vineyard Malbec blend from vineyar...,Chañar Punco,94,68.0,Other,Calchaquí Valley,,Michael Schachner,@wineschach,El Esteco 2013 Chañar Punco Red (Calchaquí Val...,Red Blend,El Esteco,5.565963,5,5
Armenia,Armenia,Armenia,"Medium straw in the glass, this wine has a nos...",Estate Bottled,87,14.0,Armenia,,,Mike DeSimone,@worldwineguys,Van Ardi 2015 Estate Bottled Kangoun (Armenia),Kangoun,Van Ardi,-1.434037,5,5
Australia,Australia Other,Australia,Writes the book on how to make a wine filled w...,Sarah's Blend,93,15.0,Australia Other,South Eastern Australia,,,,Marquis Philips 2000 Sarah's Blend Red (South ...,Red Blend,Marquis Philips,4.565963,5,5
Australia,New South Wales,Australia,This is full and rich but not overly heavy or ...,Botrytis,91,19.0,New South Wales,Riverina,,Joe Czerwinski,@JoeCz,Three Bridges 2013 Botrytis Semillon (Riverina),Sémillon,Three Bridges,2.565963,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,Juanico,Uruguay,This mature Bordeaux-style blend is earthy on ...,Preludio Barrel Select Lote N 77,90,45.0,Juanico,,,Michael Schachner,@wineschach,Familia Deicas 2004 Preludio Barrel Select Lot...,Red Blend,Familia Deicas,1.565963,5,5
Uruguay,Montevideo,Uruguay,"A rich, heady bouquet offers aromas of blackbe...",Monte Vide Eu Tannat-Merlot-Tempranillo,91,60.0,Montevideo,,,Michael Schachner,@wineschach,Bouza 2015 Monte Vide Eu Tannat-Merlot-Tempran...,Red Blend,Bouza,2.565963,5,5
Uruguay,Progreso,Uruguay,RPF is regularly one of Uruguay's better Tanna...,RPF,88,20.0,Progreso,,,Michael Schachner,@wineschach,Pisano 2013 RPF Tannat (Progreso),Tannat,Pisano,-0.434037,5,5
Uruguay,San Jose,Uruguay,"Baked, sweet, heavy aromas turn earthy with ti...",El Preciado Gran Reserva,87,50.0,San Jose,,,Michael Schachner,@wineschach,Castillo Viejo 2005 El Preciado Gran Reserva R...,Red Blend,Castillo Viejo,-1.434037,5,5


Another `groupby()` method worth mentioning is [`agg()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [None]:
wine.groupby(['country']).winery.nunique()

Unnamed: 0_level_0,winery
country,Unnamed: 1_level_1
Argentina,416
Armenia,1
Australia,350
Austria,209
Bosnia and Herzegovina,1
...,...
Switzerland,3
Turkey,12
US,4491
Ukraine,2


In [None]:
# count doesn't include missing values
wine.groupby(['country']).price.agg(["count", "min", "max", "mean"])

Unnamed: 0_level_0,count,min,max,mean
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,1887,4.0,230.0,23.604663
Armenia,1,14.0,14.0,14.000000
Australia,1158,6.0,850.0,35.786701
Austria,1364,7.0,150.0,30.846774
...,...,...,...,...
Turkey,43,15.0,120.0,25.767442
US,27058,4.0,750.0,36.344889
Ukraine,5,6.0,10.0,9.200000
Uruguay,61,10.0,120.0,26.737705


In [None]:
# size includes missing values
wine.groupby(['country']).price.agg(["size", "min", "max", "mean"])

Unnamed: 0_level_0,size,min,max,mean
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,1907,4.0,230.0,23.604663
Armenia,1,14.0,14.0,14.000000
Australia,1177,6.0,850.0,35.786701
Austria,1635,7.0,150.0,30.846774
...,...,...,...,...
Turkey,43,15.0,120.0,25.767442
US,27177,4.0,750.0,36.344889
Ukraine,5,6.0,10.0,9.200000
Uruguay,61,10.0,120.0,26.737705


Effective use of `groupby()` will allow you to do lots of really powerful things with your dataset.

## Multi-indexes
In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. `groupby()` is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

A multi-index differs from a regular index in that it has multiple levels. For example:

In [None]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65499 entries, 0 to 65498
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                65467 non-null  object 
 1   description            65499 non-null  object 
 2   designation            46588 non-null  object 
 3   points                 65499 non-null  int64  
 4   price                  60829 non-null  float64
 5   province               65467 non-null  object 
 6   region_1               54744 non-null  object 
 7   region_2               25170 non-null  object 
 8   taster_name            51856 non-null  object 
 9   taster_twitter_handle  49467 non-null  object 
 10  title                  65499 non-null  object 
 11  variety                65499 non-null  object 
 12  winery                 65499 non-null  object 
 13  remeaned_pts           65499 non-null  float64
 14  Star                   65499 non-null  int64  
 15  St

In [None]:
countries_wine = wine.groupby(['country', 'province']).description.agg(["count"])
countries_wine

Unnamed: 0_level_0,Unnamed: 1_level_0,count
country,province,Unnamed: 2_level_1
Argentina,Mendoza Province,1635
Argentina,Other,272
Armenia,Armenia,1
Australia,Australia Other,131
...,...,...
Uruguay,Montevideo,10
Uruguay,Progreso,5
Uruguay,San Jose,3
Uruguay,Uruguay,7


In [None]:
mi = countries_wine.index
type(mi)

pandas.core.indexes.multi.MultiIndex

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

The use cases for a multi-index are detailed alongside instructions on using them in the [MultiIndex / Advanced Selection](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) section of the pandas documentation.

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the `reset_index()` method:

In [None]:
countries_wine.reset_index()

Unnamed: 0,country,province,count
0,Argentina,Mendoza Province,1635
1,Argentina,Other,272
2,Armenia,Armenia,1
3,Australia,Australia Other,131
...,...,...,...
381,Uruguay,Montevideo,10
382,Uruguay,Progreso,5
383,Uruguay,San Jose,3
384,Uruguay,Uruguay,7


In [None]:
countries_wine

Unnamed: 0_level_0,Unnamed: 1_level_0,count
country,province,Unnamed: 2_level_1
Argentina,Mendoza Province,1635
Argentina,Other,272
Armenia,Armenia,1
Australia,Australia Other,131
...,...,...
Uruguay,Montevideo,10
Uruguay,Progreso,5
Uruguay,San Jose,3
Uruguay,Uruguay,7


# Sorting
Looking again at `countries_wine` we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a `groupby`, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The `sort_values()` method is handy for this. [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first.

In [None]:
countries_wine.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
country,province,Unnamed: 2_level_1
US,California,18122
US,Washington,4308
France,Bordeaux,3014
Italy,Tuscany,2985
...,...,...
Slovenia,Kras,1
Slovenia,Slovenia,1
South Africa,Vlootenburg,1
Greece,Beotia,1


In [None]:
countries_wine.loc['US'] # sort_values doesn't modify the current DataFrame.

Unnamed: 0_level_0,count
province,Unnamed: 1_level_1
America,53
Arizona,21
California,18122
Colorado,34
...,...
Vermont,1
Virginia,354
Washington,4308
Washington-Oregon,2


In [None]:
countries_wine.loc['US'].loc['Colorado']

Unnamed: 0,Colorado
count,34


We can also sort the dataframe by its index using `sort_index()`

In [None]:
countries_wine.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
country,province,Unnamed: 2_level_1
Argentina,Mendoza Province,1635
Argentina,Other,272
Armenia,Armenia,1
Australia,Australia Other,131
...,...,...
Uruguay,Montevideo,10
Uruguay,Progreso,5
Uruguay,San Jose,3
Uruguay,Uruguay,7


In [None]:
wine.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,Star,Star2
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.434037,5,5
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.434037,5,5


In [None]:
wine.groupby('country').nunique()

Unnamed: 0_level_0,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,Star,Star2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Argentina,1825,664,16,83,2,27,0,1,1,1825,54,416,16,3,3
Armenia,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1
Australia,1141,569,21,94,6,55,0,2,2,1137,57,350,21,3,3
Austria,1547,761,16,85,24,0,0,2,2,1546,48,209,16,2,2
Bosnia and Herzegovina,1,1,1,1,1,0,0,1,0,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Switzerland,4,2,3,3,3,0,0,2,2,4,3,3,3,2,2
Turkey,40,24,9,16,7,0,0,2,1,40,19,12,9,2,2
US,26111,9038,21,144,26,240,17,15,12,26052,210,4491,21,3,3
Ukraine,5,4,3,2,1,0,0,1,1,5,2,2,3,1,1


In [None]:
wine.groupby('country').nunique().sort_values(by=['winery', 'taster_name'], ascending=False)

Unnamed: 0_level_0,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,remeaned_pts,Star,Star2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
US,26111,9038,21,144,26,240,17,15,12,26052,210,4491,21,3,3
France,10744,3934,21,260,11,357,0,6,6,10543,130,2962,21,3,3
Italy,9554,4292,21,178,10,344,0,5,5,9512,156,2449,21,3,3
Spain,3257,1550,19,130,8,77,0,3,3,3220,104,1096,19,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Armenia,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1
Bosnia and Herzegovina,1,1,1,1,1,0,0,1,0,1,1,1,1,1,1
India,4,2,4,3,1,0,0,1,1,4,3,1,4,1,1
Slovakia,1,0,1,1,1,0,0,1,0,1,1,1,1,1,1
