**GROUPING AND SORTING**
Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, **often we want to group our data, and then do something specific to the group the data is in**
As you'll learn, we do this with the **groupby()** operation. We'll also cover some additional topics, such as more complex ways to index your DataFrames, along with how to sort your data.

**Groupwise analysis**

One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:

In [None]:
reviews.groupby('points').points.count()

groupby() created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the points() column and counted how many times it appeared. value_counts() is just a shortcut to this groupby() operation.

groupby('points'): agrupa el DataFrame por el valor de la columna 'points'.
.points: selecciona la columna 'points' dentro de cada grupo.
.count(): cuenta cuántas veces aparece ese valor en cada grupo. Solo cuenta valores NO NULOS (NaN)!!!
--> ¿Por qué usar .points.count() y no cualquier otra columna?
Porque si agrupas por 'points', y luego cuentas 'points', solo estás contando las filas donde 'points' realmente estaba presente (no era NaN).
--> Es más seguro y más lógico, porque estás contando lo mismo que usaste para agrupar.

Conclusión
Siempre que agrupes por una columna, lo más coherente es contar esa misma columna. Así evitas errores si hay valores nulos en otras.

We can use any of the summary functions we've used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

In [None]:
reviews.groupby('points').price.min()
#groups by points, and te¡hen from each group, it grabs the min price

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the apply() method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:


In [None]:
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country and province:

In [None]:
reviews.groupby(['country','province']).apply(lambda df: df.loc[df.points.idxmax()])
# groups by country and province, and then from each group selects the wine with highest points (retrieves all info about it!!!)

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [None]:
reviews.groupby(['country']).price.agg([len, min, max])
#groups by country, and then applies in the price various functions (length, min and max)

I can even put a lambda function inside agg()

Effective use of groupby() will allow you to do lots of really powerful things with your dataset.

**Multi-indexes**

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

A multi-index differs from a regular index in that it has multiple levels. For example:


In [None]:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
#groups by country and province, and then sees the descriptions and retrieves the number of elements in that series (descriptions)

In [None]:
mi = countries_reviewed.index
type(mi)
#--> pandas.core.indexes.multi.MultiIndex

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

In [None]:
countries_reviewed.reset_index()
#this will retrieve a one-index df were you see all info from all row separated (not a common one for ones that are in the same country for example)
#before you would see common rows of country fex, but now all rows have all columns individually (pountry, province,... each one, not common country for some of them)
#reset_index() no reordena las filas, solo convierte el índice en columnas y agrega un nuevo índice numérico basado en el orden actual.(important to keep inf mind for the sort_values later)

**Sorting**

Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.


In [None]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')
#retrieves the list sorted by len (number of elements in the group)
#from lowest to highest

sort_values() defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first. That goes thusly:

In [None]:
countries_reviewed.sort_values(by='len', ascending=False)
#now from highest to lowest

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:


In [None]:
countries_reviewed.sort_index() #sorts by index (numerical that was created when 'reset_index()'' was called (keeps the order of the groups but individualizes the rows)

Finally, know that you can sort by more than one column at a time:


In [None]:
countries_reviewed.sort_values(by=['country', 'len'])
#i use 'sort_values()'!!! not 'sort_index()'

**EXERCISES**

In [3]:
import pandas as pd

In [145]:
reviews = pd.DataFrame([
    {
        'country': 'Italy',
        'description': 'Aromas include tropical fruit, broom, and minerals.',
        'designation': 'Vulkà Bianco',
        'points': 87,
        'price': 20.0,
        'province': 'Sicily & Sardinia',
        'region_1': 'Etna',
        'region_2': 'Eastern Sicily',
        'taster_name': 'Kerin O’Keefe',
        'taster_twitter_handle': '@kerinokeefe',
        'title': 'Nicosia 2013 Vulkà Bianco (Etna)',
        'variety': 'White Blend',
        'winery': 'Nicosia'
    },
    {
        'country': 'Portugal',
        'description': 'This is ripe and fruity, a wine that is smooth and balanced.',
        'designation': 'Avidagos',
        'points': 87,
        'price': 15.0,
        'province': 'Douro',
        'region_1': 'Douro',
        'region_2': 'Northern Portugal',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Quinta dos Avidagos 2011 Avidagos Red (Douro)',
        'variety': 'Portuguese Red',
        'winery': 'Quinta dos Avidagos'
    },
    {
        'country': 'France',
        'description': 'A dry style of Pinot Gris, crisp with acidity and minerality.',
        'designation': 'Classic',
        'points': 100,
        'price': 32.0,
        'province': 'Alsace',
        'region_1': 'Alsace',
        'region_2': 'Northeast France',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Domaine Marcel Deiss 2012 Pinot Gris (Alsace)',
        'variety': 'Pinot Gris',
        'winery': 'Domaine Marcel Deiss'
    },
    {
        'country': 'France',
        'description': 'Big, rich and off-dry, with intensity and floral notes.',
        'designation': 'Lieu-dit Harth Cuvée Caroline',
        'points': 90,
        'price': 21.0,
        'province': 'Berona',
        'region_1': 'Alsace',
        'region_2': 'Northeast France',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline (Alsace)',
        'variety': 'Gewürztraminer',
        'winery': 'Domaine Schoffit'
    },
    {
        'country': 'Spain',
        'description': 'Dark cherry, spice and leather aromas dominate this classic Rioja.',
        'designation': 'Reserva',
        'points': 89,
        'price': 18.0,
        'province': 'Rioja',
        'region_1': 'Rioja Alta',
        'region_2': 'Northern Spain',
        'taster_name': 'Michael Schachner',
        'taster_twitter_handle': '@wineschach',
        'title': 'Marqués de Cáceres 2011 Reserva (Rioja)',
        'variety': 'Tempranillo',
        'winery': 'Marqués de Cáceres'
    },
    {
        'country': 'US',
        'description': 'Fruity and soft, with hints of raspberry and vanilla.',
        'designation': 'Estate',
        'points': 88,
        'price': 25.0,
        'province': 'California',
        'region_1': 'Napa Valley',
        'region_2': 'North Coast',
        'taster_name': 'Jim Gordon',
        'taster_twitter_handle': '@jimgordonwine',
        'title': 'Robert Mondavi 2014 Cabernet Sauvignon (Napa Valley)',
        'variety': 'Cabernet Sauvignon',
        'winery': 'Robert Mondavi'
    },
    {
        'country': 'Argentina',
        'description': 'Bold and structured, offering black fruit and mocha.',
        'designation': 'Gran Reserva',
        'points': 92,
        'price': 30.0,
        'province': 'Mendoza Province',
        'region_1': 'Uco Valley',
        'region_2': 'Mendoza',
        'taster_name': 'Alejandro Iglesias',
        'taster_twitter_handle': '@aliglesiaswine',
        'title': 'Trapiche 2015 Gran Reserva Malbec (Uco Valley)',
        'variety': 'Malbec',
        'winery': 'Trapiche'
    },
    {
        'country': 'Chile',
        'description': 'Smooth, with red berries and a touch of herbs.',
        'designation': 'Reserva Especial',
        'points': 86,
        'price': 12.0,
        'province': 'Maipo Valley',
        'region_1': 'Maipo Valley',
        'region_2': 'Central Valley',
        'taster_name': 'Patricio Tapia',
        'taster_twitter_handle': '@ptapiawine',
        'title': 'Concha y Toro 2016 Carmenere (Maipo Valley)',
        'variety': 'Carmenere',
        'winery': 'Concha y Toro'
    },
    {
        'country': 'Germany',
        'description': 'Lively and fresh, with notes of green apple and lime.',
        'designation': 'Kabinett',
        'points': 91,
        'price': 22.0,
        'province': 'Mosel',
        'region_1': 'Mosel',
        'region_2': 'Western Germany',
        'taster_name': 'Anne Krebiehl',
        'taster_twitter_handle': '@annewine',
        'title': 'Dr. Loosen 2015 Riesling Kabinett (Mosel)',
        'variety': 'Riesling',
        'winery': 'Dr. Loosen'
    },
    {
        'country': 'South Africa',
        'description': 'Aromas of citrus and melon, fresh and vibrant.',
        'designation': 'Signature',
        'points': 85,
        'price': 10.0,
        'province': 'Western Cape',
        'region_1': 'Stellenbosch',
        'region_2': 'Coastal Region',
        'taster_name': 'Lauren Buzzeo',
        'taster_twitter_handle': '@laurenbuzzeo',
        'title': 'Spier 2016 Chenin Blanc (Western Cape)',
        'variety': 'Chenin Blanc',
        'winery': 'Spier'
    },
    {
        'country': 'South Africa',
        'description': 'Aromas of citrus and melon, fresh and vibrant.',
        'designation': 'Signature',
        'points': 87,
        'price': 10.0,
        'province': 'Western Cape',
        'region_1': 'Stellenbosch',
        'region_2': 'Coastal Region',
        'taster_name': 'Lauren Buzzeo',
        'taster_twitter_handle': '@laurenbuzzeo',
        'title': 'Spier 2016 Chenin Blanc (Western Cape)',
        'variety': 'Chenin Blanc',
        'winery': 'Spier'
    }
])

print(reviews.head())

    country                                        description  \
0     Italy  Aromas include tropical fruit, broom, and mine...   
1  Portugal  This is ripe and fruity, a wine that is smooth...   
2    France  A dry style of Pinot Gris, crisp with acidity ...   
3    France  Big, rich and off-dry, with intensity and flor...   
4     Spain  Dark cherry, spice and leather aromas dominate...   

                     designation  points  price           province  \
0                   Vulkà Bianco      87   20.0  Sicily & Sardinia   
1                       Avidagos      87   15.0              Douro   
2                        Classic     100   32.0             Alsace   
3  Lieu-dit Harth Cuvée Caroline      90   21.0             Berona   
4                        Reserva      89   18.0              Rioja   

     region_1           region_2        taster_name taster_twitter_handle  \
0        Etna     Eastern Sicily      Kerin O’Keefe          @kerinokeefe   
1       Douro  Northern Port

**Exercise 1:**
Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.

In [11]:
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.agg(len)
reviews_written

taster_twitter_handle
@aliglesiaswine    1
@annewine          1
@jimgordonwine     1
@kerinokeefe       1
@laurenbuzzeo      2
@ptapiawine        1
@vossroger         3
@wineschach        1
Name: taster_twitter_handle, dtype: int64

Other posible solutions:
reviews_written = reviews.groupby('taster_twitter_handle').size()

or

reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()

Differences btw these three:

*.groupby(...).size()*

Cuenta el número total de filas en cada grupo.

**Incluye valores NaN** en cualquier columna (porque solo le importa la cantidad de filas, no los valores).

Es la forma más rápida y directa de contar elementos por grupo

*.groupby(...).taster_twitter_handle.count()*

Agrupa por taster_twitter_handle, y luego cuenta esa misma columna.

**No incluye los NaN** en esa columna.

Si alguna fila tiene taster_twitter_handle = NaN, no se cuenta.

*.groupby(...).taster_twitter_handle.agg(len)*

Hace lo mismo que .count() aquí.

En este contexto, len y .count() dan el mismo resultado.

**También ignora los NaN** en esa columna.


**Exercise 2:**

What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).

In [29]:
best_rating_per_price = reviews.points.loc[reviews.groupby('price').points.idxmax()]
best_rating_per_price
#be careful!!! gotta pass it to reviews.points.loc (specificly to points so that it retrieves points and not all the info!) cos .idxmax() what it does is that it retrieves the index of the max element en the serie (or group in this case)
#--> No devuelve el valor máximo, sino la fila donde se encuentra ese valor máximo.

10     87
7      86
1      87
4      89
0      87
3      90
8      91
5      88
6      92
2     100
Name: points, dtype: int64

In [33]:
#easier:
reviews.groupby('price').points.max()
#max() retrieves the max value directly!!! (not the index of the possition where it is)

price
10.0     87
12.0     86
15.0     87
18.0     89
20.0     87
21.0     90
22.0     91
25.0     88
30.0     92
32.0    100
Name: points, dtype: int64

**Exercise 3:**

What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.

In [59]:
price_extremes = reviews.groupby('variety').price.agg([max, min])
price_extremes

  price_extremes = reviews.groupby('variety').price.agg([max, min])
  price_extremes = reviews.groupby('variety').price.agg([max, min])


Unnamed: 0_level_0,max,min
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Cabernet Sauvignon,25.0,25.0
Carmenere,12.0,12.0
Chenin Blanc,10.0,10.0
Gewürztraminer,21.0,21.0
Malbec,30.0,30.0
Pinot Gris,32.0,32.0
Portuguese Red,15.0,15.0
Riesling,22.0,22.0
Tempranillo,18.0,18.0
White Blend,20.0,20.0


**Exercise 4:**

What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

In [79]:
price_extremes = reviews.groupby('variety').price.agg([max, min])
sorted_varieties = price_extremes.sort_values(by=['min','max'], ascending= False)
sorted_varieties

  price_extremes = reviews.groupby('variety').price.agg([max, min])
  price_extremes = reviews.groupby('variety').price.agg([max, min])


Unnamed: 0_level_0,max,min
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Pinot Gris,32.0,32.0
Malbec,30.0,30.0
Cabernet Sauvignon,25.0,25.0
Riesling,22.0,22.0
Gewürztraminer,21.0,21.0
White Blend,20.0,20.0
Tempranillo,18.0,18.0
Portuguese Red,15.0,15.0
Carmenere,12.0,12.0
Chenin Blanc,10.0,10.0


**Exercise 5:**

Create a `Series` whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the `taster_name` and `points` columns.

In [85]:
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
reviewer_mean_ratings

taster_name
Alejandro Iglesias    92.000000
Anne Krebiehl         91.000000
Jim Gordon            88.000000
Kerin O’Keefe         87.000000
Lauren Buzzeo         86.000000
Michael Schachner     89.000000
Patricio Tapia        86.000000
Roger Voss            92.333333
Name: points, dtype: float64

Are there significant differences in the average scores assigned by the various reviewers? 

Use the describe() method to see a summary of the range of values.

In [87]:
reviewer_mean_ratings.describe()

count     8.000000
mean     88.916667
std       2.592725
min      86.000000
25%      86.750000
50%      88.500000
75%      91.250000
max      92.333333
Name: points, dtype: float64

**Exercise 6:**

What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.

In [135]:
series = (reviews.groupby(['country','province']).title.agg(len)).sort_values(by= 'len', ascending= False)
series
#aplicando agg([len]) me devuelve un df!!! y me piden una serie
#pq? pq al poner [len] con corchetes interpreta q vas s querer aplicar varias funciones y te devuelve un df
# --> agg(len) sí, agg([len]) no!!!

Unnamed: 0_level_0,Unnamed: 1_level_0,len
country,province,Unnamed: 2_level_1
South Africa,Western Cape,2
Argentina,Mendoza Province,1
Chile,Maipo Valley,1
France,Alsace,1
France,Berona,1
Germany,Mosel,1
Italy,Sicily & Sardinia,1
Portugal,Douro,1
Spain,Rioja,1
US,California,1


In [153]:
series = reviews.groupby(['country','province']).size().sort_values(ascending= False)
series

country       province         
South Africa  Western Cape         2
Argentina     Mendoza Province     1
Chile         Maipo Valley         1
France        Alsace               1
              Berona               1
Germany       Mosel                1
Italy         Sicily & Sardinia    1
Portugal      Douro                1
Spain         Rioja                1
US            California           1
dtype: int64

“¿Por qué con .size() o .agg(len) no hace falta decir by='len'?”

Porque:

Si el resultado es una Series, solo hay una cosa que ordenar → pandas sabe que debe ordenar por los valores de la Serie.

En cambio, si es un DataFrame, tiene múltiples columnas y necesitas decirle por cuál columna ordenar (by='len' en tu caso).

In [177]:
country_variety_counts = (reviews.groupby(['country','variety']).title.agg(len)).sort_values(ascending=False)
type(country_variety_counts)

pandas.core.series.Series

In [175]:
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts
#it asks with variety!!!!!! not country and province!!!

country       variety           
South Africa  Chenin Blanc          2
Argentina     Malbec                1
Chile         Carmenere             1
France        Gewürztraminer        1
              Pinot Gris            1
Germany       Riesling              1
Italy         White Blend           1
Portugal      Portuguese Red        1
Spain         Tempranillo           1
US            Cabernet Sauvignon    1
dtype: int64