# Summary Functions and Maps
Extract insights from your data.

### Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand. This tutorial will cover different operations we can apply to our data to get the input "just right".

In [32]:
import pandas as pd
wine_data = 'https://raw.githubusercontent.com/lju-lazarevic/wine/refs/heads/master/data/winemag-data-130k-v2.csv'
reviews = pd.read_csv(wine_data)

In [33]:
reviews.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
0,94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
1,126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
2,119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
3,126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
4,119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,


### Sumarry Functions
Pandas provides many simple `"summary functions"` (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [34]:
reviews.points.describe()

count    119988.000000
mean         88.442236
std           3.092915
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [35]:
reviews.taster_name.describe()

count          95071
unique            19
top       Roger Voss
freq           23560
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean() `function:

In [36]:
reviews.points.mean().item() # The item method it's just to return only the number at the result.

88.44223589025569

To see a list of unique values we can use the `unique()` function:

In [37]:
reviews.taster_name.unique()

array(['Roger Voss', 'Virginie Boone', nan, 'Michael Schachner',
       'Anna Lee C. Iijima', 'Paul Gregutt', 'Sean P. Sullivan',
       'Kerin O’Keefe', 'Anne Krebiehl\xa0MW', 'Lauren Buzzeo',
       'Joe Czerwinski', 'Alexander Peartree', 'Matt Kettmann',
       'Jim Gordon', 'Susan Kostrzewa', 'Mike DeSimone', 'Jeff Jenssen',
       'Christina Pickard', 'Carrie Dykes', 'Fiona Adams'], dtype=object)

To see a list of uniques values *and* how they occur in the dataset, we can use the `value_counts()` method:

In [38]:
reviews.taster_name.value_counts()

taster_name
Roger Voss            23560
Michael Schachner     14046
Kerin O’Keefe          9697
Paul Gregutt           8868
Virginie Boone         8708
Matt Kettmann          5730
Joe Czerwinski         4766
Sean P. Sullivan       4461
Anna Lee C. Iijima     4017
Jim Gordon             3766
Anne Krebiehl MW       3290
Lauren Buzzeo          1700
Susan Kostrzewa        1023
Mike DeSimone           461
Jeff Jenssen            436
Alexander Peartree      383
Carrie Dykes            129
Fiona Adams              24
Christina Pickard         6
Name: count, dtype: int64

## Maps

A `map` is a term, borrowed from mathematics, for a function that takes one set of values and *"maps"* them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

`map()` is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows


In [39]:
reviews_points_mean = reviews.points.mean()
reviews.points.map(lambda x: x - reviews_points_mean)

0        -3.442236
1        -1.442236
2        -2.442236
3        -2.442236
4        -1.442236
            ...   
119983    1.557764
119984   -0.442236
119985    1.557764
119986   -1.442236
119987    1.557764
Name: points, Length: 119988, dtype: float64

The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.



### Apply 
`apply()` is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [40]:
def remean_points(row):
    row.points = row.points - reviews_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
0,94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,-3.442236,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
1,126883,US,$10 for this very drinkable Cab? That's crazy....,,-1.442236,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
2,119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,-2.442236,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
3,126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,-2.442236,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
4,119752,Spain,). Light and lemony on the nose. The palate ha...,,-1.442236,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119983,80210,Italy,Zonchera is Ceretto's more affordable base Bar...,Zonchera,1.557764,48.0,Piedmont,Barolo,,,,Ceretto 2004 Zonchera (Barolo),Nebbiolo,Ceretto,
119984,76487,Italy,Zonin's 2006 Amarone opens with very ripe arom...,,-0.442236,70.0,Veneto,Amarone della Valpolicella,,,,Zonin 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",Zonin,
119985,86953,Italy,Zorzettig's precious Picolit dessert wine deli...,,1.557764,,Northeastern Italy,Colli Orientali del Friuli,,,,Zorzettig 2006 Picolit (Colli Orientali del Fr...,Picolit,Zorzettig,
119986,18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,-1.442236,18.0,California,Amador County,Sierra Foothills,Virginie Boone,@vboone,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,


If we had called `reviews.apply()` `with axis='index'`, then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. `They don't modify the original data they're called on`. If we look at the first row of reviews, we can see that it still has its original points value.

In [41]:
reviews.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
0,94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
1,126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
2,119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
3,126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
4,119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,


Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

In [42]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0        -3.442236
1        -1.442236
2        -2.442236
3        -2.442236
4        -1.442236
            ...   
119983    1.557764
119984   -0.442236
119985    1.557764
119986   -1.442236
119987    1.557764
Name: points, Length: 119988, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

In [43]:
reviews.country + " - " + reviews.region_1

0                                        NaN
1                           US - North Coast
2                            US - California
3                   Spain - Ribera del Duero
4                        Spain - Rías Baixas
                         ...                
119983                        Italy - Barolo
119984    Italy - Amarone della Valpolicella
119985    Italy - Colli Orientali del Friuli
119986                    US - Amador County
119987                                   NaN
Length: 119988, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

# Practice exercises

In [44]:
reviews

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
0,94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
1,126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
2,119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
3,126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
4,119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119983,80210,Italy,Zonchera is Ceretto's more affordable base Bar...,Zonchera,90,48.0,Piedmont,Barolo,,,,Ceretto 2004 Zonchera (Barolo),Nebbiolo,Ceretto,
119984,76487,Italy,Zonin's 2006 Amarone opens with very ripe arom...,,88,70.0,Veneto,Amarone della Valpolicella,,,,Zonin 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",Zonin,
119985,86953,Italy,Zorzettig's precious Picolit dessert wine deli...,,90,,Northeastern Italy,Colli Orientali del Friuli,,,,Zorzettig 2006 Picolit (Colli Orientali del Fr...,Picolit,Zorzettig,
119986,18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,87,18.0,California,Amador County,Sierra Foothills,Virginie Boone,@vboone,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,


#### 1. What is the median of the `poins` column in the `reviews` DataFrame?

In [48]:
print(f'The median at the points column is: {reviews.points.median():.2f}')

The median at the points column is: 88.00


#### 2. What contries are represented in the dataset? (Your answer shold not include any duplicates.)

In [50]:
reviews.country.unique()

array(['Austria', 'US', 'Spain', 'Italy', 'France', 'Portugal',
       'South Africa', 'Australia', 'Chile', 'Argentina', 'New Zealand',
       'Israel', 'Greece', 'Hungary', 'Germany', 'Canada', 'Morocco',
       'Turkey', 'Romania', 'Bulgaria', 'Serbia', nan, 'Lebanon',
       'Moldova', 'Georgia', 'Slovenia', 'Uruguay', 'England', 'Brazil',
       'Croatia', 'Bosnia and Herzegovina', 'India', 'Switzerland',
       'Mexico', 'Cyprus', 'Peru', 'Czech Republic', 'Armenia',
       'Slovakia', 'Ukraine', 'Macedonia', 'Luxembourg', 'Egypt', 'China'],
      dtype=object)

#### 3. How often does each contry appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [52]:
reviews_per_country = reviews.country.value_counts()
reviews_per_country

country
US                        50457
France                    20353
Italy                     17940
Spain                      6116
Portugal                   5256
Chile                      4184
Argentina                  3544
Austria                    3034
Australia                  2197
Germany                    1992
South Africa               1301
New Zealand                1278
Israel                      466
Greece                      432
Canada                      226
Bulgaria                    132
Hungary                     129
Romania                     102
Uruguay                      98
Turkey                       81
Slovenia                     77
Georgia                      76
Croatia                      70
Mexico                       68
England                      63
Moldova                      56
Brazil                       49
Lebanon                      32
Morocco                      24
Peru                         16
Ukraine                      14


#### 4. Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)

In [53]:
#price - price.mean()

centered_price = reviews.price - reviews.price.mean()
centered_price 

0        -11.620747
1        -25.620747
2        -21.620747
3        -20.620747
4        -18.620747
            ...    
119983    12.379253
119984    34.379253
119985          NaN
119986   -17.620747
119987    -9.620747
Name: price, Length: 119988, dtype: float64

In [55]:
reviews.price.map(lambda x: x - reviews.price.mean()) # Too slowly

0        -11.620747
1        -25.620747
2        -21.620747
3        -20.620747
4        -18.620747
            ...    
119983    12.379253
119984    34.379253
119985          NaN
119986   -17.620747
119987    -9.620747
Name: price, Length: 119988, dtype: float64

#### 5. I'm a economical winer buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the with with the highest points-to-price ratio in the dataset.

In [63]:
# < price
# > points

bargain_wine = reviews.loc[(reviews.points / reviews.price).idxmax(), 'title']
bargain_wine 

'Cramele Recas 2011 UnWineD Pinot Grigio (Viile Timisului)'

#### 6. There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" of "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` columns in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

In [82]:
key_words = ['tropical', 'fruity']

descriptor_counts = pd.Series({
    word : reviews['description'].str.count(word).sum() for word in key_words
        })
descriptor_counts

tropical    3394
fruity      8498
dtype: int64

In [83]:
count = [reviews['description'].map(lambda description: x in description.lower()).sum() for x in key_words]
descriptor_counts2 = pd.Series(count, index=key_words)
descriptor_counts2

tropical    3483
fruity      8679
dtype: int64

#### 7. We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand. We'd like to translate them into simple star raitings. A score 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

#### Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

#### Create a Series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [84]:
def points_to_star(value):
    if value >=95:
        return 3
    elif value >=85:
        return 2
    else:
        return 1

In [86]:
star_ratings = reviews.points.map(points_to_star)

In [88]:
star_ratings.value_counts()

points
2    105508
1     12088
3      2392
Name: count, dtype: int64

In [105]:
def remean_points(row):
    if row.points >=95 or row.country == 'Canada':
        return 3
    elif row.points >=85:
        return 2
    else:
        return 1
    

In [106]:
reviews_stars = reviews['points'] = reviews.apply(remean_points, axis=1)

In [108]:
reviews_stars.value_counts()

2    105291
1     12079
3      2618
Name: count, dtype: int64

In [102]:
reviews_stars.points.value_counts()

points
2    105291
1     12079
3      2618
Name: count, dtype: int64

In [98]:
reviews#.country.value_counts()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
0,94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
1,126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
2,119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
3,126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
4,119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119983,80210,Italy,Zonchera is Ceretto's more affordable base Bar...,Zonchera,90,48.0,Piedmont,Barolo,,,,Ceretto 2004 Zonchera (Barolo),Nebbiolo,Ceretto,
119984,76487,Italy,Zonin's 2006 Amarone opens with very ripe arom...,,88,70.0,Veneto,Amarone della Valpolicella,,,,Zonin 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",Zonin,
119985,86953,Italy,Zorzettig's precious Picolit dessert wine deli...,,90,,Northeastern Italy,Colli Orientali del Friuli,,,,Zorzettig 2006 Picolit (Colli Orientali del Fr...,Picolit,Zorzettig,
119986,18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,87,18.0,California,Amador County,Sierra Foothills,Virginie Boone,@vboone,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,
