# Summary functions and maps workbook

## Introduction

This is the workbook component to the "Summary functions and maps" section of the Advanced Pandas tutorial. For the reference section, [**click here**](https://www.kaggle.com/residentmario/summary-functions-and-maps-reference).

In the last section we learned how to select relevant data out of our `pandas` `DataFrame` and `Series` objects. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the visualization exercises attached to the workbook.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.

The remainder of this tutorial will cover different operations we can apply to our data to get the input "just right". We'll start off in this section by looking at the most commonly looked built-in reshaping operations. Along the way we'll cover data `dtypes`, a concept essential to working with `pandas` effectively.

In [None]:
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

## Exercises

**Exercise 1**: What is the median of the `points` column?

In [None]:
median_points = ____

q1.check()

In [None]:
#%%RM_IF(PROD)%%
median_points = review.points.median

q1.check()

In [None]:
#%%RM_IF(PROD)%%
median_points = review.points.mean()

q1.check()

In [None]:
#%%RM_IF(PROD)%%
median_points = review.points.median()

q1.check()

In [None]:
# Uncomment the line below to see a solution
#_COMMENT_IF(PROD)_
q1.solution()

**Exercise 2**: What countries are represented in the dataset?

In [None]:
countries = ____

q2.check()

In [None]:
#%%RM_IF(PROD)%%
countries = reviews.country.unique()

q2.check()

In [None]:
#%%RM_IF(PROD)%%
countries = reviews.country

q2.check()

In [None]:
#_COMMENT_IF(PROD)_
q2.solution()

**Exercise 3**: How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [None]:
reviews_per_country = ____

q3.check()

In [None]:
#%%RM_IF(PROD)%%
reviews_per_country = reviews.country.value_counts()

q3.check()

In [None]:
#_COMMENT_IF(PROD)_
q3.solution()

**Exercise 4**: Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

In [None]:
centered_price = ____

q4.check()

In [None]:
#%%RM_IF(PROD)%%
centered_price = reviews.price - reviews.price.mean()

q4.check()

In [None]:
#_COMMENT_IF(PROD)_
q4.solution()

**Exercise 5**: I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?

Hint: use a map and the [`argmax` function](http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.Series.argmax.html).

In [None]:
# Your code here

In [None]:
#_COMMENT_IF(PROD)_
q5.solution()

Now it's time for some visual exercises.

**Exercise 6**: There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a `Series` counting how many times each of these two words appears in the `description` column in the dataset.

Hint: use a map to check each description for the string `tropical`, then count up the number of times this is `True`. Repeat this for `fruity`. Create a `Series` combining the two values at the end.

`TODO: I don't like this question much.`

In [None]:
# Your code here

In [None]:
#_COMMENT_IF(PROD)_
q6.solution()

**Exercise 7**: What combination of countries and varieties are most common?

Create a `Series` whose index consists of strings of the form `"<Country> - <Wine Variety>"`. For example, a pinot noir produced in the US should map to `"US - Pinot Noir"`. The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete `country` or `variety` data.

Note that some of the `Country` and `Wine Variety` values are missing data. We will learn more about missing data in a future section of the tutorial. For now you may use the included code snippet to normalize these columns.

Hint:  Use a map to create a series whose entries are a `str` concatenation of those two columns. Then, generate a `Series` counting how many times each label appears in the dataset.

In [None]:
answer_q7()

In [None]:
ans = reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
# Your code here

In [None]:
#_COMMENT_IF(PROD)_
q7.solution()

# Keep going
**[Continue to grouping and sorting](https://www.kaggle.com/residentmario/grouping-and-sorting-workbook).**