# Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

In [None]:
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [None]:
median_points = ____

# Check your answer
q1.check()

In [None]:
#%%RM_IF(PROD)%%
median_points = reviews.points.median

q1.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
median_points = reviews.points.mean()

q1.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
median_points = reviews.points.median()

q1.assert_check_passed()

In [None]:
#_COMMENT_IF(PROD)_
q1.hint()
#_COMMENT_IF(PROD)_
q1.solution()

## 2. 
What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [None]:
countries = ____

# Check your answer
q2.check()

In [None]:
#%%RM_IF(PROD)%%
countries = reviews.country.unique()

q2.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
countries = reviews.country

q2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
countries = set(reviews.country)

q2.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
countries = [str(c).upper() for c in set(reviews.country)]

q2.assert_check_failed()

In [None]:
#_COMMENT_IF(PROD)_
q2.hint()
#_COMMENT_IF(PROD)_
q2.solution()

## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [None]:
reviews_per_country = ____

# Check your answer
q3.check()

In [None]:
#%%RM_IF(PROD)%%
reviews_per_country = reviews.country.value_counts()

q3.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
reviews_per_country = reviews.groupby('country').size()

# TODO: This check probably should pass, since this series meets the requirements of the question.
# To make this happen, we'd need to implement a comparison that weakens `Series.equals` by ignoring order.

q3.check()

In [None]:
#_COMMENT_IF(PROD)_
q3.hint()
#_COMMENT_IF(PROD)_
q3.solution()

## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

In [None]:
centered_price = ____

# Check your answer
q4.check()

In [None]:
#%%RM_IF(PROD)%%
centered_price = reviews.price - reviews.price.mean()

q4.assert_check_passed()

In [None]:
#_COMMENT_IF(PROD)_
q4.hint()
#_COMMENT_IF(PROD)_
q4.solution()

## 5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [None]:
bargain_wine = ____

# Check your answer
q5.check()

In [None]:
#%%RM_IF(PROD)%%
bargain_wine = reviews.loc[(reviews.points / reviews.price).idxmax(), 'title']

q5.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
def bargainness(row):
    return row.points / row.price

bargain_scores = reviews.apply(bargainness, axis='columns')
bargain_index = bargain_scores.idxmax()
bargain_wine = reviews.loc[bargain_index, "title"]

q5.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
# Alternative correct answer.
bargain_index = bargain_scores.iloc[::-1].idxmax()
bargain_wine = reviews.loc[bargain_index, "title"]

q5.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
bargain_wine = 'Chateau Plonk'
q5.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
bargain_wine = bargain_index
q5.assert_check_failed()

In [None]:
#_COMMENT_IF(PROD)_
q5.hint()
#_COMMENT_IF(PROD)_
q5.solution()

## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset. The Series should be properly indexed by these two words. (For simplicity, let's ignore the capitalized versions of these words.)

In [None]:
descriptor_counts = ____

# Check your answer
q6.check()

In [None]:
#%%RM_IF(PROD)%%
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

q6.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
n_trop = reviews.description.str.contains("tropical").sum()
n_fruity = reviews.description.str.contains("fruity").sum()

descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

q6.assert_check_passed()

In [None]:
#_COMMENT_IF(PROD)_
q6.hint()
#_COMMENT_IF(PROD)_
q6.solution()

## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [None]:
star_ratings = ____

# Check your answer
q7.check()

In [None]:
#%%RM_IF(PROD)%%
def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1
    
star_ratings = reviews.apply(stars, axis='columns')

q7.assert_check_passed()

In [None]:
#_COMMENT_IF(PROD)_
q7.hint()
#_COMMENT_IF(PROD)_
q7.solution()

# Keep going
Continue to **[grouping and sorting](#$NEXT_NOTEBOOK_URL$)**.