**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**

---


# Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

Setup complete.


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [2]:
reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

# Tutorial 
## Summary Functions
Pandas provides many simple "summary functions" which restructure the data in some useful way

<code>dataframe.column_name.describe()</code>

In [3]:
reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

This method generates a high-level summary of the attributes of the given column.<br>
It is type-aware, meaning that its output changes based on the data type of the input.

In [4]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [5]:
reviews.points.mean()

88.44713820775404

In [6]:
# to see a list of unique values, use unique()

reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

To see a list of unique values and how often they occur in the dataset<br>
use <code>value_counts()</code>

In [9]:
# reviews.taster_name.unique()
reviews.taster_name.value_counts()

Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: taster_name, Length: 19, dtype: int64

# Maps

> A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values

In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later.<br>

maps are what handle this work, making them extremely important for getting your work done!

In [10]:
review_points_mean = reviews.points.mean()  # review_points_mean
reviews.points.map(lambda p: p - review_points_mean)  # .map(lambda )

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [14]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda x: x - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [17]:
reviews.points.map(lambda x: x - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

외우자!

In [50]:
reviews.points.map(lambda x: x- review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [51]:
reviews.points.map(lambda x: x- review_points_mean).mean()

-1.2830454158312965e-14

In [53]:
reviews.points.mean()

88.44713820775404

<code>apply()</code> is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [60]:
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,1.552862,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


Note that <code>map()</code> and <code>apply()</code> return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.



In [None]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

In [None]:
reviews.country + " - " + reviews.region_1

# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [None]:
median_points = reviews.points.median()

# Check your answer
q1.check()

In [None]:
#q1.hint()
# q1.solution()

## 2. 
What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [None]:
# countries = set(reviews.country.values)
# countries = reviews.country.unique()
countries = reviews.country.unique()


# Check your answer
q2.check()

In [None]:
# q2.hint()
# q2.solution()

## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [None]:
# reviews_per_country = reviews.country.value_counts()
reviews_per_country = reviews.country.value_counts()

# Check your answer
q3.check()

In [None]:
# q3.hint()
#q3.solution()

## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

In [None]:
# centered_price = 
# centered_price = reviews.price - reviews.price.mean()
centered_price = reviews.price - reviews.price.mean()

# Check your answer
q4.check()

In [None]:
# q4.hint()
# q4.solution()

## 5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

Point는 높으면서 가격은 낮은 wine을 어떻게 구할 수 있을까?

In [21]:
reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

In [23]:
# reviews.head()

In [27]:
# bargain_wine = reviews.title[(reviews.points / reviews.price).idxmax()]
bargain_wine = reviews.title.loc[(reviews.points / reviews.price).idxmax()]

q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [25]:
q5.hint()
# q5.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> The `idxmax` method may be useful here.

<code>idxmax</code>함수를 기억해둡시다...!!

In [26]:
bargain_idx = (reviews.points/ reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
bargain_wine

q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [28]:
reviews.description.isin(['tropical','fruity']).value_counts()

False    129971
Name: description, dtype: int64

In [31]:
# description =

# Check your answer
q6.check()

<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> Expected `descriptor_counts` to have length 2 but was actually 1

In [None]:
q6.hint()
q6.solution()

In [44]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

descriptor_counts

tropical    3607
fruity      9090
dtype: int64

In [45]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

descriptor_counts

tropical    3607
fruity      9090
dtype: int64

humm.... I don't understand it exactly

## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [62]:
# reviews.points.filter(lambda x: x >= 95)

In [66]:
def rating(row):
    
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1
        
stars_ratings = reviews.apply(rating, axis='columns')

'''
# Solution
def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')
'''

"\n# Solution\ndef stars(row):\n    if row.country == 'Canada':\n        return 3\n    elif row.points >= 95:\n        return 3\n    elif row.points >= 85:\n        return 2\n    else:\n        return 1\n\nstar_ratings = reviews.apply(stars, axis='columns')\n"

In [67]:
# star_ratings = 
def rating(row):
    
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1
        
stars_ratings = reviews.apply(rating, axis='columns')

# Check your answer
q7.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variable `star_ratings`

In [64]:
q7.hint()
q7.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Begin by writing a custom function that accepts a row from the DataFrame as input and returns the star rating corresponding to the row.  Then, use `DataFrame.apply` to apply the custom function to every row in the dataset.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1
    
star_ratings = reviews.apply(stars, axis='columns')
```

# Keep going
Continue to **[grouping and sorting](https://www.kaggle.com/residentmario/grouping-and-sorting)**.

---
**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*