**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**

---


# Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

Setup complete.


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [2]:
reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

# Tutorial 
## Summary Functions
Pandas provides many simple "summary functions" which restructure the data in some useful way

<code>dataframe.column_name.describe()</code>

In [6]:
reviews.describe()

Unnamed: 0,points,price
count,129971.000000,120975.000000
mean,88.447138,35.363389
...,...,...
75%,91.000000,42.000000
max,100.000000,3300.000000


In [7]:
# pandas.column_name.describe()

reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

In [8]:
reviews.points.median()

88.0

mean과 median의 값이 거의 비슷하다

This method generates a high-level summary of the attributes of the given column.<br>
It is type-aware, meaning that its output changes based on the data type of the input.

In [9]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [10]:
# to see a list of unique values, use unique()

reviews.taster_twitter_handle.unique()

array(['@kerinokeefe', '@vossroger', '@paulgwine\xa0', nan, '@wineschach',
       '@vboone', '@mattkettmann', '@wawinereport', '@gordone_cellars',
       '@JoeCz', '@AnneInVino', '@laurbuzz', '@worldwineguys',
       '@suskostrzewa', '@bkfiona', '@winewchristina'], dtype=object)

To see a list of unique values and how often they occur in the dataset<br>
use <code>value_counts()</code>

In [11]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [12]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

In [15]:
reviews.taster_name.value_counts()

Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: taster_name, Length: 19, dtype: int64

# Maps

지난번에 이 <code>Map</code>함수 때문에 애좀 먹었었지...!! - 20.05.30.sat.pm12:50 -<br>
과연... 오늘도 애를 먹을 것인가?! - 20.06.06.sat am12:30 -

> A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values

In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later.<br>

maps are what handle this work, making them extremely important for getting your work done!

가장 좋은 방법은 역시나 두들겨보는거지...!!

직무평가 문제 1. reviews데이터 프레임의 points값에 평균을 뺀 값을 저장해주세요<br>
YJ, 너라면 어떻게 할래?

In [19]:
test = reviews.points
test

0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: int64

In [22]:
test = test.to_frame()

In [24]:
test.points = test.points - test.points.mean()
test.points

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

<code>Map</code>함수를 사용해서 구하기

In [25]:
reviews_points_mean = reviews.points.mean()
reviews.points.map(lambda x: x - reviews_points_mean)
reviews.points

0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: int64

In [28]:
# reviews_points 칼럼의 값을 평균을 뺀 값으로 저장해주세요!

reviews_points_mean = reviews.points.mean()
reviews.points.map(lambda x: x - reviews_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [27]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda x: x - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In [None]:
reviews.points.map(lambda x: x - review_points_mean)

외우자!

이제는 이해가 가능하다. <code>map</code>함수 정말 유용하다! column 값을 변경할 때 이처럼 유용하게 사용될 수 있구나!

In [29]:
# Again,

# reviews.points.map(lambda x: x - reviews.points.mean())  # 좋은 방법이 아니다?!

reviews_points_mean = reviews.points.mean()
transformed_points = reviews.points.map(lambda x: x - reviews_points_mean)
print(reviews.points)

0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: int64


In [30]:
print(transformed_points)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64


편차, 분산, 표준편차와 연관지어 계산할 수 있다면 좋을텐데!<br>
정규분포는 더 할 나위 없이!!

<code>apply()</code> is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [31]:
def remean_points(row):
    row.points = row.points - reviews_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,1.552862,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


In [32]:
def remean_points(row):
    row.points = row.points - reviews_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,1.552862,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


Note that <code>map()</code> and <code>apply()</code> return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.



In [None]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

In [None]:
reviews.country + " - " + reviews.region_1

# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [33]:
median_points = reviews.points.median()
q1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [None]:
#q1.hint()
# q1.solution()

## 2. 
What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [35]:
reviews.country.describe()

count     129908
unique        43
top           US
freq       54504
Name: country, dtype: object

In [40]:
reviews.country.unique()

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

In [41]:
# countries = set(reviews.country.values)
# countries = reviews.country.unique()
countries = reviews.country.unique()

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [None]:
# q2.hint()
# q2.solution()

## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [44]:
reviews_per_country = reviews.country.value_counts()
reviews_per_country
q3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [None]:
# q3.hint()
#q3.solution()

## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

Notice!!<br>
'centering' transformation is a common preprocessing step before applying various machine learning algorithms!

In [47]:
reviews_price_mean = reviews.price.mean()

centered_price = reviews.price.map(lambda x: x - reviews_price_mean)
centered_price

0               NaN
1        -20.363389
            ...    
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

In [48]:
q4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [51]:
# q4.hint()
# q4.solution()

## 5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

Point는 높으면서 가격은 낮은 wine을 어떻게 구할 수 있을까?

In [52]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [66]:
# bargain_wine = 

points_price = (reviews.points / reviews.price)
# points_price.sort_values(ascending=False)
points_price.max()

points_price[points_price == 21.5].index

Int64Index([64590, 126096], dtype='int64')

In [70]:
bargain_wine = reviews.title[64590]
q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

*더 좋은 방법은 없을까?*

In [71]:
reviews['points to price'] = reviews.points / reviews.price
reviews['points to price']

0              NaN
1         5.800000
            ...   
129969    2.812500
129970    4.285714
Name: points to price, Length: 129971, dtype: float64

In [72]:
reviews['points to price'].sort_values(ascending=False)

64590     21.5
126096    21.5
          ... 
129893     NaN
129964     NaN
Name: points to price, Length: 129971, dtype: float64

In [74]:
bargain_wine = reviews.title.loc[64590]
q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [77]:
bargain_wine = reviews.title.loc[reviews['points to price'].idxmax()]
bargain_wine
q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [78]:
bargain_wine = reviews.title.loc[(reviews.points / reviews.price).idxmax()]  # idxmax(): return index of first occurence of maximum over requested axis
bargain_wine
q5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

<code>idxmax</code>함수를 기억해둡시다...!!

<code>pandas.DataFrame.idxmax()</code><br>
Return index of first occurence of maximum over requested axis.

In [None]:
bargain_idx = (reviews.points/ reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
bargain_wine

q5.check()

## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [80]:
# descriptor_counts = 

In [84]:
n_trop = reviews.description.map(lambda x: "tropical" in x).sum()

In [85]:
n_fruit = reviews.description.map(lambda x: "fruity" in x).sum()

In [92]:
descriptor_counts = pd.Series([n_trop, n_fruit],
                             index = ['n_trop', 'n_fruit'])

descriptor_counts

n_trop     3607
n_fruit    9090
dtype: int64

In [95]:
q6.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [79]:
q6.hint()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use a map to check each description for the string `tropical`, then count up the number of times this is `True`. Repeat this for `fruity`. Finally, create a `Series` combining the two values.

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc)
n_trop

In [None]:
n_fruity = reviews.description.map(lambda desc: "fruity" in desc)
n_fruity

In [None]:
n_trop.sum()

In [None]:
n_fruity.sum()

In [None]:
descriptor_counts = pd.DataFrame([3607, 9090],
                                index=['n_trop', 'n_fluity'])
descriptor_counts.squeeze()

In [None]:
type(descriptor_counts.squeeze())

In [None]:
descriptor_counts.to_Series

In [None]:
reviews.description.isin(['tropical','fruity']).value_counts()

In [None]:
reviews.description.isin(['tropical', 'fruity']).value_counts()

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity],
                             index = ['tropical', 'fruity'])
descriptor_counts

q6.check()

In [94]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], 
                             index = ['tropical', 'fruity'])

# Check your answer
q6.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [None]:
q6.hint()
q6.solution()

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

descriptor_counts

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

descriptor_counts

humm.... I don't understand it exactly

## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

딱 이런 문제가 바로 어제 테스트에서 나왔었지.<br>
조금 노가다를 하면 풀 수 있지만 그건 좋은 방법이 아니기에 이번 기회에 제대로 정리해 볼 수 있도록 하자!

* Create a series 'star_ratings'
    * 95점 이상의 와인은 3스타
    * 85~95 2스타
    * 85미만은 1스타
    * 단 캐나다산 와인은 모두 3스타!

In [96]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,points to price
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,5.8
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,6.214286
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,6.692308
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,1.338462


In [98]:
def star(row):
    if row.points >= 95:
        return 3
    elif row.points >=85:
        return 2
    elif row.points < 85:
        return 1
    elif row.country == 'canada':
        return 3
    
star_ratings = reviews.apply(star, axis = 'columns')

In [100]:
star_ratings
q7.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [101]:
def star_ratings(row):
    if row.country == 'canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >=85:
        return 2
    else:
        return 1
    
star_ratings = reviews.apply(star_ratings, axis='columns')
star_ratings

0         2
1         2
         ..
129969    2
129970    2
Length: 129971, dtype: int64

In [102]:
q7.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

이건 진짜 미쳤다...!!

# Keep going
Continue to **[grouping and sorting](https://www.kaggle.com/residentmario/grouping-and-sorting)**.

---
**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*