### Introduction
Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

This tutorial will cover different operations we can apply to our data to get the input "just right".

In [1]:
import pandas as pd

In [30]:
# Get data
wine_catalog = pd.read_csv('../data/wine_catalog/winemag-data_first150k.csv')
wine_catalog.shape

(144035, 10)

In [31]:
# Find error(not a number values) in row of Dataset
# Since the correct elements are no more than three characters in length, I will find all that more.
x = []
for i, j in enumerate(wine_catalog.points):
    if len(str(j)) > 3:
        x.append(i)
#         print(j) # error's row(item) print 
print(x)    

[14178, 14997, 18800, 19076, 21217, 25151, 30794, 32346, 37674, 38366, 38534, 39099, 39165, 39218, 41212, 44310, 45544, 48624, 48756, 49404, 49747, 51487, 52109, 52294, 52295, 56726, 60731, 61027, 69136, 70922, 70983, 71272, 72141, 72658, 74487, 74535, 75755, 76667, 76938, 77089, 77389, 77746, 77959, 78578, 79433, 79985, 80024, 81560, 81829, 83671, 89674, 93332, 93367, 93943, 102634, 115316, 124294, 128757, 133431, 137107, 137962, 141477]


In [32]:
# Drop(delete) row with errors
for i in x:
    wine_catalog = wine_catalog.drop([i], axis=0)

In [33]:
# Drop(delete) not a number values in 'Points' column of 'Wine Magazine' Dataset.
wine_catalog = wine_catalog.dropna(subset=['points'])

In [34]:
# Change the datatype of one or more columns to numeric
wine_catalog[['points', 'price']] = wine_catalog[['points', 'price']].astype(float)

In [35]:
# Drop(delete) row that over 100 points
x = list(wine_catalog.loc[wine_catalog.points > 100].index)
for i in x:
    wine_catalog = wine_catalog.drop([i], axis=0)

In [36]:
wine_catalog.to_csv('../data/wine_catalog/wine_store_dataset.csv') # create a new catalog(clear catalog)

### Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.  
For example, consider the `describe()` method:

In [9]:
wine_catalog.points.describe()

count    143965.000000
mean         87.873462
std           3.216045
min          80.000000
25%          86.000000
50%          88.000000
75%          90.000000
max         100.000000
Name: points, dtype: float64

In [10]:
print(list(wine_catalog.columns)) # get a col. name
wine_catalog.province.describe() # another output winth 'describe()' method

['country', 'designation', 'points', 'price', 'province', 'region_1', 'region_2', 'variety', 'winery', 'last_year_points']


count         143963
unique           451
top       California
freq           43765
Name: province, dtype: object

For example, to see the mean of the points allocted (e.g. how well an averagely rated wine does), we can use the `mean()` function.

In [11]:
wine_catalog.points.mean()

87.87346229986456

To see a list of unique values we can use the `unique()` function

In [12]:
len(wine_catalog.province.unique())

452

In [13]:
len(sorted(wine_catalog.winery.unique()))

14582

`value_counts` maked a list (Dataset series) of unique values and shows us how often items occur in the Dataset.

In [14]:
wine_catalog.price.value_counts()

20.0     7484
15.0     6760
18.0     5700
25.0     5694
30.0     5166
         ... 
266.0       1
172.0       1
151.0       1
580.0       1
243.0       1
Name: price, Length: 351, dtype: int64

In [15]:
# Find value that repeating less than 10 times
winery = []
for i in list(wine_catalog.price.value_counts()):
    if i < 10:
        winery.append(wine_catalog.winery[i])

In [16]:
list(set(winery)) # sort a list

['Blue Farm',
 'Macauley',
 'Bergström',
 'Maurodos',
 'Numanthia',
 'Ponzi',
 'Domaine de la Bégude',
 'Bodega Carmen Rodríguez']

In [17]:
list(wine_catalog['winery'].loc[wine_catalog.price > 1000]) # winery, whose price by a bottle of wine over 1000$

['Blair',
 'Krug',
 'Château Latour',
 'Château Margaux',
 'Château La Mission Haut-Brion',
 'Château Mouton Rothschild',
 'Château Haut-Brion',
 'Krug',
 'Krug']

In [18]:
# Sorting values those price great than 1000$
x = wine_catalog.loc[wine_catalog.price >= 1000]
x.sort_values(by=['price'])

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery,last_year_points
34603,France,,94.0,1000.0,Bordeaux,Pessac-Léognan,,Bordeaux-style White Blend,Château La Mission Haut-Brion,87
34012,France,,97.0,1100.0,Bordeaux,Pessac-Léognan,,Bordeaux-style Red Blend,Château La Mission Haut-Brion,83
34027,France,,96.0,1200.0,Bordeaux,Pessac-Léognan,,Bordeaux-style Red Blend,Château Haut-Brion,80
34024,France,,96.0,1300.0,Bordeaux,Pauillac,,Bordeaux-style Red Blend,Château Mouton Rothschild,92
25520,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,97
50406,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,88
80484,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,81
34007,France,,98.0,1900.0,Bordeaux,Margaux,,Bordeaux-style Red Blend,Château Margaux,88
13164,US,Roger Rose Vineyard,91.0,2013.0,California,Arroyo Seco,Central Coast,Chardonnay,Blair,81
34006,France,,99.0,2300.0,Bordeaux,Pauillac,,Bordeaux-style Red Blend,Château Latour,81


### `Series.idxmax` 
Return index of the maximum element.

In [19]:
# (a / b).idxmax() >>> Can be used on values that are only obtained by various real-time operations.
bargain_idx =  (wine_catalog.points / wine_catalog.price).idxmax()
print(bargain_idx)
bargain_wine = wine_catalog.loc[bargain_idx, 'winery']
print(bargain_wine)

wine_catalog.loc[bargain_idx]

24884
Bandit


country                           US
designation                      NaN
points                            86
price                              4
province                  California
region_1                  California
region_2            California Other
variety                       Merlot
winery                        Bandit
last_year_points                  97
Name: 24884, dtype: object

### Maps

### `map()`
A `map` is a term, borrowed from mathematics, for a function that takes one set of values and `maps` them to another set of values.

In [20]:
# df.mean(axis = 1, skipna = True) >>> skipna - skip the NaN values while finding the mean
review_points_mean = wine_catalog.points.mean()
review_points_mean

87.87346229986456

In [21]:
review_points_mean = wine_catalog.points.median()
review_points_mean

88.0

In [22]:
wine_catalog.points.map(lambda i: i - review_points_mean) # `i` is an each element in `points` series

0         8.0
1         8.0
2         8.0
3         8.0
4         7.0
         ... 
144030    3.0
144031    3.0
144032    3.0
144033    2.0
144034    2.0
Name: points, Length: 143965, dtype: float64

The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value.  
`map()` returns a new Series where all the values have been transformed by your function.

### `apply()`
`apply()` is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [23]:
def remean_points(row):
    row.points -= review_points_mean # 94.0 - 87.87346229986456
    return row

wine_catalog.apply(remean_points, axis='columns')

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery,last_year_points
0,US,Martha's Vineyard,8.0,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,94
1,Spain,Carodorum Selección Especial Reserva,8.0,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,92
2,US,Special Selected Late Harvest,8.0,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,100
3,US,Reserve,8.0,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,94
4,France,La Brûlade,7.0,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,94
...,...,...,...,...,...,...,...,...,...,...
144030,Italy,,3.0,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio,84
144031,France,Cuvée Prestige,3.0,27.0,Champagne,Champagne,,Champagne Blend,H.Germain,83
144032,Italy,Terre di Dora,3.0,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora,97
144033,France,Grand Brut Rosé,2.0,52.0,Champagne,Champagne,,Champagne Blend,Gosset,89


### Finding the same word
Creating a `Series` counting how many times a word appears in `variety` column in the Dataset.

In [24]:
n_word = wine_catalog.variety.map(lambda i: 'Pinot' in i).sum()
descriptor_counts = pd.Series([n_word], index=['Pinot'])
descriptor_counts

Pinot    17232
dtype: int64

In [25]:
wine_catalog.country + '-' + wine_catalog.region_1

0                  US-Napa Valley
1                      Spain-Toro
2               US-Knights Valley
3            US-Willamette Valley
4                   France-Bandol
                   ...           
144030    Italy-Fiano di Avellino
144031           France-Champagne
144032    Italy-Fiano di Avellino
144033           France-Champagne
144034           Italy-Alto Adige
Length: 143965, dtype: object

### Dataframe or series (if, else and loc)

In [26]:
# We'd like to host these wine reviews on our website,  
# but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings.  
# A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

# Also, the Canadian Vintners Association bought a lot of ads on the site,  
# so any wines from Canada should automatically get 3 stars, regardless of points.

# Create a series star_ratings with the number of stars corresponding to each review in the dataset.

def stars(row):
    if row.country == 'Canada':
        return 4
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = wine_catalog.apply(stars, axis='columns')
star_ratings

0         3
1         3
2         3
3         3
4         3
         ..
144030    2
144031    2
144032    2
144033    2
144034    2
Length: 143965, dtype: int64

In [27]:
# We'd like to host these wine reviews on our website,  
# but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings.  
# A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

# Also, the Canadian Vintners Association bought a lot of ads on the site,  
# so any wines from Canada should automatically get 3 stars, regardless of points.

# Create a series star_ratings with the number of stars corresponding to each review in the dataset.
def star_rating():
    
    x = wine_catalog.points
    y = wine_catalog
    
    y.loc[x >= 95, 'points'] = 3
    
    y.loc[(x >= 85) & (x < 95), 'points'] = 2
    
    y.loc[(x < 80), 'points'] = 1
    
    return y
    
star_ratings = star_rating()
star_ratings.loc[star_ratings.points > 4]

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery,last_year_points
1522,Argentina,Clásico,84.0,12.0,Mendoza Province,Mendoza,,Malbec,Altos Las Hormigas,94
1523,Portugal,,84.0,18.0,Alentejano,,,Rosé,Dona Maria-Júlio Bastos,99
1524,Portugal,,84.0,,Tejo,,,Sauvignon Blanc,Quinta da Alorna,87
1525,US,Bianca's White,84.0,15.0,California,Napa Valley,Napa,Sauvignon Blanc-Semillon,Napa by N.A.P.A.,82
1526,Portugal,Senses,84.0,13.0,Alentejano,,,Alvarinho,Adega Cooperativa de Borba,82
...,...,...,...,...,...,...,...,...,...,...
144009,Chile,Prima Reserva,81.0,13.0,Maipo Valley,,,Cabernet Sauvignon,De Martino,94
144010,Chile,Reserva,81.0,12.0,Maipo Valley,,,Merlot,Undurraga,84
144011,Chile,Estate Bottled,81.0,10.0,Maipo Valley,,,Chardonnay,De Martino,89
144012,Chile,120,81.0,7.0,Rapel Valley,,,Cabernet Sauvignon,Santa Rita,83


In [28]:
# The first mistake I wrote
def __nonzero__(self):
    raise ValueError(f"The truth value of a {type(self).__name__} is ambiguous. "
                     "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
                    )
    
__nonzero__(x)

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().