### Introduction
Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

This tutorial will cover different operations we can apply to our data to get the input "just right".

In [1]:
import pandas as pd

In [2]:
# Get data
wine_magazine = pd.read_csv('../data/wine_magazine/winemag-data_first150k.csv')
wine_magazine.shape

  interactivity=interactivity, compiler=compiler, result=result)


(144035, 10)

In [3]:
# Find error(not a number values) in row of Dataset
# Since the correct elements are no more than three characters in length, I will find all that more.
x = []
for i, j in enumerate(wine_magazine.points):
    if len(str(j)) > 3:
        x.append(i)
#         print(j) # error's row(item) print 
print(x)    

[14178, 14997, 18800, 19076, 21217, 25151, 30794, 32346, 37674, 38366, 38534, 39099, 39165, 39218, 41212, 44310, 45544, 48624, 48756, 49404, 49747, 51487, 52109, 52294, 52295, 56726, 60731, 61027, 69136, 70922, 70983, 71272, 72141, 72658, 74487, 74535, 75755, 76667, 76938, 77089, 77389, 77746, 77959, 78578, 79433, 79985, 80024, 81560, 81829, 83671, 89674, 93332, 93367, 93943, 102634, 115316, 124294, 128757, 133431, 137107, 137962, 141477]


In [4]:
# Drop(delete) row with errors
for i in x:
    wine_magazine = wine_magazine.drop([i], axis=0)

In [5]:
# Drop(delete) not a number values in 'Points' column of 'Wine Magazine' Dataset.
wine_magazine = wine_magazine.dropna(subset=['points'])

In [6]:
# Change the datatype of one or more columns to numeric
wine_magazine[['points', 'price']] = wine_magazine[['points', 'price']].astype(float)

In [7]:
# Drop(delete) row that over 100 points
x = list(wine_magazine.loc[wine_magazine.points > 100].index)
for i in x:
    wine_magazine = wine_magazine.drop([i], axis=0)

### Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.  
For example, consider the `describe()` method:

In [8]:
wine_magazine.points.describe()

count    143965.000000
mean         87.873462
std           3.216045
min          80.000000
25%          86.000000
50%          88.000000
75%          90.000000
max         100.000000
Name: points, dtype: float64

In [9]:
print(list(wine_magazine.columns)) # get a col. name
wine_magazine.province.describe() # another output winth 'describe()' method

['country', 'designation', 'points', 'price', 'province', 'region_1', 'region_2', 'variety', 'winery', 'last_year_points']


count         143963
unique           451
top       California
freq           43765
Name: province, dtype: object

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function.

In [10]:
wine_magazine.points.mean()

87.87346229986456

To see a list of unique values we can use the `unique()` function

In [11]:
len(wine_magazine.province.unique())

452

In [12]:
len(sorted(wine_magazine.winery.unique()))

14582

`value_counts` maked a list (Dataset series) of unique values and shows us how often items occur in the Dataset.

In [48]:
wine_magazine.price.value_counts()

20.0     7484
15.0     6760
18.0     5700
25.0     5694
30.0     5166
         ... 
266.0       1
172.0       1
151.0       1
580.0       1
243.0       1
Name: price, Length: 351, dtype: int64

In [49]:
# Find value that repeating less than 10 times
winery = []
for i in list(wine_magazine.price.value_counts()):
    if i < 10:
        winery.append(wine_magazine.winery[i])

In [50]:
list(set(winery)) # sort a list

['Bergström',
 'Bodega Carmen Rodríguez',
 'Ponzi',
 'Numanthia',
 'Macauley',
 'Maurodos',
 'Domaine de la Bégude',
 'Blue Farm']

In [52]:
list(wine_magazine['winery'].loc[wine_magazine.price > 1000]) # winery, whose price by a bottle of wine over 1000$

['Blair',
 'Krug',
 'Château Latour',
 'Château Margaux',
 'Château La Mission Haut-Brion',
 'Château Mouton Rothschild',
 'Château Haut-Brion',
 'Krug',
 'Krug']

In [47]:
# Sorting values those price great than 1000$
x = wine_magazine.loc[wine_magazine.price >= 1000]
x.sort_values(by=['price'])

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery,last_year_points
34603,France,,94.0,1000.0,Bordeaux,Pessac-Léognan,,Bordeaux-style White Blend,Château La Mission Haut-Brion,87
34012,France,,97.0,1100.0,Bordeaux,Pessac-Léognan,,Bordeaux-style Red Blend,Château La Mission Haut-Brion,83
34027,France,,96.0,1200.0,Bordeaux,Pessac-Léognan,,Bordeaux-style Red Blend,Château Haut-Brion,80
34024,France,,96.0,1300.0,Bordeaux,Pauillac,,Bordeaux-style Red Blend,Château Mouton Rothschild,92
25520,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,97
50406,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,88
80484,France,Clos du Mesnil,100.0,1400.0,Champagne,Champagne,,Chardonnay,Krug,81
34007,France,,98.0,1900.0,Bordeaux,Margaux,,Bordeaux-style Red Blend,Château Margaux,88
13164,US,Roger Rose Vineyard,91.0,2013.0,California,Arroyo Seco,Central Coast,Chardonnay,Blair,81
34006,France,,99.0,2300.0,Bordeaux,Pauillac,,Bordeaux-style Red Blend,Château Latour,81
