### Summary Function and Maps

    The data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.
    

### Summary Function
    
    Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. 
    For example:
      reviews.describe()
      
     This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. 
     
     If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

    For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:
     
         reviews.points.mean()
     
    To see a list of unique values we can use the unique() function:
      
         reviews.taster_name.unique()
    
    To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:
        
        reviews.taster_name.value_counts()

### Exercise 
    

In [1]:
import pandas as pd
reviews = pd.read_csv("Dataset/winemag-data-130k-v2.csv", index_col=0)

In [2]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [4]:
reviews.describe() # only provide numeric columns of statistical data.

Unnamed: 0,points,price
count,129971.0,120975.0
mean,88.447138,35.363389
std,3.03973,41.022218
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,3300.0


In [5]:
# Find the median of the points column in the reviews DataFrame.

median_points = reviews.points.median()

median_points

88.0

In [6]:
# Find countries are represented in the dataset. (It doesn't include any duplicates.)

countries = reviews.country.unique()

countries

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

In [7]:
# Find how many times country present in DataFrame. 
# Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.

reviews_per_country = reviews.country.value_counts()

reviews_per_country

US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14
Serbia  

In [8]:
# Create variable centered_price containing a version of the price column with the mean price subtracted.

centered_price = reviews.price - reviews.price.mean()

centered_price

0               NaN
1        -20.363389
2        -21.363389
3        -22.363389
4         29.636611
            ...    
129966    -7.363389
129967    39.636611
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

In [9]:
# Which wine is the "best bargain"? 
# Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

bargain_wine

'Bandit NV Merlot (California)'

### Maps
    A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

    There are two mapping methods that you will use often.
    
    1) map() 
    2) apply()

    map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:
    
        review_points_mean = reviews.points.mean()
        reviews.points.map(lambda p: p - review_points_mean)
    
    The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

    apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.
    
        def remean_points(row):
            row.points = row.points - review_points_mean
            return row

        reviews.apply(remean_points, axis='columns')
    
    If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

    Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.
    
    Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:
    
        review_points_mean = reviews.points.mean()
        reviews.points - review_points_mean
    
    In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

    Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:
    
        reviews.country + " - " + reviews.region_1
    
    These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

    However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.
    
### Exercise:

In [10]:
# There are only so many words you can use when describing a bottle of wine. 
# Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many 
# times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the 
# capitalized versions of these words.)

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

descriptor_counts

tropical    3607
fruity      9090
dtype: int64

In [12]:
# We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to 
# understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of
# at least 85 but less than 95 is 2 stars. Any other score is 1 star.

# Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically 
# get 3 stars, regardless of points.

# Create a series star_ratings with the number of stars corresponding to each review in the dataset.

def star_rating(row):
    if row.points >= 95 or row.country == 'Canada':
        return 3
    elif row.points >= 85 or row.points < 95:
        return 2
    else:
        return 1

star_ratings = reviews.apply(star_rating, axis='columns')

star_ratings

0         2
1         2
2         2
3         2
4         2
         ..
129966    2
129967    2
129968    2
129969    2
129970    2
Length: 129971, dtype: int64