In [2]:
import pandas as pd
pd.set_option('display.max_rows', 5)
import numpy as np
reviews = pd.read_csv("C:/Github/Pandas/pandas/winemag-data_first150k.csv", index_col = 0)

In [2]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


# Summary Funtions

In [3]:
reviews.points.describe()

count    150930.000000
mean         87.888418
             ...      
75%          90.000000
max         100.000000
Name: points, Length: 8, dtype: float64

The output above only makes sense for numerical data; for string data here's what we get

In [5]:
reviews.province.describe()

count         150925
unique           455
top       California
freq           44508
Name: province, dtype: object

to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:

In [6]:
reviews.points.mean()

87.8884184721394

To see a list of unique values we can use the unique() function:

In [7]:
reviews.variety.unique()

array(['Cabernet Sauvignon', 'Tinta de Toro', 'Sauvignon Blanc',
       'Pinot Noir', 'Provence red blend', 'Friulano', 'Tannat',
       'Chardonnay', 'Tempranillo', 'Malbec', 'Rosé', 'Tempranillo Blend',
       'Syrah', 'Mavrud', 'Sangiovese', 'Sparkling Blend',
       'Rhône-style White Blend', 'Red Blend', 'Mencía', 'Palomino',
       'Petite Sirah', 'Riesling', 'Cabernet Sauvignon-Syrah',
       'Portuguese Red', 'Nebbiolo', 'Pinot Gris', 'Meritage', 'Baga',
       'Glera', 'Malbec-Merlot', 'Merlot-Malbec', 'Ugni Blanc-Colombard',
       'Viognier', 'Cabernet Sauvignon-Cabernet Franc', 'Moscato',
       'Pinot Grigio', 'Cabernet Franc', 'White Blend', 'Monastrell',
       'Gamay', 'Zinfandel', 'Greco', 'Barbera', 'Grenache',
       'Rhône-style Red Blend', 'Albariño', 'Malvasia Bianca',
       'Assyrtiko', 'Malagouzia', 'Carmenère', 'Bordeaux-style Red Blend',
       'Touriga Nacional', 'Agiorgitiko', 'Picpoul', 'Godello',
       'Gewürztraminer', 'Merlot', 'Syrah-Grenache', 'G-S-M

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:

In [8]:
reviews.variety.value_counts()

variety
Chardonnay        14482
Pinot Noir        14291
                  ...  
Syrah-Carignan        1
Carnelian             1
Name: count, Length: 632, dtype: int64

# Maps

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later.

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

The code shows how to center wine review scores around 0 (a common statistical technique called mean centering). Let's go through it step by step:

### (Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)

In [9]:
review_points_mean = reviews.points.mean()

This calculates the average (mean) of all wine review scores in the points column
The result is stored in review_points_mean

In [10]:
reviews.points.map(lambda p: p - review_points_mean)

0         8.111582
1         8.111582
            ...   
150928    2.111582
150929    2.111582
Name: points, Length: 150930, dtype: float64

map() applies a function to every element in a series
Here, it uses a lambda function (a small anonymous function) that takes each score p and subtracts the mean from it
The result would be scores centered around 0: positive numbers for above-average scores and negative numbers for below-average scores

For example, if:

The mean score was 88
Original scores were [90, 85, 88]
After mapping: [2, -3, 0]

##### apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

This achieves the same result as the map() example we discussed earlier, but apply() is more flexible because:

It can access multiple columns at once
It can perform more complex operations
It can modify multiple values in the row
It can include more sophisticated logic

In [11]:
def remean_points(row):
  row.points = row.points - review_points_mean
  return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,8.111582,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,8.111582,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,2.111582,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,2.111582,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


Note that map() and apply() return new, transformed Series and DataFrames, respectively.

In [12]:
reviews.head(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz


here's a faster way of remeaning our points column:

In [13]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0         8.111582
1         8.111582
            ...   
150928    2.111582
150929    2.111582
Name: points, Length: 150930, dtype: float64

an easy way of combining country and region information in the dataset would be to do the following:

In [14]:
reviews.country + " - " + reviews.region_1

0           US - Napa Valley
1               Spain - Toro
                 ...        
150928    France - Champagne
150929    Italy - Alto Adige
Length: 150930, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

**I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.**

idxmax() returns the index (row number) where this ratio is highest

In [18]:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'winery']

In [19]:
bargain_wine

'Bandit'

**no times fig and blackberry apears in description**

In [23]:
n_fig = reviews.description.map(lambda desc: "fig" in desc).sum()
n_blackb = reviews.description.map(lambda desc: "blackberry" in desc).sum()
description_counts = pd.Series([n_fig, n_blackb], index=['fig', 'blackberry'])
description_counts

fig            1951
blackberry    14518
dtype: int64

**above 95 is 5 stars, 85 to 95 is 4 stars, 65 to 85 is 3 stars, bellow 65 is 1 stars and if from france 5 stars.**

In [25]:
def stars(row):
  if row.country == 'France':
    return 5
  elif row.points >= 95:
    return 5
  elif row.points >= 85:
    return 4
  elif row.points >= 65:
    return 3
  else:
    return 1

# creating series out of the data
star_ratings = reviews.apply(stars, axis='columns')
star_ratings

0         5
1         5
         ..
150928    5
150929    4
Length: 150930, dtype: int64