In [1]:
import pandas as pd
reviews = pd.read_csv("./winemag-data.csv", index_col=0)
print(reviews.head())

    country                                        description  \
0     Italy  Aromas include tropical fruit, broom, brimston...   
1  Portugal  This is ripe and fruity, a wine that is smooth...   
2        US  Tart and snappy, the flavors of lime flesh and...   
3        US  Pineapple rind, lemon pith and orange blossom ...   
4        US  Much like the regular bottling from 2012, this...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   

              region_1           region_2         taster_name  \
0                 Etna                NaN       Kerin O’Keefe   
1                  NaN

# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [2]:
median_points = reviews.points.median()
print(median_points)

88.0


## 2. 
What countries are represented in the dataset? (No duplicates)

In [3]:
countries = reviews.country.unique()
print(countries)

['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
 'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
 'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey']


## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [4]:
reviews_per_country = reviews.country.value_counts()
print(reviews_per_country)

country
US              774
France          347
Italy           339
Spain           101
Portugal         84
Chile            77
Argentina        70
Australia        54
Austria          46
Germany          34
South Africa     27
New Zealand      19
Israel            9
Greece            6
Hungary           4
Romania           3
Mexico            2
Canada            1
Turkey            1
Name: count, dtype: int64


## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

'centering' transformation is a common preprocessing step before applying various machine learning algorithms.

In [5]:
centered_price = reviews.price-reviews.price.mean()
print(centered_price)

0             NaN
1      -24.290648
2      -25.290648
3      -26.290648
4       25.709352
          ...    
1994   -21.290648
1995   -19.290648
1996   -14.290648
1997   -28.290648
1998   -24.290648
Name: price, Length: 1999, dtype: float64


## 5.
For an economical wine buyer, which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [6]:
bargain_wine = reviews.title.iloc[(reviews.points/reviews.price).idxmax()]
print(bargain_wine)

Felix Solis 2013 Flirty Bird Syrah (Vino de la Tierra de Castilla)


## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

In [7]:
n_trop=reviews.description.map(lambda desc:"tropical" in desc).sum()
n_fruity=reviews.description.map(lambda desc:"fruity" in desc).sum()

descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
print(descriptor_counts)

tropical     44
fruity      148
dtype: int64


## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [8]:
def star(row):
    if row.country=="Canada":
        return 3
        
    if row.points>=95:
        return 3
    elif row.points>=85:
        return 2
    else:
        return 1
    
star_ratings = reviews.apply(star,axis="columns")
print(star_ratings)

0       2
1       2
2       2
3       2
4       2
       ..
1994    1
1995    1
1996    1
1997    1
1998    1
Length: 1999, dtype: int64
