# Wines Points prediction 

In [9]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [10]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [11]:
wine_reviews = pd.read_csv("data/winemag-data-130k-v2.csv")
wine_reviews.shape

(129971, 14)

In [12]:
wine_reviews.sample(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
29131,29131,Germany,"Vibrant yellow peach, apricot and blossom note...",1735,89,20.0,Mosel,,,Anna Lee C. Iijima,,Villa Huesgen 2014 1735 Riesling (Mosel),Riesling,Villa Huesgen
42463,42463,Portugal,"This wine comes from Moscatel Graúdo, the most...",Quinta da Arrancosa,88,13.0,Tejo,,,Roger Voss,@vossroger,Casal do Conde 2012 Quinta da Arrancosa Moscat...,Moscatel Graúdo,Casal do Conde
123015,123015,US,"This Pinot delivers aromas of rose petals, red...",,90,25.0,California,Santa Lucia Highlands,Central Coast,Matt Kettmann,@mattkettmann,Stephen Ross 2013 Pinot Noir (Santa Lucia High...,Pinot Noir,Stephen Ross
19213,19213,Italy,This is a nicely shaped Chianti Classico with ...,Tenuta Santedame,86,,Tuscany,Chianti Classico,,,,Ruffino 2006 Tenuta Santedame (Chianti Classico),Sangiovese,Ruffino
38779,38779,Italy,This opens with aromas that recall tilled soil...,Rio Sordo,89,38.0,Piedmont,Barbaresco,,Kerin O’Keefe,@kerinokeefe,Musso 2012 Rio Sordo (Barbaresco),Nebbiolo,Musso


## Points prediction

Points is descrete value target. There for we are talking about a prediction (Regression) problem (in contrary to classification problem). Prediction solutions can be measured in few metrics:

* MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
* R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
* MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)

Read more [here](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b)

### Train and test set split

To properly report results, let's split to train and test datasets:

In [13]:
train_data = wine_reviews.sample(frac = 0.8)
test_data = wine_reviews[~wine_reviews.index.isin(train_data.index)]
assert(len(train_data) + len(test_data) == len(wine_reviews))

In [14]:
len(test_data), len(train_data)

(25994, 103977)

### Baselines

In [15]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [16]:
def calc_prediction_quality(df, pred_score_col, true_score_col):
    return pd.Series({'MSE': mean_squared_error(df[true_score_col], df[pred_score_col]),
                      'MAE': mean_absolute_error(df[true_score_col], df[pred_score_col]),
                      'R2': r2_score(df[true_score_col], df[pred_score_col])})

#### Baseline 1

The most basic baseline is simply the average points. The implementaion is as simple as:

In [17]:
test_data['basiline_1_predicted_points'] = train_data.points.mean()
b1_stats = calc_prediction_quality(test_data, 'basiline_1_predicted_points', 'points')
b1_stats

MSE    9.388859e+00
MAE    2.503101e+00
R2    -3.355529e-07
dtype: float64

#### Basline 2

We can probably improve by predicting the average score based on the origin country:

In [18]:
avg_points_by_country = train_data.groupby('country').points.mean()
avg_points_by_country.head()

country
Argentina                 86.752879
Armenia                   88.000000
Australia                 88.607427
Austria                   90.085908
Bosnia and Herzegovina    86.500000
Name: points, dtype: float64

In [19]:
test_data['basiline_2_predicted_points'] = test_data.country.map(avg_points_by_country).fillna(train_data.points.mean())
b2_stats = calc_prediction_quality(test_data, 'basiline_2_predicted_points', 'points')
b2_stats

MSE    8.883871
MAE    2.426056
R2     0.053786
dtype: float64

### Baseline 3

Adding more breakdowns will increase our granularity but can result in overfitting. Yet:

In [20]:
avg_points_by_country_and_region = train_data.groupby(['country','province']).points.mean().rename('basiline_3_predicted_points')
avg_points_by_country_and_region.head()

country    province        
Argentina  Mendoza Province    86.868752
           Other               86.026316
Armenia    Armenia             88.000000
Australia  Australia Other     85.615789
           New South Wales     87.743243
Name: basiline_3_predicted_points, dtype: float64

In [21]:
test_data_with_baseline_3 = test_data.merge(avg_points_by_country_and_region, on = ['country','province'], how='left')
test_data_with_baseline_3.basiline_3_predicted_points = test_data_with_baseline_3.basiline_3_predicted_points.fillna(test_data_with_baseline_3.basiline_2_predicted_points).fillna(test_data.basiline_1_predicted_points)
test_data_with_baseline_3.shape, test_data.shape

((25994, 17), (25994, 16))

In [22]:
b3_stats = calc_prediction_quality(test_data_with_baseline_3, 'basiline_3_predicted_points', 'points')
b3_stats

MSE    8.392773
MAE    2.341455
R2     0.106092
dtype: float64

### Baselines summary

In [23]:
baseline_summary = pd.DataFrame([b1_stats, b2_stats, b3_stats], index=['baseline_1', 'baseline_2','baseline_3'])
baseline_summary

Unnamed: 0,MSE,MAE,R2
baseline_1,9.388859,2.503101,-3.355529e-07
baseline_2,8.883871,2.426056,0.05378556
baseline_3,8.392773,2.341455,0.106092


In [24]:
baseline_summary.to_csv('data/baselines_summary.csv', index=False)

## Training a Boosting trees regressor

In [25]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#### Preparing data - Lable encoding categorical features

In [26]:
categorical_features = ['country','province','region_1','region_2','taster_name','variety','winery']
numerical_features = ['price']
features = categorical_features + numerical_features

In [27]:
encoded_features = wine_reviews[categorical_features].apply(lambda col: le.fit_transform(col.fillna('NA')))
encoded_features['price'] = wine_reviews.price.fillna(-1)
encoded_features['points'] = wine_reviews.points
encoded_features.head()

NameError: name 'points' is not defined

#### Re-splitting to train and test

In [28]:
train_encoded_features = encoded_features[encoded_features.index.isin(train_data.index)]
test_encoded_features = encoded_features[encoded_features.index.isin(test_data.index)]
assert(len(train_encoded_features) + len(test_encoded_features) == len(wine_reviews))

#### Fitting a tree-regressor

In [29]:
from src.models import i_feel_lucky_xgboost_training

In [30]:
train_encoded_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103977 entries, 0 to 129970
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   country      103977 non-null  int32  
 1   province     103977 non-null  int32  
 2   region_1     103977 non-null  int32  
 3   region_2     103977 non-null  int32  
 4   taster_name  103977 non-null  int32  
 5   variety      103977 non-null  int32  
 6   winery       103977 non-null  int32  
 7   price        103977 non-null  float64
 8   points       103977 non-null  int64  
dtypes: float64(1), int32(7), int64(1)
memory usage: 5.2 MB


In [32]:
#xgb_clf, clf_name = i_feel_lucky_xgboost_training(train_encoded_features, test_encoded_features, features, 'points', name='xgb_clf_points_prediction')

Let's look at the function output - specifically the **xgb_clf_points_prediction** column:

In [33]:
#test_encoded_features.head()

In [34]:
#xgb_stats = calc_prediction_quality(test_encoded_features, 'xgb_clf_points_prediction','points')
#xgb_stats

In [35]:
#all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb'])
#all_compared

In [36]:
#all_compared.to_csv('data/all_models_compared.csv', index=False)

# Neural Network

In [37]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score

In [40]:
column_names = ['country','province','region_1','region_2','taster_name','variety','winery','price','points']

train_data = encoded_features.sample(frac=0.8)[column_names]
test_data = encoded_features[~encoded_features.index.isin(train_data.index)][column_names]
assert(len(test_data) + len(train_data) == len(encoded_features))

In [52]:
model = Sequential([
    Dense(5, activation='relu',name = 'layer1',input_dim=8),
    Dense(1, activation='linear',name = 'layer3')
])
# Dense(5, activation='relu',name = 'layer2'),

optimizer = Adam(learning_rate=0.03)
model.compile(loss='mean_squared_error', optimizer=optimizer)

In [53]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer1 (Dense)              (None, 5)                 45        
                                                                 
 layer3 (Dense)              (None, 1)                 6         
                                                                 
Total params: 51
Trainable params: 51
Non-trainable params: 0
_________________________________________________________________


In [54]:
early_stopping_monitor = EarlyStopping(
    monitor='val_loss',
    min_delta=0,
    patience=2,
    verbose=0,
    mode='auto',
    baseline=None,
    restore_best_weights=True
)


model.fit(train_data[column_names[:-1]].values, train_data['points'],
          validation_data=(test_data[column_names[:-1]], test_data['points']),
          batch_size=10, epochs=10, verbose=True, callbacks=[early_stopping_monitor])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


<keras.callbacks.History at 0x1d942f9ee20>

In [61]:
pred_y = model.predict(test_data[column_names[:-1]].values)
pred_y.max()

88.41376

In [58]:
train_data['points']

127930    86
62509     91
55391     88
34876     90
123076    93
          ..
28300     86
20278     88
29157     92
121085    87
32778     84
Name: points, Length: 103977, dtype: int64