# Predict __Price of a bottle of wine__

# Summary

The data summarize **258210** wine reviews:

__175000__ are the training set, the data on which to train your models;

The remaining __83210__ observations constitute the validation set (or score set), or the data on which you must make the estimate for the submission. The validation set at your disposal obviously does not contain the variable price, the price of the bottle of wine that the goal of your forecast.

 

## File descriptions

    - train.csv - the training set
    - test.csv - the test set
    - Sample_Submission.csv  - a sample submission file in the correct format
 
### Data fields

**country (String)** The country that the wine is from

**province (String)** The province or state that the wine is from

**region_1 (String)** The wine growing area in a province or state (ie Napa)

**region_2 (String)** Sometimes there are more specific regions within the wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank

**winery (String)** The winery that made the wine

**variety (String)** The type of grapes used to make the wine (ie Pinot Noir)

**designation (String)** The vineyard within the winery where the grapes that made the wine are from

**taster_name (String)** taster name

**taster_twitter_handle (String)** taster twitter account name

**description (String)** A few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.

**points (Numeric)** Number of points WineEnthusiast rated the wine on a scale of **1-100**

## __TARGET: price (Numeric) The cost for a bottle of wine__

The accuracy of your forecasts will be evaluated using the Root Mean Squared Error (RMSE).

An example code for the calculation:

RMSE = sqrt (mean ((predicted-true) ^ 2))

This leaderboard is calculated with approximately 30% of the test data.

The final results will be based on the other 70%, so the final standings may be different

# Import Libraries

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import tensorflow as tf

%matplotlib inline

In [33]:
tf.__version__

'2.0.0'

In [2]:
train = pd.read_csv("./Datasets/train.csv")
test = pd.read_csv("./Datasets/test.csv")

# check our numerical features

In [3]:
train.select_dtypes('number').columns

Index(['points', 'price', 'id'], dtype='object')

In [4]:
test.select_dtypes('number').columns

Index(['index', 'points', 'price', 'id'], dtype='object')

In [8]:
test.drop(['index', 'price'], axis=1, inplace=True)

In [9]:
test.select_dtypes('number').columns

Index(['points', 'id'], dtype='object')

# Check our categorical features

In [29]:
categorical_features = train.select_dtypes('object').columns
categorical_features

Index(['country', 'description', 'designation', 'province', 'region_1',
       'region_2', 'taster_name', 'taster_twitter_handle', 'title', 'variety',
       'winery'],
      dtype='object')

In [20]:
test.select_dtypes('object').columns

Index(['country', 'description', 'designation', 'province', 'region_1',
       'region_2', 'taster_name', 'taster_twitter_handle', 'title', 'variety',
       'winery'],
      dtype='object')

# Cardinality of categorical data

In [22]:
train[train.select_dtypes('object').columns].nunique().reset_index(name='cardinality')

Unnamed: 0,index,cardinality
0,country,45
1,description,123811
2,designation,37931
3,province,468
4,region_1,1278
5,region_2,18
6,taster_name,19
7,taster_twitter_handle,15
8,title,77411
9,variety,706


In [23]:
test[test.select_dtypes('object').columns].nunique().reset_index(name='cardinality')

Unnamed: 0,index,cardinality
0,country,46
1,description,70235
2,designation,25487
3,province,423
4,region_1,1145
5,region_2,18
6,taster_name,19
7,taster_twitter_handle,15
8,title,37684
9,variety,622


In [30]:
def prepar_data_set(data_df):
    categoy_features = data_df.select_dtypes('category').columns
    numerique_features = data_df.select_dtypes('number').columns
    for col in categoy_features:
        encoder = LabelEncoder()
        data_df[col] = encoder.fit_transform(data_df[col])
    return data_df,categoy_features,numerique_features

In [38]:
models=[]
inputs=[]

for cat in categorical_features:
    vocab_size = train[cat].nunique()
    inpt = tf.keras.layers.Input(shape=(1,),\
                                 name='input_' + '_'.join(\
                                 cat.split(' ')))
    embed = tf.keras.layers.Embedding(vocab_size, 200,trainable=True,\
                                      embeddings_initializer=tf.random_normal_initializer)(inpt)
    embed_rehsaped = tf.keras.layers.Reshape(target_shape=(200,))(embed)
    models.append(embed_rehsaped)
    inputs.append(inpt)

In [39]:
num_input = tf.keras.layers.Input(shape=(len(num_features)),\
                                  name='input_number_features')
# append this model to the list of models
models.append(num_input)
# keep track of the input, we are going to feed them later to the #final model
inputs.append(num_input)

NameError: name 'num_features' is not defined