<h3> Beer prediction model using neural networks

<h5> The aim is to develop a Machine Learning model into production that accurately predicts the beer type based on review inputs entered by the user within an API

In [1]:
#Import initial packages
import pandas as pd
import numpy as np

<h4> 1. Load and Explore Train Dataset

<h5> Firstly, we need to load the dataset and explore it

In [2]:
df = pd.read_csv('../data/raw/beer_reviews.csv')

In [3]:
df.shape

(1586614, 13)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 13 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   brewery_id          1586614 non-null  int64  
 1   brewery_name        1586599 non-null  object 
 2   review_time         1586614 non-null  int64  
 3   review_overall      1586614 non-null  float64
 4   review_aroma        1586614 non-null  float64
 5   review_appearance   1586614 non-null  float64
 6   review_profilename  1586266 non-null  object 
 7   beer_style          1586614 non-null  object 
 8   review_palate       1586614 non-null  float64
 9   review_taste        1586614 non-null  float64
 10  beer_name           1586614 non-null  object 
 11  beer_abv            1518829 non-null  float64
 12  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(3), object(4)
memory usage: 157.4+ MB


In [5]:
df.describe()

Unnamed: 0,brewery_id,review_time,review_overall,review_aroma,review_appearance,review_palate,review_taste,beer_abv,beer_beerid
count,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1518829.0,1586614.0
mean,3130.099,1224089000.0,3.815581,3.735636,3.841642,3.743701,3.79286,7.042387,21712.79
std,5578.104,76544270.0,0.7206219,0.6976167,0.6160928,0.6822184,0.7319696,2.322526,21818.34
min,1.0,840672000.0,0.0,1.0,0.0,1.0,1.0,0.01,3.0
25%,143.0,1173224000.0,3.5,3.5,3.5,3.5,3.5,5.2,1717.0
50%,429.0,1239203000.0,4.0,4.0,4.0,4.0,4.0,6.5,13906.0
75%,2372.0,1288568000.0,4.5,4.0,4.0,4.0,4.5,8.5,39441.0
max,28003.0,1326285000.0,5.0,5.0,5.0,5.0,5.0,57.7,77317.0


In [6]:
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [7]:
item_counts = df["beer_style"].value_counts()
print(item_counts)

American IPA                        117586
American Double / Imperial IPA       85977
American Pale Ale (APA)              63469
Russian Imperial Stout               54129
American Double / Imperial Stout     50705
                                     ...  
Gose                                   686
Faro                                   609
Roggenbier                             466
Kvass                                  297
Happoshu                               241
Name: beer_style, Length: 104, dtype: int64


<h5> As seen on the tables above, the dataset contains more than 1,5 million observations, being beer_style 'American IPA' the beer type with more than 1 million reviews. <br> <br>
Therefore, the dataset will be reduced to the same number of observations for each beer style from the minority class (Happoshu - 241 observations) <br> <br>
There are 105 different types of beers that will be used in the model architecture 

<h4> 2. Clean Dataset

<h5> Unique identifiers will be removed from the dataset: review_time, beer_name , beer_beerid, review_profilename, brewery_id

In [8]:
#Removing columns
cols = ["review_time",
        "beer_name",
        "beer_beerid", 
        "review_profilename",
        "brewery_id",
        "beer_abv"
       ]

In [9]:
df_cleaned = df.copy()
df_cleaned.drop(cols, axis=1, inplace=True)

In [10]:
df_cleaned.head()

Unnamed: 0,brewery_name,review_overall,review_aroma,review_appearance,beer_style,review_palate,review_taste
0,Vecchio Birraio,1.5,2.0,2.5,Hefeweizen,1.5,1.5
1,Vecchio Birraio,3.0,2.5,3.0,English Strong Ale,3.0,3.0
2,Vecchio Birraio,3.0,2.5,3.0,Foreign / Export Stout,3.0,3.0
3,Vecchio Birraio,3.0,3.0,3.5,German Pilsener,2.5,3.0
4,Caldera Brewing Company,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5


<h5> Reviewing if there are missing values

In [11]:
print(df_cleaned.isnull().sum())

brewery_name         15
review_overall        0
review_aroma          0
review_appearance     0
beer_style            0
review_palate         0
review_taste          0
dtype: int64


<h5> Removing missing values from dataset

In [12]:
df_cleaned.dropna(inplace=True)

<h5> Column names are saved into a variable

In [13]:
cols = df_cleaned.columns.values.tolist()

<h4> 3.0 Standarize dataset and Scale numerical features using pipeline

In [14]:
# Standarize Numerical Features
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from src.models.pytorch import MultiColumnOrdinalEncoder

<h5> Create a pipeline called num_transformer with one step that contains StandardScaler

In [15]:
num_cols = ["review_overall",
            "review_aroma",
            "review_appearance",
            "review_palate",
            "review_taste"
           ]
    
cat_cols = ["brewery_name"
            , "beer_style"
           ] 


In [16]:
num_transformer = Pipeline (
    steps = [
        ('scaler', StandardScaler())
    ]
)

<h5> Create a pipeline called cat_transformer with one step that contains OrdinalEncoder

In [17]:
cat_transformer = Pipeline (
    steps = [
        ('MultiColumnOrdinalEncoder', MultiColumnOrdinalEncoder(columns=cat_cols))
    ]
)

<h5> Create a ColumnTransformer called preprocessor with two steps that contains num_transformer and cat_transformer

In [18]:
from sklearn.compose import ColumnTransformer

In [19]:
preprocessor = ColumnTransformer(
    transformers = [
        #('num_cols', num_transformer, num_cols), 
        ('cat_cols', cat_transformer, cat_cols)
    ], remainder='passthrough'
)

In [20]:
df_transformed = preprocessor.fit_transform(df_cleaned)

In [21]:
df_transformed = pd.DataFrame(df_transformed, columns = num_cols + cat_cols)

In [22]:
df_transformed[cat_cols] = df_transformed[cat_cols].astype(int)

In [23]:
df_transformed.head()

Unnamed: 0,review_overall,review_aroma,review_appearance,review_palate,review_taste,brewery_name,beer_style
0,1.0,1.0,1.5,2.0,2.5,1,1
1,1.0,2.0,3.0,2.5,3.0,3,3
2,1.0,3.0,3.0,2.5,3.0,3,3
3,1.0,4.0,3.0,3.0,3.5,2,3
4,2.0,5.0,4.0,4.5,4.0,4,4


In [24]:
print(df_transformed.max())

review_overall       5742.0
review_aroma          104.0
review_appearance       5.0
review_palate           5.0
review_taste            5.0
brewery_name            5.0
beer_style              5.0
dtype: float64


<h4> 4.0 Undersample dataset

<h5> As dataset is too big with more than 1.5 million rows, a random resampling will be performed by keeping 241 observations from each beer style

In [25]:
#Select a random sample used for training

#1 shuffle df
result = df_transformed.sample(frac=1, random_state = 7)

#2 get the first 10 by beer_style
result = result.groupby("beer_style").head(3000)

df_sample = result.copy()

<h5> Due to dataset being undersampled, the dataframe index will be reset 

In [26]:
df_sample.reset_index(drop=True,inplace=True)

In [27]:
target = df_sample.pop("beer_style")

<h4> 5.0 Splitting datasets and saving them

In [28]:
#Spliting data into training and testing sets with 80/20 ratio 

In [29]:
#Import subset function for getting training and evaluate
from src.data.sets import split_set

In [30]:
X_train, X_val, X_test, y_train, y_val, y_test = split_set(df_sample, target)

In [31]:
#Saving sets into ..data/processed folder

In [32]:
#Import saving function for saving sets
from src.data.sets import save_sets

In [33]:
save_sets(X_train=X_train, X_val=X_val, y_train=y_train, y_val=y_val, X_test=X_test, y_test=y_test)

<h4> 6.0 Converting datasets into PytorchDataset

In [34]:
#Import class from src/models/pytorch and convert all sets to PytorchDatasets
from src.models.pytorch import PytorchDataset

In [35]:
train_dataset = PytorchDataset(X=X_train, y=y_train)
val_dataset = PytorchDataset(X=X_val, y=y_val)
test_dataset = PytorchDataset(X=X_test, y=y_test)

In [36]:
X_train.shape[1]

6

<h4> 7.0 Define neural network Architecture

<h5> A multi-class classification apporach will be used as there are multiple target classes

In [37]:
#Initiate PytorchMulticlass with the correct no of input feature
from src.models.pytorch import PytorchMultiClass

<h5> The model architecture consists of 7 input features, 80 neaurons and 104 output layers

In [38]:
model = PytorchMultiClass(num_features=X_train.shape[1])

<h5> Custom function get_device will be used to determin if CPU or GPU will be used depending on if GPU is available locally

In [39]:
# Import get_device() from src.models.pytorch and set model to use de device available
from src.models.pytorch import get_device
device = get_device()
model.to(device)

PytorchMultiClass(
  (layer_1): Linear(in_features=6, out_features=60, bias=True)
  (layer_out): Linear(in_features=60, out_features=105, bias=True)
)

<h4> 8.0 Train Model

In [40]:
#Import torch

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim.lr_scheduler as op

In [41]:
criterion = nn.CrossEntropyLoss()

<h5> Initiate a torch.optim.Adam() optimizer with the model's parameters and 0.001 as learning rate saving it into a variable called optimizer

In [42]:
optimizer = torch.optim.Adam(model.parameters(), lr= 0.0002)

In [43]:
# Create 2 variables N_EPOCHS and BATCH_SIZE 
N_EPOCHS = 682
BATCH_SIZE = 120 

<h5> Create a loop that ill iterate through the number of epochs and will train the model with the training set and asses the performance on the validation sets and print the results

In [44]:
#scheduler = op.StepLR(optimizer, 1, gamma=0.1)

In [45]:
from src.models.pytorch import train_classification, test_classification

In [46]:
for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_classification(train_dataset, model=model, criterion= criterion, optimizer=optimizer, batch_size=BATCH_SIZE, device=device
                                                 #, scheduler=scheduler 
                                                )
    valid_loss, valid_acc = test_classification(val_dataset, model=model, criterion= criterion, batch_size=BATCH_SIZE, device=device )
    print(f'Epoch: {epoch}')
    print(f'\t(train)\t|\tloss: {train_loss: .5f}\t|\tAcc: {train_acc * 100:.3f}%')
    print(f'\t(valid)\t|\tloss: {valid_loss: .5f}\t|\tAcc: {valid_acc * 100:.3f}%')
    

Epoch: 0
	(train)	|	loss:  7.77286	|	Acc: 2.344%
	(valid)	|	loss:  3.58835	|	Acc: 19.833%
Epoch: 1
	(train)	|	loss:  4.32067	|	Acc: 7.635%
	(valid)	|	loss:  0.87879	|	Acc: 20.000%
Epoch: 2
	(train)	|	loss:  2.73418	|	Acc: 14.354%
	(valid)	|	loss:  0.22812	|	Acc: 16.708%
Epoch: 3
	(train)	|	loss:  2.16701	|	Acc: 16.688%
	(valid)	|	loss:  0.12777	|	Acc: 22.750%
Epoch: 4
	(train)	|	loss:  1.82943	|	Acc: 19.000%
	(valid)	|	loss:  0.17649	|	Acc: 20.458%
Epoch: 5
	(train)	|	loss:  1.59287	|	Acc: 18.917%
	(valid)	|	loss:  0.13309	|	Acc: 21.792%
Epoch: 6
	(train)	|	loss:  1.31280	|	Acc: 19.740%
	(valid)	|	loss:  0.14079	|	Acc: 19.667%
Epoch: 7
	(train)	|	loss:  1.06083	|	Acc: 19.865%
	(valid)	|	loss:  0.13388	|	Acc: 20.292%
Epoch: 8
	(train)	|	loss:  0.84765	|	Acc: 19.552%
	(valid)	|	loss:  0.08438	|	Acc: 28.292%
Epoch: 9
	(train)	|	loss:  0.63795	|	Acc: 19.604%
	(valid)	|	loss:  0.08036	|	Acc: 19.917%
Epoch: 10
	(train)	|	loss:  0.44918	|	Acc: 19.458%
	(valid)	|	loss:  0.06654	|	Acc: 21.167%


<h4> Save model into models folder

In [47]:
torch.save(model, "../models/pytorch_multi_class_beer")

In [48]:
#Asses model performance on testing set and print results

In [49]:
test_loss, test_acc = test_classification(test_dataset, model=model, criterion= criterion, batch_size=BATCH_SIZE, device=device )
print(f'\t(test)\t|\tloss: {test_loss: .4f}\t|\tAccuracy: {test_acc * 100:.3f}%')


	(test)	|	loss:  0.0065	|	Accuracy: 66.867%


<h5> Create Pipeline with 1 model initiation

In [50]:
nn_pipe = Pipeline(
    steps = [
        ('neural_net', model)
        ]    
)

<h5> save nn_pipe into models folder

In [51]:
from joblib import dump
dump(nn_pipe, '../models/nn_pipe.joblib')

['../models/nn_pipe.joblib']