## Beer Ratings Prediction
Source: https://www.kaggle.com/c/beer-ratings/

**Note:** The analysis belwo was performed as part of an NLP challenge. I did not participate in the Kaggle contest.

This is a fairly standard review rating prediction problem. There are some additional information and we shall see how it works out. First we shall import some necessary packages and read the data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn 

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.describe()

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/timeUnix,user/ageInSeconds,user/birthdayUnix
count,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,7856.0,7856.0
mean,24951.887573,7.403725,21861.152027,3036.59512,3.900053,3.87324,3.88944,3.854867,3.92244,1232794000.0,1176705000.0,241630300.0
std,14434.009669,2.318145,18923.130832,5123.084675,0.588778,0.680865,0.70045,0.668068,0.716504,71909550.0,337551400.0,337551400.0
min,0.0,0.1,175.0,1.0,0.0,1.0,0.0,1.0,1.0,926294400.0,703436600.0,-2208960000.0
25%,12422.5,5.4,5441.0,395.0,3.5,3.5,3.5,3.5,3.5,1189194000.0,979481000.0,143362800.0
50%,24942.5,6.9,17538.0,1199.0,4.0,4.0,4.0,4.0,4.0,1248150000.0,1100009000.0,318326400.0
75%,37416.75,9.4,34146.0,1315.0,4.5,4.5,4.5,4.5,4.5,1291330000.0,1274973000.0,438854400.0
max,49999.0,57.7,77207.0,27797.0,5.0,5.0,5.0,5.0,5.0,1326267000.0,3627295000.0,714898800.0


At a first glance, nothing seems to be out of ordinary. There aren't any outliers. Gender and user age might be useful later. Next we shall check for missing values.

In [2]:
nulls = (len(train_df) - train_df.count()).rename('# misssing')
nulls_pct = (nulls/len(train_df)*100.).rename('% missing')
pd.concat([nulls, nulls_pct], axis=1)

Unnamed: 0,# misssing,% missing
index,0,0.0
beer/ABV,0,0.0
beer/beerId,0,0.0
beer/brewerId,0,0.0
beer/name,0,0.0
beer/style,0,0.0
review/appearance,0,0.0
review/aroma,0,0.0
review/overall,0,0.0
review/palate,0,0.0


## Data Cleanup
Unfortunatelty, user birthday, age, and gender are not usable as features since so many data points are missing. Additionally, review time, beerId, profileName etc. are also not useful. We'll drop these columns. Rows with missing review text will also be dropped from taining data.

In [3]:
train_df.drop(['beer/beerId', 'user/ageInSeconds', 'user/birthdayRaw', 'user/birthdayUnix', 'user/gender', 'user/profileName', 'review/timeStruct', 'review/timeUnix'], axis=1, inplace=True)
test_df.drop(['beer/beerId', 'user/ageInSeconds', 'user/birthdayRaw', 'user/birthdayUnix', 'user/gender', 'user/profileName', 'review/timeStruct', 'review/timeUnix'], axis=1, inplace=True)

train_df.dropna(subset=['review/text'], inplace=True)
test_df['review/text'].fillna('', inplace=True)

nulls = (len(test_df) - test_df.count()).rename('# misssing')
nulls_pct = (nulls/len(test_df)*100.).rename('% missing')
pd.concat([nulls, nulls_pct], axis=1)

Unnamed: 0,# misssing,% missing
index,0,0.0
beer/ABV,0,0.0
beer/brewerId,0,0.0
beer/name,0,0.0
beer/style,0,0.0
review/appearance,12500,100.0
review/aroma,12500,100.0
review/overall,12500,100.0
review/palate,12500,100.0
review/taste,12500,100.0


Standardize ABV

In [4]:
train_df['standardized_abv'] = (train_df['beer/ABV']-train_df['beer/ABV'].mean())/train_df['beer/ABV'].std()
test_df['standardized_abv'] = (test_df['beer/ABV']-test_df['beer/ABV'].mean())/test_df['beer/ABV'].std()
train_df.head()

Unnamed: 0,index,beer/ABV,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,standardized_abv
0,40163,5.0,14338,Chiostro,Herbed / Spiced Beer,4.0,4.0,4.0,4.0,4.0,Pours a clouded gold with a thin white head. N...,-1.036809
1,8135,11.0,395,Bearded Pat's Barleywine,American Barleywine,4.0,3.5,3.5,3.5,3.0,12oz bottle into 8oz snifter.\t\tDeep ruby red...,1.551615
2,10529,4.7,365,Naughty Nellie's Ale,American Pale Ale (APA),3.5,4.0,3.5,3.5,3.5,First enjoyed at the brewpub about 2 years ago...,-1.16623
3,44610,4.4,1,Pilsner Urquell,Czech Pilsener,3.0,3.0,2.5,3.0,3.0,First thing I noticed after pouring from green...,-1.295651
4,37062,4.4,1417,Black Sheep Ale (Special),English Pale Ale,4.0,3.0,3.0,3.5,2.5,A: pours an amber with a one finger head but o...,-1.295651


## Text Pre-Processing
Now we'll build a simple preprocessing pipeline. In particular, We shall perform the following operations:
- remove all punctuations
- fix most common typos (found with `typo_fix.py`)
- remove last `y` if the rest of the word is in wordnet. For instance, `piney -> pine`.
- remove all stop words except negators (not, no, never, nothing). This is because in this problem a negator can significantly alter the ratings. Ideally, we might also want to use next word negation, however, adding simple negattion prefix, such as `neg_` does not seem to work. It is also possible to replace most frequent next words with their antonyms, however I haven't tried it.
- separate numbers from units; for instance, `8oz -> 8 oz`
- replace numbers with their spellings: `100 -> one hundred`
- lemmatizing and stemming. I have used `nltk`'s `WordNetLemmatizer` and Snowball Stemmer.

In [5]:
import string
import re

from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk import corpus
from num2words import num2words
from gensim.parsing.preprocessing import STOPWORDS

import json
from autocorrect import Speller

spell_correct = Speller(lang='en')
with open('typo_dict.json') as json_file:
    typo_dict = json.load(json_file)
    json_file.close()

negators = {'not', 'no', 'never', 'nothing'}
stop = set(STOPWORDS).difference(negators)
stop = stop.union({'will', 'can', 'should', 'shall'})
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
english_words = set(corpus.words.words())


def text_process(review):
    def is_number(word):
        return word.replace('.','',1).isdigit()
    
    def convert_number(number):
        if is_number(number):
            return num2words(number)
        else:
            return number
    
    def reduce_lengthening(text):
        # replace aaa... by aa
        # spellllling -> spelling
        pattern = re.compile(r"(.)\1{2,}")
        return pattern.sub(r"\1\1", text)
    
    def spelling_correct(word):
        word = reduce_lengthening(word)
        if typo_dict.get(word) is None:
            return word
        else:
            return typo_dict[word]
    
    def remove_y(word):
        # remove small words
        if len(word) == 1:
            return ' '
        if word[-1] == 'y' and word[:-1] in english_words:
            return word[:-1]
        elif word[-1] == 'y' and word[-2] == word[-3] and word[:-2] in english_words:
            return word[:-2]
        elif word[-1] == 'y' and word[:-2]+'e' in english_words:
            return word[:-1]
        else:
            return word
    
    def next_word_negation(review):
        is_negated = False
        output = []
        for word in review:
            if word in negators:
                is_negated = True
                continue
            if is_negated and len(word) > 1:
                output.append("neg_" + word)
                is_neagted = False
            else:
                output.append(word)
        return output
    # remove all punctuations except apostrophe
    punctuations = ''.join([p for p in string.punctuation])
    table = str.maketrans('', '', punctuations)
    review = str(review).translate(table)
    
    # convert text to lower case
    review = str(review).lower()
    
    # spelling correction
    review = [spelling_correct(word) for word in review.split()]
    
    # remove y at the end
    review = [remove_y(word) for word in review]
    
    # remove stop words
    review = filter(lambda word : word not in stop, review)
    
    # next word negation
    # review = ' '.join(next_word_negation(review))
#     # remove apostrophe
#     table = str.maketrans('', '', '\'')
#     review = review.translate(table)
    
    # separate numbers from units: for example 8oz to 8 oz
    review = ' '.join(re.split('(\d+)', ' '.join(review)))
    
    # replace numbers to their spellings. for example 100 -> one hundred
    review = ' '.join([convert_number(word) for word in review.split()])
    
    # lemmatizing and stemming
    review = ' '.join([stemmer.stem(lemmatizer.lemmatize(word, pos='a')) for word in review.split()])
    return review


train_df['processed_beer_style'] = train_df['beer/style'].map(text_process)
test_df['processed_beer_style'] = test_df['beer/style'].map(text_process)

train_df['processed_beer_name'] = train_df['beer/name'].map(text_process)
test_df['processed_beer_name'] = test_df['beer/name'].map(text_process)

#train_df['processed_beer_style'].astype(str) + ' ' + \
#train_df['processed_beer_name'].astype(str) + ' ' + \
train_df['processed_text'] = train_df['review/text'].map(text_process).astype(str)

#test_df['processed_beer_style'].astype(str) + ' ' + \
# test_df['processed_beer_name'].astype(str) + ' ' + \
test_df['processed_text'] = test_df['review/text'].map(text_process).astype(str)
train_df.head(5)

Unnamed: 0,index,beer/ABV,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,standardized_abv,processed_beer_style,processed_beer_name,processed_text
0,40163,5.0,14338,Chiostro,Herbed / Spiced Beer,4.0,4.0,4.0,4.0,4.0,Pours a clouded gold with a thin white head. N...,-1.036809,herb spice beer,chiostro,pour cloud gold white head nose floral larg sp...
1,8135,11.0,395,Bearded Pat's Barleywine,American Barleywine,4.0,3.5,3.5,3.5,3.0,12oz bottle into 8oz snifter.\t\tDeep ruby red...,1.551615,american barleywin,beard pat barleywin,twelv oz bottl eight oz snifter deep rub red h...
2,10529,4.7,365,Naughty Nellie's Ale,American Pale Ale (APA),3.5,4.0,3.5,3.5,3.5,First enjoyed at the brewpub about 2 years ago...,-1.16623,american pale ale apa,naught nelli ale,enjoy brewpub year ago final manag bottl sligh...
3,44610,4.4,1,Pilsner Urquell,Czech Pilsener,3.0,3.0,2.5,3.0,3.0,First thing I noticed after pouring from green...,-1.295651,czech pilsen,pilsner urquel,thing notic pour green bottl glass skunk smell...
4,37062,4.4,1417,Black Sheep Ale (Special),English Pale Ale,4.0,3.0,3.0,3.5,2.5,A: pours an amber with a one finger head but o...,-1.295651,english pale ale,black sheep ale special,pour amber finger head onl v strong pour head ...


In [6]:
# dump dataframes so that we don't have to do the preprocessing again
train_df.to_pickle("processed_train.pkl")
test_df.to_pickle("processed_test.pkl")

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

# load pickled dataframes
train_df = pd.read_pickle('processed_train.pkl')
test_df = pd.read_pickle('processed_test.pkl')

I did some simple testing to remove the most frequent and the least frequent words. I have also kept some of the words because they clearly have contextual meaning.

In [8]:
word_freq = train_df.processed_text.apply(lambda x : ' '.join(set(x.split()))).str.split(expand=True).stack().value_counts()
most_common_words = set(word_freq[ word_freq > 18000].keys()).difference({'not', 'sweet', 'flavor'})
least_common_words = set(word_freq[ word_freq < 4].keys())
# we shall remove both type of words and some others as necessary
undesirable_words = most_common_words.union(least_common_words).union({'drink', 'drinabl', 'and', 'like'})


In [9]:
def remove_words(review):
    return ' '.join(list(filter(lambda word : word not in undesirable_words, review.split())))

train_df['processed_beer_style'] = train_df['processed_beer_style'].map(remove_words)
test_df['processed_beer_style'] = test_df['processed_beer_style'].map(remove_words)

train_df['processed_beer_name'] = train_df['processed_beer_name'].map(remove_words)
test_df['processed_beer_name'] = test_df['processed_beer_name'].map(remove_words)

# train_df['processed_beer_style'].astype(str) + ' ' + \
train_df['processed_text'] = train_df['processed_beer_name'].astype(str) + ' ' + \
                       train_df['processed_text'].map(remove_words).astype(str)

# test_df['processed_beer_style'].astype(str) + ' ' + \
test_df['processed_text'] = test_df['processed_beer_name'].astype(str) + ' ' + \
                      test_df['processed_text'].map(remove_words).astype(str)
train_df.head(5)

Unnamed: 0,index,beer/ABV,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,standardized_abv,processed_beer_style,processed_beer_name,processed_text
0,40163,5.0,14338,Chiostro,Herbed / Spiced Beer,4.0,4.0,4.0,4.0,4.0,Pours a clouded gold with a thin white head. N...,-1.036809,herb spice,,cloud gold white nose floral larg spice ad de...
1,8135,11.0,395,Bearded Pat's Barleywine,American Barleywine,4.0,3.5,3.5,3.5,3.0,12oz bottle into 8oz snifter.\t\tDeep ruby red...,1.551615,american barleywin,beard pat barleywin,beard pat barleywin twelv oz bottl eight oz sn...
2,10529,4.7,365,Naughty Nellie's Ale,American Pale Ale (APA),3.5,4.0,3.5,3.5,3.5,First enjoyed at the brewpub about 2 years ago...,-1.16623,american pale ale apa,naught nelli ale,naught nelli ale enjoy brewpub year ago final ...
3,44610,4.4,1,Pilsner Urquell,Czech Pilsener,3.0,3.0,2.5,3.0,3.0,First thing I noticed after pouring from green...,-1.295651,czech pilsen,pilsner urquel,pilsner urquel thing notic green bottl glass s...
4,37062,4.4,1417,Black Sheep Ale (Special),English Pale Ale,4.0,3.0,3.0,3.5,2.5,A: pours an amber with a one finger head but o...,-1.295651,english pale ale,black sheep ale special,black sheep ale special amber finger onl stron...


## The Baseline
We shall build a simple regression model with ABV (standardized), brewerId (one-hot encoded) and beer style (one-hot encoded). The scoring metrix will be RMSE as in the Kaggle competition.

In [10]:
print(train_df['processed_beer_name'].nunique())
print(train_df['processed_beer_style'].nunique())
print(train_df['beer/brewerId'].nunique())

1568
95
212


In [11]:
from sklearn.preprocessing import OneHotEncoder

def one_hot_encoder(col_name):
    enc = OneHotEncoder(handle_unknown = 'ignore', categories='auto')
    train = train_df[[col_name]]

    train_arr = enc.fit_transform(train).toarray()
    test_arr = enc.transform(train).toarray()
    feature_labels = np.array(enc.categories_).ravel()
    return pd.DataFrame(train_arr, columns=feature_labels), pd.DataFrame(test_arr, columns=feature_labels)

style_train_df, style_test_df = one_hot_encoder('processed_beer_style')
brewer_train_df, brewer_test_df = one_hot_encoder('beer/brewerId')

baseline_train_df = pd.merge(style_train_df, brewer_train_df, left_index=True, right_index=True)
baseline_test_df = pd.merge(style_test_df, brewer_test_df, left_index=True, right_index=True)

y_appear_train = train_df['review/appearance'].to_numpy()
y_appear_test = test_df['review/appearance'].to_numpy()

y_aroma_train = train_df['review/aroma'].to_numpy()
y_aroma_test = test_df['review/aroma'].to_numpy()

y_overall_train = train_df['review/overall'].to_numpy()
y_overall_test = test_df['review/overall'].to_numpy()

y_palate_train = train_df['review/palate'].to_numpy()
y_palate_test = test_df['review/palate'].to_numpy()

y_taste_train = train_df['review/taste'].to_numpy()
y_taste_test = test_df['review/taste'].to_numpy()

print(y_taste_train.shape)
print(train_df.shape)

(37490,)
(37490, 15)


### Cross Validation
We shall use ridge regression from `sklearn`. I ran a binary search for the hyperparameter `alpha` and found that `5.` works the best.

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

def cross_validate(X, y):
    model = Ridge(alpha=5.)
    scores = cross_val_score(model, X, y, cv=8, scoring='neg_root_mean_squared_error')
    print(f"mean: {-scores.mean()}, std: {scores.std()}")

In [13]:
print("metric: RMSE")
print("appear.", end="> ")
cross_validate(baseline_train_df.to_numpy(), y_appear_train)
print("aroma", end=">   ")
cross_validate(baseline_train_df.to_numpy(), y_aroma_train)
print("overall", end="> ")
cross_validate(baseline_train_df.to_numpy(), y_overall_train)
print("palate", end=">  ")
cross_validate(baseline_train_df.to_numpy(), y_palate_train)
print("taste", end=">   ")
cross_validate(baseline_train_df.to_numpy(), y_taste_train)

metric: RMSE
appear.> mean: 0.49788788009696816, std: 0.004899081538924065
aroma>   mean: 0.5400473722662857, std: 0.005250306730235193
overall> mean: 0.6169219003605013, std: 0.007052941642733783
palate>  mean: 0.5543871164782056, std: 0.0063881214176124326
taste>   mean: 0.5777728409804774, std: 0.0064877568956391


## Advanced Model
We shall compute the TF-IDF score for the top 2500 bigrams and unigrams from the processed review text. We shall also use the one hot encoded brewer ids that were computed before.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=2500, ngram_range=(1,2))
# just send in all your docs here
tfidf_vectorizer_vectors_train=tfidf_vectorizer.fit_transform(train_df['processed_text'])
tfidf_train_df = pd.DataFrame(tfidf_vectorizer_vectors_train.toarray(), columns=tfidf_vectorizer.get_feature_names())

tfidf_vectorizer_vectors_test=tfidf_vectorizer.transform(test_df['processed_text'])
tfidf_test_df = pd.DataFrame(tfidf_vectorizer_vectors_test.toarray(), columns=tfidf_vectorizer.get_feature_names())

m1_train_df = pd.merge(tfidf_train_df, brewer_train_df, left_index=True, right_index=True)
m1_test_df = pd.merge(tfidf_test_df, brewer_test_df, left_index=True, right_index=True)

m1_train_df["abv"] = train_df['standardized_abv'].to_numpy()
m1_test_df["abv"] = test_df['standardized_abv'].to_numpy()

print(f"X matrix shape: {m1_train_df.shape}")
print(tfidf_vectorizer.get_feature_names())

X matrix shape: (37490, 2712)
['abbey', 'abl', 'absolut', 'abund', 'abv', 'abv not', 'accent', 'accompani', 'acid', 'acquir', 'activ', 'actual', 'ad', 'adam', 'add', 'addit', 'adequ', 'adjunct', 'admit', 'aecht', 'aecht schlenkerla', 'aftertast', 'aftertast mouthfeel', 'age', 'age bori', 'age stout', 'aggress', 'ago', 'air', 'alcohol', 'alcohol bite', 'alcohol burn', 'alcohol content', 'alcohol finish', 'alcohol flavor', 'alcohol heat', 'alcohol hidden', 'alcohol not', 'alcohol note', 'alcohol notic', 'alcohol presenc', 'alcohol present', 'alcohol sweet', 'alcohol warm', 'alcohol warmth', 'ale', 'ale appear', 'ale bottl', 'ale clear', 'ale dark', 'ale hazi', 'ale nice', 'ale not', 'ale special', 'ale twelv', 'allow', 'alright', 'amaz', 'amber', 'amber ale', 'amber bod', 'amber color', 'american', 'american ipa', 'american pale', 'amount', 'ampl', 'amstel', 'amstel light', 'anis', 'anticip', 'apa', 'apart', 'apour', 'appar', 'appeal', 'appear', 'appear black', 'appear clear', 'appear da

Now we shall transform the processed beer styles the same way and pick top 400 words.

In [15]:
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=400, ngram_range=(1,2))
# just send in all your docs here
tfidf_vectorizer_vectors_train=tfidf_vectorizer.fit_transform(train_df['processed_beer_style'])
tfidf_train_df = pd.DataFrame(tfidf_vectorizer_vectors_train.toarray(), columns=tfidf_vectorizer.get_feature_names())

tfidf_vectorizer_vectors_test=tfidf_vectorizer.transform(test_df['processed_beer_style'])
tfidf_test_df = pd.DataFrame(tfidf_vectorizer_vectors_test.toarray(), columns=tfidf_vectorizer.get_feature_names())

m1_train_df = pd.merge(tfidf_train_df, m1_train_df, left_index=True, right_index=True)
m1_test_df = pd.merge(tfidf_test_df, m1_test_df, left_index=True, right_index=True)

print(f"X matrix shape: {m1_train_df.shape}")
print(tfidf_vectorizer.get_feature_names())

X matrix shape: (37490, 2937)
['adjunct', 'adjunct lager', 'alcohol', 'ale', 'ale apa', 'ale ipa', 'ale wee', 'altbier', 'amber', 'amber red', 'american', 'american adjunct', 'american amber', 'american barleywin', 'american black', 'american blond', 'american brown', 'american dark', 'american doubl', 'american ipa', 'american liquor', 'american pale', 'american porter', 'american stout', 'american strong', 'american wild', 'ancient', 'ancient herb', 'apa', 'baltic', 'baltic porter', 'barleywin', 'belgian', 'belgian dark', 'belgian ipa', 'belgian pale', 'belgian strong', 'berlin', 'berlin weissbier', 'bier', 'bier bier', 'bitter', 'bitter esb', 'biã', 'biã gard', 'black', 'black ale', 'black tan', 'blond', 'blond ale', 'bock', 'braggot', 'brown', 'brown ale', 'bruin', 'california', 'california common', 'chile', 'common', 'common steam', 'cream', 'cream ale', 'czech', 'czech pilsen', 'dark', 'dark ale', 'dark lager', 'dark mild', 'dark wheat', 'doppelbock', 'dortmund', 'dortmund export

### Cross Validation
We again chose Ridge Regression from `sklearn` because of its speed and fewer hyperparameters. A deep neural network is likely to perform better. However, I didn't try it out because of resource and time constraints.

In [16]:
#from sklearn.svm import SVR
#from  sklearn.ensemble import RandomForestRegressor
# from sklearn.gaussian_process import GaussianProcessRegressor

def m1_cross_validate(X, y):
    model = Ridge(alpha=5.)# SVR(C=1., epsilon=0.2)
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')
    print(f"mean: {-scores.mean()}, std: {scores.std()}")
print("metric: RMSE")
print("appear.", end="> ")
m1_cross_validate(m1_train_df.to_numpy(), y_appear_train)
print("aroma", end=">   ")
m1_cross_validate(m1_train_df.to_numpy(), y_aroma_train)
print("overall", end="> ")
m1_cross_validate(m1_train_df.to_numpy(), y_overall_train)
print("palate", end=">  ")
m1_cross_validate(m1_train_df.to_numpy(), y_palate_train)
print("taste", end=">   ")
m1_cross_validate(m1_train_df.to_numpy(), y_taste_train)

metric: RMSE
appear.> mean: 0.4525057971608934, std: 0.001969068109346739
aroma>   mean: 0.47942876422794967, std: 0.006095681525380359
overall> mean: 0.5247244831477001, std: 0.0034210276598466978
palate>  mean: 0.4841065209808324, std: 0.004906666930703412
taste>   mean: 0.47852516131541795, std: 0.0038021095282875753


This looks much better than the base line. Let us make predictions on test set.
## Prediction

In [17]:
def predict(X_train, X_test, y_train):
    model = Ridge(alpha=5.)
    model.fit(X_train, y_train)
    return model.predict(X_test)

y_appear_pred = predict(m1_train_df.to_numpy(), m1_test_df.to_numpy(), y_appear_train)
y_aroma_pred = predict(m1_train_df.to_numpy(), m1_test_df.to_numpy(), y_aroma_train)
y_overall_pred = predict(m1_train_df.to_numpy(), m1_test_df.to_numpy(), y_overall_train)
y_palate_pred = predict(m1_train_df.to_numpy(), m1_test_df.to_numpy(), y_palate_train)
y_taste_pred = predict(m1_train_df.to_numpy(), m1_test_df.to_numpy(), y_taste_train)

pred = np.vstack((y_appear_pred, y_aroma_pred, y_overall_pred, y_palate_pred, y_taste_pred)).T
# pred = (pred * 2).round(0) / 2
pred = pd.DataFrame(pred, index=test_df['index'], columns=["review/appearance", "review/aroma", "review/overall", "review/palate", "review/taste"])
# pred['index'] = 
pred.to_csv("prediction.csv", sep=',')

The final RMSE score from kaggle is **0.51427**, which is slightly higher than the fourth lowest public score (0.51281).

## Possible Improvements
Here are some things I would try out later:
- Analyse the reviews for which predicted rating was >= 1.0 point off. If there is any pattern, try to modify the model to accomodate for that.
- Try out a more complex model.
- Handle context negation. Right now, the model seems to pick up some of them such as `not sweet`, `not bitter` etc. However, careful analysis is necessary.
- Beer style can be used more efficiently instead of just using TF-IDF score.
- ABV is almost ignored by the model as it's only one of 3000 features. A better integration is necessary.
- Handle British and American spelling variations