# Beer Recommendation System

##### by Adam Larsen
###### Flatiron School, Data Science Capstone
###### January 2022

# Foreward

The goal of this system is to take an input from the user about their favorite beer, and predict similar beers based off of 5 different ratings (overall, taste, feel, look, smell) and descriptions and text reviews.

This notebook will explain the different steps and thought process to achieve this type system. Feel free to reach out if there are any questions.

# Introduction

There is absolutely nothing worse than ordering a beer that is the same style as a beer you’ve previously enjoyed, or ordered something at random off the beer list to try something new, that you end up HATING. You sit there staring at a full glass hoping it would empty itself so you could go back to drinking something you know you’d enjoy. That’s where my model comes into play. Our predictions will allow you to stop making these miserable $8 mistakes and allow you to continue to try something new, with a strong guarantee you’ll enjoy it as well.

Let's make that happen!

In [308]:
# We'll use this cell for all of our imports

import pandas as pd
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

## Data Cleaning / Introducing our Data

Data was taken from two different Kaggle data sources where data was pulled by data scientists at Stanford University as well as independent data scientists which includes data ratings, different breweries, text reviews aswell as an extensive list of data about beers. All of this data is from BeerAdvocate and is a bit old at the time of writing this repo due to the fact that they have since then made all of their data private. The great news is that we have a ton of data to work with.

Kaggle sources:
(1) https://www.kaggle.com/rdoume/beerreviews
(2) https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews

In [309]:
df_ba = pd.read_csv('data/beer_reviews.csv')
df_beers = pd.read_csv('data/beers.csv')
df_breweries = pd.read_csv('data/breweries.csv')
df_reviews = pd.read_csv('data/reviews.csv')

Let us first start with our first data source, df_ba, and see what we're working with. We'll simply start with a preview of the data, see how much data we're working with, as well as see how many missing values we have.

In [310]:
df_ba.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [311]:
df_ba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 13 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   brewery_id          1586614 non-null  int64  
 1   brewery_name        1586599 non-null  object 
 2   review_time         1586614 non-null  int64  
 3   review_overall      1586614 non-null  float64
 4   review_aroma        1586614 non-null  float64
 5   review_appearance   1586614 non-null  float64
 6   review_profilename  1586266 non-null  object 
 7   beer_style          1586614 non-null  object 
 8   review_palate       1586614 non-null  float64
 9   review_taste        1586614 non-null  float64
 10  beer_name           1586614 non-null  object 
 11  beer_abv            1518829 non-null  float64
 12  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(3), object(4)
memory usage: 157.4+ MB


In [312]:
df_ba.isna().sum()

brewery_id                0
brewery_name             15
review_time               0
review_overall            0
review_aroma              0
review_appearance         0
review_profilename      348
beer_style                0
review_palate             0
review_taste              0
beer_name                 0
beer_abv              67785
beer_beerid               0
dtype: int64

Immediate thoughts here:

    a) Looks like we have some missing values to deal with:

    1. 15 in brewery_name
    2. 348 in review_profilename
    3. ~67k in beer_abv
    
    b) There are a lot of columns that I ultimately not need to use. Some examples
       of these columns would be brewery_id and review_time. 


### Dealing with Missing Values

In [313]:
df_ba[df_ba['brewery_name'].isna()]

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
651565,1193,,1301022066,2.0,2.5,2.5,Knapp85,Vienna Lager,1.5,1.5,Engel Tyrolian Bräu WRONG BREWERY SEE SCHWABIS...,5.0,67503
659293,1193,,1290107698,4.0,4.5,3.5,dqrull,Bock,4.0,3.5,Engel Bock Dunkel WRONG BREWERY SEE CRAILSHEIMER,7.2,63658
659299,1193,,1289077001,3.5,3.0,3.0,dqrull,Dortmunder / Export Lager,4.0,4.0,Engel Gold WRONG BREWERY SEE CRAILSHEIMER,5.4,63215
659300,1193,,1289851033,3.5,4.0,3.5,dqrull,Munich Helles Lager,3.5,3.0,Engel Landbier WRONG BREWERY SEE CRAILSHEIMER,4.8,63557
659301,1193,,1289158632,3.5,4.0,4.0,dqrull,Keller Bier / Zwickel Bier,4.0,3.5,Engel Keller Hell WRONG BREWERY SEE CRAILSHEIMER,5.4,63256
659302,1193,,1289590065,3.5,4.0,3.0,dqrull,Vienna Lager,3.0,3.5,Engel Aloisius - WRONG BREWERY SEE CRAILSHEIMER,5.9,63459
659303,1193,,1298078926,3.0,3.0,3.0,Ochsenblut,Keller Bier / Zwickel Bier,2.0,3.0,Engel Keller Dunkel WRONG BREWERY SEE CRAILSH...,5.3,63324
659304,1193,,1292423271,4.5,4.0,4.0,Dentist666,Keller Bier / Zwickel Bier,4.0,4.5,Engel Keller Dunkel WRONG BREWERY SEE CRAILSH...,5.3,63324
659305,1193,,1289329962,3.5,3.5,4.0,dqrull,Keller Bier / Zwickel Bier,4.0,4.0,Engel Keller Dunkel WRONG BREWERY SEE CRAILSH...,5.3,63324
1391043,27,,1020244856,3.5,3.5,3.5,Jason,American Adjunct Lager,4.0,3.5,Hard Hat American Beer,3.8,60


Looks like we're only missing values for brewery IDs 1193 and 27. I'm going to see if any of these values exist in the data provided, and if not, I'll manually add these brewery names using a google search to see what the name of the brewery is.

In [314]:
print(df_ba[df_ba['brewery_id'] == 1193])

print(df_ba[df_ba['brewery_id'] == 27])

        brewery_id brewery_name  review_time  review_overall  review_aroma  \
651565        1193          NaN   1301022066             2.0           2.5   
659293        1193          NaN   1290107698             4.0           4.5   
659299        1193          NaN   1289077001             3.5           3.0   
659300        1193          NaN   1289851033             3.5           4.0   
659301        1193          NaN   1289158632             3.5           4.0   
659302        1193          NaN   1289590065             3.5           4.0   
659303        1193          NaN   1298078926             3.0           3.0   
659304        1193          NaN   1292423271             4.5           4.0   
659305        1193          NaN   1289329962             3.5           3.5   

        review_appearance review_profilename                  beer_style  \
651565                2.5            Knapp85                Vienna Lager   
659293                3.5             dqrull                       

In [315]:
df_ba.loc[df_ba['brewery_id'] == 1193, ['brewery_name']] = 'Engel' 
df_ba.loc[df_ba['brewery_id'] == 27, ['brewery_name']] = 'American Brewing Company' 
df_ba.isna().sum()


brewery_id                0
brewery_name              0
review_time               0
review_overall            0
review_aroma              0
review_appearance         0
review_profilename      348
beer_style                0
review_palate             0
review_taste              0
beer_name                 0
beer_abv              67785
beer_beerid               0
dtype: int64

Looks like adding Engel and American Brewing Company seemed to solve our problems in the brewery name column, let's try and tackle the next two columns: review_profilename and beer_abv.

<br>

review_profile name is an interesting column. It has been my intention entirely from the get-go to make a content based recommendation system, not a collaborative system because I'm more interested in what one person enjoys and predicting similar to that instead of predicting what others like if they like the same thing. I feel like you'd ultimately find yourself getting recommended other styles of beers that doesn't translate over too cleanly into preferences around beers ie. people seem to stick to only a few styles of beer that they really enjoy. Because of this, will be dropping this column entirely.

beer_abv is another interesting column; whether or not ABV has an impact on the taste of a beer across styles is a hotly debated topic. I don't think it should be a massive problem if I generalize ABV for the beers that are missing this data based on the style of beer. 

This is a solid chunk of our data. I think the best approach would be to take the mean beer_abv for each beer_style and then apply that to our data. It's not a fool proof approach however it certaily would give us a good indication of the ABV so we can use that data going forward. There certainly will be some outliers here however beer style is a generally good way of designating ABV as different styles have different ABVs.


In [316]:
df_ba.drop(['review_profilename'], inplace=True, axis=1)

In [317]:
beer_styles = list(df_ba['beer_style'].unique())

beer_abv = {}

for style in beer_styles:
        func_df = df_ba[df_ba['beer_style'] == style]
        mean = round(func_df['beer_abv'].mean(),1)
        beer_abv.update({style : mean})

In [318]:
# Let us check to make sure that worked, which it does!
beer_abv

# Let's chuck this back into our original dataframe.

for beer in beer_abv:
    df_ba['beer_abv'] = df_ba['beer_abv'].fillna(beer_abv[beer])
    
# Now let's look back at our isna/sum to see if we're missing any data still.

df_ba.isna().sum()

brewery_id           0
brewery_name         0
review_time          0
review_overall       0
review_aroma         0
review_appearance    0
beer_style           0
review_palate        0
review_taste         0
beer_name            0
beer_abv             0
beer_beerid          0
dtype: int64

Perfectly cleaned dataset, this is extremely exciting. Let's take a look at some of our other data sources to see what we need to do to clean these up.

Let's take a look at the df_beers dataframe next. 

In [319]:
df_beers.head()

Unnamed: 0,id,name,brewery_id,state,country,style,availability,abv,notes,retired
0,202522,Olde Cogitator,2199,CA,US,English Oatmeal Stout,Rotating,7.3,No notes at this time.,f
1,82352,Konrads Stout Russian Imperial Stout,18604,,NO,Russian Imperial Stout,Rotating,10.4,No notes at this time.,f
2,214879,Scottish Right,44306,IN,US,Scottish Ale,Year-round,4.0,No notes at this time.,t
3,320009,MegaMeow Imperial Stout,4378,WA,US,American Imperial Stout,Winter,8.7,Every time this year,f
4,246438,Peaches-N-Cream,44617,PA,US,American Cream Ale,Rotating,5.1,No notes at this time.,f


In [320]:
df_beers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358873 entries, 0 to 358872
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            358873 non-null  int64  
 1   name          358873 non-null  object 
 2   brewery_id    358873 non-null  int64  
 3   state         298147 non-null  object 
 4   country       358719 non-null  object 
 5   style         358872 non-null  object 
 6   availability  358873 non-null  object 
 7   abv           320076 non-null  float64
 8   notes         358827 non-null  object 
 9   retired       358873 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 27.4+ MB


In [321]:
df_beers.isna().sum()

id                  0
name                0
brewery_id          0
state           60726
country           154
style               1
availability        0
abv             38797
notes              46
retired             0
dtype: int64

1. The sheer number of missing states seems a bit daunting but one thing that is very identifiable in the head print out above is that those beers seem to international beers where there isn't a state concept similar to the US. I haven't checked this as a foolproof method however I'd be willing to bet those beers are brewed outside the US.
2. I'll have to look into the missing 154 country values but might just drop those given how insignificant those beers might be to the ~360k data points we have in this set. I'll do the exact same for style because it's so insignificant.
3. I'm going to replace those NaN notes to just be "No notes at this time." so that it's similar to other tasting notes.

In [322]:
df_beers.dropna(subset=['country', 'style'], inplace=True)
df_beers['notes'].fillna('No notes at this time', inplace=True)
df_beers.isna().sum()

id                  0
name                0
brewery_id          0
state           60572
country             0
style               0
availability        0
abv             38787
notes               0
retired             0
dtype: int64

Filling in state to be "N/A" because it wasn't populated if the country wasn't USA. Will be dropping ABV from this dataframe because we have a much larger, and exhaustive dataset that already contains this information. Would just be overlapping and maybe could cause inconsistencies in our data.

In [323]:
df_beers['state'].fillna('Not Applicable', inplace=True)
df_beers.drop(columns='abv', inplace=True, axis=1)
df_beers.isna().sum()

id              0
name            0
brewery_id      0
state           0
country         0
style           0
availability    0
notes           0
retired         0
dtype: int64

Perfectly cleaned dataset, this is extremely exciting. Let's take a look at some of our other data sources to see what we need to do to clean these up.

df_breweries is our next data set, let's see how this looks.

In [324]:
df_breweries.head()

Unnamed: 0,id,name,city,state,country,notes,types
0,19730,Brouwerij Danny,Erpe-Mere,,BE,No notes at this time.,Brewery
1,32541,Coachella Valley Brewing Co,Thousand Palms,CA,US,No notes at this time.,"Brewery, Bar, Beer-to-go"
2,44736,Beef 'O' Brady's,Plant City,FL,US,No notes at this time.,"Bar, Eatery"
3,23372,Broadway Wine Merchant,Oklahoma City,OK,US,No notes at this time.,Store
4,35328,Brighton Beer Dispensary (DUPLICATE),Brighton,GB2,GB,Duplicate of https://www.beeradvocate.com/beer...,"Bar, Eatery"


In [325]:
df_breweries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50347 entries, 0 to 50346
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       50347 non-null  int64 
 1   name     50347 non-null  object
 2   city     50289 non-null  object
 3   state    39076 non-null  object
 4   country  50341 non-null  object
 5   notes    50262 non-null  object
 6   types    50347 non-null  object
dtypes: int64(1), object(6)
memory usage: 2.7+ MB


In [326]:
df_breweries.isna().sum()

id             0
name           0
city          58
state      11271
country        6
notes         85
types          0
dtype: int64

I had zero intention of using brewery data to help predict what beers you'd like because breweries brew all types of beers, however I also didn't fall in love with this data set because it included stores and eateries that weren't solely related to beers, and could have potentially added negative value to our modeling process. For those reasons, I decided to axe this dataset and not continue down that path.

<br>

Let's take a look at the reviews dataset now and see what we're working with.

In [327]:
df_reviews.head()

Unnamed: 0,beer_id,username,date,text,look,smell,taste,feel,overall,score
0,271781,bluejacket74,2017-03-17,"750 ml bottle, 2016 vintage, bottle #304 of...",4.0,4.0,4.0,4.25,4.0,4.03
1,125646,_dirty_,2017-12-21,,4.5,4.5,4.5,4.5,4.5,4.5
2,125646,CJDUBYA,2017-12-21,,4.75,4.75,4.75,4.75,4.75,4.75
3,125646,GratefulBeerGuy,2017-12-20,0% 16 oz can. Funny story: As I finally wal...,4.75,4.75,4.5,4.5,4.5,4.58
4,125646,LukeGude,2017-12-20,Classic TH NEIPA. Overflowing head and bouq...,4.25,4.5,4.25,4.25,4.25,4.31


In [328]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9073128 entries, 0 to 9073127
Data columns (total 10 columns):
 #   Column    Dtype  
---  ------    -----  
 0   beer_id   int64  
 1   username  object 
 2   date      object 
 3   text      object 
 4   look      float64
 5   smell     float64
 6   taste     float64
 7   feel      float64
 8   overall   float64
 9   score     float64
dtypes: float64(6), int64(1), object(3)
memory usage: 692.2+ MB


In [329]:
print(len(df_reviews))
df_reviews.isna().sum()

9073128


beer_id           0
username       3815
date              0
text              0
look        3790018
smell       3790018
taste       3790018
feel        3790018
overall     3790018
score             0
dtype: int64

This dataset was an absolute beast of a dataset and had way too much data, especially when working on a local machine so I decided the best approach was just to drop any rows that had ANY missing values and told myself that I'd review this later if I ever needed more data to enhance my model.

In [330]:
df_reviews.dropna(inplace=True)
df_reviews.isna().sum()

beer_id     0
username    0
date        0
text        0
look        0
smell       0
taste       0
feel        0
overall     0
score       0
dtype: int64

All of our 4 data sets seemed to be clean, and now it was time to create a bigger and more concise dataframe that we could begin modeling on.

For the sake of computing times and not wanting to work with too much data, I decided to limit the beers dataframe to reflect only the beers brewed within the US. For better results, I could certainly remove that restriction however I wanted to make this app light and nimble in terms of speed. We still had over 250k rows of data so I wasn't concerned without having enough.

In [331]:
df_usa = df_beers[df_beers['state'] == 'NY']
df_usa.set_index(['id'])

Unnamed: 0_level_0,name,brewery_id,state,country,style,availability,notes,retired
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
246535,Helles Bock,44691,NY,US,German Maibock,Limited (brewed once),No notes at this time.,t
317409,Chanukah Hanukkah Pass the Beer 2017,262,NY,US,Belgian Strong Pale Ale,Limited (brewed once),Golden Strong Ale brewed with Cocoa Nibs.,f
230668,Circle Of Trust,35586,NY,US,American IPA,Summer,No notes at this time.,f
216389,Neziah Bliss,45,NY,US,American Strong Ale,Limited (brewed once),No notes at this time.,t
143477,That's A Paddlin',3137,NY,US,American IPA,Limited (brewed once),No notes at this time.,t
...,...,...,...,...,...,...,...,...
205899,Brettstone With Cherries,10607,NY,US,American Brown Ale,Limited (brewed once),Brown Ale aged in brett barrels with cherries,t
276931,Exchange Student,48384,NY,US,Scotch Ale / Wee Heavy,Rotating,No notes at this time.,f
174682,Citrification,8768,NY,US,American Wild Ale,Limited (brewed once),No notes at this time.,t
215395,Asylum Porter,43184,NY,US,English Porter,Year-round,No notes at this time.,f


CHANGE THIS SHIT FROM NY YORK TO US TO MAKE A BETTER MODEL|

In [332]:
df_beers.set_index(['id'], inplace=True)
new_df = df_ba.join(df_usa, lsuffix="_2")

In [333]:
new_df

Unnamed: 0,brewery_id_2,brewery_name,review_time,review_overall,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_name,...,beer_beerid,id,name,brewery_id,state,country,style,availability,notes,retired
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,Hefeweizen,1.5,1.5,Sausa Weizen,...,47986,,,,,,,,,
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,English Strong Ale,3.0,3.0,Red Moon,...,48213,,,,,,,,,
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,...,48215,,,,,,,,,
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,Sausa Pils,...,47969,,,,,,,,,
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,...,64883,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586609,14359,The Defiant Brewing Company,1162684892,5.0,4.0,3.5,Pumpkin Ale,4.0,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586610,14359,The Defiant Brewing Company,1161048566,4.0,5.0,2.5,Pumpkin Ale,2.0,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586611,14359,The Defiant Brewing Company,1160702513,4.5,3.5,3.0,Pumpkin Ale,3.5,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586612,14359,The Defiant Brewing Company,1160023044,4.0,4.5,4.5,Pumpkin Ale,4.5,4.5,The Horseman's Ale,...,33061,,,,,,,,,


In [334]:
new_df = new_df[new_df['notes'] != 'No notes at this time.']

In [335]:
new_df

Unnamed: 0,brewery_id_2,brewery_name,review_time,review_overall,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_name,...,beer_beerid,id,name,brewery_id,state,country,style,availability,notes,retired
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,Hefeweizen,1.5,1.5,Sausa Weizen,...,47986,,,,,,,,,
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,English Strong Ale,3.0,3.0,Red Moon,...,48213,,,,,,,,,
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,...,48215,,,,,,,,,
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,Sausa Pils,...,47969,,,,,,,,,
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,...,64883,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586609,14359,The Defiant Brewing Company,1162684892,5.0,4.0,3.5,Pumpkin Ale,4.0,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586610,14359,The Defiant Brewing Company,1161048566,4.0,5.0,2.5,Pumpkin Ale,2.0,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586611,14359,The Defiant Brewing Company,1160702513,4.5,3.5,3.0,Pumpkin Ale,3.5,4.0,The Horseman's Ale,...,33061,,,,,,,,,
1586612,14359,The Defiant Brewing Company,1160023044,4.0,4.5,4.5,Pumpkin Ale,4.5,4.5,The Horseman's Ale,...,33061,,,,,,,,,


In [336]:
df_reviews.set_index(['beer_id'], inplace=True)
new_df = new_df.join(df_reviews, how='inner', lsuffix="_3")
new_df = new_df.reset_index()
print(new_df.head())
new_df.rename(columns={'index' : 'beer_id'})

   index  brewery_id_2             brewery_name  review_time  review_overall  \
0      3         10325          Vecchio Birraio   1234725145             3.0   
1      3         10325          Vecchio Birraio   1234725145             3.0   
2      3         10325          Vecchio Birraio   1234725145             3.0   
3      4          1075  Caldera Brewing Company   1293735206             4.0   
4      4          1075  Caldera Brewing Company   1293735206             4.0   

   review_aroma  review_appearance                      beer_style  \
0           3.0                3.5                 German Pilsener   
1           3.0                3.5                 German Pilsener   
2           3.0                3.5                 German Pilsener   
3           4.5                4.0  American Double / Imperial IPA   
4           4.5                4.0  American Double / Imperial IPA   

   review_palate  review_taste  ... retired     username        date  \
0            2.5          

Unnamed: 0,beer_id,brewery_id_2,brewery_name,review_time,review_overall,review_aroma,review_appearance,beer_style,review_palate,review_taste,...,retired,username,date,text,look,smell,taste,feel,overall,score
0,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,MAB,2003-03-17,"The label is very informative, except it di...",4.00,4.50,4.50,4.00,4.50,4.42
1,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,Morris729,2002-11-20,"According to the label, this beer is dedica...",4.00,3.50,4.00,4.00,4.00,3.88
2,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,Jason,2002-04-17,Presentation: 16oz brown with no freshness ...,3.50,4.00,4.00,4.50,4.00,4.02
3,4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,...,,shapudding,2005-03-21,"No dating. Had this one in the fridge, then...",4.50,4.50,2.50,2.50,3.00,3.20
4,4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,...,,allengarvin,2005-03-09,Picked up a bottle at Hall's in Colleyville...,3.50,3.50,3.50,4.00,3.50,3.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5107143,372874,22,Unibroue,1026760336,4.5,5.0,4.5,Tripel,4.5,4.5,...,,smcolw,2018-09-29,"Cloudy, naturally. Gold to amber color. Mod...",4.00,4.50,4.50,4.00,4.25,4.37
5107144,372914,22,Unibroue,1299216465,4.5,4.0,3.5,Tripel,4.5,4.0,...,,brentk56,2018-09-29,Appearance: Arrives cloudy and the color of...,4.00,4.50,4.50,4.00,4.25,4.37
5107145,373052,22,Unibroue,1288793189,5.0,4.5,4.5,Tripel,5.0,5.0,...,,ruzzal,2018-09-30,,4.25,4.25,4.25,4.25,4.25,4.25
5107146,373052,22,Unibroue,1288793189,5.0,4.5,4.5,Tripel,5.0,5.0,...,,Dreynolds1808,2018-09-30,,4.25,4.00,4.00,4.25,4.25,4.09


In [337]:
new_df = new_df.rename(columns={'index' : 'beer_id'})

Creating a smaller dataframe now so that we can focus on columns we only need.

In [338]:
new_df

Unnamed: 0,beer_id,brewery_id_2,brewery_name,review_time,review_overall,review_aroma,review_appearance,beer_style,review_palate,review_taste,...,retired,username,date,text,look,smell,taste,feel,overall,score
0,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,MAB,2003-03-17,"The label is very informative, except it di...",4.00,4.50,4.50,4.00,4.50,4.42
1,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,Morris729,2002-11-20,"According to the label, this beer is dedica...",4.00,3.50,4.00,4.00,4.00,3.88
2,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,German Pilsener,2.5,3.0,...,,Jason,2002-04-17,Presentation: 16oz brown with no freshness ...,3.50,4.00,4.00,4.50,4.00,4.02
3,4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,...,,shapudding,2005-03-21,"No dating. Had this one in the fridge, then...",4.50,4.50,2.50,2.50,3.00,3.20
4,4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,American Double / Imperial IPA,4.0,4.5,...,,allengarvin,2005-03-09,Picked up a bottle at Hall's in Colleyville...,3.50,3.50,3.50,4.00,3.50,3.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5107143,372874,22,Unibroue,1026760336,4.5,5.0,4.5,Tripel,4.5,4.5,...,,smcolw,2018-09-29,"Cloudy, naturally. Gold to amber color. Mod...",4.00,4.50,4.50,4.00,4.25,4.37
5107144,372914,22,Unibroue,1299216465,4.5,4.0,3.5,Tripel,4.5,4.0,...,,brentk56,2018-09-29,Appearance: Arrives cloudy and the color of...,4.00,4.50,4.50,4.00,4.25,4.37
5107145,373052,22,Unibroue,1288793189,5.0,4.5,4.5,Tripel,5.0,5.0,...,,ruzzal,2018-09-30,,4.25,4.25,4.25,4.25,4.25,4.25
5107146,373052,22,Unibroue,1288793189,5.0,4.5,4.5,Tripel,5.0,5.0,...,,Dreynolds1808,2018-09-30,,4.25,4.00,4.00,4.25,4.25,4.09


In [339]:
rcmd = new_df[['beer_id', 'style', 'text', 'notes',
               'name', 'look', 'smell', 'taste', 'feel', 'overall', 'score']]

In [340]:
rcmd

Unnamed: 0,beer_id,style,text,notes,name,look,smell,taste,feel,overall,score
0,3,,"The label is very informative, except it di...",,,4.00,4.50,4.50,4.00,4.50,4.42
1,3,,"According to the label, this beer is dedica...",,,4.00,3.50,4.00,4.00,4.00,3.88
2,3,,Presentation: 16oz brown with no freshness ...,,,3.50,4.00,4.00,4.50,4.00,4.02
3,4,,"No dating. Had this one in the fridge, then...",,,4.50,4.50,2.50,2.50,3.00,3.20
4,4,,Picked up a bottle at Hall's in Colleyville...,,,3.50,3.50,3.50,4.00,3.50,3.55
...,...,...,...,...,...,...,...,...,...,...,...
5107143,372874,,"Cloudy, naturally. Gold to amber color. Mod...",,,4.00,4.50,4.50,4.00,4.25,4.37
5107144,372914,,Appearance: Arrives cloudy and the color of...,,,4.00,4.50,4.50,4.00,4.25,4.37
5107145,373052,,,,,4.25,4.25,4.25,4.25,4.25,4.25
5107146,373052,,,,,4.25,4.00,4.00,4.25,4.25,4.09


In [341]:
rcmd.isna().sum()

beer_id          0
style      5061503
text             0
notes      5061503
name       5061503
look             0
smell            0
taste            0
feel             0
overall          0
score            0
dtype: int64

In [342]:
rcmd.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rcmd.dropna(inplace=True)


In [343]:
rcmd.isna().sum()

beer_id    0
style      0
text       0
notes      0
name       0
look       0
smell      0
taste      0
feel       0
overall    0
score      0
dtype: int64

In [344]:
rec_group = rcmd.groupby(['name'], as_index=False).mean()
rec_group['beer_id'] = rec_group['beer_id'].astype(str).apply(lambda x: x.replace('.0',''))

rec_group['beer_id'] = rec_group['beer_id'].astype(float)
rec_group.set_index(['beer_id'], inplace=True)
rec_group.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 1371 entries, 92847.0 to 263860.0
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     1371 non-null   object 
 1   look     1371 non-null   float64
 2   smell    1371 non-null   float64
 3   taste    1371 non-null   float64
 4   feel     1371 non-null   float64
 5   overall  1371 non-null   float64
 6   score    1371 non-null   float64
dtypes: float64(6), object(1)
memory usage: 85.7+ KB


In [345]:
print(len(rec_group))
rec_group

1371


Unnamed: 0_level_0,name,look,smell,taste,feel,overall,score
beer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,3.424000
260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,3.760000
1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,3.681157
277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,3.946098
88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,4.344482
...,...,...,...,...,...,...,...
172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,3.436667
124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,3.973243
212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,3.930000
126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,4.209887


In [346]:
rec_words = rcmd[['text', 'notes', 'style', 'beer_id']]
rec_words['combined'] = rec_words['text'] + ' '+ rec_words['notes'] + ' ' + rec_words['style']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_words['combined'] = rec_words['text'] + ' '+ rec_words['notes'] + ' ' + rec_words['style']


In [347]:
rec_words.drop(columns=['text', 'notes'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [348]:
rec_words

Unnamed: 0,style,beer_id,combined
77644,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77645,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77646,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77647,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77648,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
...,...,...,...
5096518,American IPA,357966,"Amarillo, Cascade, Idaho 7 and East Kent Go..."
5096519,American IPA,357966,Poured into a tulip. The appearance was a b...
5096520,American IPA,357966,"Amarillo, Cascade, Idaho 7 and East Kent Go..."
5096521,American IPA,357966,I am loving all things citra this summer an...


In [349]:

# concatenate the string
rec_words['combined_concat'] = rec_words.groupby(['beer_id'])['combined'].transform(lambda x : ' '.join(x))

rec_words  
  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_words['combined_concat'] = rec_words.groupby(['beer_id'])['combined'].transform(lambda x : ' '.join(x))


Unnamed: 0,style,beer_id,combined,combined_concat
77644,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...,Golden Strong Ale brewed with Cocoa Nibs. ...
77645,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...,Golden Strong Ale brewed with Cocoa Nibs. ...
77646,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...,Golden Strong Ale brewed with Cocoa Nibs. ...
77647,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...,Golden Strong Ale brewed with Cocoa Nibs. ...
77648,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...,Golden Strong Ale brewed with Cocoa Nibs. ...
...,...,...,...,...
5096518,American IPA,357966,"Amarillo, Cascade, Idaho 7 and East Kent Go...",16 oz can into IPA glass. Dated 7/10/18. Ni...
5096519,American IPA,357966,Poured into a tulip. The appearance was a b...,16 oz can into IPA glass. Dated 7/10/18. Ni...
5096520,American IPA,357966,"Amarillo, Cascade, Idaho 7 and East Kent Go...",16 oz can into IPA glass. Dated 7/10/18. Ni...
5096521,American IPA,357966,I am loving all things citra this summer an...,16 oz can into IPA glass. Dated 7/10/18. Ni...


In [350]:
rec_words.iloc[0]
print(len(rec_words['combined'].iloc[0]))
print(len(rec_words['combined_concat'].iloc[0]))

print(len(rec_words['combined'].iloc[1]))
print(len(rec_words['combined_concat'].iloc[1]))

69
185302
69
185302


In [351]:
rec_words.drop(columns=['combined'], inplace=True)
rec_words

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,style,beer_id,combined_concat
77644,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77645,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77646,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77647,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
77648,Belgian Strong Pale Ale,120,Golden Strong Ale brewed with Cocoa Nibs. ...
...,...,...,...
5096518,American IPA,357966,16 oz can into IPA glass. Dated 7/10/18. Ni...
5096519,American IPA,357966,16 oz can into IPA glass. Dated 7/10/18. Ni...
5096520,American IPA,357966,16 oz can into IPA glass. Dated 7/10/18. Ni...
5096521,American IPA,357966,16 oz can into IPA glass. Dated 7/10/18. Ni...


In [352]:
rec_words = rec_words.drop_duplicates()

In [353]:
rec_words = rec_words.set_index(['beer_id'])

In [354]:
rec_words

Unnamed: 0_level_0,style,combined_concat
beer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
120,Belgian Strong Pale Ale,Golden Strong Ale brewed with Cocoa Nibs. ...
195,American Imperial IPA,Dry hopped with Lemondrop and Simcoe Americ...
666,American Imperial IPA,Always and Forever Imperial IPA (7.8%) is t...
751,American Black Ale,Cascadian Dark Ale American Black Ale Ca...
807,American IPA,Oops! I Mangoed My Pants! takes a twist on ...
...,...,...
355414,American IPA,"IPA w / amarillo, chinook + mosaic American..."
356923,American IPA,"0% As a few other reviewers, I was expectin..."
357086,New England IPA,0% Gravity of Smile is a hazy double IPA dr...
357699,American Wild Ale,A salivary gland surge! Subtly sweet and so...


In [355]:
final_rec = rec_group.join(rec_words, lsuffix="_4")

In [356]:
final_rec.dropna(inplace=True)
print(final_rec.isna().sum())
final_rec

name               0
look               0
smell              0
taste              0
feel               0
overall            0
score              0
style              0
combined_concat    0
dtype: int64


Unnamed: 0_level_0,name,look,smell,taste,feel,overall,score,style,combined_concat
beer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,3.424000,American Porter,The balance of hops and malt is perfect to ...
260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,3.760000,American Amber / Red Ale,1609 isn’t a year. It’s a landmark in time....
1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,3.681157,German Pilsner,AS THE FIRST real lager made in Queens sinc...
277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,3.946098,American Black Ale,"Brewed with 3 varieties of Hops, and 3 type..."
88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,4.344482,American Imperial IPA,Made the purchase at the brewry and waited ...
...,...,...,...,...,...,...,...,...,...
172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,3.436667,American Imperial IPA,Pours a thin head on a clear straw body. Th...
124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,3.973243,Belgian Saison,Summer Saison brewed with wild rice and Ama...
212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,3.930000,American Imperial Stout,Received in a beer52 order. Pours hazy tan-...
126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,4.209887,American Imperial IPA,dHop 2 is a 8.5% DIPA that investigates the...


In [357]:
final_rec.columns

Index(['name', 'look', 'smell', 'taste', 'feel', 'overall', 'score', 'style',
       'combined_concat'],
      dtype='object')

In [358]:
final_rec

Unnamed: 0_level_0,name,look,smell,taste,feel,overall,score,style,combined_concat
beer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,3.424000,American Porter,The balance of hops and malt is perfect to ...
260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,3.760000,American Amber / Red Ale,1609 isn’t a year. It’s a landmark in time....
1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,3.681157,German Pilsner,AS THE FIRST real lager made in Queens sinc...
277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,3.946098,American Black Ale,"Brewed with 3 varieties of Hops, and 3 type..."
88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,4.344482,American Imperial IPA,Made the purchase at the brewry and waited ...
...,...,...,...,...,...,...,...,...,...
172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,3.436667,American Imperial IPA,Pours a thin head on a clear straw body. Th...
124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,3.973243,Belgian Saison,Summer Saison brewed with wild rice and Ama...
212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,3.930000,American Imperial Stout,Received in a beer52 order. Pours hazy tan-...
126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,4.209887,American Imperial IPA,dHop 2 is a 8.5% DIPA that investigates the...


In [359]:
rec_idens = final_rec[['combined_concat', 'name']]
rec_idens = rec_idens.applymap(lambda x: x.lower())
rec_idens = rec_idens.applymap(lambda x: x.translate(x.maketrans('', '', string.punctuation)))
sw = stopwords.words('english')
sw
rec_idens['combined_concat'] = rec_idens['combined_concat'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))

rec_idens = rec_idens.applymap(lambda x: x.translate(x.maketrans('', '', '0123456789')))

final_rec['combined_concat'] = rec_idens['combined_concat']
final_rec



Unnamed: 0_level_0,name,look,smell,taste,feel,overall,score,style,combined_concat
beer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,3.424000,American Porter,balance hops malt perfect tastes may everyone ...
260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,3.760000,American Amber / Red Ale,isn’t year it’s landmark time explorers first...
1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,3.681157,German Pilsner,first real lager made queens since prohibition...
277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,3.946098,American Black Ale,brewed varieties hops types malts oats mustc...
88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,4.344482,American Imperial IPA,made purchase brewry waited line minutes love...
...,...,...,...,...,...,...,...,...,...
172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,3.436667,American Imperial IPA,pours thin head clear straw body aroma grain p...
124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,3.973243,Belgian Saison,summer saison brewed wild rice amarillo hops f...
212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,3.930000,American Imperial Stout,received beer order pours hazy tanred tulip gl...
126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,4.209887,American Imperial IPA,dhop dipa investigates interplay ale strain ...


In [360]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df = 1, max_features=500)
tfid_vec = tf.fit_transform(final_rec['combined_concat'])
closeness = cosine_similarity(tfid_vec, tfid_vec[0])
closeness = pd.DataFrame(closeness)
closeness

Unnamed: 0,0
0,1.000000
1,0.055403
2,0.185208
3,0.177926
4,0.063905
...,...
1355,0.208439
1356,0.186295
1357,0.163070
1358,0.129456


In [361]:
final_rec = final_rec.reset_index()

In [362]:
final_rec = final_rec.join(closeness)
final_rec = final_rec.rename(columns={0 : 'similarity'})

In [363]:
final_rec

Unnamed: 0,beer_id,name,look,smell,taste,feel,overall,score,style,combined_concat,similarity
0,92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,3.424000,American Porter,balance hops malt perfect tastes may everyone ...,1.000000
1,260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,3.760000,American Amber / Red Ale,isn’t year it’s landmark time explorers first...,0.055403
2,1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,3.681157,German Pilsner,first real lager made queens since prohibition...,0.185208
3,277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,3.946098,American Black Ale,brewed varieties hops types malts oats mustc...,0.177926
4,88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,4.344482,American Imperial IPA,made purchase brewry waited line minutes love...,0.063905
...,...,...,...,...,...,...,...,...,...,...,...
1355,172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,3.436667,American Imperial IPA,pours thin head clear straw body aroma grain p...,0.208439
1356,124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,3.973243,Belgian Saison,summer saison brewed wild rice amarillo hops f...,0.186295
1357,212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,3.930000,American Imperial Stout,received beer order pours hazy tanred tulip gl...,0.163070
1358,126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,4.209887,American Imperial IPA,dhop dipa investigates interplay ale strain ...,0.129456


In [364]:
final_rec.drop(columns=['score'], inplace=True)
rec_group_2 = final_rec[['look', 'smell', 'taste', 'feel', 'overall', 'similarity']]
rec_group_2 = np.array(rec_group_2)
print(len(rec_group_2))
rec_group_2


1360


array([[3.74      , 3.4       , 3.38      , 3.51      , 3.39      ,
        1.        ],
       [3.88888889, 3.83333333, 3.72222222, 3.72222222, 3.72222222,
        0.05540257],
       [3.69214876, 3.5392562 , 3.69731405, 3.67252066, 3.81818182,
        0.18520788],
       ...,
       [4.08333333, 3.75      , 4.        , 4.        , 3.91666667,
        0.16307026],
       [4.25988701, 4.17725989, 4.22951977, 4.18220339, 4.19915254,
        0.12945619],
       [3.58333333, 3.83333333, 3.83333333, 3.58333333, 3.83333333,
        0.16465146]])

In [365]:

target = final_rec.iloc[0]
target

beer_id                                                        92847
name                                  1/2 Baked Peanut Butter Porter
look                                                            3.74
smell                                                            3.4
taste                                                           3.38
feel                                                            3.51
overall                                                         3.39
style                                                American Porter
combined_concat    balance hops malt perfect tastes may everyone ...
similarity                                                         1
Name: 0, dtype: object

In [366]:
numerators = np.array([rec_group_2[0].dot(beers) for beers in rec_group_2[0:]])
denominators = np.array([np.sqrt(sum(rec_group_2[0]**2)) *\
                         np.sqrt(sum(beers**2)) for beers in rec_group_2[0:]])




results = pd.DataFrame((numerators / denominators), columns=['final_recommendation'])

# rec_group.drop([0], inplace=True)


In [367]:
results

Unnamed: 0,final_recommendation
0,1.000000
1,0.992259
2,0.993437
3,0.993420
4,0.992045
...,...
1355,0.993855
1356,0.993806
1357,0.993611
1358,0.992931


In [368]:
final_rec = final_rec.join(results)

In [369]:
final_rec

Unnamed: 0,beer_id,name,look,smell,taste,feel,overall,style,combined_concat,similarity,final_recommendation
0,92847.0,1/2 Baked Peanut Butter Porter,3.740000,3.400000,3.380000,3.510000,3.390000,American Porter,balance hops malt perfect tastes may everyone ...,1.000000,1.000000
1,260733.0,1609 Amber Ale,3.888889,3.833333,3.722222,3.722222,3.722222,American Amber / Red Ale,isn’t year it’s landmark time explorers first...,0.055403,0.992259
2,1246.0,19-33 Pilsner,3.692149,3.539256,3.697314,3.672521,3.818182,German Pilsner,first real lager made queens since prohibition...,0.185208,0.993437
3,277295.0,2XIBA,3.932927,3.871951,3.987805,3.951220,3.945122,American Black Ale,brewed varieties hops types malts oats mustc...,0.177926,0.993420
4,88298.0,2XONE - Mosaic (2014),4.373955,4.337793,4.342391,4.348035,4.336747,American Imperial IPA,made purchase brewry waited line minutes love...,0.063905,0.992045
...,...,...,...,...,...,...,...,...,...,...,...
1355,172659.0,Zap,3.583333,3.250000,3.583333,3.250000,3.416667,American Imperial IPA,pours thin head clear straw body aroma grain p...,0.208439,0.993855
1356,124262.0,Zizania,4.047297,3.918919,3.979730,3.972973,3.993243,Belgian Saison,summer saison brewed wild rice amarillo hops f...,0.186295,0.993806
1357,212274.0,Zuul,4.083333,3.750000,4.000000,4.000000,3.916667,American Imperial Stout,received beer order pours hazy tanred tulip gl...,0.163070,0.993611
1358,126891.0,dHop2,4.259887,4.177260,4.229520,4.182203,4.199153,American Imperial IPA,dhop dipa investigates interplay ale strain ...,0.129456,0.992931


In [371]:
final_rec.isna().sum()

beer_id                 0
name                    0
look                    0
smell                   0
taste                   0
feel                    0
overall                 0
style                   0
combined_concat         0
similarity              0
final_recommendation    0
dtype: int64

In [372]:
results[1:].idxmax()

final_recommendation    1278
dtype: int64

In [373]:
final_rec.iloc[1278]

beer_id                                                             70515
name                                                               Unfurl
look                                                              3.78846
smell                                                             3.57692
taste                                                             3.69231
feel                                                              3.76923
overall                                                           3.69231
style                                                     American Porter
combined_concat         porter brewed seven varieties malts two variet...
similarity                                                       0.580745
final_recommendation                                             0.997942
Name: 1278, dtype: object