## About / Notes

### This code is an edited version of the movie recommender notebook by Kevin Liao found [here](https://github.com/KevinLiao159/MyDataSciencePortfolio/blob/master/movie_recommender/movie_recommendation_using_KNN.ipynb)

- Collaborative filtering doesnt rely on information about features or users, rather it uses user feedback about the features to determine which features are most similar.
- It can be good at recommending items that are categorically different than the input data (input is IPA, may recommend GOSE)
- cold start problem: new items with less ratings are less likely to be recommended, particularly when dealing with sparse data


KNN collaborative filtering approach to recommending beers based on a sparse dataset.

- num reviews: 4,837,392 (out of 5,487,730)
- num beers: 9,999 (out of 24,542)
- num users: 52,707 (out of 101,574)

**item based approach** will allow the computations to be done offline and then served, as the items change less than users.
A user based approach would need to be updated and retrained too frequently.

**euclidean dist vs cosine similarity:**
- **cosine:**
- looks at the angle between two vectors without considering magnitude
- useful when comparing vectors of different length or dimensionality, helps balance the gap and prevent favoring samples based on number of dimensions rather than similarity between values.

**choosing a nearest neighbor algorithm:**
- **brute:** can be very slow for large datasets with high dimensionality
- **ball tree:** 
- recursively divide data into nodes defined by centroid C and radius r, which reduces the candidates to compare to a new data point. Builds a tree to filter new data points into the most similar node (brute force is then done within node).
- works well with sparse data that is highly intrinsic, but large portion of time is spent building the query tree relative to doing a single query. better when several queries are necessary (true for recommender!)
- leaf_size == node size, very high leaf size results in quick construction but closer query time to brute force. very low leaf size results in lots of time spent filtering through tree.

**choosing k:**
- **brute:** largely unnaffected by choice of k
- **ball tree:** can slow down with larger k partially due to internal queuing and increased difficulty pruning branches in query tree.


In [1]:
import os
import time


import math
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

import json
from fuzzywuzzy import fuzz
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# has beer names
df_orig = pd.read_csv('./Beer_Data/reduced_data.csv')
# totally numeric, no beer names
df = pd.read_csv('./Beer_Data/reduced_data_X2.csv')
# beer names to ID's
beer_ids = pd.read_csv('./Beer_Data/beer_ids.csv')

In [3]:
# drop index in beer_ids and build maps 
beer_ids = beer_ids.set_index('beer_id')
id2beer = beer_ids.to_dict()['beer_full']
beer2id = {name:beer_id for beer_id, name in id2beer.items()}

In [4]:
df

Unnamed: 0,beer_id,user_score,user_id
0,18580,3.75,1
1,18570,4.25,1
2,18581,4.25,1
3,4200,4.25,1
4,1,4.50,1
...,...,...,...
4837387,3583,4.25,101906
4837388,14654,4.00,101906
4837389,1106,3.40,101906
4837390,11819,4.00,101906


In [5]:
# pivot ratings into movie features
df_beer_features = df.pivot(
    index='beer_id',
    columns='user_id',
    values='user_score'
).fillna(0)
# convert dataframe of beer features to scipy sparse matrix
mat_beer_features = csr_matrix(df_beer_features.values)

In [6]:
# create mapper from beer name to index by building a list of beer names based on the beer IDs
# found in the rows of df_beer_features
beer_to_idx = {
    beer: i for i, beer in 
    enumerate(list(beer_ids.loc[df_beer_features.index].beer_full))
}

In [7]:
# save beer to index dict to file
with open('beer2idx.json', 'w') as fp:
    json.dump(beer_to_idx, fp)

In [7]:
mat_beer_features[9997]

<1x52707 sparse matrix of type '<class 'numpy.float64'>'
	with 141 stored elements in Compressed Sparse Row format>

In [8]:
# build predictor
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

In [9]:
# fit predictor with sparse matrix
model_knn.fit(mat_beer_features)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
                 radius=1.0)

In [10]:
def fuzzy_matching(mapper, fav_beer, verbose=True):
    """
    return index loc (int) of the closest matching beer name in dataset compared to fav_beer.
    If no match found, return None.
    
    Parameters
    ----------    
    mapper: dict, map beer name to beer index loc in data

    fav_beer: str, name of user input beer
    
    verbose: bool, print log if True

    Return
    ------
    beer ID of the closest match
    """
    match_tuple = []
    # get match
    for name, idx in mapper.items():
        ratio = fuzz.ratio(name.lower(), fav_beer.lower())
        if ratio >= 60:
            match_tuple.append((name, idx, ratio))
    # sort
    match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
    if not match_tuple:
        print('Oops! No match is found')
        return
    if verbose:
        print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
    return match_tuple[0][1]

In [11]:
def make_recommendation(model_knn, data, mapper, fav_beer, n_recommendations):
    """
    return top n similar beer recommendations based on user's input beer


    Parameters
    ----------
    model_knn: sklearn model, knn model (untrained)

    data: [beer,user] matrix

    mapper: dict, map beer name to beer index loc in data

    fav_beer: str, name of user input beer

    n_recommendations: int, top n recommendations

    Return
    ------
    list of top n similar beer recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input beer:', fav_beer)
    idx = fuzzy_matching(mapper, fav_beer, verbose=True)
    # inference
    print('Recommendation system: start to making inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    # get list of raw idx of recommendations
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])
    # get reverse mapper, idx to beer name
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_beer))
    for i, (idx, dist) in enumerate(raw_recommends[1:]):
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], round(dist, 3)))

In [12]:
my_favorite = 'Boston Beer Works - Canal Street Bohemian Pilsner'

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: Boston Beer Works - Canal Street Bohemian Pilsner
Found possible matches in our database: ['Boston Beer Works - Canal Street Boston Red', 'Boston Beer Works - Canal Street Watermelon Ale', 'Boston Beer Works - Canal Street Fenway Pale Ale', 'Boston Beer Works - Canal Street Back Bay IPA', 'Boston Beer Works - Canal Street Bunker Hill Blueberry Ale', 'Dock Street Brewery & Restaurant Dock Street Bohemian Pilsner', 'von Trapp Brewing Bohemian Pilsner']

Recommendation system: start to making inference
......

Recommendations for Boston Beer Works - Canal Street Bohemian Pilsner:
1: Boston Beer Works - Canal Street Bunker Hill Blueberry Ale, with distance of 0.655
2: Boston Beer Works - Canal Street Fenway Pale Ale, with distance of 0.657
3: Boston Beer Works - Canal Street Back Bay IPA, with distance of 0.739
4: Boston Beer Works - Canal Street Watermelon Ale, with distance of 0.827
5: Cape Ann Brewing Company Fisherman's Ale, with distance of 0.831
6: Mayflower Brew

In [19]:
my_favorite = 'heady topper'

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: heady topper
Found possible matches in our database: ['The Alchemist Heady Topper']

Recommendation system: start to making inference
......

Recommendations for heady topper:
1: Russian River Brewing Company Pliny The Elder, with distance of 0.397
2: Founders Brewing Company KBS (Kentucky Breakfast Stout), with distance of 0.405
3: Goose Island Beer Co. Bourbon County Brand Stout, with distance of 0.409
4: 3 Floyds Brewing Co. Zombie Dust, with distance of 0.422
5: Goose Island Beer Co. Bourbon County Brand Coffee Stout, with distance of 0.423
6: Lawson's Finest Liquids Sip Of Sunshine, with distance of 0.424
7: Stone Brewing Enjoy By IPA, with distance of 0.444
8: Maine Beer Company Lunch, with distance of 0.448
9: Firestone Walker Brewing Co. Parabola, with distance of 0.454
10: Ballast Point Brewing Company Sculpin, with distance of 0.456


In [20]:
my_favorite = "Lawsons Sip Of Sunshine"

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: Lawsons Sip Of Sunshine
Found possible matches in our database: ["Lawson's Finest Liquids Sip Of Sunshine", "Lawson's Finest Liquids Double Sunshine", "Lawson's Finest Liquids Triple Sunshine"]

Recommendation system: start to making inference
......

Recommendations for Lawsons Sip Of Sunshine:
1: The Alchemist Focal Banger, with distance of 0.352
2: Tree House Brewing Company Julius, with distance of 0.365
3: Tree House Brewing Company Green, with distance of 0.407
4: The Alchemist Heady Topper, with distance of 0.424
5: Tree House Brewing Company Haze, with distance of 0.425
6: Tree House Brewing Company Alter Ego, with distance of 0.45
7: Fiddlehead Brewing Company Second Fiddle, with distance of 0.456
8: The Alchemist Crusher, with distance of 0.464
9: Maine Beer Company Lunch, with distance of 0.464
10: Trillium Brewing Company Congress Street IPA, with distance of 0.473


In [21]:
my_favorite = "troegs perpetual IPA"

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: troegs perpetual IPA
Found possible matches in our database: ['Tröegs Brewing Company Perpetual IPA']

Recommendation system: start to making inference
......

Recommendations for troegs perpetual IPA:
1: Tröegs Brewing Company Nugget Nectar, with distance of 0.533
2: Tröegs Brewing Company Hopback Amber Ale, with distance of 0.541
3: Ithaca Beer Company Flower Power India Pale Ale, with distance of 0.548
4: Tröegs Brewing Company Java Head Stout, with distance of 0.57
5: Tröegs Brewing Company The Mad Elf, with distance of 0.574
6: Tröegs Brewing Company Troegenator, with distance of 0.582
7: Tröegs Brewing Company Hop Knife Harvest Ale, with distance of 0.584
8: Victory Brewing Company - Downingtown DirtWolf, with distance of 0.591
9: Victory Brewing Company - Downingtown HopDevil, with distance of 0.604
10: Victory Brewing Company - Downingtown Hop Ranch, with distance of 0.607


In [22]:
my_favorite = "Zero Gravity american flatbread Conehead IPA"

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: Zero Gravity american flatbread Conehead IPA
Found possible matches in our database: ['Zero Gravity Craft Brewery / American Flatbread Conehead IPA', 'Zero Gravity Craft Brewery / American Flatbread T.L.A. IPA', 'Zero Gravity Craft Brewery / American Flatbread Narconaut Black IPA', 'Zero Gravity Craft Brewery / American Flatbread Madonna', 'Zero Gravity Craft Brewery / American Flatbread Green State', 'Zero Gravity Craft Brewery / American Flatbread Little Wolf']

Recommendation system: start to making inference
......

Recommendations for Zero Gravity american flatbread Conehead IPA:
1: Fiddlehead Brewing Company Fiddlehead IPA, with distance of 0.591
2: Lost Nation Brewing Gose, with distance of 0.608
3: Lost Nation Brewing Mosaic IPA, with distance of 0.613
4: Lawson's Finest Liquids Super Session #2, with distance of 0.624
5: 14th Star Brewing Co. Tribute Double India Pale Ale, with distance of 0.634
6: Lost Nation Brewing Lost Galaxy, with distance of 0.647
7: