#### About

KNN collaborative filtering approach to recommending beers based on a sparse dataset.

- num reviews: 4,837,392 (out of 5,487,730)
- num beers: 9,999 (out of 24,542)
- num users: 5,2707 (out of 101,574)

**item based approach** will allow the computations to be done offline and then served, as the items change less than users.
A user based approach would need to be updated and retrained too frequently.

**euclidean dist vs cosine similarity:**
- **cosine:**
- looks at the angle between two vectors without considering magnitude
- useful when comparing vectors of different length or dimensionality, helps balance the gap and prevent favoring samples based on number of dimensions rather than similarity between values.

**choosing a nearest neighbor algorithm:**
- **brute:** can be very slow for large datasets with high dimensionality
- **ball tree:** 
- recursively divide data into nodes defined by centroid C and radius r, which reduces the candidates to compare to a new data point. Builds a tree to filter new data points into the most similar node (brute force is then done within node).
- works well with sparse data that is highly intrinsic, but large portion of time is spent building the query tree relative to doing a single query. better when several queries are necessary (true for recommender!)
- leaf_size == node size, very high leaf size results in quick construction but closer query time to brute force. very low leaf size results in lots of time spent filtering through tree.

**choosing k:**
- **brute:** largely unnaffected by choice of k
- **ball tree:** can slow down with larger k partially due to internal queuing and increased difficulty pruning branches in query tree.

This code follows the blog by Kevin Liao about recommenders found [here](https://github.com/KevinLiao159/MyDataSciencePortfolio/blob/master/movie_recommender/movie_recommendation_using_KNN.ipynb)

In [45]:
import os
import time


import math
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

from fuzzywuzzy import fuzz
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [131]:
# has beer names
df_orig = pd.read_csv('./Beer_Data/reduced_data.csv')
# totally numeric, no beer names
df = pd.read_csv('./Beer_Data/reduced_data_X2.csv')
# beer names to ID's
beer_ids = pd.read_csv('./Beer_Data/beer_ids.csv')

In [132]:
# drop index in beer_ids and build maps 
beer_ids = beer_ids.set_index('beer_id')
id2beer = beer_ids.to_dict()['beer_full']
beer2id = {name:beer_id for beer_id, name in beer2id.items()}

In [3]:
df

Unnamed: 0,beer_id,user_score,user_id
0,18580,3.75,1
1,18570,4.25,1
2,18581,4.25,1
3,4200,4.25,1
4,1,4.50,1
...,...,...,...
4837387,3583,4.25,101906
4837388,14654,4.00,101906
4837389,1106,3.40,101906
4837390,11819,4.00,101906


In [6]:
# pivot ratings into movie features
df_beer_features = df.pivot(
    index='beer_id',
    columns='user_id',
    values='user_score'
).fillna(0)
# convert dataframe of movie features to scipy sparse matrix
mat_beer_features = csr_matrix(df_beer_features.values)

In [149]:
# create mapper from beer name to index by building a list of beer names based on the beer IDs
# found in the rows of df_beer_features
beer_to_idx = {
    beer: i for i, beer in 
    enumerate(list(beer_ids.loc[df_beer_features.index].beer_full))
}

In [112]:
mat_beer_features[9997]

<1x52707 sparse matrix of type '<class 'numpy.float64'>'
	with 141 stored elements in Compressed Sparse Row format>

In [41]:
# build predictor
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

In [43]:
# fit predictor with sparse matrix
model_knn.fit(df_beer_features)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
                 radius=1.0)

In [78]:
def fuzzy_matching(mapper, fav_beer, verbose=True):
    """
    return index loc (int) of the closest matching beer name in dataset compared to fav_beer.
    If no match found, return None.
    
    Parameters
    ----------    
    mapper: dict, map beer name to beer index loc in data

    fav_beer: str, name of user input beer
    
    verbose: bool, print log if True

    Return
    ------
    beer ID of the closest match
    """
    match_tuple = []
    # get match
    for name, idx in mapper.items():
        ratio = fuzz.ratio(name.lower(), fav_beer.lower())
        if ratio >= 60:
            match_tuple.append((name, idx, ratio))
    # sort
    match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
    if not match_tuple:
        print('Oops! No match is found')
        return
    if verbose:
        print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
    return match_tuple[0][1]

In [146]:
def make_recommendation(model_knn, data, mapper, fav_beer, n_recommendations):
    """
    return top n similar beer recommendations based on user's input beer


    Parameters
    ----------
    model_knn: sklearn model, knn model (untrained)

    data: [beer,user] matrix

    mapper: dict, map beer name to beer index loc in data

    fav_beer: str, name of user input beer

    n_recommendations: int, top n recommendations

    Return
    ------
    list of top n similar beer recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input beer:', fav_beer)
    idx = fuzzy_matching(mapper, fav_beer, verbose=True)
    # inference
    print('Recommendation system: start to making inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    # get list of raw idx of recommendations
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper, idx to beer name
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_beer))
    for i, (idx, dist) in enumerate(raw_recommends):
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))

In [70]:
def make_recommendation(model_knn, data, mapper, fav_beer, n_recommendations):
    """
    return top n similar beer recommendations based on user's input beer


    Parameters
    ----------
    model_knn: sklearn model, knn model (untrained)

    data: [beer,user] matrix

    mapper: dict, map beer name to beer ID of the beer in data

    fav_beer: str, name of user input beer

    n_recommendations: int, top n recommendations

    Return
    ------
    list of top n similar beer recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input beer:', fav_beer)
    beer_id = fuzzy_matching(mapper, fav_beer, verbose=True)
    idx = 
    # inference
    print('Recommendation system: start to making inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[beer_id], n_neighbors=n_recommendations+1)
    # get list of raw beer_id of recommendations
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper, beer ID to beer name
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_beer))
    for i, (ind, dist) in enumerate(raw_recommends):
        beer_id = df_beer_features.iloc[:,ind].name
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[beer_id], dist))

In [150]:
my_favorite = 'Boston Beer Works - Canal Street Bohemian Pilsner'

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer_to_idx,
    n_recommendations=10)

You have input beer: Boston Beer Works - Canal Street Bohemian Pilsner
Found possible matches in our database: ['Boston Beer Works - Canal Street Boston Red', 'Boston Beer Works - Canal Street Watermelon Ale', 'Boston Beer Works - Canal Street Fenway Pale Ale', 'Boston Beer Works - Canal Street Back Bay IPA', 'Boston Beer Works - Canal Street Bunker Hill Blueberry Ale', 'Dock Street Brewery & Restaurant Dock Street Bohemian Pilsner', 'von Trapp Brewing Bohemian Pilsner']

Recommendation system: start to making inference
......

Recommendations for Boston Beer Works - Canal Street Bohemian Pilsner:
1: Mayflower Brewing Company Mayflower Golden Ale, with distance of 0.851501566969167
2: Blue Hills Brewery Blue Hills India Pale Ale, with distance of 0.8479799832362422
3: Mayflower Brewing Company Mayflower Spring Hop, with distance of 0.8450238777690956
4: Wachusett Brewing Company Wachusett Octoberfest Ale, with distance of 0.834590315801979
5: Mayflower Brewing Company Daily Ration, wit

In [77]:
my_favorite = 'heady topper'

make_recommendation(
    model_knn=model_knn,
    data=mat_beer_features,
    fav_beer=my_favorite,
    mapper=beer2id,
    n_recommendations=10)

You have input beer: heady topper
Found possible matches in our database: ['The Alchemist Heady Topper', 'Melvin Brewing / Thai Me Up Dready Copper', 'Tommyknocker Brewery Butthead Doppelbock', 'Chaos Mountain Brewing Mad Hopper', "Fat Head's Brewery & Saloon Head Hunter", 'Trillium Brewing Company Double Dry Hopped Fort Point Pale Ale', 'Trillium Brewing Company Double Dry Hopped Congress Street', 'Trillium Brewing Company Double Dry Hopped Sleeper Street', 'Trillium Brewing Company Double Dry Hopped Scaled', 'Boston Beer Company (Samuel Adams) Samuel Adams Wee Heavy (Imperial Series)', 'Indeed Brewing Company Day Tripper', 'Trillium Brewing Company Double Dry Hopped Summer Street IPA', 'Other Half Brewing Co. Double Dry Hopped Double Mosaic Dream', 'Lion Brewery, Inc. Lionshead Pilsner', 'Other Half Brewing Co. Double Dry Hopped All Citra Everything', 'Other Half Brewing Co. Double Dry Hopped Mylar Bags', 'The Olde Mecklenburg Brewery Copper', 'Other Half Brewing Co. Double Dry Hoppe

IndexError: row index (24537) out of range

In [88]:
id2beer[24537]

'The Alchemist Heady Topper'