## Models
A collaborative filtering model can be built once given a user-item matrix with ratings.  

### Question:
* Build a Model that recommends to the user the "first item" they may want to place into their "basket"
  * Input: user - customer ID
  * Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"

In `recommend_1.csv`, we provide a list of customer IDs. If you select option 1, use this data to generate a csv file that indicates top 10 recommendations for each of the customers. Note the order of the recommended products should be ordered by user preference, with the most preferred item in the beginning.

Sample output:

`customerId, recommendedProducts
1,0|1|2|3|4|5|6|7|8|9
2,8|3|1|2|4|7|9|10|11|13
3,20|21|22|23|24|25|26|27|28|29
...
`


## Notes on the business use case for evaluation
* The goal of this modeling project is to recommend to the user a list of items that they are most likely to purchase (option 1) or add to their existing basket (option 2).  
* As you are selecting metrics, please keep in mind that 
 1. the primary goal is to successfully recommend as many items in your list that they may be inclined to purchase/add, and 
 2. the secondary goal is that the items are ordered by the user's inclination (the more inclined they are, the higher up in your list of 10 recommendations.

In [1]:
import pandas as pd
import numpy as np
from collections import namedtuple
from itertools import chain
from ordered_set import OrderedSet

from scipy.sparse.linalg import svds
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth',-1)

#### I. Reading the transactions data to train the algorithm from and reading test data

In [2]:
trx_df = pd.read_csv('../data/trx_data.csv')
trx_df.head()

Unnamed: 0,customerId,products
0,0,20
1,1,2|2|23|68|68|111|29|86|107|152
2,2,111|107|29|11|11|11|33|23
3,3,164|227
4,5,2|2


In [3]:
test_df = pd.read_csv('../data/recommend_1.csv')
test_df.head()

Unnamed: 0,customerId
0,1553
1,20400
2,19750
3,6334
4,27773


Storing all the customer IDs from the test data to a list

In [4]:
customer_ID_test = test_df['customerId'].values.tolist()

Creating a named tuple to format our recommendations

In [5]:
RECOMMENDATIONS = namedtuple("RECOMMENDATIONS", ["customerId", "products"])

#### II. Preprocessing the data

1. The data here is transactional data containing products purchased for each user for each transaction. 
2. These are separated using a delimiter. For this, I am creating a function which returns the list of products purchased for that transaction, removing the delimiter
3. - Now, in order to implement collaborative filtering, it is essential that our data be in terms of a user-item interactions matrix, containing ratings by each user for the item they have purchased.
  - However, the transactional data that we have here does not have any such "explicit" feedback. Therefore, I am substituting the counts of each item purchased by that user throughout all the transactions as our ratings. My intuition behind using frequency of item purchased is such that if an item is purchased frequently by a customer, it is a "popular" item. Thus that should be factored in while creating our utility matrix of customer-to-product interactions.
  - One very important thing to note here is that this causes the *loss of temporal information in the data*, since I am aggregating all the sequential transactions into a single list of items purchased.
4. In order to process the data in the above mentioned format, I am first creating a simple dataframe which shows for each customerId, the productId and frequency of purchase of that product by that customer. Given that there could be lots of products, we will have no count values for those items that a customer has not purchased. Thus, I am replacing the NaN values with 0 to display the frequency as 0.
5. Another approach is to instead simply substitute the ratings with a binary value, which could be represented as:
    - 1: if customer has purchased an item based on the transactions data
    - 0: if customer has not purchased an item based on the transactions data
    
   However, we lose out on the popularity factor of the items purchased. Though we will not be using this approach, I have created a function below which returns the user-item matrix in terms of binary values for future use.


In [6]:
def split_trx_on_symbol(transactions_df, symbol):
    '''Removing the pipe symbol from transactions and storing as list.
    Args:
        transactions_df (pd.DataFrame): Customer purchase history.
        symbol (str): Delimiter for splitting product IDs.
    Returns:
        processed_df (pd.DataFrame): Processed dataframe consisting of 
            customer purchase history with products in a list.
    '''
    
    processed_df = transactions_df.copy()
    processed_df['products'] = processed_df['products'].apply(lambda x: [int(each) for each in x.split(symbol)])
    return processed_df


def get_product_counts(processed_df):
    '''Returns user-item interaction matrix by computing frequency of each product purchased by user.
    Args:
        processed_df (pd.DataFrame): Customer purchase history after splitting on delimiter.
    Returns:
        frequency_df (pd.DataFrame): Processed dataframe consisting of user-item-frequency.
    '''
    
    user_item_int = processed_df.set_index('customerId')['products'].apply(pd.Series).reset_index()
    user_items =  user_item_int.melt(id_vars=['customerId'], value_name="products").\
                    sort_values(by=['customerId']).dropna().drop("variable",axis=1).\
                    reset_index(drop=True)
    user_items = user_items.astype(int)
    frequency_df = user_items.groupby(['customerId','products']).size().reset_index(name='purchaseFrequency')
    
    return frequency_df


def get_product_purchased(processed_df):
    '''Returns user-item interaction matrix consisting of binary value based user's purchase history
    Args:
        processed_df (pd.DataFrame): Customer purchase history after splitting on delimiter.
    Returns:
        purchased_df (pd.DataFrame): Processed dataframe consisting of user-item-binary.
    '''
    
    user_item_int = processed_df.set_index('customerId')['products'].apply(pd.Series).reset_index()
    purchased_df =  user_item_int.melt(id_vars=['customerId'], value_name="products").\
                    sort_values(by=['customerId']).dropna().drop("variable",axis=1).\
                    reset_index(drop=True)
    purchased_df['purchasedOrNot'] = 1
    purchased_df.drop_duplicates(keep='first',inplace=True)
    purchased_df = purchased_df.astype(int)
    purchased_df.reset_index(drop=True,inplace=True)
    
    return purchased_df

Preprocessing the data using above functions:

In [7]:
processed_df = split_trx_on_symbol(trx_df,'|')
processed_df.head()

Unnamed: 0,customerId,products
0,0,[20]
1,1,"[2, 2, 23, 68, 68, 111, 29, 86, 107, 152]"
2,2,"[111, 107, 29, 11, 11, 11, 33, 23]"
3,3,"[164, 227]"
4,5,"[2, 2]"


In [8]:
purchase_frequency_df = get_product_counts(processed_df)
purchase_frequency_df.head()

Unnamed: 0,customerId,products,purchaseFrequency
0,0,1,2
1,0,13,1
2,0,19,3
3,0,20,1
4,0,31,2


In the code below, I am widening the purchase_frequency_df from above by using pivot, where the columns are all product IDs and a record represents one customer in the dataframe.

Note: I am using the raw frequencies and not demeaning them across items or users since I want the interactions to be in their original representations.

In [9]:
interactions_df = purchase_frequency_df.pivot(index='customerId', columns='products', values='purchaseFrequency').fillna(0)
interactions_df.reset_index(drop=False, inplace=True)
interactions_df = interactions_df.rename_axis(None, axis=1)
interactions_df.head()

Unnamed: 0,customerId,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
interactions_df.shape

(24429, 301)

#### III. Evaluating our models
In order to evaluate which model to whose, we divide the interactions data into a training and a validation set with the ratio of 90:10.

In [11]:
X_train, X_val = train_test_split(interactions_df, test_size=0.10, random_state=42)
X_train.shape, X_val.shape

((21986, 301), (2443, 301))

Storing all the customer IDs from the validation set to a list

In [12]:
customer_ID_val = X_val["customerId"].tolist()

#### Metrics for evaluation
*Recall@k*: Recall is a measure which gives us the proportion of relevant items found in the top-k recommendations. Here I am interpreting relevant items as the items which are in a customer's purchase history. The highest recall suggests that the recommendations captures the maximum relevant products.
*RECALL@k* = (Number of recommended items @k that are relevant) / (total number of relevant items)

Finally, the final recall@k for each model will be averaged across all users in the val/test set.

In [13]:
def compute_recall_at_k(recommendations_df,preprocessed_original_df,k):
    '''Computing average recall over all the customers recommendations for different k values.
    Args:
        recommendations_df (pd.DataFrame): List of recommended products for each customer in test data.
        preprocessed_original_df (pd.DataFrame): Customer purchase history dataframe
        k (int): The position in the top-k recommendations where we want to obtain the highest recall.
    Returns:
        averageRecall (float): Average recall value
    '''
    
    recommendations_string_df = split_trx_on_symbol(recommendations_df,'|')
    sumOfRecall = 0
    for index, row in recommendations_string_df.iterrows():
        customerId = row['customerId']
        recommendedProducts = row['products']
        relevantProducts = preprocessed_original_df[preprocessed_original_df['customerId'] == customerId]\
            ['products'].tolist()
        numberOfRelevantProducts = len(relevantProducts)
        if k == 1:
            numberOfRelevantRecommendedProducts = len(list(set(recommendedProducts[:1]) & set(relevantProducts)))
            recall = numberOfRelevantRecommendedProducts/numberOfRelevantProducts
            sumOfRecall += recall
        elif k == 5:
            numberOfRelevantRecommendedProducts = len(list(set(recommendedProducts[:5]) & set(relevantProducts)))
            recall = numberOfRelevantRecommendedProducts/numberOfRelevantProducts
            sumOfRecall += recall
        elif k == 10:
            numberOfRelevantRecommendedProducts = len(list(set(recommendedProducts) & set(relevantProducts)))
            recall = numberOfRelevantRecommendedProducts/numberOfRelevantProducts
            sumOfRecall += recall
        
    averageRecall = sumOfRecall/recommendations_string_df.shape[0]
    
    return averageRecall

#### IV. Implementing model for predicting 10 product recommendations to each user.

1. I am building a model for the question: 
    - Option 1 Model: In recommend_1.csv, we provide a list of customer IDs. If you select option 1, use this data to generate a csv file that indicates top 10 recommendations for each of the customers. Note the order of the recommended products should be ordered by user preference, with the most preferred item in the beginning.
2. There are different approaches to building a collaborative filtering model. 
    - First approach: *Model-based CF* using matrix-factorization: *Matrix factorization* is an approach which works by factorizing the customer-product interactions matrix to find latent features in the data. We obtain multiple matrices which are user-features matrix, item-features matrix, and a matrix having weights. On taking the product of these matrices, we get an approximation of the original interactions matrix, thus helping in our recommendations.
    - Second approach: *Memory-based CF*: I am looking at an approach which incorporates customer's purchase history to find most similar items. The similarity among items is calculated by finding the *nearest neighbors* of an item based on a *distance measure* which could be cosine, euclidean, etc. Thus the nearest 10 neighbors can be our recommendations. 

#### Approach 1: Matrix factorisation using Singular Value Decomposition and then making recommendations.
The function *generate_svd_results()* generates the approximated matrix after applying svd, which is then passed as an argument to the *make_recommendations_svd()* which returns the dataframe containing recommendations for all customers in our val/test set.

In this approach, for each customer in the validation/test data, I am returning 10 items which have the highest scores in the approximated matrix from SVD.

In [14]:
def generate_svd_results(interactions_df,numberOfLatentFeatures):
    '''Computing matrix factorization using SVD and returning the approximated matrix as a dataframe.
    Args:
        interactions_df (pd.DataFrame): Customer purchase history in the form of user-item interactions.
        numberOfLatentFeatures (int): Number of latent features we want the interactions to be broken down into.
    Returns:
        preds_df (pd.DataFrame): Approximated matrix after applying SVD.
    '''
    
    frequency_interactions = interactions_df.values
    U, sigma, Vt = svds(frequency_interactions, k = numberOfLatentFeatures)
    sigma = np.diag(sigma)
    predicted_scores = np.dot(np.dot(U, sigma), Vt)
    preds_df = pd.DataFrame(predicted_scores, columns = interactions_df.columns)
    return preds_df


def make_recommendations_svd(interactions_df, predictions_df, test_customers):
    '''Generating the recommendations dataframe using the approximated matrix obtained from SVD.
    Args:
        interactions_df (pd.DataFrame): Customer purchase history in the form of user-item interactions.
        predictions_df (pd.DataFrame): Approximated matrix after applying SVD.
        test_customers (list): List of customers for whom we want to make the recommendations.
    Returns:
        recommendations_df (pd.DataFrame): Final recommendations for our validation/test data.
    '''
    
    final_recommendations = []
    for customerId in test_customers:
        user_index = interactions_df[interactions_df.customerId == customerId].index[0]
        sorted_user_predictions = pd.DataFrame(predictions_df.iloc[user_index].sort_values(ascending=False)).reset_index()
        recommended_items = sorted_user_predictions["index"][:10]
        recommended_items = map(str,recommended_items)
        recommended_string = "|".join(recommended_items)
        rec = RECOMMENDATIONS(customerId, recommended_string)
        final_recommendations.append(rec)
    recommendations_df = pd.DataFrame.from_records(
       final_recommendations,
       columns=RECOMMENDATIONS._fields
    )

    return recommendations_df

##### Running SVD for different number of latent features from [25,50,75,100,125,150]and generating recall at 1st, 5th & 10th positions to determine which model works best for the customers in the validation set. 

i. numberOfLatentFeatures = 25

In [15]:
predictions_df = generate_svd_results(interactions_df,25)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)


Recall@1:  0.1733239721476204
Recall@5:  0.336201845814706
Recall@10:  0.4033837278236912


ii. numberOfLatentFeatures = 50

In [16]:
predictions_df = generate_svd_results(interactions_df,50)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.23348804000346934
Recall@5:  0.46374240397752176
Recall@10:  0.5557809712028159


iii. numberOfLatentFeatures = 75

In [17]:
predictions_df = generate_svd_results(interactions_df,75)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.28024834523696385
Recall@5:  0.5703779834385959
Recall@10:  0.6621810123452504


iv. numberOfLatentFeatures = 100

In [18]:
predictions_df = generate_svd_results(interactions_df,100)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.3038904381772921
Recall@5:  0.6436990277847527
Recall@10:  0.7412760732203166


v. numberOfLatentFeatures = 125

In [19]:
predictions_df = generate_svd_results(interactions_df,125)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.31090026216419364
Recall@5:  0.6536384313418756
Recall@10:  0.7492118822058108


vi. numberOfLatentFeatures = 150

In [20]:
predictions_df = generate_svd_results(interactions_df,150)
predictions_df.drop('customerId', axis=1, inplace=True)
recommendations_df = make_recommendations_svd(interactions_df, predictions_df, customer_ID_val)

recall_at_1 = compute_recall_at_k(recommendations_df,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.31090026216419364
Recall@5:  0.6536384313418756
Recall@10:  0.7492118822058108


**Understanding:** Looking at the results of *SVD* by inputting different values of *hidden features* in the algorithm, we see that after a certain number i.e. 125, the recall does not improve. Thus, the highest recall@10 obtained by SVD is ~75% i.e. out of all 10 recommended items, 75% of the items are relevant. On the other hand, the recall@1 which is the result of the first recommended item to be relevant is 31%

#### Approach 2: Recommending nearest neighbors using similarity and frequency of items purchased
This approach is described using the following steps:

 1) For each customer in our test data, first I obtain the 10 nearest neighbors of each item purchased by that customer. The nearest neighbors are calculated using the Minkowski distance among all interactions. Thus I have neighbors as well as their distances.
    - For instance, if a customer's purchase history contains 3 items. Now, we have 30 neighbors & their distances (10 for each of the purchase item).

 2) Now, I compute the weighted distances (score) of each of these items' neighbors by multiplying the inverse of distances with that item's frequency. Thus, we get the "nearer" items to have higher scores and the "farther" items to have lowest scores
 
 3) Then, I obtain the 10 unique items having the highest weighted distances. These are our final recommendations.  
    - In the case explained above we obtain the 10 unique recommendations based 30 weighted distances.

In [21]:
def make_recommendations_neighbors(interactions_df,purchase_frequency_df,test_customers,metric,algorithm):
    '''
    Args:
        interactions_df (pd.DataFrame): Customer purchase history in the form of user-item interactions.
        purchase_frequency_df (pd.DataFrame): Customer purchase history in the preprocessed format
            of user-item-frequency
        test_customers (list): List of customers for whom we want to make the recommendations.
    Returns:
        recommendations_df (pd.DataFrame): Final recommendations for our validation/test data.
    '''
    
    final_recommendations = []
    frequency_interactions = interactions_df.values
    model_knn = NearestNeighbors(algorithm=algorithm, metric=metric)
    model_knn.fit(frequency_interactions)
    
    for customerId in test_customers:
        all_neighbors = []
        all_distances = []
        relevantRecords = purchase_frequency_df[purchase_frequency_df['customerId'] == customerId]
        relevantProducts = relevantRecords['products'].tolist()
        
        for index, row in relevantRecords.iterrows():
            weight = row['purchaseFrequency']
            distances,neighbors = model_knn.kneighbors(interactions_df.iloc[row['products']].values.reshape(1,-1), n_neighbors=11)
            weightedDistances = weight * (1/(distances + 1e-5))
            all_neighbors.append(list(neighbors[0])[:])
            all_distances.append(list(weightedDistances[0][:]))
            
        all_neighbors_flat = list(chain.from_iterable(all_neighbors))
        all_distances_flat = list(chain.from_iterable(all_distances))
        
        products_scores_df = pd.DataFrame({'products':all_neighbors_flat,'scores':all_distances_flat})
        products_scores_df.sort_values('scores',ascending=False,inplace=True)
        
        recommended_items = list(OrderedSet(products_scores_df['products']))[:10]
        recommended_items = map(str,recommended_items)
        recommended_string = "|".join(recommended_items)
        rec = RECOMMENDATIONS(customerId,recommended_string)
        final_recommendations.append(rec)
        
    recommendations_df = pd.DataFrame.from_records(
        final_recommendations,
        columns=RECOMMENDATIONS._fields)
    return recommendations_df

Running neearest neigbors for default number of neighbors from and generating recall at 1st, 5th & 10th positions to determine which model works best for the customers in the validation set. 

Here, I am modifying the interactions_df to pivot such the shape is number of items x number of customers

In [22]:
interactions_df_nn = purchase_frequency_df.pivot(index='products', columns='customerId', values='purchaseFrequency').fillna(0)
interactions_df_nn.reset_index(drop=True, inplace=True)
interactions_df_nn = interactions_df_nn.rename_axis(None, axis=1)
interactions_df_nn.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28581,28583,28585,28588,28590,28593,28596,28598,28604,28605
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,6.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
## Here, I am using "minkowski" distance to calculate similarity among items
recommendations_df_nn = make_recommendations_neighbors(interactions_df_nn,purchase_frequency_df,\
                                                                       customer_ID_val,'minkowski','ball_tree')

recall_at_1 = compute_recall_at_k(recommendations_df_nn,purchase_frequency_df,1)
print("Recall@1: ",recall_at_1)
recall_at_5 = compute_recall_at_k(recommendations_df_nn,purchase_frequency_df,5)
print("Recall@5: ",recall_at_5)
recall_at_10 = compute_recall_at_k(recommendations_df_nn,purchase_frequency_df,10)
print("Recall@10: ",recall_at_10)

Recall@1:  0.4008043613396813
Recall@5:  0.8483730142301252
Recall@10:  0.9619563150581717


**Understanding:** Wow! Now, by getting nearest neighbors using a *distance-based "similarity" measure*, we get a much higher recall@10 of 96%
On the other hand, recall@1 i.e. recall for the first position is 40%, which is higher than what we observed from results of SVD

#### Determining the algorithm: As we have observed, nearest neighbors outperforms matrix factorisation by a large margin. So, if we are interested in obtaining all 10 recommended items to be as relevant as possible, or even if we are interested in the first item, or the first 5 items recommended to be highly relevant, I choose a nearest neighbor approach.

Now that we have chosen our model based on the recall scores, there is one more metric which we should consider:  
1. *Discounted Cumulative Gain (DCG@K)*:  DCG is a measure of the top-k recommendations' **ranking quality**. What this suggests is that highly relevant items should be recommended at the top. 
2. DCG@k = SUM_over_k(relevance_score(i)/log2(i+1)) where i is the rank for which DCG is being calculated' i is in range of 1 to 10 for our 10 recommender system.
3. Based on our business problem, it could be important that we would care for a higher value of DCG at the first 5 results, thus DCG could be used as an evaluation metric for selecting the most appropriate model.
    - This could be further enhanced by using Normalized DCG (NDCG), when we have the information about the "true ranking" of each product across all products in the dataset. Alternatively, we could frequency of items purchased across all users as a true ranking.

In [24]:
test_recommendations_df = make_recommendations_neighbors(interactions_df_nn,purchase_frequency_df,customer_ID_test
                                                   ,'minkowski','ball_tree')
test_recommendations_df.rename(columns = {'products':'recommendedProducts'},inplace=True)
test_recommendations_df.to_csv('../output/test_output.csv',index=False)

#### V. Future Work

1. Firstly, I would include DCG@k metric for our evaluations, in order to select the best model.
2. Second, a modeling approach to try would be using Autoencoders for making recommendations.
3. Third, work on ways to include temporal information in our recommender systems.


Note: Throughout the notebook, I have tried to keep the code simple, readable and reusable by creating functions specific to my analysis. However, in order to productionize this code, we could store our modeling algorithms into packages and libraries, thus making them callable. We should also ensure that inputs to the commonly used methods are generalized, and outputs from our modeling results are in the same format. The outputs could be passed on to a simple dashboard or JSON formats, for monitoring or evaluation purposes.
