# Recommendation System
## Problem Statement

Your client is a fast-growing mobile platform, for hosting coding challenges. They have a unique business model, where they crowdsource problems from various creators(authors). These authors create the problem and release it on the client's platform. The users then select the challenges they want to solve. The authors make money based on the level of difficulty of their problems and how many users take up their challenge.
 
The client, on the other hand makes money when the users can find challenges of their interest and continue to stay on the platform. Till date, the client has relied on its domain expertise, user interface and experience with user behaviour to suggest the problems a user might be interested in. You have now been appointed as the data scientist who needs to come up with the algorithm to keep the users engaged on the platform.
The client has provided you with history of last 10 challenges the user has solved, and you need to predict which might be the next 3 challenges the user might be interested to solve. Apply your data science skills to help the client make a big mark in their user engagements/revenue.

## DATASET DESCRIPTION
### We have three data files:
##### train.csv: It contains the set of 13 challenges that were attempted by the same user in a sequence 

|Variable | Definition|
|------------- |-------------|
|user_sequence|Unique ID for the sequence|
|user_id|User ID|
|challenge_sequence|Challenge sequence number (1-13)|
|challenge|Challenge ID|

##### challenge_data.csv: Contains attributes related to each challenge
|Variable|Definition|
|------------- |-------------|
|challenge_ID|Challenge ID|
|programming_language|Programming language for the challenge|
|challenge_series_ID|Series for the given challenge|
|total_submissions|Total submissions by all users|
|publish_date|Publishing date for the challenge|
|author_ID|Author ID|
|author_gender|Author gender|
|author_org_ID|Organization ID for author|
|category_id|Type of challenge|

##### test.csv:  
Contains the first 10 challenges solved by a new user set (not in train) in the test set. We need to predict the next 3 sequence of challenges for these users.

|Variable|Definition|
|------------- |-------------|
|user_sequence|Unique ID for the sequence|
|user_id|User ID|
|challenge_sequence|Challenge sequence number (1-10)|
|challenge|Challenge ID|

##### sample_submission.csv:  
It contains the format for submission. Only submissions in this format are acceptable. This should have the next 3 challenges for each user.

|Variable|Definition|
|------------- |-------------|
|user_sequence|Unique ID for the sequence (See Note)|
|challenge|Challenge ID|

###### Note: The format is given by "user_id_challenge_sequence". For example, for user ID 2 you must predict the next 3 challenges with sequence 11, 12 and 13 respectively. The corresponding user_sequence would be given by 2_11, 2_12 & 2_13.


In [40]:
%matplotlib inline
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD, SVDpp, evaluate
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

In [41]:
# df_challenge = pd.read_csv('challenge_data-revised.csv')
# df_challenge.head()

In [42]:
# df_challenge.tail()

In [43]:
# df_train = pd.read_csv('train.csv')
# df_train.head()

In [44]:
# df_test = pd.read_csv('test.csv')
# df_test.head()

## Content Based Recommender

Computes similarity between challenges based on certain metrics and suggests challenges that are most similar to a particular challenge that a user liked. the challenge metadata (or content) will be used to build this engine, this also known as Content Based Filtering.

Two Content Based Recommenders based on:

    programming_language, author id, category id, total_submissions
    programming_language, author id, author gender, author org, category id, total_submissions.

In [45]:
# df_challenge.shape

In [46]:
# df_challenge.isnull().any()

In [47]:
# df_challenge['publish_date'] = pd.to_datetime(df_challenge['publish_date'])
# df_challenge.tail()

In [48]:
# df_challenge.isnull().any()

## Data Cleaning
### total_submissions

In [49]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["total_submissions"].dropna(),
#                  bins=20,
#                  kde=False,
#                  color="green")
#     plt.title("total_submissions")
#     plt.ylabel("Count")
    

In [50]:
# print('total_submissions=> mean =%f, median=%f'%(df_challenge["total_submissions"].mean(), df_challenge["total_submissions"].median()))
# df_challenge["total_submissions"].describe()

In [51]:
# df_challenge["total_submissions"].isna().sum()

### Data cleaning for total submissions.
The latest challenges do not have the number, use median number to fillin.

In [52]:
# # Assign missing total_submissions to median number
# df_challenge["total_submissions"] = df_challenge["total_submissions"].fillna(value=122)
# df_challenge.tail()

### Challenge series id

In [53]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["challenge_series_ID"].dropna(),
#                  bins=20,
#                  kde=False,
#                  color="green")
#     plt.title("challenge_series_ID")
#     plt.ylabel("Count")
    

In [54]:
# df_challenge["challenge_series_ID"].isna().sum()

### Use mode number of series_id for missing data
Manually add the series id by previous value

In [55]:
#  df_challenge["challenge_series_ID"].mode()

In [56]:
# df_challenge["challenge_series_ID"] = df_challenge["challenge_series_ID"].fillna(value='SI2652')
# df_challenge.tail()

In [57]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["author_ID"].dropna(),
#                  bins=100,
#                  kde=False,
#                  color="green")
#     plt.title("author_ID")
#     plt.ylabel("Count")

In [58]:
#  df_challenge["author_ID"].mode()

In [59]:
# df_challenge["author_ID"] = df_challenge["author_ID"].fillna(value='AI565468')

In [60]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["author_gender"].dropna(),
#                  bins=2,
#                  kde=False,
#                  color="green")
#     plt.title("author_gender")
#     plt.ylabel("Count")

In [61]:
# df_challenge["author_gender"] = df_challenge["author_gender"].fillna(value='M')

In [62]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["author_org_ID"].dropna(),
#                  bins=100,
#                  kde=False,
#                  color="green")
#     plt.title("author_org_ID")
#     plt.ylabel("Count")

In [63]:
# df_challenge["author_org_ID"].mode()

In [64]:
# df_challenge["author_org_ID"] = df_challenge["author_org_ID"].fillna(value='AOI100201')

In [65]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["category_id"].dropna(),
#                  bins=100,
#                  kde=False,
#                  color="green")
#     plt.title("category_id")
#     plt.ylabel("Count")

In [66]:
# df_challenge["category_id"].mode()

In [67]:
# df_challenge["category_id"] = df_challenge["category_id"].fillna(value=29)

In [68]:
# with sns.plotting_context("notebook",font_scale=1.5):
#     sns.set_style("whitegrid")
#     sns.distplot(df_challenge["programming_language"].dropna(),
#                  bins=20,
#                  kde=False,
#                  color="green")
#     plt.title("programming_language")
#     plt.ylabel("Count")

In [69]:
# df_challenge.isnull().any()

## Metadata Based Recommender

To build the standard metadata based content recommender, merging current dataset with the crew and the keyword datasets. 

Data preparation as first step.

## Collaborative Filtering

Due to the lack of data content and distribution, the above based engine suffers from some severe limitations. 
It is only capable of suggesting challenges which are close to a certain challenge. 
That is, it is not capable of capturing tastes and providing recommendations across genres.
Also, the engine is not really personal in that it doesn't capture the personal tastes and biases of a user. 
Anyone querying the engine for recommendations based on a challenge will receive the same recommendations for tmovie, regardless of who s/he is.

Therefore, in this section, a technique called Collaborative Filtering will be used to make recommendations to users. 
Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.
The Surprise library provides algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.


### Define our precision function 

The evaluation metric is Mean Average Precision (MAP) at K (K = 3). MAP is a well-known metric used to evaluate ranked retrieval results

In [70]:
from collections import defaultdict

from surprise import Dataset
from surprise import SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=13, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls


### Define our function to recommend top n challenge

In [71]:
def get_top_n(predictions, n=3):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 3.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

#### Adjust challenge sequence to ranking by 14-challenge_sequence

In [72]:
# reader = Reader(rating_scale=(1,13))

# # Combine training and testing dataset
# frames = [df_train, df_test]
# df = pd.concat(frames)
# ratings = df.copy(deep=True)
# ratings['challenge_sequence'] = 14-ratings['challenge_sequence']
# ratings.head()


#### Using n-folds, n = 5, to perform cross validation.

In [73]:
# n_folds = 5
# data = Dataset.load_from_df(ratings[['user_id', 'challenge', 'challenge_sequence']], reader)

### Select algorithm for prediction

In [74]:
algo = SVDpp()
# Run 5-fold cross-validation and print results
#cross_validate(algo, data, measures=['FCP', 'RMSE', 'MAE'], cv=n_folds, verbose=True)

In [75]:
algo = SVD()
# Run 5-fold cross-validation and print results
#cross_validate(algo, data, measures=['FCP', 'RMSE', 'MAE'], cv=n_folds, verbose=True)

### Evaluate by AP@K=13


In [76]:
ap_k = 13

In [77]:
# kf = KFold(n_splits=n_folds)
# algo = SVD()

# for trainset, testset in kf.split(data):
#     algo.fit(trainset)
#     predictions = algo.test(testset)
#     precisions, recalls = precision_recall_at_k(predictions, k=ap_k, threshold=13)

#     # Precision and recall can then be averaged over all users
#     print('precisions=%f'%(sum(prec for prec in precisions.values()) / len(precisions)))
#     print('recalls=%f'%(sum(rec for rec in recalls.values()) / len(recalls)))
    

In [78]:
# kf = KFold(n_splits=n_folds)
# algo = SVDpp()

# for trainset, testset in kf.split(data):
#     algo.fit(trainset)
#     predictions = algo.test(testset)
#     precisions, recalls = precision_recall_at_k(predictions, k=ap_k, threshold=4)

#     # Precision and recall can then be averaged over all users
#     print('precisions=%f'%(sum(prec for prec in precisions.values()) / len(precisions)))
#     print('recalls=%f'%(sum(rec for rec in recalls.values()) / len(recalls)))

#### According to above testing, SVDpp provides better balance on precision and recall

Choose SVDpp as major algorithm.

### Train the data

In [79]:
algo = SVDpp()

In [80]:
# trainset = data.build_full_trainset()
# algo.train(trainset)

#### Let us pick first user and check the ratings s/he has given.

In [81]:
# ratings[ratings['user_id'] == 4576]

In [82]:
# algo.predict(4576, 'CI23855', 13)

### Make Prediction on testing set

In [83]:
# trainset = data.build_full_trainset()
# algo.fit(trainset)

In [85]:
# afile = open(r'./trainset.pkl', 'wb')
# pickle.dump(trainset, afile)
# afile.close()

# afile = open(r'./algo.pkl', 'wb')
# pickle.dump(algo, afile)
# afile.close()

# reload object from file
file2 = open(r'./trainset.pkl', 'rb')
trainset = pickle.load(file2)
file2.close()

# file2 = open(r'./algo.pkl', 'rb')
# algo = pickle.load(file2)
# file2.close()


In [None]:
# algo.predict(4577, 'CI23855', 13)

In [None]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()

In [None]:
import pickle

afile = open(r'./testset.pkl', 'wb')
pickle.dump(testset, afile)
afile.close()

#reload object from file
# file2 = open(r'./testset.pkl', 'rb')
# testset = pickle.load(file2)
# file2.close()

In [None]:
df_test.set_index(['user_id'])

new_test_data = list()
# i = 0
# for item in testset:
#     print('item[0]=%s'%(item[0]))
#     i += 1
#     if i > 5: continue
        
for item in testset:
    if item[0] in df_test.index:
        new_test_data.append(item)
    else: pass
    
        
        

In [None]:
predictions = algo.test(new_test_data)

top_n = get_top_n(predictions, n=3)

# Print the recommended items for each user, only get first 7 users as example to view the results.
i = 0
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
    i += 1
    if i > 21: continue