# Collaborative Filtering

Concept of collaborative Filtering, predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Collaborative_filtering.gif/300px-Collaborative_filtering.gif)

In this lab, we'll implement __knn__ for finding the nearest neighbors and predict rating for each project and user by using "sklern". We separate this lab to 3 parts
- Data Preparation
- Fiting Model
- Prediction to recommend next projects

In [None]:
# basic library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

In [None]:
path = 'coding/data/'
df = pd.read_csv(path+'userLog_201801_201802_for_participants.csv', delimiter = ';', error_bad_lines = False, low_memory = False)
df.head(3)

In [None]:
random_state = 100
sample_users = set(df['userCode'].sample(n=10000, random_state=random_state))
sample_data = df[df['userCode'].isin(sample_users)]
sample_data.head(3)

In [None]:
len(sample_data)

### - Data Cleaning and Transformation

create visited datetime

In [None]:
sample_data['datetime'] = sample_data.apply(lambda row : datetime.datetime(row['year'], row['month'], row['day'], row['hour']), axis=1)
sample_data['date'] = sample_data['datetime'].map(lambda x : x.date())
sample_data['yearmonth'] = sample_data['date'].map(lambda x: str(x.year) +'-'+ str(x.month).zfill(2))
sample_data.head(3)

clean data ex. cut outllier, filter users & projects which have less transaction

In [None]:
min_interacted = 30
project_count = sample_data.groupby(['project_id']).size()
ignore_project = set(project_count[project_count > min_interacted].index)
print(len(ignore_project))

In [None]:
df_filter = sample_data[~sample_data['project_id'].isin(ignore_project)]
df_filter.head(3)

In [None]:
df_filter = <FILL IN>

### - Split training-testing dataset

In [None]:
def SplitTrainTest(df, date):
    
    df['interacted'] = 1
    df_train = df[df.date <  date]
    df_test = df[df.date >=  date].sort_values(by = ['userCode', 'datetime'])
    
    # projects which are in training datasets
    project_train = set(df_train['project_id'].values)
    df_test = df_test[df_test['project_id'].isin(project_train)]
    
    # users which are in training datasets
    user_train = set(df_train['userCode'].values)
    df_test = df_test[df_test['userCode'].isin(user_train)]
    
    print('# of train dataset:', len(df_train))
    print('# of test dataset:', len(df_test))

    return df_train, df_test

In [None]:
date_ = datetime.date(2018, 2, 20)
df_train, df_test_full = SplitTrainTest(df_filter, date = date_)

In [None]:
df_test_indexed = df_test_full[['userCode', 'project_id', 'interacted', 'flag1Prj']].drop_duplicates().set_index('userCode')
df_test_indexed.head(3)

## Data Preparation
In this part, we'll create users-items matrix for calculating similarity between users. Users-items matrix can be created by many format, in this lab, we'll create 3 matrix :
- 0/1 matrix
- rating matrix
- rating + user profile matrix

### - Case 1: 0/1 matrix
This part we'll prepare interacted data to identify interacted projects for each users. We need data like below format.
```
|------------+---+---+---+----+---|
| project_id | 1 | 2 | 3 | .. | j |
| userCode   |   |   |   |    |   |
|------------+---+---+---+----+---|
| user A     | 1 | 0 | 0 | .. | 1 |
| user B     | 0 | 1 | 0 | .. | 1 |
| user C     | 0 | 0 | 1 | .. | 0 |
|  ..        | . | . | . | .. | . |
|------------+---+---+---+----+---|

```

- 1 means user interacted that project
- 0 means user didn't interact that project

In [None]:
df_train_indexed = df_train[['userCode', 'project_id', 'interacted']].drop_duplicates()
df_train_pivot = (df_train_indexed.pivot(index = 'userCode', columns = 'project_id', values = 'interacted')
                                  .fillna(0))
df_train_pivot.head(3)

Transform users-items to matrix for model and set index of df_train for fast search

In [None]:
df_train_matrix = df_train_pivot.values
df_train_matrix.shape

In [None]:
df_train_indexed = df_train_indexed.set_index('userCode')
df_train_indexed.head(3)

### - Case 2: Rating matrix
Define rating by number of interactions with each project and scale them by bining (look at the distribution of data and define boundary)

ex. (0-1]   ==> rating = 1

    (1-2]   ==> rating = 2
    
    (2-4]   ==> rating = 3
    
    (4-7]   ==> rating = 4
    
    (7-inf] ==> rating = 5

```
|------------+---+---+---+----+---|
| project_id | 1 | 2 | 3 | .. | j |
| userCode   |   |   |   |    |   |
|------------+---+---+---+----+---|
| user A     | 4 | 0 | 0 | .. | 2 |
| user B     | 0 | 3 | 0 | .. | 1 |
| user C     | 0 | 0 | 5 | .. | 0 |
|  ..        | . | . | . | .. | . |
|------------+---+---+---+----+---|


```

------------------------------------------------------------------------------------------

#### ! TO DO: prepare rating user-item matrix like above *************************************
------------------------------------------------------------------------------------------

In [None]:
# count number of interaction groupby userCode and project_id
df_train_rating = <FILL IN>

In [None]:
# distribution of number of interactions
df_train_rating.groupby(['userCode', 'project_id']).size().reset_index()[[0]].boxplot()
plt.show()

In [None]:
df_train_rating['rating'] = <FILL IN>

In [None]:
df_train_rating_pivot = (df_train_rating.pivot(index = <FILL IN>
                                             ,columns = <FILL IN>
                                             ,values = <FILL IN>)
                                       .fillna(0))
df_train_rating_pivot.head(3)

In [None]:
df_train_rating_matrix = <FILL IN>
print(df_train_rating_matrix.shape)

In [None]:
df_train_rating_index = df_train_rating.<FILL IN>

### - Case 3: Rating + user profile matrix
Using above rating matrix and concat with user profile ex. weekday etc.

```
|------------+---+---+---+----+---+-----+-----+-----+-----+-----+-----+-----|
| project_id | 1 | 2 | 3 | .. | j | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
| userCode   |   |   |   |    |   |     |     |     |     |     |     |     |
|------------+---+---+---+----+---|-----+-----+-----+-----+-----+-----+-----+
| user A     | 4 | 0 | 0 | .. | 2 | 0.1 | 0.3 | 0.2 | 0.0 | 0.0 | 0.1 | 0.3 |
| user B     | 0 | 3 | 0 | .. | 1 | 0.2 | 0.2 | 0.1 | 0.1 | 0.1 | 0.3 | 0.0 |
| user C     | 0 | 0 | 5 | .. | 0 | 0.0 | 0.1 | 0.1 | 0.0 | 0.0 | 0.5 | 0.3 |
|  ..        | . | . | . | .. | . | ... | ... | ... | ... | ... | ... | ... |
|------------+---+---+---+----+--------+-----+-----+-----+-----+-----+------|
```


In [None]:
df_train['weekday'] = df_train['datetime'].dt.dayofweek

In [None]:
# proportion of #interactions by time interval
weekday = df_train.groupby(['userCode', 'weekday']).size()
weekday = weekday.groupby(level = 0).apply(lambda x: round(x/float(x.sum()), 2)).reset_index()
weekday.columns.values[2] = 0
user_weekday = weekday.pivot(index = 'userCode', columns = 'weekday', values = 0).fillna(0).reset_index()
user_weekday.columns =  list(user_weekday.columns.values[:1]) + ['day' + str(col) for col in user_weekday.columns.values[1:]]
user_weekday.head(3)

------------------------------------------------------------------------------------------

#### ! TO DO: **
- create other user profile
- merge user profile and user rating
- create matrix 

------------------------------------------------------------------------------------------

In [None]:
<FILL IN: CREATE USER PROFILE>

In [None]:
df_train_userprofile = pd.merge(df_train_pivot.reset_index()
                                , user_weekday
                                , how='left'
                                , on=['userCode'])
df_train_userprofile.head(3)

set "userCode" to be index for fast search and create matrix user-item matrix

In [None]:
df_train_userprofile_indexed = <FILL IN>
df_train_userprofile_matrix = <FILL IN>

## Implementing KNN for recommender system
In this part, we'll use sklearn for knn algorithm. For more infomation, http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

### - Fitting model

In [None]:
# library for knn
from sklearn.neighbors import NearestNeighbors

------------------------------------------------------------------------------------------
#### ! TO DO: Using NearestNeighbors in sklearn to fit our data **

For more infomation, http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

------------------------------------------------------------------------------------------

In [None]:
n_neighbors = <FILL IN: define number of nearest neighbor>
metric = <FILL IN: define matric for calculating similarity>

In [None]:
knn = NearestNeighbors(metric = metric
                        , algorithm = 'brute'
                        ,  n_neighbors = n_neighbors)

In [None]:
knn.fit(df_train_matrix)

### - Finding k nearest neighbors by index and get their rating of each project
- __Input:__ user_id
- __Output:__ distance and indices of k nearest neighbors

In [None]:
user_id = '0383072a-6827-1246-6490-39fc4d46bcd'

In [None]:
df_train_pivot[df_train_pivot.index == user_id].iloc[0].values.reshape(1, -1)

In [None]:
distances, indices = knn.kneighbors(df_train_pivot[df_train_pivot.index == user_id].iloc[0].values.reshape(1, -1)
                                     , n_neighbors = n_neighbors)
print('distance:', distances)
print('indices:', indices)

get interacted values of k nearest neighbors by indices and calculate rating for each project.

In [None]:
k_rating_matrix = df_train_matrix[indices,]
k_rating_matrix.shape 

In [None]:
k_rating = sum(k_rating_matrix)
k_weight_rating = sum(k_rating)/n_neighbors
print(k_weight_rating)
print(k_weight_rating.shape)

------------------------------------------------------------------------------------------
__! TO DO: Calculate "k_weight_rating" by using distance to weight rating **__

k_weight_rating = sum(k_rating*(1/distance))/sum(1/distance)

------------------------------------------------------------------------------------------

In [None]:
k_weight_rating = <FILL IN>

transform to dataframe with columns 'project_id' and 'k_weight_rating'

__! TO DO: Create dataframe "recommend_df" which sort values by 'k_weight_rating' and show only topn projects. **__

In [None]:
# topn = <FILL IN>
recommend_df = (pd.DataFrame({"project_id": df_train_pivot.columns
                             ,"k_weight_rating": <FILL IN>})
                             .<FILL IN>
                             .<FILL IN>
               )
recommend_df

### - Items to ignore
We'll recommend new items so we need to ignore interacted items before recommend

------------------------------------------------------------------------------------------
__! TO DO: create "get_item_interacted" function which return set of interacted items of each user and ignore them in recommend_df**__

------------------------------------------------------------------------------------------

In [None]:
def get_item_interacted(df, user_id):
    """
    Args:
    - df = dataframe which collect interacted projects of users
    - userId = user id
    
    Return:
    - set of interacted itemsets
    """
    interacted_projects = <FILL IN>
    return set(interacted_projects['project_id'])

In [None]:
items_to_ignore = get_item_interacted(<FILL IN>)
items_to_ignore

In [None]:
topn = <FILL IN>
recommend_df = pd.DataFrame({"project_id": df_train_pivot.columns
                            ,"k_weight_rating": <FILL IN>})

recommend_df =  (recommend_df[~recommend_df['project_id'].isin(items_to_ignore)]
                .<FILL IN>
                .<FILL IN>)             
recommend_df

### Evaluation
We'll use metric __MAP@k__ for evaluate result.

Example of calculating MAP@5

```
|                                        |                   |        Precision       |          Average         |
|----------------------------------------+-------------------+------------------------+--------------------------|
| Actual rank: [2, 4, 1, 5]              |  [1, 0, 0, 1, 1]  | [1/1, 0, 0, 2/4, 3/5]  | (1 + 2/4 + 3/5)/4 = 0.53 |
| Recommended rank: [5, 9, 3, 1, 2]      |                   |                        |                          |
|----------------------------------------+-------------------+------------------------+--------------------------|
| Actual rank: [9, 6, 1]                 |  [1, 0, 0, 0, 0]  | [1/1, 0, 0, 2/4, 3/5]  | (1/1)/3 = 0.33           |
| Recommended rank: [9, 2, 5, 0, 4]      |                   |                        |                          |
|----------------------------------------+-------------------+------------------------+--------------------------|
| Actual rank: [6, 0, 4]                 |  [0, 0, 0, 1, 1]  | [0, 0, 0, 1/4, 2/5]    | (1/4 + 2/5)/3 = 0.22     |
| Recommended rank: [1, 10, 11, 4, 6]    |                   |                        |                          |
|----------------------------------------+-------------------+------------------------+--------------------------|
```
Mean Average Precision @ 5 = (0.53 + 0.33 + 0.22)/3 = 0.36

** It's ap for only 1 user. If you would like to evaluate all users, pls. submit file on kaggle :)

In [None]:
def ap_func(actual_list, recommend_list, k=7):
    
    m = len(actual_list)
    recoms = []
    precision = 0
    for i, item_ in enumerate(recommend_list):
        if item_ in actual_list:
            recoms.append(1)
            precision += round(sum(recoms[:i+1])/(i+1), 2)
        else:
            recoms.append(0)
          
    ap = round(precision/min(m, k), 2)
    return ap

In [None]:
actual_list = <FILL IN>
recommend_list = <FILL IN> 
ap = ap_func(actual_list, recommend_list, 7)
print(ap)

### Transform data for submit to kaggle

Format of dataframe for 'transform_to_kaggle' function, consist of 2 columns
- userCode
- project_id : order by sequence of recommendation (7 sequences)

```
|------------+--------------|
|  userCode  |  project_id  |
|------------+--------------|
| user A     |      4       |
| user A     |     21       |
| user A     |     34       |
|  ..        |     ..       |
```

In [None]:
def transform_to_kaggle(recommed_df):
    
    """
    Input:
        - recommed_df: userCode and project_id
    
    Returns:
        - recommed_df: 
    """
    testing_dataset = []
    recommed_df_indexed = recommed_df.set_index('userCode')
    
    for idx, user_id in enumerate(list(recommed_df_indexed.index.unique().values)):
        
        interacted_testset = recommed_df_indexed[recommed_df_indexed.index == user_id]
        rank_actual = list(interacted_testset['project_id'].values)

        if len(rank_actual) > 0:
            rank_actual_str = ' '.join(str(r) for r in rank_actual)
            testing_dataset.append({"userCode": user_id
                                   ,"project_id": rank_actual_str})
            
    testing_df = pd.DataFrame(testing_dataset)
    testing_df = testing_df[['userCode', 'project_id']]
    return testing_df

In [None]:
submit_file = transform_to_kaggle(recommend_df)