<a href="https://colab.research.google.com/github/Pataweepr/applyML_vistec_2019/blob/master/hw2_HomeDotTech_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from scipy.stats import mode
from sklearn import preprocessing

# HW2: Thailand Real Estate Recommendation using k-NN

The data used here is part of the [Home Hackathon 2018](https://www.homedottech.com/homehackathon-2018/) by courtesy of HomeDotTech. 

[Home.co.th](https://www.home.co.th/) is a website that has the most comprehensive data about real estate in Thailand. It is one of the leading website in Thailand for real estates and has millions of page views per month.

The data we will be using are page view logs from the website, and the properties of housings and condominiums around Thailand.

We would like to use this data to create simple recommendation systems for real estate.

First to get the data go to this [link](https://drive.google.com/file/d/1X-cacRIF30acXyKotSmDZu5qc8Gs84Wm/view?usp=sharing) and click **add to Drive**

We will use this data by linking it with our Google Colab notebook, by using the command below, click the link, and follow the steps to let Google Colab link to your Google Drive by copying the authorization code and pasting it into the space provided.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive/')

Now that we have mounted the drive. We can access it by simple linux commands. To use linux commands in Colab, we can use the prefix "!" in front of the command.

The command below unzips the *For_participants.zip* file and list the current directory on the machine that host our Colab.

In [0]:
!unzip '/content/gdrive/My Drive/For_participants.zip'
!ls

You should notice three files


1.   **project_main.csv** explains each property. The details such as location (latitude, longtitude), price, housing type, build date, etc. are listed here.
2.   **project_facility.csv** explains the facilities in each property. The format for each line is the id of the project follow by the type of the facility (another kind of id). The id for facilities are 

1: swimming pool

2: club house

3: park

4: fitness

5: security

6: playground

3.   **userLog_201801_201802_for_participants.csv** shows the user page views. Each view entry has a user id, the project visited, time, etc. The data is from January 2018 to February 22nd.

You can also read the data dictionary for more information [here](https://drive.google.com/file/d/1uN8lRjoJQ3f-Ui69V5wwWGw13elvvaXq/view?usp=sharing)

The files are in a csv format. **However, the delimiter used is a ';' instead of a ','**.  A sample function for reading the provided files are shown below.

In [0]:
def readDataFromDrive(file_name):
  raw_data = pd.read_csv(file_name,delimiter= ';')
  return raw_data;

# Data exploration and cleaning

## TODO#1: Explore project_main.csv

Read *project_main.csv* and explore the data using pandas.head(), pandas.summary()

What columns do you think are redundant?

**Ans:**

Drop those columns.

We will use only the projects in Bangkok (province_id = 10). Remove the projects outside of Bangkok.

Also, convert possible numbers and text to numbers using *.infer_objects()*

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
project_main = readDataFromDrive('project_main.csv')
project_main = project_main.loc[project_main['province_id'] == 10]
print(project_main.shape)
project_main = project_main.drop(columns=['project_land_size_ngan', 'project_land_size_wa'])
project_main.loc[project_main["project_status"] == "A", "project_status"] = 1
project_main.loc[project_main["project_status"] == "U", "project_status"] = 0
project_main = project_main.infer_objects()
print(project_main.head())
project_main.describe()
        </code>
      </pre>
</details>



## TODO#2: NaN removal

Note how there are many missing values in the data. Remove the projects with no location information (no latitude or longtitude info). Fill the missing starting prices with the mode of the starting price.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
project_main = project_main.dropna(subset=['lat','lon'])
project_main["starting_price"] = project_main["starting_price"].fillna(project_main["starting_price"].mode().iloc[0])
        </code>
      </pre>
</details>

## TODO#3: Explore Project_facility.csv

Check out the file *project_facility.csv*. If we want to describe each project using these information, do you think it is easy to use the information provided as is? Why?

**ans:**


In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


## TODO#4: Change the data into a useable format
Change the data format of project_facility into a table that describe whether a facility has a certain type of facility in binary format (yes/no). Hint: you can use [pd.crosstab](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html) to do this.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
project_facility_table = pd.crosstab(project_facility['project_id'], project_facility['facility_id'], rownames=['project_id'], colnames=['facility_id'])
print(project_facility_table.index)
project_facility_table.head()
        </code>
      </pre>
</details>

## TODO#5: Remove missing information that does not exist in both files

We want to use only projects that have facility information. Remove the projects that do not exist in both files. You may find [pandas.DataFrame.isin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) useful.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
project_facility_table = project_facility_table.loc[project_facility_table.index.isin(project_main['project_id'])]
project_main = project_main.loc[project_main['project_id'].isin(project_facility_table.index.astype(int))]
project_main = project_main.sort_values('project_id')
print(project_facility_table.shape)
print(project_main.shape)
project_facility_table.head()
        </code>
      </pre>
</details>

In [0]:
# Double check whether this looks okay
project_main.head()

In [0]:
# Double check whether this looks okay
project_main.iloc[0]

## TODO#6: Explore userLog_201801_201802_for_participants.csv

Read the userLog file and look at the data.

How many entries are there?

**Ans:**

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


Remove the entries for the projects that we have already removed.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
    <code>
  userLog = userLog.loc[userLog['project_id'].isin(project_facility_table.index.astype(int))]
  </code>
</details>



## TODO#7: Let's see how often the users visit Home.co.th

To explore the useLog file, we want to see how the users behave.

First we will aggregate the number of times each user visit the site.

Use [pd.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to help count the amount of times each user visit the site.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
  <code>
  userLog_freq_user = userLog['userCode'].value_counts()
  userLog_freq_user_np  = np.array(userLog_freq_user.values)
  print(userLog_freq_user_np.shape)
  userLog_freq_user.head()
  </code>
  </pre>
</details>




Show a histogram of the view counts per user. Also use *pd.head()* to list the top viewers of this site.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


## TODO#8: View statistics
Find the mean, median, and mode of the view counts. Also find the 95th percentile of the view counts. Hint: use [np.percentile](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.percentile.html)

**Ans:**

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


## TODO#9: userLog cleaning

From TODO#7-8, and the fact that this is a log from a 50 day peroid, are there any problems with the data contained in *userLog*? Explain your hypothesis.

**Ans:**

## TODO#10: userLog pruning

Let's prune out the suspicious users. To be safe, we will only **keep users that have more than 4 views and less than 41 views**. We have a minimum cutoff so that we can have better information about the users to train our recommendation system.

You may use [pd.index](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Index.html)  to help do the filtering.

Check your answer by plotting a histogram of the views after the pruning.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
  <code>

userLog_freq_user = userLog_freq_user.loc[userLog_freq_user <= 40 ]
userLog_freq_user = userLog_freq_user.loc[userLog_freq_user >= 5 ]
userLog_freq_user_np = np.array(userLog_freq_user.values)

n, bins, patches = plt.hist(userLog_freq_user_np, 100, density=True, facecolor='g', alpha=0.75)
plt.show()

userLog = userLog.loc[userLog['userCode'].isin(userLog_freq_user.index)]
  </code>
  </pre>
</details>



##Optional 

Can you do better userLog filtering by based on daily views instead of total views? For example, a user shouldn't view more than 20 pages within a day, etc.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

------

#Recommendation system

## TODO#11: Train and test split

We will split the log data into training and test set for our recommendation system.

For the training set, we will use the days between 15-18 February.
For the testing set, we will use the days between 19-22 February.

Filter the userLog dataframe using the criterion, and create two dataframes:
*userLog_train* and *userLog_test*.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
userLog_freq_day = pd.crosstab(userLog['userCode'], [userLog['year'], userLog['month'], userLog['day']], rownames =['userCode'], colnames=['year','month','day'])

userLog_last_month  = userLog.loc[userLog['month'] == 2]
userLog_train = userLog_last_month.loc[ userLog_last_month['day'] >=15]
userLog_train = userLog_train.loc[ userLog_train['day'] < 19]
userLog_test = userLog_last_month.loc[ userLog_last_month['day'] >= 19]

print(userLog_train.shape)
print(userLog_test.shape)

userLog_freq_day.head()
</code>
</pre>
</details>







## TODO#12: Format the train and test data
Create a dataframe as specified below. Hint: use pd.crosstab

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################
# Create freq_user_seen_proj_train: a dataframe that has
# User id as rows
# Project id as columns (the value is the number of visits for that user)
# using the training data


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
freq_user_seen_proj_train = pd.crosstab(userLog_train['userCode'], userLog_train['project_id'], rownames=['userCode'], colnames=['project_id'])
freq_user_seen_proj_train_np = np.array(freq_user_seen_proj_train.values)
print(freq_user_seen_proj_train_np.shape)

freq_user_seen_proj_train.head()
</code>
</pre>
</details>



In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################
# Create freq_user_seen_proj_test: a dataframe that has
# User id as rows
# Project id as columns (the value is the number of visits for that user)
# using the test data


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
freq_user_seen_proj_test = pd.crosstab(userLog_test['userCode'], userLog_test['project_id'], rownames=['userCode'], colnames=['project_id'])
freq_user_seen_proj_test_np = np.array(freq_user_seen_proj_test.values)
print(freq_user_seen_proj_test_np.shape)

freq_user_seen_proj_test.head()
</code>
</pre>
</details>




## TODO#13:  Normalize the data

In machine learning, we usually normalize the data before putting it into a machine learning model. If we do not normalize the data, the model usually latches onto the feature with highest variance.

For example, consider the following projects:


```
           lat  lon   price
Project A: 13.0 100.2 4000000
Project B: 14.0 105.3 4500000
Project C: 13.1 100.1 5000000
```

If we compute the Euclidean distance between each projects. Project A will be closer to project B than project C, even though project A and C lies within the same region.

Thus, we usually scale (or in machine learning jargon, normalize) each input feature to be in the same range.

We can either do **min-max scaling** (scale the min to 0 and max to 1), **standardization** (scale the input to have 0 mean and std of 1), etc, depending on the type of data. We use min-max scaling for features we believe have a limited range, and vice versa.

Create a function *normalize(x)* that takes in a numpy array *x* and normalize it to 0 and 1. You may find [Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html?fbclid=IwAR3jCj2F4T1qZqgqwM-rYSlLC_WymKPa4tF4DYk5AfcwlIL0cyiIdi4VMpA#preprocessing-scaler) helpful.

Note that when we have training and test data, we have to normalize by the statistics in the training data, and then apply those statistics to the test data as well. However, for this particular lab we do not have to scale the features from the test data.

In [0]:
def normalize(x):
  ################################################################################
  #                            WRITE YOUR CODE BELOW                             #
  ################################################################################
  
  return


In [0]:
# Test the function here.
# where each row is one data entry
# each column represents each feature value
x = np.array([[1.0,2.0,3.0,5.0],[4.0,5.0,6.0,5.0],[3.0,1.0,9.0,5.0]])
print(normalize(x))
# You should get the following
#[[0.         0.25       0.         0.        ]
# [1.         1.         0.5        0.        ]
# [0.66666667 0.         1.         0.        ]]

## Recommendation system using k-Nearest Neighbors

We will use k-NN to create a recommendation systems. Given a new user, find the most similar user, and recommend that new user with the projects the most similar user visted.

To do this, we have to answer the two questions, **how do we define a user, and how do we quantify similarity between users**.

We can answer the first question by defining features that represents each user, and we can answer the second question by defining distance metrics to be used with k-NN.

For the following part of this lab, we will define a user using the average of properties of the project he/she visted.





## Recommendation using user features

### TODO#14:  user feature creation

Write a function that takes

1) UserLog_data_frame: the dataFrame containing the userLog from the training set

2) user_code: the user ids to create the feature for ( e.g. freq_user_seen_proj_train.index )

and returns

np_data: a np array that each row corresponds to each user_code, with the columns containing the average of the features calculated from the project visted

The features include


*   Latitude
*   Longtitude
*   Starting Price
*   Facilities

In [0]:
def feature_sel(userLog_data_frame,user_code):
  np_data = np.zeros((user_code.shape[0], 3 + project_facility_table.shape[1]))

  for i in np.arange(user_code.shape[0]):
    user_sel = user_code[i]
    userLog_dataF_sel = userLog_data_frame.loc[userLog_data_frame['userCode'] == user_sel]
    proj_id_sel = userLog_dataF_sel['project_id'].values
    input_np_data = np.zeros((proj_id_sel.shape[0],3 + project_facility_table.shape[1]))
    for j in np.arange(proj_id_sel.shape[0]):
      user_proj_far = project_facility_table.loc[project_facility_table.index == proj_id_sel[j]]
      user_proj_main = project_main.loc[project_main['project_id'] == proj_id_sel[j]]
      input_np_data[j,:] = np.hstack(( np.array(user_proj_main[["lat","lon","starting_price"]].values) ,  np.array(user_proj_far.values) ))
  ################################################################################
  #                            WRITE YOUR CODE BELOW                             #
  ################################################################################
    # Compute the mean in input_np_data and put it in np_data, be careful with the axis

  return np_data;

In [0]:
# Test the function on a small subset of user_code first
project_main_np_data_tmp = feature_sel(userLog_train,freq_user_seen_proj_train.index[:10])
print(project_main_np_data_tmp)

<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
def feature_sel(userLog_data_frame,user_code):
  np_data = np.zeros((user_code.shape[0], 3 + project_facility_table.shape[1]))
  
  for i in np.arange(user_code.shape[0]):
    user_sel = user_code[i]
    #print(user_sel)
    userLog_dataF_sel = userLog_data_frame.loc[userLog_data_frame['userCode'] == user_sel]
    proj_id_sel = userLog_dataF_sel['project_id'].values
    input_np_data = np.zeros((proj_id_sel.shape[0],3 + project_facility_table.shape[1]))
    # print(proj_id_sel)
    for j in np.arange(proj_id_sel.shape[0]):
      user_proj_far = project_facility_table.loc[project_facility_table.index == proj_id_sel[j]]
      user_proj_main = project_main.loc[project_main['project_id'] == proj_id_sel[j]]
      input_np_data[j,:] = np.hstack(( np.array(user_proj_main[["lat","lon","starting_price"]].values) ,  np.array(user_proj_far.values) ))
    np_data[i,:] = np.mean(input_np_data,axis=0)
  return np_data;
</code>
</pre>
</details>



### TODO#15: create the training features

Use *feature_sel()* and *normalize()* from above to create training features. This does take a while to run.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
project_main_np_data_nor = feature_sel(userLog_train,freq_user_seen_proj_train.index)
project_main_np_data_nor = normalize(project_main_np_data_nor)
print(project_main_np_data_nor.shape)
</code>
</pre>
</details>



###TODO#16: perform nearest neighbors
Create a k-nn model using sk-learn. See [Nearest Neighbors example](https://scikit-learn.org/stable/modules/neighbors.html) for how to use it.

Use *n_neighbors = 5* and *algorithm = 'ball_tree'* for the k-nn setting. Ball tree is one of the methods to make k-NN faster.

Calculate the nearest neighbors of the data in the training set.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
nbrs_proj = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(project_main_np_data_nor)
distances, indices = nbrs_proj.kneighbors(project_main_np_data_nor)
</code>
</pre>
</details>

## Evaluation 

We will be using [Mean Average Precision](http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html?fbclid=IwAR2UOfz0a_2Ig60aQ2HszgZq63nch96Hbzod2q54kDZRZT_mPzdsxJXyPj0) to evaluate our performance. 

The task is to recommend **projects that the user have not seen before.** Thus, we need to prune out the projects that the users had already seen in the training set.

To do so, we need to create a recommendation list from our k-NN results. We will use the following functions:

1.   proj_seen: create a dictionary of projects viewed by user
    *   data_freq_seen (input): df of frequence of views by user in our case freq_user_seen_proj_train or freq_user_seen_proj_test
    *   output: dictionary of projects viewed by user. {userid: list of project ids}
2.   index_to_usercode: create a dictionary of usercode from nearest neighbor indices
    *   nn_index (input): output from k-NN (indices)
    *   user_code_list (input): list of user id, i.e. freq_user_seen_proj_train.index
    *   output :  dict of user id to nearest user ids. {userid: list of userids}
3.   proj_recommend: create the recommendation list
    *   nn_dict (input): output of index_to_usercode
    *   dict_seen_train (input): output of proj_seen from freq_user_seen_proj_train (training set)
    *   output: dict of recommended list for each user (contains old entries)
4.   proj_repeat: prune out the projects that the users had already seen
    *   dict_train (input): dict from training (output from proj_seen) 
    *   dict_test (input): dict from testing (output from proj_seen or proj_recommend)
    *   output: dict_test that has the projects in dict_train removed
5.   mean_average_precision: find MAP@k
    *   dict_recommend (input): dict of recommendations (only new items)
    *   dict_test (input): dict of ground truths (only new items)
    *   k (input): parameter for MAP@k
    *   output: MAP@k value


In [0]:
def proj_recommend(nn_dict,dict_seen_train):
  nn_proj = {}
  for user_code_nn in nn_dict:
    list_proj = []
    for user_in_list in nn_dict[user_code_nn]:
      list_proj = list_proj + list(dict_seen_train[user_in_list])
    nn_proj[user_code_nn] = list_proj
  return nn_proj

####################################################################

def index_to_usercode(nn_index,user_code_list):
  nn_dict = {}
  for i in np.arange(nn_index.shape[0]):
    nn_list = []
    nn_name = user_code_list[i]
    for j in np.arange(nn_index.shape[1]):
      ind = nn_index[i][j]
      nn_list.append(user_code_list[ind])
    nn_dict[nn_name] = nn_list
  return nn_dict

####################################################################

def proj_repeat_out(dict_train,dict_test):
  dict_out = {} 
  for user_code in dict_test:
    if user_code in dict_train:
      list_proj_train = np.array(dict_train[user_code])
      list_proj_test = np.array(dict_test[user_code])
      bool_list = np.isin(list_proj_test ,list_proj_train)
      dict_out[user_code] = list(list_proj_test[~bool_list])
  return dict_out

####################################################################

def proj_seen(data_freq_seen):
  all_colums = data_freq_seen.columns
  all_index = data_freq_seen.index
  output_list = {}
  for index in np.arange(len(all_index)):
    np_array = np.array(data_freq_seen.iloc[index])
    output_list[all_index[index]] = all_colums[np_array > 0].values
  return output_list

####################################################################

def mean_average_precision(dict_recomment,dict_test,k):
  list_map_at_k = []
  for user_code in dict_test:
    list_map_at_k_user_i = []
    list_proj_rec = np.array(dict_recomment[user_code])
    list_proj_test = np.array(dict_test[user_code])
    
    ##########
    
    for i in np.arange(k):
      if(i == len(list_proj_rec) or len(list_proj_test) == 0 or len(list_proj_rec) == 0):
        break
      bool_list = np.isin(list_proj_rec[0:i+1],list_proj_test)
      list_map_at_k_user_i.append(np.sum(bool_list)/(i+1))
    
    ##########
    # if fail   
    if len(list_map_at_k_user_i) == 0:
      if len(list_proj_test) != 0:
        list_map_at_k_user_i.append(0)
    
    
    if len(list_map_at_k_user_i) != 0:
      number = np.mean(np.array(list_map_at_k_user_i))
      list_map_at_k.append(number)

  return np.mean(np.array(list_map_at_k))

In [0]:
# Create the groundtruth and the recommendation list

dict_user_test = proj_seen(freq_user_seen_proj_test)
dict_user_train = proj_seen(freq_user_seen_proj_train)
dict_user_test = proj_repeat_out(dict_user_train,dict_user_test)

nn_dict_rec = index_to_usercode(indices,freq_user_seen_proj_train.index)

dict_user_rec = proj_recommend(nn_dict_rec,dict_user_train)
dict_user_rec = proj_repeat_out(dict_user_train,dict_user_rec)

#################################################################

k = 3
map_k = mean_average_precision(dict_user_rec,dict_user_test,k)
print('Your MAP@k result : ', map_k)

You should get 0.0177522349936143.

## Recommendation using views (collaborative filtering)

### TODO#14:  number of views matrix creation

Create a np matrix that contains the view for each user-project pair, normalize,a nd then create a k-nn recommender

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
  <code>
  freq_user_seen_proj_train_nor = normalize(freq_user_seen_proj_train_np) 
  nbrs_user_numberical = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(freq_user_seen_proj_train_nor) 
  distances, indices = nbrs_user_numberical.kneighbors(freq_user_seen_proj_train_nor)
  </code>
  </pre>
</details>




Evaluate the recommendation just like the case for user feature.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


### TODO#15:  binary view matrix creation

There are many ways to create user-project view matrix. Previously, we use the amount of views. However, we can also use a binary representation that signifies whether a user had view a project or not.

Repeat the previous steps to create a k-NN recommender. Be sure to use **Jaccard** distance which is better suited for binary features. For more information about Jaccard distance see [distance function](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric).

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
  <pre>
<code>
nbrs_user = NearestNeighbors(n_neighbors= 5, algorithm='auto',metric='jaccard' ).fit(freq_user_seen_proj_train_np >= 1)
distances, indices = nbrs_user.kneighbors(freq_user_seen_proj_train_np >= 1)
</code>
</pre>
</details>




Evaluate the recommendation just like the case for user feature.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

You should get the following numbers for MAP@3

**user-profile feature**                          0.0177522349936143

**number of views matrix feature**    0.028544061302681997

**binary view matrix feature**              0.027203065134099615

