# Neighborhood Based Collaborative Filtering Recommendation System Example:
## Here's an Example implementation in python3  of a Neighborhood Based Collaborative Filtering system. 
## The dataset can be downloaded here: https://www.kaggle.com/CooperUnion/anime-recommendations-database 

## We are predicting the best recommendations for around 7 million users. Get ready for some big data. 

### If you have any questions, feel free to email me at trackait@oberlin.edu

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import scipy as sp
import operator

Importing above, we need to clean the data set. Our first goal is to set up a User-Item matrix, so we can use the algorithm described in the article above. Here's the pseudocode for what we are looking at implementing:
1. Set up a user/item matrix, where the rows are the users and the columns are the items.  This is a sparse matrix, with many missing entries.
2. Define the mean user ratings for every user.
3. Mean Center user Ratings
4. Compute the similarity value between the rows(users)
5. Store the top-k similar users to users u who have specified ratings for item i.
6. With the similarity function, the neighborhoods you predict the missing ratings by: adding the mean value to  (the sum of the similarity of the user times the mean centered value divided by the sum of the absolute value of the similarity value of every user in neighborhood. 
7. Predict ratings

So lets load the dataset and get to work:

In [3]:
anime = pd.read_csv('./datasets/anime.csv')
rating = pd.read_csv('./datasets/rating.csv')


In [15]:
print('anime shape: ',anime.shape) 
print('raings shape: ',rating.shape) 

print('thats a lot of items: Lets reduce this as we have a lot of users: roughly, 7.8 million.')


anime shape:  (12294, 7)
raings shape:  (7813737, 3)
thats a lot of items: Lets reduce this as we have a lot of users: roughly, 7.8 million.


In [20]:
print(anime[anime['type']=="TV"].shape) 
animeTV = anime[anime['type']=="TV"]
animeTV['rating'].notnull()

(3787, 7)


AttributeError: 'Series' object has no attribute 'is_false'

In [7]:
rating.rating.replace({-1:np.nan}, regex=True, inplace=True)

Now the ratings that are missing aren't -1, but are Nans

In [9]:
anime_tv = anime[anime['type']=='TV']

In [10]:
# Join the two dataframes on the anime_id columns

merged = rating.merge(anime_tv, left_on = 'anime_id', right_on = 'anime_id', suffixes= ['_user', ''])
merged.rename(columns = {'rating_user':'user_rating'}, inplace = True)

In [11]:
merged=merged[['user_id', 'name', 'user_rating']]
merged_sub= merged[merged.user_id <= 10000]
merged_sub.head()
# for computers 

Unnamed: 0,user_id,name,user_rating
0,1,Naruto,
1,3,Naruto,8.0
2,5,Naruto,6.0
3,6,Naruto,
4,10,Naruto,
