# Recommender System using Collaborative Filtering

1. Install surprise package
2. Read the jokes dataset, check the shape and sample of the dataset
3. Read the ratings dataset, check the shape and sample of the dataset
4. Convert ratings data into surprise data frame format
5. Define the similarity parameters and the algorithm for finding similar users
6. Check the results using the crossvalidation
7. Get the train and test data from surprise data frame and fit the model on train data
8. Get the predictions on test data
9. Write a function to get top 10 predictions for each user
10. Using the top predictions matrix map the jokes to understand the recommendations

# Attributes Description
#### Jokes Dataset
1.ItemID - ID of each Item

2.Joke - Description of each Joke

#### Ratings Dataset
1.UserID - ID of each User

2.ItemID - ID of each Item

3.Rating - Rating for each joke ranging from [-10,10]

### Surprise is a python package used to make recommender systems
http://surpriselib.com/
<br>https://github.com/NicolasHug/Surprise
<br>http://surprise.readthedocs.io/en/stable/index.html

#### 1. Install surprise package

In [4]:
#### Install surprise package
#!pip install scikit-surprise

In [5]:
from surprise.model_selection import train_test_split
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader, KNNWithMeans
from surprise.model_selection import cross_validate
import pandas as pd

#### 2. Read the jokes dataset, check the shape and sample of the dataset

##### Jokes Dataset
http://eigentaste.berkeley.edu/dataset/
<br>140 Jokes
<br>59132 Users
<br>Reading jokes files
<br>Note - Data is tab seperated

In [6]:
jokes = pd.read_csv("jester_items.tsv",sep="\t",names=["ItemID","Joke"])
jokes.shape

(150, 2)

In [7]:
jokes.head()

Unnamed: 0,ItemID,Joke
0,1:,"A man visits the doctor. The doctor says, ""I h..."
1,2:,This couple had an excellent relationship goin...
2,3:,Q. What's 200 feet long and has 4 teeth? A. Th...
3,4:,Q. What's the difference between a man and a t...
4,5:,Q. What's O. J. Simpson's web address? A. Slas...


In [8]:
# Read the ratings dataset
ratings = pd.read_csv("jester_ratings.csv")
ratings.shape

(1761439, 3)

In [9]:
import io
ratings = pd.read_csv('jester_ratings.csv',sep=',')

In [10]:
ratings.head()

Unnamed: 0,UserID,ItemID,Rating
0,1,5,0.219
1,1,7,-9.281
2,1,8,-9.281
3,1,13,-6.781
4,1,15,0.875


In [11]:
# ratings = ratings[ratings['ItemID']!=150]

In [12]:
ratings['Rating']

0          0.219
1         -9.281
2         -9.281
3         -6.781
4          0.875
           ...  
1761434   -8.531
1761435   -9.062
1761436   -9.031
1761437   -8.656
1761438   -8.438
Name: Rating, Length: 1761439, dtype: float64

In [13]:
print(ratings.UserID.nunique())
print(ratings.ItemID.nunique())

59132
140


#### 3. Read the ratings dataset, check the shape and sample of the dataset

Defining the parser to read data into surprise dataframe

The parser requires the scale of ratings, and the columns, to be mentioned using rating_scale and line_format

Limit to 1000 users, to avoid the memory error.

In [14]:
ratings.isnull().sum()

UserID    0
ItemID    0
Rating    0
dtype: int64

In [15]:
no_of_users = 1000
reader = Reader(line_format = 'user item rating', rating_scale=(-10, 10))
data = Dataset.load_from_df(ratings[ratings.UserID < no_of_users], reader)

In [16]:
reader

<surprise.reader.Reader at 0x1ea32536208>

#### 5. Define the similarity parameters and the algorithm for finding similar users

-  Algorithm Type
-  User-Based vs Item-Based
-  Similarity Metric

In [17]:
sim_parameters = {'name': 'cosine',
               'user_based': True 
               }
algo = KNNWithMeans(sim_options=sim_parameters)

#### 6. Check the results using the crossvalidation

In [18]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.7444  4.7240  4.6653  4.6626  4.7534  4.7099  0.0387  
MAE (testset)     3.6675  3.6814  3.6106  3.6007  3.6651  3.6451  0.0328  
Fit time          3.27    3.85    3.66    3.20    3.51    3.50    0.24    
Test time         5.84    6.31    6.51    5.09    6.47    6.04    0.53    


{'test_rmse': array([4.74443227, 4.72395723, 4.66528202, 4.66261971, 4.75338973]),
 'test_mae': array([3.66747096, 3.68144292, 3.61063631, 3.60066338, 3.66509454]),
 'fit_time': (3.2729222774505615,
  3.8500406742095947,
  3.6551997661590576,
  3.204378604888916,
  3.510737419128418),
 'test_time': (5.837331533432007,
  6.3142335414886475,
  6.508704662322998,
  5.0894858837127686,
  6.473183870315552)}

#### 7. Get the train and test data from surprise data frame and fit the model on train data

In [19]:
trainset,testset=train_test_split(data, test_size=0.2, train_size=None, random_state=2, shuffle=True)

In [20]:
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ea33839bc8>

#### 8. Get the predictions on test data

In [21]:
# Making predictions
predictions = algo.test(testset)
predictions[0:5]

[Prediction(uid=622, iid=83, r_ui=-1.188, est=1.3413544760923704, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=480, iid=36, r_ui=4.156000000000001, est=6.399464763098445, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=906, iid=42, r_ui=4.625, est=5.610506671169506, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=477, iid=145, r_ui=2.656, est=4.326816640538945, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=733, iid=8, r_ui=-9.281, est=0.21443455848723286, details={'actual_k': 40, 'was_impossible': False})]

#### 9. Write a function to get top 10 predictions for each user

In [22]:
# Fetching top 10 predictions for each user
from collections import defaultdict
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

top_n = get_top_n(predictions, n=10)
take(10, top_n.items())

[(622,
  [(89, 3.2192991462467355),
   (72, 2.744928360095321),
   (62, 2.6703384170331605),
   (17, 2.2109015185909042),
   (125, 2.1402138846147127),
   (76, 2.099499388468058),
   (56, 1.6955801784261726),
   (19, 1.510386634732813),
   (96, 1.5053301479332688),
   (48, 1.425263289288179)]),
 (480,
  [(127, 8.21836772948193),
   (69, 7.831504273524685),
   (27, 7.5636859057434735),
   (125, 6.647414819624372),
   (83, 6.459269221610307),
   (36, 6.399464763098445),
   (29, 6.389544153084129),
   (46, 5.405092629763846),
   (19, 5.235586829779129),
   (20, 3.082423927613045)]),
 (906,
  [(114, 7.057774466423564),
   (127, 6.839700163847821),
   (105, 6.810843963903291),
   (76, 6.480870860019048),
   (88, 6.3891084088111985),
   (62, 6.359450157036237),
   (108, 6.304242546709937),
   (128, 6.239523922390229),
   (104, 6.138594053723148),
   (137, 5.950909013569359)]),
 (477,
  [(62, 5.76124037570184),
   (114, 5.628313761277091),
   (27, 5.394703325683908),
   (69, 5.389834301770315

##### Top Predictions Matrix

In [23]:
# Printing top predictions
for uid, user_ratings in take(10,top_n.items()):
    print(uid, [iid for (iid, _) in user_ratings])

622 [89, 72, 62, 17, 125, 76, 56, 19, 96, 48]
480 [127, 69, 27, 125, 83, 36, 29, 46, 19, 20]
906 [114, 127, 105, 76, 88, 62, 108, 128, 104, 137]
477 [62, 114, 27, 69, 117, 53, 108, 91, 83, 72]
733 [138, 142, 8]
927 [93, 126, 87, 72, 66, 121, 83, 135, 95, 94]
421 [113, 49, 78, 19, 8, 103, 21, 116, 16, 25]
308 [35, 106, 47, 76, 92, 56, 70, 96, 46, 25]
54 [93, 89, 27, 46, 70, 48, 26, 49, 45, 68]
837 [89, 21, 102, 47, 147, 109, 55, 128, 136, 112]


#### 10. Using the top predictions matrix map the jokes to understand the recommendations

##### Top Jokes for each User

In [30]:
for uid, user_ratings in top_n.items():
    
    print('*'*35)
    print('For User: {} Top Recommendations:'.format(uid))
    print('*'*35)
    
    for (iid, _) in user_ratings:
        print('\nJokeID : {} , Joke is : {}\n'.format(iid, jokes.loc[iid, 'Joke']))

***********************************
For User: 622 Top Recommendations:
***********************************

JokeID : 89 , Joke is : Q: How many programmers does it take to change a lightbulb? A: NONE! That's a hardware problem...


JokeID : 72 , Joke is : Q: What is the difference between George Washington, Richard Nixon, and Bill Clinton? A: Washington couldn't tell a lie, Nixon couldn't tell the truth, and Clinton doesn't know the difference.


JokeID : 62 , Joke is : An engineer, a physicist and a mathematician are sleeping in a room. There is a fire in the room. The engineer wakes up, sees the fire, picks up the bucket of water and douses the fire and goes back to sleep. Again there is a fire in the room. This time, the physicist wakes up, notices the bucket, fills it with water, calculates the optimal trajectory and douses the fire in minimum amount of water and goes back to sleep. Again there is a fire. This time the mathematician wakes up. He looks at the fire, looks at the buck

KeyError: 150

In [26]:
pred = algo.predict(100, 123)
pred

Prediction(uid=100, iid=123, r_ui=None, est=-0.5499868321824839, details={'actual_k': 40, 'was_impossible': False})

## Tuning the Algorithm Parameters

Surprise provides a GridSearchCV class analogous to GridSearchCV from scikit-learn.

With a dict of all parameters, GridSearchCV tries all the combinations of parameters and reports the best parameters for any accuracy measure

For example, you can check which similarity metric works best for your data in memory-based approaches:

In [27]:
from surprise.model_selection import GridSearchCV

sim_options = {
    "name": ["pearson", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

jokes_gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)

jokes_gs.fit(data)
print(jokes_gs.best_score["rmse"])
print(jokes_gs.best_params["rmse"])

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...

In [28]:
jokes_gs.best_params

{'rmse': {'sim_options': {'name': 'cosine',
   'min_support': 4,
   'user_based': False}},
 'mae': {'sim_options': {'name': 'cosine',
   'min_support': 4,
   'user_based': False}}}