## Collaborative Filtering
In collaborative filtering, algorithms are used to make automatic predictions about a user's interests by compiling preferences from several users.

Different Types:
- Memory Based: This method makes use of user rating information to calculate the likeness between the users or items. This calculated likeness is then used to make recommendations. User-User/Item-Item Collaborative Filter
- Model Based: Models are created by using data mining, and the system learns algorithms to look for habits according to training data. These models are then used to come up with predictions for actual data. Matrix-Factorisation
- Hybrid: Various programs combine the model-based and memory-based CF algorithms.


## User-to-User Collaborative Filtering

The method identifies users that are similar to the queried user and estimate the desired rating to be the weighted average of the ratings of these similar users.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

%matplotlib inline

RANDOM_STATE = 42

import warnings
warnings.filterwarnings('ignore')

In [201]:
# Behaviors included are 'purchase' and 'play' - Value is hours if behaviour is play otherwise 1 if purchase
names = [
    'User ID',
    'Game Title',
    'Behaviour',
    'Value'
]

data = pd.read_csv('data/steam-200k.csv', header=None, names=names)

In [4]:
data.head()

Unnamed: 0,User ID,Game Title,Behaviour,Value
0,151603712,The Elder Scrolls V: Skyrim,own,1
1,151603712,The Elder Scrolls V: Skyrim,play,273
2,151603712,Fallout 4,own,1
3,151603712,Fallout 4,play,87
4,151603712,Spore,own,1


### Preprocessing

In [202]:
# Data Preprocess
data = data[(data['Behaviour'] == 'own') | (data['Behaviour'] == 'play')]
data['User ID'] = data['User ID'].astype(str)
data['Value'] = data['Value'].astype(float)
data['Hours Played'] = data['Value'].apply(lambda x: x if x > 1 else 0)
data.drop(columns=['Value','Behaviour'], inplace = True)
data = data.sort_values(['User ID','Game Title','Hours Played'])
data.drop_duplicates(['User ID', 'Game Title'], keep = 'last', inplace = True)
data.head()

Unnamed: 0,User ID,Game Title,Hours Played
61275,100012061,Star Trek: D-A-C,0.0
47522,100053304,Dota 2,0.0
47524,100053304,Dream Of Mirror Online,0.0
47516,100053304,Dungeons & Dragons Online®,12.6
47520,100053304,PAYDAY: The Heist,1.1


In [203]:
# Adjusted hours played = hours played - mean hours played - mean hours played per game - mean hours played per user

data['Adjusted Hours Played'] = data['Hours Played']
data['Adjusted Hours Played'] = data['Adjusted Hours Played'] - data['Hours Played'].mean()

game_avg = data.groupby(['Game Title'])['Hours Played'].mean()
user_avg = data.groupby(['User ID'])['Hours Played'].mean()
game_avg = game_avg.rename('Game Avg Hours')
user_avg = user_avg.rename('User Avg Hours')
data = pd.merge(data, game_avg, on='Game Title')
data = pd.merge(data, user_avg, on='User ID')

data['Adjusted Hours Played'] = data['Adjusted Hours Played'] - data['Game Avg Hours']
data['Adjusted Hours Played'] = data['Adjusted Hours Played'] - data['User Avg Hours']

data.drop(columns=['Hours Played','Game Avg Hours','User Avg Hours'], inplace = True)
data.head()

Unnamed: 0,User ID,Game Title,Hours Played,Adjusted Hours Played
0,100012061,Star Trek: D-A-C,0.0,-27.124663
1,110879737,Star Trek: D-A-C,0.0,-27.124663
2,110879737,Sid Meier's Civilization V,0.0,-194.256327
3,134225377,Star Trek: D-A-C,0.0,-27.124663
4,14544587,Star Trek: D-A-C,2.3,-106.953956


In [7]:
games = len(set(data['Game Title']))
users = len(set(data['User ID']))
print(f'Unique Games: {games} Unique Users: {users}')

Unique Games: 5108 Unique Users: 12364


### User-Item Matrix

In [204]:
# User to item matrix based on adjusted hours played - fillna by avg adjusted hrs played
user_hours = data.pivot(index='User ID', columns='Game Title', values='Adjusted Hours Played').fillna(0)
# user_hours = user_hours.apply(lambda row: row.fillna(row.mean()), axis=1)

# # Create user to item matrix based on whether a user owns a game
user_owns = data.pivot(index='User ID', columns='Game Title', values='Adjusted Hours Played').fillna(value = 0)
user_owns = user_owns.astype(bool).astype(int)

In [205]:
user_hours.iloc[:10,:10]

Game Title,007™ Legends,0RBITALIS,1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),10 Second Ninja,100% Orange Juice,1000 Amps,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 Labours of Hercules III: Girl Power,140
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100012061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100053304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100057229,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100070732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100096071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100168166,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100208126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100267049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100311267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100322840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
user_owns.iloc[:10,:10]

Game Title,007™ Legends,0RBITALIS,1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),10 Second Ninja,100% Orange Juice,1000 Amps,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 Labours of Hercules III: Girl Power,140
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100012061,0,0,0,0,0,0,0,0,0,0
100053304,0,0,0,0,0,0,0,0,0,0
100057229,0,0,0,0,0,0,0,0,0,0
100070732,0,0,0,0,0,0,0,0,0,0
100096071,0,0,0,0,0,0,0,0,0,0
100168166,0,0,0,0,0,0,0,0,0,0
100208126,0,0,0,0,0,0,0,0,0,0
100267049,0,0,0,0,0,0,0,0,0,0
100311267,0,0,0,0,0,0,0,0,0,0
100322840,0,0,0,0,0,0,0,0,0,0


### User-User Similairty Matrix

#### Similarity measure
- Use Pearson when your data is subject to user-bias/ different ratings scales of users
- Use Cosine, if data is sparse (many ratings are undefined)
- Use Euclidean, if your data is not sparse and the magnitude of the attribute values is significant
- Use adjusted cosine for Item-based approach to adjust for user-bias

In [26]:
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

user_user_hours = cosine_similarity(csr_matrix(user_hours.values)) # cosine similarity
# user_user_hours = 1 / (1 + pairwise_distances(user_hours, metric='euclidean', n_jobs=-1)) # Euclidean similarity
# user_user_hours = (1 - pairwise_distances(user_hours, metric='correlation', n_jobs=-1)) # Pearson similarity
user_user_owns = (1 - pairwise_distances(user_owns.values, metric='jaccard', n_jobs=-1)) # Jaccard similairty

In [27]:
user_user_hours = pd.DataFrame(user_user_hours, index=user_hours.index, columns=user_hours.index)
user_user_owns = pd.DataFrame(user_user_owns, index=user_owns.index, columns=user_owns.index)

In [28]:
user_user_hours.iloc[:10,:10]

User ID,100012061,100053304,100057229,100070732,100096071,100168166,100208126,100267049,100311267,100322840
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100012061,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100053304,0.0,1.0,0.0,0.0,0.598737,0.0,0.0,0.0,0.0215,0.0
100057229,0.0,0.0,1.0,0.0,-0.030883,0.0,0.0,0.0,0.0,0.0
100070732,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
100096071,0.0,0.598737,-0.030883,0.0,1.0,0.0,0.0,0.0,0.091611,0.134026
100168166,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.169747,0.0
100208126,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
100267049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
100311267,0.0,0.0215,0.0,0.0,0.091611,0.169747,0.0,0.0,1.0,0.080833
100322840,0.0,0.0,0.0,0.0,0.134026,0.0,0.0,0.0,0.080833,1.0


In [29]:
user_user_owns.iloc[:10,:10]

User ID,100012061,100053304,100057229,100070732,100096071,100168166,100208126,100267049,100311267,100322840
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100012061,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100053304,0.0,1.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.02,0.0
100057229,0.0,0.0,1.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0
100070732,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
100096071,0.0,0.022727,0.02381,0.0,1.0,0.0,0.0,0.0,0.047244,0.025641
100168166,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.010526,0.0
100208126,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
100267049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
100311267,0.0,0.02,0.0,0.0,0.047244,0.010526,0.0,0.0,1.0,0.010417
100322840,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.010417,1.0


### Recommend Items
Using the user-user similarities based on user games owned

User-User similarities based on adjusted hours played doesn't capture similarities well.

In [265]:
user = '100311267'

In [266]:
from collections import Counter

def most_frequent(List, n):
    return [i for i in Counter(List).most_common(n)]

In [267]:
def most_similar_users(user: str, n: int, verbose=True):
    """
        Return n most similar users based on similarity metric used
        
        user - user id
        n - number of users
    """
    if user not in user_user_hours.columns:
        return print(f'No data available on user {user}')
    
    users = user_user_owns.loc[user].sort_values(ascending=False)[1:n+1].index
    similarity = np.round(user_user_owns.loc[user].sort_values(ascending=False)[1:n+1].values, 2)
    if verbose:
        return print(f'{n} most similar users based on owned games:', *zip(users,similarity), sep='\n')
    else:
        return list(users)

In [268]:
most_similar_users(user, 5)

5 most similar users based on owned games:
('152959594', 0.21)
('182399789', 0.2)
('119619506', 0.17)
('76420334', 0.17)
('89523414', 0.17)


In [269]:
def recommend_similar_users_games(user, sim_users, n=5):
    """
        Recommend n games owned by similar users that the user doesn't own
    """
    recommended = []
    user_owned = user_owns.loc[user,:][user_owns.loc[user,:] == 1].index # Games user owns
    
    for sim_user in sim_users:
        sim_user_owned = user_owns.loc[sim_user,:][user_owns.loc[sim_user,:] == 1].index
        games_not_owned = set(sim_user_owned) - set(user_owned) # games the user doesn't own that similar user does
        recommended.append(list(games_not_owned))
        
    recommended = [game for sublist in recommended for game in sublist]
    recommended = most_frequent(recommended, n)
    return print('Similar gamers also play:', '(game, amount who played)', *recommended, sep='\n')

In [270]:
sim_users = most_similar_users(user, 5, verbose=False)
recommend_similar_users_games(user, sim_users)

Similar gamers also play:
(game, amount who played)
('Warface', 4)
('War Thunder', 4)
('HAWKEN', 4)
('theHunter', 4)
('Dota 2', 3)


In [271]:
def most_played_game(user):
    """
        Return most played game for a user
        
        user - User ID
    """
    if user_hours.loc[user,:].max() > 0:
        return user_hours.loc[user,:].idxmax()
    else:
        return False

In [272]:
most_played_game(user)

"Garry's Mod"

In [273]:
def similar_most_played_game(user: str, other_users: list):
    """
        Find a gamer whose most played game is the same
        
        user - user ID
        other_users - list of all other users
    """
    most_played_by_user = most_played_game(user)
    o_users = []
    hrs = []
    if most_played_by_user:
        for other_user in other_users:
            if most_played_game(user) == most_played_game(other_user): # if both have the same most played game
                o_users.append(other_user)
                played_hrs = user_hours.loc[other_user,:].max()
                hrs.append(round(played_hrs, 2))
                return o_users, print(f'User(s) {o_users} most played game is also {most_played_by_user} who have played it for {hrs} hrs, respectively')
            else:
                continue
        return f'No users most played game is also {most_played_by_user}'
    else:
        return f'No games played by user {user}'

In [274]:
other_users =  list(set(user_owns.index) - set(user))

similar_users, _ = similar_most_played_game(user, other_users)

User(s) ['169653526'] most played game is also Garry's Mod who have played it for [16.0] hrs, respectively


In [275]:
def recommend_based_on_most_played(user: str, similar_users: list, n: int):
    """
        Based on users whose most played game is the same as the user what other n games do they also play a lot
        
        user - user ID
        similar_users - users that have the same most played game
        n - number of games to return
    """
    game_ = []
    hrs = []
    for user_ in similar_users:
        games = set(user_owns.loc[user_,:][user_owns.loc[user_,:] == 1].index) - set([most_played_game(user)])
        for game in games:
            hrs_played = user_hours.loc[user_, game]
            if hrs_played > 0:
                game_.append(game)
                hrs.append(hrs_played)
    
    game_hrs = list(zip(game_, hrs))
    game_hrs.sort(key=lambda tup: tup[1], reverse=True)
    
    return print(f'Gamers who liked {most_played_game(user)} also liked: ', '(game, hours played)', *game_hrs[:n], sep='\n')

In [276]:
recommend_based_on_most_played(user, list(similar_users), 5)

Gamers who liked Garry's Mod also liked: 
(game, hours played)
('Counter-Strike: Global Offensive', 8.3)
('Rust', 8.1)
('Path of Exile', 8.0)
('SpeedRunners', 7.5)
('Magic 2015', 3.3)
