## **Recommender System** Using **Cosine Similarity**

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

import sys
sys.path.append('../')

Creating the Recomender.

### The idea:

`CosSinComputer` contains the recmmender system. The need of using mulitple attributes and methods to make recommendations leads to writing long and complex blocks of code. Creating a class object can solve most problems as i must use distinct methods and procedures to compute similirities and creating responses.

It is simply taking the idea of `Estimators` from scikit-learn. For sure, this "`Computer`" object is way simpler and less sofisticated than any class from any Machine Learning library, but at least it perform simple yet useful calculations specifically for the purpose of this project.

- `CosSinComputer` can be "trained" with a pandas DataFrame; which is an ilusion because it just stores it as an instance attribute.
- It has an equivalent for a "predict" method; `CosSinComputer.compute_similarities()` performs the Cosine Similarity algorithm (from Scikit-learn) to compute similarities between vectors.

In [2]:
class CosSimComputer:

    def __init__(self, df_train:pd.DataFrame):
        self.df_train = df_train
        self.itemsMatrix = df_train.loc[:,'_Released after 2010':]
        self.items = df_train.loc[:,['item_id', 'app_name']]
        self.basisVector = None

    def set_basisVector(self, id):
        # Creating vector with the corect shape
        vector_idx = self.df_train.loc[self.df_train['item_id'] == id].index

        # Getting values of the resulting Series
        vector = self.itemsMatrix.iloc[vector_idx].values
        
        # Instance basis vector
        self.basisVector = vector.reshape(1,-1)

    def _cos_sim(self, item:pd.Series):
        cos_sim = cosine_similarity(
            self.basisVector, item.values.reshape(1,-1)
            )[0,0]
        return cos_sim
    
    def compute_similarities(self):
        similarities = (
            self.itemsMatrix
            .apply(
                lambda row: self._cos_sim(item=row), 
                axis=1
                )
            )
        return similarities
    
    def n_most_similar(self, n:int, to_:int, indexes = False):
        # Re instancing basis vector for each compute
        self.set_basisVector(to_)

        # Computing similars
        similars = self.compute_similarities()

        # indexes for n largest excluding itself
        n_largest = similars.nlargest(n+1, keep='last').index[1:]

        # Choosing to return the indexes
        if indexes:
            return n_largest
        
        # Returning items id and names
        items = self.items.iloc[n_largest]
        return items

## Preprocessing Pipeline:
Includes:

- Extrancting prices from labels and filling null with median.
- Binning and One-Hote encoding years and prices,
- One-Hot encoding distinct genres and specs

In [3]:
df_games = pd.read_json('../data/games.json.gz', compression='gzip', lines=True)

In [4]:
import functions.preprocessing as pp 

# Cleaning and filling prices
df_games['price'] = df_games['price'].apply(pp.float_prices)
df_games['price'].fillna(df_games['price'].median(), inplace=True)

# year binning
df_games = pp.year_binning(df_games, dummies=True)

# Price binning
df_games = pp.price_binning(df_games, dummies=True)

# Genres dummies
df_games = pp.genres_dummies(df_games)

# Specs dummies
df_games = pp.specs_dummies(df_games)

At the end of the preprocessing procedure, the Dataframe will be mostly ones and zeros.

Excluding the original columns, those remain conform what is called a sparse matrix (a lot of zeros). That slice of data is stored in the .itemsMatrix attribute. **Sparse matrix can be advantageous by makin storing and computing more efficient.**

In [5]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32133 entries, 0 to 32132
Data columns (total 77 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   item_id                     32133 non-null  int64  
 1   developer                   32133 non-null  object 
 2   app_name                    32133 non-null  object 
 3   genres                      32133 non-null  object 
 4   tags                        32133 non-null  object 
 5   specs                       32133 non-null  object 
 6   release_year                32133 non-null  int64  
 7   price                       32133 non-null  float64
 8   _Released after 2010        32133 non-null  int64  
 9   _Released before 2000       32133 non-null  int64  
 10  _Released in 2000-2010      32133 non-null  int64  
 11  _Cheap                      32133 non-null  int64  
 12  _Expensive                  32133 non-null  int64  
 13  _Typical price              321

Checking class usability

In [10]:
df_games

Unnamed: 0,item_id,developer,app_name,genres,tags,specs,release_year,price,_Released after 2010,_Released before 2000,...,Includes level editor,Mods,Mods (require HL2),Game demo,Includes Source SDK,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,Mods (require HL1)
0,761140,Kotoshiro,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],2018,4.99,1,0,...,0,0,0,0,0,0,0,0,0,0
1,643980,Secret Level SRL,Ironbound,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",2018,0.00,1,0,...,0,0,0,0,0,0,0,0,0,0
2,670290,Poolians.com,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",2017,0.00,1,0,...,0,0,0,0,0,0,0,0,0,0
3,767400,彼岸领域,弹炸人2222,"[Action, Adventure, Casual]","[Action, Adventure, Casual]",[Single-player],2017,0.99,1,0,...,0,0,0,0,0,0,0,0,0,0
4,773570,Unknown,Log Challenge,Empty,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2016,2.99,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,629410,Unknown,VectorWave,Empty,"[Indie, Casual, Strategy, Action, VR]","[Single-player, Steam Achievements, Steam Trad...",2016,9.99,1,0,...,0,0,0,0,0,0,0,0,0,0
9996,577530,Unknown,Pixel Ripped 1989,Empty,"[Adventure, Indie, Action]","[Single-player, Steam Achievements, Full contr...",2016,4.99,1,0,...,0,0,0,0,0,0,0,1,0,0
9997,635762,Sacada,LOGistICAL - USA - Hawaii,"[Casual, Strategy]","[Strategy, Casual]","[Single-player, Downloadable Content, Steam Ac...",2017,1.99,1,0,...,0,0,0,0,0,0,0,0,0,0
9998,617690,DarkSun Studio,Endless Winter,"[Adventure, Casual, Indie, Racing]","[Adventure, Indie, Casual, Racing, Puzzle, Par...","[Single-player, Online Co-op, Local Co-op, Ste...",2017,7.99,1,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Reducing size
df_games = df_games.iloc[:5000]

In [18]:
# Storing the DataFrame, instantiate the class
# Compare how fast executes depending on the size of the DataFrame
computer = CosSimComputer(df_games)

In [22]:
# Computing similarities and getting indexes
indexes = computer.n_most_similar(10, to_=761140, indexes=True)

Items in a DataFrame format could be brought directly by setting `indexes = False`.

Using indexes to get the 10 most similar.

In [24]:
similar_to_761140 = df_games.loc[indexes, ['app_name', 'genres', 'specs', 'release_year', 'price']]
similar_to_761140

Unnamed: 0,app_name,genres,specs,release_year,price
3161,Evolution II: Fighting for Survival,"[Action, Indie, Simulation, Strategy]",[Single-player],2015,1.99
2418,Wildlife Park 2 - Farm World,"[Casual, Indie, Simulation, Strategy]",[Single-player],2010,3.99
2417,Wildlife Park 2 - Dino World,"[Casual, Indie, Simulation, Strategy]",[Single-player],2012,3.99
2416,Wildlife Park 2 - Fantasy,"[Casual, Indie, Simulation, Strategy]",[Single-player],2013,0.99
4498,Ant War: Domination,"[Casual, Indie, Simulation, Strategy]","[Single-player, Steam Trading Cards]",2015,2.99
4271,Farm Frenzy: Heave Ho,"[Casual, Indie, Simulation, Strategy]","[Single-player, Steam Trading Cards]",2015,4.99
4053,Robot vs Birds Zombies,"[Action, Casual, Indie, Simulation]","[Single-player, Steam Trading Cards]",2013,0.99
3840,Sierra Ops Demo,"[Casual, Indie, Simulation, Strategy]","[Single-player, Game demo]",2016,4.99
3680,Age of Castles: Warlords,"[Casual, Indie, Simulation, Strategy]","[Single-player, Steam Trading Cards]",2015,1.99
2423,Wildlife Park 2 - Domestic Animals,"[Casual, Indie, Simulation, Strategy]","[Single-player, Downloadable Content]",2012,0.99
