## **Recommender System** Using **Cosine Similarity**

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

Creating the Recomender.

### The idea:

`CosSinComputer` contains the recmmender system. The need of using mulitple attributes and methods to make recommendations leads to writing long and complex blocks of code. Creating a class object can solve most problems as i must use distinct methods and procedures to compute similirities and creating responses.

It is simply taking the idea of `Estimators` from scikit-learn. For sure, this "`Computer`" object is way simpler and less sofisticated than any class from any Machine Learning library, but at least it perform simple yet useful calculations specifically for the purpose of this project.

- `CosSinComputer` can be "trained" with a pandas DataFrame; which is an ilusion because it just stores it as an instance attribute.
- It has an equivalent for a "predict" method; `CosSinComputer.compute_similarities()` performs the Cosine Similarity algorithm (from Scikit-learn) to compute similarities between vectors.

In [25]:
class CosSimComputer:

    def __init__(self, df_train:pd.DataFrame):
        self.df_train = df_train
        self.itemsMatrix = df_train.loc[:,'_Released after 2010':]
        self.items = df_train.loc[:,['item_id', 'app_name']]
        self.basisVector = None

    def set_basisVector(self, id):
        # Creating vector with the corect shape
        vector_idx = self.df_train.loc[self.df_train['item_id'] == id].index

        # Getting values of the resulting Series
        vector = self.itemsMatrix.iloc[vector_idx].values
        
        # Instance basis vector
        self.basisVector = vector.reshape(1,-1)

    def _cos_sim(self, item:pd.Series):
        cos_sim = cosine_similarity(
            self.basisVector, item.values.reshape(1,-1)
            )[0,0]
        return cos_sim
    
    def compute_similarities(self):
        similarities = (
            self.itemsMatrix
            .apply(
                lambda row: self._cos_sim(item=row), 
                axis=1
                )
            )
        return similarities
    
    def n_most_similar(self, n:int, to_:int, indexes = False):
        # Re instancing basis vector for each compute
        self.set_basisVector(to_)

        # Computing similars
        similars = self.compute_similarities()

        # indexes for n largest excluding itself
        n_largest = similars.nlargest(n+1, keep='last').index[1:]

        # Choosing to return the indexes
        if indexes:
            return n_largest
        
        # Returning items id and names
        items = self.items.iloc[n_largest]
        return items

## Preprocessing Pipeline:
Includes:

- Extrancting prices from labels and filling null with median.
- Binning and One-Hote encoding years and prices,
- One-Hot encoding distinct genres and specs

In [10]:
df_games = pd.read_json('data/games.json.gz', compression='gzip', lines=True)

In [11]:
import functions.preprocessing as pp 

# Cleaning and filling prices
df_games['price'] = df_games['price'].apply(pp.float_prices)
df_games['price'].fillna(df_games['price'].median(), inplace=True)

# year binning
df_games = pp.year_binning(df_games, dummies=True)

# Price binning
df_games = pp.price_binning(df_games, dummies=True)

# Genres dummies
df_games = pp.genres_dummies(df_games)

# Specs dummies
df_games = pp.specs_dummies(df_games)

At the end of the preprocessing procedure, the Dataframe will be mostly ones and zeros.

Excluding the original columns, those remain conform what is called a sparse matrix (a lot of zeros). That slice of data is stored in the .itemsMatrix attribute. **Sparse matrix can be advantageous by makin storing and computing more efficient.**

In [12]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32133 entries, 0 to 32132
Data columns (total 77 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   item_id                     32133 non-null  int64  
 1   developer                   32133 non-null  object 
 2   app_name                    32133 non-null  object 
 3   genres                      32133 non-null  object 
 4   tags                        32133 non-null  object 
 5   specs                       32133 non-null  object 
 6   release_year                32133 non-null  int64  
 7   price                       32133 non-null  float64
 8   _Released after 2010        32133 non-null  int64  
 9   _Released before 2000       32133 non-null  int64  
 10  _Released in 2000-2010      32133 non-null  int64  
 11  _Cheap                      32133 non-null  int64  
 12  _Expensive                  32133 non-null  int64  
 13  _Typical price              321

Checking class usability

In [26]:
# Storing the DataFrame
computer = CosSimComputer(df_games)

In [27]:
# Computing similarities and getting indexes
indexes = computer.n_most_similar(10, to_=10, indexes=True)

Items in a DataFrame format could be brought directly by setting `indexes = False`.

Using indexes to get the 10 most similar.

In [28]:
df_games.loc[indexes]

Unnamed: 0,item_id,developer,app_name,genres,tags,specs,release_year,price,_Released after 2010,_Released before 2000,...,Includes level editor,Mods,Mods (require HL2),Game demo,Includes Source SDK,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,Mods (require HL1)
32025,360,Valve,Half-Life Deathmatch: Source,[Action],"[Action, FPS, Multiplayer, Sci-fi, Shooter, Fi...","[Multi-player, Valve Anti-Cheat enabled]",2006,9.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32115,80,Valve,Counter-Strike: Condition Zero,[Action],"[Action, FPS, Shooter, Multiplayer, Singleplay...","[Single-player, Multi-player, Valve Anti-Cheat...",2004,9.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32021,1200,Tripwire Interactive,Red Orchestra: Ostfront 41-45,[Action],"[Action, World War II, FPS, Realistic, Multipl...","[Multi-player, Steam Achievements, Valve Anti-...",2006,9.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32050,2100,Arkane Studios,Dark Messiah of Might & Magic,"[Action, RPG]","[RPG, Action, First-Person, Fantasy, Adventure...","[Single-player, Multi-player, Valve Anti-Cheat...",2006,9.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32112,30,Valve,Day of Defeat,[Action],"[FPS, World War II, Multiplayer, Action, Shoot...","[Multi-player, Valve Anti-Cheat enabled]",2003,4.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32106,40,Valve,Deathmatch Classic,[Action],"[Action, FPS, Multiplayer, Classic, Shooter, F...","[Multi-player, Valve Anti-Cheat enabled]",2001,4.99,0,0,...,0,0,0,0,0,0,0,0,0,0
32103,60,Valve,Ricochet,[Action],"[Action, FPS, Multiplayer, First-Person, Cyber...","[Multi-player, Valve Anti-Cheat enabled]",2000,4.99,0,0,...,0,0,0,0,0,0,0,0,0,0
31904,2640,Gray Matter Studios,Call of Duty: United Offensive,[Action],"[Action, World War II, FPS, Shooter, Multiplay...","[Single-player, Multi-player]",2004,19.99,0,0,...,0,0,0,0,0,0,0,0,0,0
31903,2630,"Infinity Ward,Aspyr (Mac)",Call of Duty® 2,[Action],"[Action, FPS, World War II, Multiplayer, Singl...","[Single-player, Multi-player]",2005,19.99,0,0,...,0,0,0,0,0,0,0,0,0,0
31900,2620,Infinity Ward,Call of Duty®,[Action],"[FPS, Action, World War II, Classic, Shooter, ...","[Single-player, Multi-player]",2003,19.99,0,0,...,0,0,0,0,0,0,0,0,0,0
