# Video games recommender system | part 2 
based on steam sales datasets : 
- https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data) gathered by [Insert teacher names]
- https://www.kaggle.com/nikdavis/steam-store-games

In this notebook we'll try to build a video games recommender system based on a steam sales dataset. In order to do so we'll use different techniques to recommend as precisely as possible a game to a user.

This notebook will focus on the modeling part of the project.

____

## Imports

In [118]:
#Data manipulation
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict

#Modeling 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import jaccard_score

#File manipulation
import os
import gzip
import ast

#DataViz
import plotly.express as px
import matplotlib.pyplot as plt
import cufflinks as cf
%matplotlib inline

#Helper functions
%load_ext autoreload
%autoreload 2
import helper

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load dataset

In [2]:
df = pd.read_pickle('data/final_df.pckl')

## Features

The goal of the recommending system is to recommend a new game to a user based on his played games and other relation such as similar users. 

This concept can be approximated by a binary classification task which will classify user and games pairs in two category : 
- $1$ the user is gonna play the game.
- $2$ the user isn't gonna play the game.

SO let's add a column **play** to the data frame.

In [5]:
# 1 if the player ever played the game (playtime > 0), 0 otherwise
df['play'] = df['playtime_forever'].apply(lambda x : 1 if (x > 0) else 0)

As we can see, the dataset is pretty big and with a high dimension. This can be a problem in term of calculation time for our model. We might have to use dimensionality reduction techniques to address this.
Two techniques  :
- PCA to reduce the number of features
- Clustering methods (Kmeans, GMM, ...) to apply model within a cluster.

In [8]:
df.shape

(4020731, 76)

### Sets

In [42]:
df.loc[:,'release_date'] = df['release_date'].apply(lambda x : x.year)

In [156]:
X = df.drop(['item_id','user_id','item_name','categories','genres'],axis = 1).iloc[:,:-1]
y = df.drop(['item_id','user_id','item_name','categories','genres'],axis = 1).iloc[:,-1]

Time to create the train, validation and test set. They will be divided as such : 
- Train = 60%
- Validation = 20%
- Test = 20%

In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [158]:
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size = 0.25 ,random_state = 42)

In [159]:
X_train.shape,X_val.shape,X_test.shape

((2412438, 70), (804146, 70), (804147, 70))

In [177]:
X_train.head()

Unnamed: 0,playtime_forever,playtime_2weeks,release_date,price,Online Multi-Player,Captions available,MMO,Stats,Online Co-op,Windows Mixed Reality,...,Sexual Content,Sports,Animation & Modeling,Simulation,Software Training,Indie,Action,Video Production,Early Access,Web Publishing
909261,0.9,0.0,2012,14.99,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2032823,0.816667,0.0,2006,24.99,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
319315,3.566667,0.0,1997,2.99,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
711759,0.583333,0.0,2009,12.99,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
897700,2.116667,0.0,2003,4.99,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Logistic classification

### Standardization

First let's not forget to standardized the features.

In [160]:
from sklearn.preprocessing import StandardScaler

In [161]:
scaler = StandardScaler()

In [162]:
log_X_train = pd.DataFrame(scaler.fit_transform(X_train))

In [163]:
log_X_val = pd.DataFrame(scaler.transform(X_val))
log_X_test = pd.DataFrame(scaler.transform(X_test))

____

### Logistic regression with full features

In [37]:
from sklearn.linear_model import LogisticRegression

In [190]:
clf = LogisticRegression(solver = 'sag')

In [191]:
%time clf.fit(log_X_train,y_train)

CPU times: user 5min 41s, sys: 3.78 s, total: 5min 45s
Wall time: 5min 48s



The max_iter was reached which means the coef_ did not converge



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='sag', tol=0.0001, verbose=0,
                   warm_start=False)

In [192]:
%time log_y_pred = clf.predict(log_X_val)

CPU times: user 227 ms, sys: 496 ms, total: 724 ms
Wall time: 1.04 s


In [193]:
print(classification_report(y_val, log_y_pred))

              precision    recall  f1-score   support

           0       0.75      0.85      0.80    258884
           1       0.93      0.87      0.90    545262

    accuracy                           0.86    804146
   macro avg       0.84      0.86      0.85    804146
weighted avg       0.87      0.86      0.87    804146



In [179]:
clf.coef_

array([[ 5.17326840e+01,  4.58005007e+00,  4.81565898e-02,
         1.97156282e-02,  5.32657434e-02,  6.70556655e-02,
         1.30172876e-01,  1.51344043e-02, -3.34577651e-02,
        -1.06795848e-02,  1.01173358e-01, -2.68965648e-04,
         5.25429199e-02, -3.75249577e-02, -1.58750724e-02,
         1.08036239e-01,  2.55197337e-02,  1.11671317e-02,
         5.07441951e-02, -1.26309231e-02,  1.70050113e-02,
         1.33865039e-03, -1.09216588e-02,  2.13146527e-02,
         6.34486233e-03,  5.35108697e-02,  7.28101319e-02,
         7.54323617e-02,  1.21207072e-01, -2.38505012e-03,
         2.38942951e-02, -6.43479705e-02,  3.06670654e-02,
        -3.38526748e-02, -1.85098055e-02,  4.30527968e-02,
        -2.83162114e-02, -3.59189538e-02,  0.00000000e+00,
        -2.59630745e-02, -2.27462089e-02, -4.09523838e-03,
         1.06338682e-02, -6.24177033e-02,  3.51267624e-02,
        -4.19095718e-02,  0.00000000e+00, -3.58949984e-02,
        -1.42425240e-01,  1.36076011e-01,  1.52123355e-0

### Logistic regression with selected features

In [194]:
clf_ = LogisticRegression(solver = 'sag')

In [173]:
%time clf_.fit(log_X_train.loc[:,0:3],y_train)

CPU times: user 1min 53s, sys: 438 ms, total: 1min 54s
Wall time: 1min 54s



The max_iter was reached which means the coef_ did not converge



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [175]:
log_y_pred_ = clf_.predict(log_X_val.loc[:,0:4])

In [176]:
print(classification_report(y_val, log_y_pred_))

              precision    recall  f1-score   support

           0       0.76      0.91      0.83    258884
           1       0.95      0.86      0.90    545262

    accuracy                           0.88    804146
   macro avg       0.85      0.88      0.86    804146
weighted avg       0.89      0.88      0.88    804146



In [180]:
clf_.coef_

array([[ 5.84835115e+01,  5.03278027e+00,  2.23423779e-01,
        -1.53118480e-02,  1.13019500e-01,  8.69659835e-02]])

In [None]:
clf_.fit(log_X_train.loc[:,0:3],y_train)

In [None]:
log_y_pred_ = clf_.predict(log_X_val.loc[:,0:3])

In [None]:
print(classification_report(y_val, log_y_pred_))

In [186]:
clf_.coef_

array([[ 5.88020667e+01,  5.09077482e+00,  2.40717329e-01,
        -2.69212074e-02]])

In [187]:
y_test_pred = clf_.predict(log_X_test.loc[:,0:3])

In [189]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.77      0.86      0.81    258642
           1       0.93      0.88      0.90    545505

    accuracy                           0.87    804147
   macro avg       0.85      0.87      0.86    804147
weighted avg       0.88      0.87      0.87    804147

