# Modeling
## Predict recommended games

In this part, we will predict if a user would recommend a given game.

I will assume that the goal is to predict the _recommend_ row from the _reviews_ dataset, for a given user and a given game. This is therefore a supervised classification problem. 

In [1]:
cd ..

/Users/nicolas.peruchot/workdir/scalable


In [2]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

from steam_challenge.preprocessing.preprocessing_items import items_preprocessing
from steam_challenge.preprocessing.preprocessing_reviews import reviews_preprocessing
from steam_challenge.preprocessing.preprocessing_users import users_preprocessing

from steam_challenge.features_eng import dataset_creation, features_creation
from sklearn.metrics import accuracy_score

pd.options.mode.chained_assignment = None

In [3]:
reviews=pd.read_json("data/reviews.json",orient='records')
users=pd.read_json("data/users.json",orient='records')
items=pd.read_json("data/items.json",orient='records')

reviews=reviews_preprocessing(reviews)
users=users_preprocessing(users)
items=items_preprocessing(items)

## Feature engineering

We will first create the dataset. The goal is to extract features for each game and each user.

- We will first merge the _reviews_ dataset with the _users_ dataset, so that each review is associated to a user.
- Then, we will create a feature for each game that represent the total playtime for a given game among all users. In the same way, we will determine the total playtime for a given user among all his games. We add these two features to our dataset.
- Finally, we merge the _items_ dataset to the dataset that we are creating.

In [4]:
dataset=dataset_creation(reviews=reviews,items=items,users=users)

In [5]:
Y = dataset.recommend
dataset = dataset.drop(columns=['app_name','recommend','user_id'])

In [6]:
dataset.head()

Unnamed: 0_level_0,developer,early_access,genres,price,release_date,sentiment,specs,tags,funny,helpful,playtime,Total playtime on Steam for this user,Total playtime for this game among all users
ind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"Game: Carmageddon Max Pack, User: InstigatorAU",Stainless Games Ltd,0,"['Action', 'Indie', 'Racing']",9.99,1997-06-30,1,"['Single-player', 'Multi-player', 'Steam Tradi...","['Racing', 'Action', 'Classic', 'Indie', 'Gore...",0,0,466,5678,632
"Game: Half-Life, User: EizanAratoFujimaki",Valve,0,['Action'],9.99,1998-11-08,1,"['Single-player', 'Multi-player', 'Valve Anti-...","['FPS', 'Classic', 'Action', 'Sci-fi', 'Single...",1,74,1395,2278,81542
"Game: Half-Life, User: GamerFag",Valve,0,['Action'],9.99,1998-11-08,1,"['Single-player', 'Multi-player', 'Valve Anti-...","['FPS', 'Classic', 'Action', 'Sci-fi', 'Single...",0,0,590,41463,81542
"Game: Half-Life, User: 76561198020928326",Valve,0,['Action'],9.99,1998-11-08,1,"['Single-player', 'Multi-player', 'Valve Anti-...","['FPS', 'Classic', 'Action', 'Sci-fi', 'Single...",0,100,5599,9324,81542
"Game: Half-Life, User: Bluegills",Valve,0,['Action'],9.99,1998-11-08,1,"['Single-player', 'Multi-player', 'Valve Anti-...","['FPS', 'Classic', 'Action', 'Sci-fi', 'Single...",0,0,64,13804,81542


In [7]:
Y.head()

ind
Game: Carmageddon Max Pack, User: InstigatorAU    True
Game: Half-Life, User: EizanAratoFujimaki         True
Game: Half-Life, User: GamerFag                   True
Game: Half-Life, User: 76561198020928326          True
Game: Half-Life, User: Bluegills                  True
Name: recommend, dtype: bool

## Baseline model

We will create a first simple baseline model: if the _sentiment_ feature is positive, we will assume that the player is going to recommend the game.

In [8]:
prediction = [row.sentiment==1 for  _, row in dataset.iterrows()]
score = round(accuracy_score(Y,prediction)*100,2)
print(f"Precision of {score}% on the whole dataset.")

Precision of 84.8% on the whole dataset.


With this baseline model, the precision is quiet good because our intuition is true: when the majority of the users likes a game, we are more likely to recommend it too.

We will now create a more complex model. For this, we will continue the feature engineering.
- We will create columns for each _specs_, _genres_ and _tags_. 
- Then, we will encode the name of the developers.
- We will then normalize the data and reduce the number of features by performing a PCA, keeping 95% of the information.

In [9]:
dataset=features_creation(data=dataset)

In [10]:
dataset.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,238,239,240,241,242,243,244,245,246,247
ind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Game: Carmageddon Max Pack, User: InstigatorAU",-1.077378,1.444128,-0.177056,0.674488,-0.264709,0.032672,-0.145676,0.641126,-2.324064,1.273667,...,-0.96007,-0.21087,-2.588835,-0.626609,-1.862201,0.952158,0.038417,-0.84748,0.439314,1.030504
"Game: Half-Life, User: EizanAratoFujimaki",-0.395918,2.547259,-1.46082,-3.506735,-1.944084,-0.189999,-0.798068,1.658398,-1.868825,-2.502362,...,0.103976,-0.666985,-0.88392,-0.165434,-0.004997,-0.462682,-1.132751,-0.096305,0.483026,-0.118436
"Game: Half-Life, User: GamerFag",-0.329749,2.562931,-1.451227,-3.514582,-2.007942,-0.274361,-0.788192,1.645983,-1.892065,-2.576678,...,0.214936,-0.650993,-0.899773,-0.258455,-0.035552,-0.462567,-1.13032,-0.099273,0.45331,-0.088015
"Game: Half-Life, User: 76561198020928326",-0.317628,2.572196,-1.471656,-3.502447,-2.008656,-0.229122,-0.795183,1.661691,-1.880534,-2.5627,...,0.104554,-0.62362,-0.883542,-0.161348,-0.015335,-0.443559,-1.151226,-0.126525,0.487223,-0.141941
"Game: Half-Life, User: Bluegills",-0.416972,2.540017,-1.47967,-3.498658,-1.996057,-0.260199,-0.793537,1.660248,-1.903245,-2.573462,...,0.165562,-0.646142,-0.894077,-0.224047,-0.024589,-0.457658,-1.129103,-0.115188,0.475497,-0.101463


We will now split our dataset and train a simple Logistic Regression model.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(dataset,Y,test_size=0.2, random_state=12)

In [12]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(max_iter=10000)
lr.fit(X_train,y_train)
print(f"Score: {round(accuracy_score(y_test,lr.predict(X_test))*100,2)}")

Score: 89.37


We now have a better precision on the prediction thanks to our new model.