# Descriptions and Implementations of algorithms

**Steam Games Predictor:** by: Kornel Zieliński, Krystian Rodzaj, Krystian Wojakiewicz

## Introduction

To ensure our algorithms can correctly distinguish our input parameters, we first have to encode them. This process has been explained in the previous presentation. Now, that we have our encoded datasets, we can start implementing the chosen algorithms. For the implementation of our algorithms we will use the **sklearn** Python library, which provides a wide range of machine learning tools. In this presentation we will cover: **Support Vector Machines**, **Linear Regression**, **Decision Trees**. We will look at the different parameter used for building the algorithms, how they were achieved, and what they represent. The results for some of the runs of the algorithms will also be shown. In the future, neural networks will be implemented and compared to the results presented here.

## Data preparation

Below you can see methods used for storing achieved figures ("save_fig") and for loading the csv files containing encoded datasets ("load_data"). Some useful constants are also set here (paths, folder names).

In [19]:
from __future__ import division, print_function, unicode_literals

import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Location, in which the figures will be saved
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "machine_learning_part"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "pictures", CHAPTER_ID)
DATA_PATH = 'E:\\PROGRAMS\\GitHub\\SteamGamesPredictor\\src\\data_shuffle\\data\\'

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving image", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resoflution)

In [20]:
import os
import pandas as pd

def load_data(steam_path, file):
    csv_path = os.path.join(steam_path, file)
    return pd.read_csv(csv_path, error_bad_lines=False)

In [21]:
steam = load_data(DATA_PATH, 'enc_steam.csv')

Using the **head** method of the **Pandas** DataFrame structure, we can look at the first few rows of the dataset.

In [22]:
steam.head()

Unnamed: 0,Accounting,Action,Adventure,Animation & Modeling,Audio Production,Casual,Design & Illustration,Documentary,Early Access,Education,...,Year,English,Developer,Publisher,Required_Age,Achievements,Average_Playtime,Median_Playtime,Rating,Owners
0,0,1,0,0,0,0,0,0,0,0,...,2000,1,0,0,0,0,17612,317,97.0,0
1,0,1,0,0,0,0,0,0,0,0,...,1999,1,0,0,0,0,277,62,84.0,1
2,0,1,0,0,0,0,0,0,0,0,...,2003,1,0,0,0,0,187,34,90.0,1
3,0,1,0,0,0,0,0,0,0,0,...,2001,1,0,0,0,0,258,184,83.0,1
4,0,1,0,0,0,0,0,0,0,0,...,1999,1,4,0,0,0,624,415,95.0,1


In [23]:
steam_cl = steam.copy()
steam_lin = steam.copy()

# DecisionTreeClassifier

The decision tree learning method is a predictive model commonly used in machine learning and data mining. The idea is based on decision trees, in which the features of the examined subject are represented as the branches, and the target values (output classes) are represented as leaves. The algorithms traverses the tree from the root, making a decision about the given feature at each branch and continuing depending on that decision. When a leaf is reached, the final verdict can be made in terms of the target value residing in that particular leaf. If the target values are a finite set, the tree may be called a classification tree.

In our program, we used the **sklearn.tree.DecisionTreeClassifier** class to build our algorithm.

![decision tree](..\data\figures\dec_tree.png "Sample decision tree")

First, we need to divide our dataset into one training set and one verification set. The goal for this algorithm is to predict the amount of potential buyers for a new game, based on features like: genre, developers, release date. The **Owners** feature is represented by ranges of owners, e.g. [2000 - 5000]. Thus, we remove the **Owners** feature from our training set and we'll use it for validation.

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

x,y = steam_cl.loc[:,steam_cl.columns != 'Owners'], steam_cl.loc[:,'Owners']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 42)

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree_clf = DecisionTreeClassifier()

Below, the process of finding the optimal hyperparameters is shown. We used the **sklearn.model_selection.RandomizedSearchCV** tool with ten iterations to find more suitable hyperparamers.

In [26]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

tree_clf = DecisionTreeClassifier()

param_dist = {"max_depth": sp_randint(1,22),
              "max_features": sp_randint(1, 22),
              "min_samples_split": sp_randint(2, 100),
              "random_state": sp_randint(2, 100),
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 10
random_search = RandomizedSearchCV(tree_clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)
random_search.fit(x_train, y_train)



RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=DecisionTreeClassifier(class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort=False,
                                                    random_state=None,
                                                    splitter='best')

In [27]:
print(random_search.best_score_)
print(random_search.best_params_)

0.720873786407767
{'criterion': 'entropy', 'max_depth': 15, 'max_features': 21, 'min_samples_split': 95, 'random_state': 34}


These are the results of the "RandomizedSearchCV".
-  **best score**: best prediction rate achieved,
-  **best_params**: the parameters, which generated the highest prediction rate.

In [28]:
param_dist = {"max_depth": sp_randint(10,14),
              "max_features": sp_randint(8, 12),
              "min_samples_split": sp_randint(50, 70),
              "random_state": sp_randint(2, 20),
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 10
random_search = RandomizedSearchCV(tree_clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)
random_search.fit(x_train, y_train)



RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=DecisionTreeClassifier(class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort=False,
                                                    random_state=None,
                                                    splitter='best')

In [29]:
print(random_search.best_score_)
print(random_search.best_params_)

0.7192908400168848
{'criterion': 'entropy', 'max_depth': 12, 'max_features': 9, 'min_samples_split': 62, 'random_state': 16}


In [30]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'max_depth': list(range(9, 11)), 'max_features': list(range(9,11)), 'min_samples_split': list(range(60, 67)), 'random_state': list(range(2, 20))}
]
grid = GridSearchCV(tree_clf, param_grid, cv=5, scoring='accuracy')

grid.fit(x_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid=[{'max_depth': [9, 10], 'max_features': [9, 10],
                          'min_samples_split': [60, 61, 62, 63,

In [31]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

0.7243035035880118
{'max_depth': 9, 'max_features': 10, 'min_samples_split': 61, 'random_state': 11}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=10, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=61,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=11, splitter='best')


## FINAL MODEL

In [37]:
tree_clf = DecisionTreeClassifier(class_weight=None, criterion='gini',
                                  max_depth=9, max_features=10, max_leaf_nodes=None,
                                  min_impurity_decrease=0.0, min_impurity_split=None,
                                  min_samples_leaf=1, min_samples_split=65,
                                  min_weight_fraction_leaf=0.0, presort='auto',
                                  random_state=13, splitter='best')


tree_clf.fit(x_train,y_train)
y_pred = tree_clf.predict(x_test)

# Cross-Validation

In order to avoid overfitting of the model we used a technique called cross validation. This approach doesn't require usage of the validation set. In the technique called k-fold CV (short for cross validation) the training set is splitted into k smaller sets and for each of the k "folds" the model is trained using k-1 of the folds as a training set and the remaining fold is used as a validation set. After that the steps are repeated for some other "validation-fold" and a training set composed by other folds until every single fold have been used as a validation set. We tried it using 3 or 5 folds. That's why in our project everytime the cross validation method is called the results of it are stored in a three/five element table consisting of accuracy scores or mean squared errors.

In [39]:
cross_val_score(tree_clf, x_train, y_train, cv=3)



array([0.71896252, 0.72079772, 0.71969577])

In [41]:
print(accuracy_score(y_test, y_pred))

0.7074972300874062


In [42]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

Unnamed: 0,Actual,Predicted
10506,122,122
26313,122,122
2622,34,45
1489,45,45
19949,122,122
24157,122,122
14372,122,122
4424,122,122
8295,26,122
21560,122,122


# RandomForestClassifier

Random forests are an example of **ensemble learning**. Ensemble learning is a type of supervised learning and it involves taking multiple trained models, usually from the same base learner, and combining them to improve the prediction rate. The disadvantage is that this method needs significantly more computation than simple **decision trees**. Random forest can intuitively be thought of, as a collection of independent decision trees working together to produce a more accurate prediction.

In our, program the **sklearn.ensemble.RandomForestClassifier** is used.

In [84]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10)

In [85]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_dist = {"max_leaf_nodes": sp_randint(2,100),
              "min_samples_split": sp_randint(2, 100),
              "random_state": sp_randint(2, 100),
              
             }
# run randomized search
n_iter_search = 10
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)
random_search.fit(x_train, y_train)



RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators=10,
                                                    n_jobs=None,
  

In [86]:
print(random_search.best_score_)
print(random_search.best_params_)

0.7272055719712959
{'max_leaf_nodes': 71, 'min_samples_split': 3, 'random_state': 93}


In [87]:
param_dist = {"max_leaf_nodes": sp_randint(70,200),
              "min_samples_split": sp_randint(20, 35),
              "random_state": sp_randint(50, 100)
             }
# run randomized search
n_iter_search = 10
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5, iid=False)
random_search.fit(x_train, y_train)



RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators=10,
                                                    n_jobs=None,
  

In [88]:
print(random_search.best_score_)
print(random_search.best_params_)

0.7298470864939473
{'max_leaf_nodes': 133, 'min_samples_split': 32, 'random_state': 52}


In [89]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {"max_leaf_nodes": sp_randint(170,200),
              "min_samples_split": sp_randint(20, 36),
              "random_state": sp_randint(80, 98)
             }
# run randomized search
n_iter_search = 10
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5, iid=False)
random_search.fit(x_train, y_train)



RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators=10,
                                                    n_jobs=None,
  

In [90]:
print(random_search.best_score_)
print(random_search.best_params_)

0.7302110836420442
{'max_leaf_nodes': 194, 'min_samples_split': 27, 'random_state': 90}


In [91]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'max_leaf_nodes': list(range(184, 188)), 'min_samples_split': list(range(26, 32)), 
               'random_state': list(range(84, 88))
             }]
grid = GridSearchCV(tree_clf, param_grid, cv=5, scoring='accuracy')

grid.fit(x_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=9,
                                              max_features=10,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=65,
                                              min_weight_fraction_leaf=0.0,
                                              presort='auto', random_state=13,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid=[{'max_leaf_nodes': [184, 185, 186, 187],
                          'min_samples_split': [26, 27, 28, 29, 30, 31],
  

In [93]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

0.7217180244829042
{'max_leaf_nodes': 184, 'min_samples_split': 26, 'random_state': 84}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=10, max_leaf_nodes=184,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=26,
                       min_weight_fraction_leaf=0.0, presort='auto',
                       random_state=84, splitter='best')


## FINAL MODEL

In [97]:
rf = RandomForestClassifier(class_weight=None, criterion='gini',
                       max_depth=9, max_features=10, max_leaf_nodes=180,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=25,
                       min_weight_fraction_leaf=0.0,
                       random_state=72, n_estimators=10)

rf.fit(x_train, y_train)

y_pred = rf.predict(x_test)

In [98]:
cross_val_score(rf, x_train, y_train, cv=3, scoring="accuracy")



array([0.73414518, 0.72712884, 0.72650927])

In [100]:
print(accuracy_score(y_test, y_pred))

0.7137756986335098


In [101]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

Unnamed: 0,Actual,Predicted
10506,122,122
26313,122,122
2622,34,29
1489,45,45
19949,122,122
24157,122,122
14372,122,122
4424,122,122
8295,26,122
21560,122,122


# KNeighborsClassifier

A K nearest neighbor algorithm is a data classifier, which estimates probability that a data point is a member of one group or the other depending on the group, in which the nearest data points are located. KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.

To use this classifier we used the library sklearn.neighbors.KNeighborsClassifier. We started by looking for the best parameters for our classifier.

In [130]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

k_range = list(range(15, 31))
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')

grid.fit(x_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'n_neighbors': [15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
                                         25, 26, 27, 28, 29, 30]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [103]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

0.7112705783030815
{'n_neighbors': 29}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=29, p=2,
                     weights='uniform')


## FINAL MODEL

In [128]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                           metric_params=None, n_neighbors=29, p=2, weights='uniform')
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

After finding the best parameters we used them in the classifier.
Used parameters:
-	**algorithm** – Algorithm used to compute the nearest neighbors.
-	**leaf_size** - Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree.
-	**metric** - The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
-	**metric_params** - Additional keyword arguments for the metric function
-	**n_jobs** - The number of parallel jobs to run for neighbors search
-	**n_neighbors** - Number of neighbors to use by default for kneighbors queries
-	**p** - Power parameter for the Minkowski metric
-	**weights** - Weight function used in prediction. We used ‘uniform’ which means that All points in each neighborhood are weighted equally 


In [129]:
cross_val_score(knn, x_train, y_train, cv=3, scoring="accuracy")



array([0.09185773, 0.0873227 , 0.09370359])

In [121]:
print(accuracy_score(y_test, y_pred))

0.09026444083321022


We should be satisfied with chosen parameters. The percentage of correct prediction is 70%, which is a good result.

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

# REGRESSION PART

Linear Regression is a machine learning algorithm based on supervised learning. Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between input (x) and output (y). The regression line is the best-fit line for given data. In linear regression, the relationships are modeled using linear predictor functions which estimate model parameters based on data.

In this case, we took the ratings column for our output, which is responsible for the overall rating of the game given by players. To use this regressor we used the library sklearn.linear_model.LinearRegression.


## BASIC LINEAR REGRESSOR

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

x,y = steam_lin.loc[:,steam_lin.columns != 'Rating'], steam_lin.loc[:,'Rating']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 42)

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(regr, x_train, y_train,
                         scoring="neg_mean_squared_error", cv=3)
regr_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(regr_rmse_scores)

The most interesting result parameter is the standard deviation. This value is very close to zero, which is a correct result. 

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

## SVM REGRESSOR

Support Vector Machines are a very powerful and versatile machine learning algorithms. It can be used in a linear or nonlinear classification tasks, regression tasks or to detect the outliers. It is espescially useful in classification of the complex but not very big datasets. It operates on the principle of finding the widest gap between the separate categories, addition of antoher samples does not affect the margin because it is supported by the samples at the extremities, usually called supporting vectors. SVMs can also be used in regression tasks, this method is called support-vector regression (SVR). In this approach model depends only on subset of the training data, because cost function ignores any training data close to the model prediction.


![svm](svm.png "SVM visualisation")

In [None]:
from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_rbf.fit(x_train, y_train)
y_pred = svr_rbf.predict(x_test)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(svr_rbf, x_train, y_train,
                         scoring="neg_mean_squared_error", cv=3)
svr_rbf_rmse_scores = np.sqrt(-scores)

In [None]:
display_scores(svr_rbf_rmse_scores)

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

## RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_regr = RandomForestRegressor(ccp_alpha=0.0, max_depth=9, max_features=10, max_leaf_nodes=180,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=25,
                       min_weight_fraction_leaf=0.0,
                       random_state=72)

rf_regr.fit(x_train, y_train)

y_pred = rf_regr.predict(x_test)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf_regr, x_train, y_train,
                         scoring="neg_mean_squared_error", cv=5)
rf_regr_rmse_scores = np.sqrt(-scores)

In [None]:
display_scores(rf_regr_rmse_scores)

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(50)