# Xbox Game Recommendations 
![image](https://th.bing.com/th/id/R.bae4480d3b547571607f1e27cf3a418a?rik=LUhcku%2fV%2f4JcQA&riu=http%3a%2f%2fgetwallpapers.com%2fwallpaper%2ffull%2fa%2f8%2f8%2f426305.jpg&ehk=q8Z3BazxlpiQTI2iWIy5S55BlBuCFqYCv16egWz1tic%3d&risl=&pid=ImgRaw&r=0)



The following topics are covered in this tutorial:

- Importing a real-world dataset
- Preparing a dataset for training
- Training and interpreting random forests
- Overfitting & hyperparameter tuning
- Making predictions on single inputs

## Dataset Description
The main data for this competition is in the train.csv and test.csv files. These files contain information on what items users clicked on after making a search.

Each line of train.csv describes a user's click on a single item. It contains the following fields:

- user: A user ID
- sku: The stock-keeping unit (item) that the user clicked on
- category: The category the sku belongs to
- query: The search terms that the user entered
- click_time: Time the sku was clicked on
- query_time: Time the query was run

## Import Libraries
Let's install and import some required libraries before we begin.

In [1]:
# !pip install nltk

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import re
import nltk
nltk.download('stopwords') 
nltk.download('wordnet')
nltk.download('words')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
import warnings

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading words: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading omw-1.4: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [3]:
train_data = pd.read_csv('Xbox/train.csv')
test_data = pd.read_csv('Xbox/test.csv')

In [4]:
train_data.head()

Unnamed: 0,user,sku,category,query,click_time,query_time
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,gears of war,2011-10-09 17:22:56.101,2011-10-09 17:21:42.917
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,Gears of war,2011-09-25 13:35:42.198,2011-09-25 13:35:33.234
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,Gears of war,2011-09-25 13:36:08.668,2011-09-25 13:35:33.234
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,Assassin creed,2011-09-25 13:37:23.709,2011-09-25 13:37:00.049
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,dead island,2011-09-11 15:15:34.336,2011-09-11 15:15:26.206


In [5]:
train_data['sku'].nunique()

413

## Prepare the Dataset for Training

Before we can train the model, we need to prepare the dataset.

In [6]:
train_data['query'] = [re.findall(r'\w+', i.lower()) for i in train_data['query'].fillna('NONE')]

In [7]:
train_data

Unnamed: 0,user,sku,category,query,click_time,query_time
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,"[gears, of, war]",2011-10-09 17:22:56.101,2011-10-09 17:21:42.917
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,"[gears, of, war]",2011-09-25 13:35:42.198,2011-09-25 13:35:33.234
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,"[gears, of, war]",2011-09-25 13:36:08.668,2011-09-25 13:35:33.234
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,"[assassin, creed]",2011-09-25 13:37:23.709,2011-09-25 13:37:00.049
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,"[dead, island]",2011-09-11 15:15:34.336,2011-09-11 15:15:26.206
...,...,...,...,...,...,...
42360,fff95d849a4d9c9946e081459471adf4a7192d79,2670133,abcat0701002,"[modern, warfare, 3]",2011-09-27 22:53:29.344,2011-09-27 22:53:04.644
42361,fffa393d127dec90b7eae4718535bd16be3b394d,2173065,abcat0701002,[batman],2011-10-14 12:44:14.669,2011-10-14 12:44:07.004
42362,fffa393d127dec90b7eae4718535bd16be3b394d,3046603,abcat0701002,[batman],2011-10-14 12:44:31.228,2011-10-14 12:44:07.004
42363,fffd288ec29a96dbac7356bcda0a1e9f88255a5b,2340293,abcat0701002,"[360, games]",2011-10-10 08:46:10.368,2011-10-10 08:43:56.768


In [8]:
stopwords_eng = stopwords.words('english')  
new_titles_sub = []
for title_sub in train_data['query']:
    new_title_sub = []
    for w_title in title_sub:
        if w_title not in stopwords_eng and not w_title.isdigit():
            new_title_sub.append(w_title)
    
    new_titles_sub.append(new_title_sub) 
    
train_data['new_query'] = new_titles_sub

In [9]:
train_data

Unnamed: 0,user,sku,category,query,click_time,query_time,new_query
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,"[gears, of, war]",2011-10-09 17:22:56.101,2011-10-09 17:21:42.917,"[gears, war]"
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,"[gears, of, war]",2011-09-25 13:35:42.198,2011-09-25 13:35:33.234,"[gears, war]"
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,"[gears, of, war]",2011-09-25 13:36:08.668,2011-09-25 13:35:33.234,"[gears, war]"
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,"[assassin, creed]",2011-09-25 13:37:23.709,2011-09-25 13:37:00.049,"[assassin, creed]"
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,"[dead, island]",2011-09-11 15:15:34.336,2011-09-11 15:15:26.206,"[dead, island]"
...,...,...,...,...,...,...,...
42360,fff95d849a4d9c9946e081459471adf4a7192d79,2670133,abcat0701002,"[modern, warfare, 3]",2011-09-27 22:53:29.344,2011-09-27 22:53:04.644,"[modern, warfare]"
42361,fffa393d127dec90b7eae4718535bd16be3b394d,2173065,abcat0701002,[batman],2011-10-14 12:44:14.669,2011-10-14 12:44:07.004,[batman]
42362,fffa393d127dec90b7eae4718535bd16be3b394d,3046603,abcat0701002,[batman],2011-10-14 12:44:31.228,2011-10-14 12:44:07.004,[batman]
42363,fffd288ec29a96dbac7356bcda0a1e9f88255a5b,2340293,abcat0701002,"[360, games]",2011-10-10 08:46:10.368,2011-10-10 08:43:56.768,[games]


In [10]:
wordnet_lemmatizer = WordNetLemmatizer()
new_titles_sub = []
for title_sub in train_data['new_query']:
    new_title_sub = []
    for w_title in title_sub:
        new_title_sub.append(wordnet_lemmatizer.lemmatize(w_title, pos="v"))
    new_titles_sub.append(new_title_sub) 
    
train_data['new_query'] = new_titles_sub
train_data['new_query'] = [' '.join(i) for i in train_data['new_query']]

In [11]:
train_data

Unnamed: 0,user,sku,category,query,click_time,query_time,new_query
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,"[gears, of, war]",2011-10-09 17:22:56.101,2011-10-09 17:21:42.917,gear war
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,"[gears, of, war]",2011-09-25 13:35:42.198,2011-09-25 13:35:33.234,gear war
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,"[gears, of, war]",2011-09-25 13:36:08.668,2011-09-25 13:35:33.234,gear war
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,"[assassin, creed]",2011-09-25 13:37:23.709,2011-09-25 13:37:00.049,assassin creed
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,"[dead, island]",2011-09-11 15:15:34.336,2011-09-11 15:15:26.206,dead island
...,...,...,...,...,...,...,...
42360,fff95d849a4d9c9946e081459471adf4a7192d79,2670133,abcat0701002,"[modern, warfare, 3]",2011-09-27 22:53:29.344,2011-09-27 22:53:04.644,modern warfare
42361,fffa393d127dec90b7eae4718535bd16be3b394d,2173065,abcat0701002,[batman],2011-10-14 12:44:14.669,2011-10-14 12:44:07.004,batman
42362,fffa393d127dec90b7eae4718535bd16be3b394d,3046603,abcat0701002,[batman],2011-10-14 12:44:31.228,2011-10-14 12:44:07.004,batman
42363,fffd288ec29a96dbac7356bcda0a1e9f88255a5b,2340293,abcat0701002,"[360, games]",2011-10-10 08:46:10.368,2011-10-10 08:43:56.768,game


In [12]:
wordnet_lemmatizer = WordNetLemmatizer()
new_titles_sub = []
for title_sub in train_data['new_query']:
    new_title_sub = []
    for w_title in title_sub:
        new_title_sub.append(wordnet_lemmatizer.lemmatize(w_title, pos="v"))
    new_titles_sub.append(new_title_sub) 
    
train_data['new_query'] = new_titles_sub
train_data['new_query'] = [''.join(i) for i in train_data['new_query']]

In [13]:
train_data

Unnamed: 0,user,sku,category,query,click_time,query_time,new_query
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,"[gears, of, war]",2011-10-09 17:22:56.101,2011-10-09 17:21:42.917,gear war
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,"[gears, of, war]",2011-09-25 13:35:42.198,2011-09-25 13:35:33.234,gear war
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,"[gears, of, war]",2011-09-25 13:36:08.668,2011-09-25 13:35:33.234,gear war
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,"[assassin, creed]",2011-09-25 13:37:23.709,2011-09-25 13:37:00.049,assassin creed
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,"[dead, island]",2011-09-11 15:15:34.336,2011-09-11 15:15:26.206,dead island
...,...,...,...,...,...,...,...
42360,fff95d849a4d9c9946e081459471adf4a7192d79,2670133,abcat0701002,"[modern, warfare, 3]",2011-09-27 22:53:29.344,2011-09-27 22:53:04.644,modern warfare
42361,fffa393d127dec90b7eae4718535bd16be3b394d,2173065,abcat0701002,[batman],2011-10-14 12:44:14.669,2011-10-14 12:44:07.004,batman
42362,fffa393d127dec90b7eae4718535bd16be3b394d,3046603,abcat0701002,[batman],2011-10-14 12:44:31.228,2011-10-14 12:44:07.004,batman
42363,fffd288ec29a96dbac7356bcda0a1e9f88255a5b,2340293,abcat0701002,"[360, games]",2011-10-10 08:46:10.368,2011-10-10 08:43:56.768,game


In [14]:
train_data2 = train_data.drop(['category' , 'query_time' , 'click_time' , 'user','query'], axis=1)

In [15]:
train_data2.head()

Unnamed: 0,sku,new_query
0,2032076,gear war
1,9854804,gear war
2,2670133,gear war
3,9984142,assassin creed
4,2541184,dead island


In [16]:
from sklearn.model_selection import train_test_split

In [17]:
train, test = train_test_split(train_data2, test_size=0.2)

In [18]:
train_features = train['new_query']
test_features = test['new_query']

train_labels = train['sku']
test_labels = test['sku']

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
train_features = vectorizer.fit_transform(train_features)
test_features = vectorizer.transform(test_features)

## Training a Random Forest

While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model. 

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out. This idea is also commonly known as the "wisdom of the crowd":

A random forest works by averaging/combining the results of several decision trees:

<img src="https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif" width="640">


We'll use the `RandomForestClassifier` class from `sklearn.ensemble`.

This general technique of combining the results of many models is called "ensembling", it works because most errors of individual models cancel out on averaging. Here's what it looks like visually:

<img src="https://i.imgur.com/qJo8D8b.png" width="640">


We can also look at the probabilities for the predictions. The probability of a class is simply the fraction of trees which that predicted the given class.

In [21]:
from sklearn.ensemble import RandomForestClassifier

In [22]:
model = RandomForestClassifier(n_estimators=20)

In [23]:
model.fit(train_features, train_labels)
labels = model.predict(test_features)

In [24]:
acurracy_nb = accuracy_score(test_labels, labels)
scores = cross_val_score(model, train_features, train_labels, cv=5)



In [25]:
print("acurracy_NB " + str(acurracy_nb))
print('validation scores: ', scores)

acurracy_NB 0.6263424997049452
validation scores:  [0.62546098 0.62929636 0.63174978 0.6379463  0.63086456]


## Booststrapping with Random Forests


When bootstrapping is enabled, you can also control the number or fraction of rows to be considered for each bootstrap using `max_samples`. This can further generalize the model.

<img src="https://i.imgur.com/rsdrL1W.png" width="640">

In [26]:
parameters = {'max_depth':range(9, 20, 2), 'min_samples_leaf': range(30, 101, 20), \
             'min_samples_split':range(10, 101, 10)}
grid_search_cv_clf = GridSearchCV(model, parameters, cv=5)
grid_search_cv_clf.fit(train_features, train_labels)



In [27]:
best_clf = grid_search_cv_clf.best_estimator_

In [28]:
nlabels = best_clf.predict(test_features)

In [29]:
nlabels = best_clf.predict(test_features)

In [30]:
acurracy_ss = accuracy_score(test_labels, nlabels)
nscores = cross_val_score(best_clf, train_features, train_labels, cv=5)



In [31]:
best_clf.score(test_features, test_labels)

0.46205594240528736

In [32]:
labels

array([2541184, 2945052, 2704058, ..., 2670133, 2107458, 2633103],
      dtype=int64)

In [33]:
prob = model.predict_proba(test_features)
prob

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [34]:
test_features.shape

(8473, 2162)

In [35]:
prob.shape

(8473, 405)

In [36]:
train_sku = train_labels.unique()

In [37]:
train_sku = list(train_sku)

In [38]:
train_sku.sort()

In [39]:
train_sku

[1004622,
 1010544,
 1011067,
 1011491,
 1011831,
 1012721,
 1012876,
 1013666,
 1014064,
 1032361,
 1052221,
 1066233,
 1066515,
 1066551,
 1067848,
 1067948,
 1078792,
 1092494,
 1094401,
 1121355,
 1142357,
 1154546,
 1161734,
 1170735,
 1179927,
 1179963,
 1180061,
 1180104,
 1182175,
 1184298,
 1199284,
 1207275,
 1208344,
 1208468,
 1208486,
 1208753,
 1228939,
 1228993,
 1251132,
 1309041,
 1315467,
 1315528,
 1318437,
 1324987,
 1331217,
 1338007,
 1338089,
 1404133,
 1404415,
 1431386,
 1440189,
 1443131,
 1450317,
 1450556,
 1470129,
 1470147,
 1475036,
 1493444,
 1493639,
 1508787,
 1511584,
 1515341,
 1535584,
 1535718,
 1563392,
 1563461,
 1579138,
 1580049,
 1617038,
 1685052,
 1775051,
 1776209,
 1778076,
 1778119,
 1802089,
 1807118,
 1808056,
 1814093,
 1816073,
 1930151,
 1953203,
 1972826,
 1973042,
 1974315,
 1981099,
 1981345,
 1989198,
 1991044,
 1997154,
 2010435,
 2011559,
 2032076,
 2035692,
 2051131,
 2075383,
 2078113,
 2095189,
 2106238,
 2107458,
 2138219,


In [40]:
pred_df = pd.DataFrame(prob, columns = train_sku)
pred_df.head()

Unnamed: 0,1004622,1010544,1011067,1011491,1011831,1012721,1012876,1013666,1014064,1032361,...,9955514,9956073,9959853,9963729,9967476,9976899,9977237,9980886,9984142,9999169100050027
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.074714,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
pred_df = pred_df.transpose()

In [42]:
test_labels = [str(i) for i in test_labels]

In [43]:
pred_df.columns = test_labels
pred_df

Unnamed: 0,2541184,9276473,2704058,7294661,9902347,2107458,9540428,1228939,2467129,3650081,...,2671044,1066551,9276473.1,1011491,9854804,2173065,2202037,2670133,2107458.1,2633103
1004622,0.0,0.0,0.000000,0.0,0.0,0.001618,0.0,0.00000,0.0,0.004785,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
1010544,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
1011067,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
1011491,0.0,0.0,0.074714,0.0,0.0,0.000818,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.074714,0.0,0.0,0.0,0.0,0.000000,0.0
1011831,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9976899,0.0,0.0,0.000000,0.0,0.0,0.001961,0.0,0.00376,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000531,0.0
9977237,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
9980886,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
9984142,0.0,0.0,0.000000,0.0,0.0,0.003612,0.0,0.00000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0


In [44]:
test = test.reset_index(drop=True)

## Metric: MAP@5 (Mean Average Precision) 

MAP treats the recommendation as a ranking task where the aim is that of recommending to users first the items that are most likely for the user to fancy. The 5 stands for evaluating the top 5 five items recommended. The value of MAP goes from 0 to 1. The higher the MAP value, the better.

In [58]:
import warnings
warnings.filterwarnings('ignore')

In [59]:
test['top5_SKU'] = ""
# pred_df.iloc[:,[1]]
for i in range(0,len(test['sku']) - 1):
    s1 = test['sku'][i]
    d = pred_df.iloc[:,[i]]
    top5 = d.nlargest(5,d.columns[0])
    x = list(top5.index)
    test['top5_SKU'][i] = x

In [60]:
test.head()

Unnamed: 0,sku,new_query,top5_SKU
0,2541184,dead island,"[2541184, 3046603, 2670133, 2945052, 2095189]"
1,9276473,modern warfare,"[2945052, 2670133, 9276473, 2107458, 2833031]"
2,2704058,fifa,"[2704058, 1011491, 2670133, 9854786, 1338089]"
3,7294661,halo,"[2856517, 7294661, 9254111, 2856544, 9713872]"
4,9902347,black ops,"[9902347, 2833031, 9739989, 2202037, 9854804]"


In [61]:
test['top5_SKU'].replace('', np.nan, inplace=True)
test = test.dropna()

In [62]:
ap = 0
for i in range(0,len(test['top5_SKU']) - 1):
    lst = list(test['top5_SKU'][i])
    if test['sku'][i] in lst:
        j = lst.index(test['sku'][i]) + 1
        ap += 1/j
    else:
        ap += 0
print(ap/len(test))

0.7360386416401008


## Summary and References

The following topics were covered in this tutorial:

- Downloading a real-world dataset
- Preparing a dataset for training
- Training and interpreting decision trees
- Training and interpreting random forests
- Overfitting & hyperparameter tuning
- Making predictions on single inputs



We also introduced the following terms:

* Random forest
* Hyperparameter tuning
* vectorization
* Ensembling
* Generalization
* Bootstrapping
* Mean Average Precision


Check out the following resources to learn more: 

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.kaggle.com/c/acm-sf-chapter-hackathon-small/data
- https://www.kaggle.com/c/home-credit-default-risk/discussion/64821
- https://towardsdatascience.com/recommender-systems-and-hyper-parameter-tuning-25567b10e298
