# Data Understanding and Data Visualization


#### Objectives


The objectives to be achieved with this notebook are as follows:

-   Explore the data to find out its characteristics or features useful for creating our Recommender System.
-   Prepare and convert the data to a more appropriate format.
-   Visualise the data in order to understand their relationships.


#### Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#ref1">Data Extraction</a></li>
        <li><a href="#ref2">Load Data</a></li>
        <li><a href="#ref3">Data Cleaning</a></li>
        <li><a href="#ref4">Data Engineering</a></li>
        <li><a href="#ref5">Data Interpretation</a></li>
    </ol>
</div>
<br>


<a id="ref1"></a>

## 1. Data Extraction

For this analysis we will use two datasets. A dataset containing board games and their main characteristics. And a dataset of users whose characteristics are related to the games that each user owns and their own evaluations of those games.

The data pertaining to the games dataset have been obtained from the Board Game Geek [BGG](https://boardgamegeek.com/). The download date of the dataset is 26/03/2021.


The data belonging to the user dataset has been acquired from WebScraping on a page associated with the BGG. After this process, different files were obtained, one for each user, the content of which corresponds to the data for that user only.



<a id="ref2"></a>

## 2. Load Data

#### Import libraries

In [2]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

pd.options.display.max_columns = None

### Board Games data

##### Read data

In [5]:
path='/Users/postigo/Google Drive/BoardGamesData' #Actualizar tras crear Prepair the Env
file = os.path.join(path, "bgg_GameItem.csv") 

In [15]:
bg=pd.read_csv(file, low_memory=False)
bg.head()

Unnamed: 0,bgg_id,name,year,game_type,designer,artist,publisher,min_players,max_players,min_players_rec,max_players_rec,min_players_best,max_players_best,min_age,min_age_rec,min_time,max_time,category,mechanic,cooperative,compilation,compilation_of,family,implementation,integration,rank,num_votes,avg_rating,stddev_rating,bayes_rating,complexity,language_dependency,bga_id,dbpedia_id,luding_id,spielen_id,wikidata_id,wikipedia_id
0,1,Die Macher,1986.0,5497.0,1,125174959,1332272615108392491165253828147,3.0,5.0,4.0,5.0,5.0,5.0,14.0,14.03125,240.0,240.0,102110261001,291620802012207220402020,0,0,,106433411691,,,286.0,5224,7.62849,1.57747,7.13389,4.3245,1.166667,,,,,,
1,2,Dragonmaster,1981.0,5497.0,8384,12424,6420,3.0,4.0,3.0,4.0,3.0,4.0,12.0,,30.0,30.0,10021010,2009,0,0,,7005,2174.0,,3718.0,553,6.63055,1.44269,5.79353,1.963,,,,,,,
2,3,Samurai,1998.0,5497.0,2,11883,"17,133,267,29,7340,7335,41,2973,4617,1391,8291...",2.0,4.0,2.0,4.0,3.0,3.0,10.0,9.793103,30.0,60.0,10091035,208020402026284620042002,0,0,,10634601114228732,,,209.0,14736,7.45062,1.18523,7.24469,2.4885,1.0,,,,,,
3,4,Tal der Könige,1992.0,,8008,2277,37,2.0,4.0,2.0,4.0,2.0,4.0,12.0,,60.0,60.0,1050,2001208020122004,0,0,,64229647111505,,,4951.0,339,6.59888,1.23291,5.69032,2.6667,,,,,,,
4,5,Acquire,1964.0,5497.0,4,1265818317,925487130828582962539246683846227107,2.0,6.0,3.0,6.0,4.0,4.0,12.0,11.735294,90.0,90.0,10211086,20402910290029112940200520022874,0,0,,4891,,,276.0,18189,7.33994,1.33515,7.15158,2.5041,1.090278,,,,,,


In [16]:
print('Number rows and columns', bg.shape)

Number rows and columns (100052, 38)


In [17]:
bg.columns

Index(['bgg_id', 'name', 'year', 'game_type', 'designer', 'artist',
       'publisher', 'min_players', 'max_players', 'min_players_rec',
       'max_players_rec', 'min_players_best', 'max_players_best', 'min_age',
       'min_age_rec', 'min_time', 'max_time', 'category', 'mechanic',
       'cooperative', 'compilation', 'compilation_of', 'family',
       'implementation', 'integration', 'rank', 'num_votes', 'avg_rating',
       'stddev_rating', 'bayes_rating', 'complexity', 'language_dependency',
       'bga_id', 'dbpedia_id', 'luding_id', 'spielen_id', 'wikidata_id',
       'wikipedia_id'],
      dtype='object')

In [18]:
bg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100052 entries, 0 to 100051
Data columns (total 38 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   bgg_id               100052 non-null  int64  
 1   name                 100052 non-null  object 
 2   year                 90954 non-null   float64
 3   game_type            21698 non-null   object 
 4   designer             85454 non-null   object 
 5   artist               40876 non-null   object 
 6   publisher            100038 non-null  object 
 7   min_players          98213 non-null   float64
 8   max_players          94686 non-null   float64
 9   min_players_rec      98213 non-null   float64
 10  max_players_rec      94686 non-null   float64
 11  min_players_best     98213 non-null   float64
 12  max_players_best     94686 non-null   float64
 13  min_age              77999 non-null   float64
 14  min_age_rec          906 non-null     float64
 15  min_time         

### Users data

##### Read data

In [5]:
#act_dir= os.getcwd()
%cd /users/postigo/Google Drive/BoardGamesData/users

/Users/postigo/Google Drive/BoardGamesData/users


In [7]:
# Obtain the number of users in the file
!ls | wc -l

    2853


In [8]:
# Obtain the number of records in each file
!ls | xargs wc -l > countfile.txt

In [9]:
# Read .txt data
df_countfile= pd.read_csv("countfile.txt", sep=" ", header= None, usecols=[5,6], names=['files', 'namefile'], error_bad_lines=False)
df_countfile.head()

Unnamed: 0,files,namefile
0,545.0,0.csv
1,489.0,1.csv
2,,47
3,552.0,100.csv
4,109.0,1000.csv


In [17]:
# Open a user file and view the information it contains
path= "/Users/postigo/Documents/20200917_Repaso/users2"

user237= pd.read_csv("237.csv")
user237

Unnamed: 0.1,Unnamed: 0,Game,Plays,BGG Rank,BGG Rating,Your Rating,Users Rating,Utilisation
0,0,Magic: The Gathering,182,158,7.5,10.0,32365,100.0%
1,1,Codenames,68,93,7.6,10.0,67424,99.9%
2,2,Mottainai,47,937,7.0,10.0,3486,99.9%
3,3,Innovation,44,334,7.2,10.0,14811,99.9%
4,4,Hansa Teutonica,40,139,7.6,10.0,11425,99.9%
...,...,...,...,...,...,...,...,...
212,212,Age of Steam Expansion: Germany & France,0,-1,7.9,-1.0,89,0.0%
213,213,7 Wonders Duel: Pantheon,0,-1,8.0,-1.0,8984,0.0%
214,214,1859,0,-1,6.9,8.0,25,0.0%
215,215,1844/1854,0,-1,8.0,-1.0,359,0.0%


In [15]:
user237.columns

Index(['Unnamed: 0', 'Game', 'Plays', 'BGG Rank', 'BGG Rating', 'Your Rating',
       'Users Rating', 'Utilisation'],
      dtype='object')

In [16]:
user237.shape

(217, 8)