# Data preprocessing

The purpose of this notebook is to set the data in a suitable format for Spark to use. We start with a pseudo json file from (add_source).

In [1]:
!pip install demjson

Collecting demjson
  Downloading demjson-2.2.4.tar.gz (131 kB)
[?25l[K     |██▌                             | 10 kB 16.3 MB/s eta 0:00:01[K     |█████                           | 20 kB 8.0 MB/s eta 0:00:01[K     |███████▌                        | 30 kB 6.8 MB/s eta 0:00:01[K     |██████████                      | 40 kB 6.5 MB/s eta 0:00:01[K     |████████████▌                   | 51 kB 5.0 MB/s eta 0:00:01[K     |███████████████                 | 61 kB 5.1 MB/s eta 0:00:01[K     |█████████████████▌              | 71 kB 5.2 MB/s eta 0:00:01[K     |████████████████████            | 81 kB 5.8 MB/s eta 0:00:01[K     |██████████████████████▍         | 92 kB 4.7 MB/s eta 0:00:01[K     |█████████████████████████       | 102 kB 4.9 MB/s eta 0:00:01[K     |███████████████████████████▍    | 112 kB 4.9 MB/s eta 0:00:01[K     |██████████████████████████████  | 122 kB 4.9 MB/s eta 0:00:01[K     |████████████████████████████████| 131 kB 4.9 MB/s 
[?25hBuilding wheels for 

In [5]:
from google.colab import drive

import pandas as pd
from demjson import decode
import csv
import os

In [9]:
drive.mount('/content/drive')
file_location = '/content/drive/My Drive/datasets/recommenders' 
file_name = 'australian_users_items.json'

file_address = os.path.join(file_location,file_name)
with open(file_address, 'r',errors="ignore") as file_used:
    Lines = file_used.readlines()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Prepare 3 mappings: 

*   from item_id to item_name
*   from name to item_id and from steam_id (the user)
* item_name,  to the count of games he has purchased.

Read all the necessary columns (steam_id, item_id, paytime_2weeks and playtime_forever) from the pseudo json (dictionary) and store them in a dataframe.


In [10]:
item_id2_name = {}
name2_item_id = {}
steam_id2total_items = {}
rows = []
for line in Lines:
    data_dict = decode(line)
    data_row = data_dict['items']
    name_row = data_dict['steam_id']
    steam_id2total_items[name_row] = data_dict['items_count']
    for row in data_row:
        if row['item_id'] not in item_id2_name:
            item_id2_name[row['item_id']]=row['item_name']
            name2_item_id[row['item_name']] = row['item_id']
        row_mod = {'Name':name_row, 'item': row['item_id'], 
                   'recent_play_time': row['playtime_2weeks'], 'total_play_time': row['playtime_forever']}
        rows.append(row_mod)
        

In [11]:
user_games= pd.DataFrame(rows)
user_games.head()

Unnamed: 0,Name,item,recent_play_time,total_play_time
0,76561197970982479,10,0,6
1,76561197970982479,20,0,0
2,76561197970982479,30,0,7
3,76561197970982479,40,0,0
4,76561197970982479,50,0,0


Check if there are any missings or duplicates in the data:


In [16]:
user_games.isnull().sum()

Name                0
item                0
recent_play_time    0
total_play_time     0
dtype: int64

In [17]:
user_games.duplicated().sum()

59104

We would deal with the duplicates later!

In [14]:
#TODO: move to utils python script
def write_in_csv(dict_to_write, file_name):
    with open(file_name, 'w') as csv_file:
        writer = csv.writer(csv_file)
        for key, value in dict_to_write.items():
            if key is not None and value is not None:
                writer.writerow([key,value])

def read_from_csv(file_name):
    with open(file_name) as csv_file:
      reader = csv.reader(csv_file)
      read_dict = dict(reader)
    return read_dict

Store the results to a csv:

In [21]:
user_games.to_csv(file_location+"games_played.csv")
write_in_csv(item_id2_name, file_location+"item_id2item_map.csv")
write_in_csv(name2_item_id, file_location+"name2_item_id_map.csv")
write_in_csv(steam_id2total_items, file_location+"steam_id2total_items_map.csv")