## Task
You are given some item from a bundle and 10 other items (may or may not be from the same bundle).  
Given a new item, you should recommend the most appropriate items for the bundle.    
  
Since bundled items (for the most part) are placed in an arbitrary order, accuracy will be assessed by the percent of bundle items in the top k positions, where k items from the same bundle are compared.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
from pathlib import Path

task_fp = Path("data") / "bundle_task.csv"
df = pd.read_csv(task_fp)
print(df.shape)
df.head(3)

(3525, 4)


Unnamed: 0,bundle_id,item_id,item_name,genre
0,450,326950,Sword of Asumi,"Adventure, Indie, RPG"
1,450,331490,Sword of Asumi - Soundtrack,"Adventure, Indie, RPG"
2,450,331491,Sword of Asumi - Graphic Novel,"Adventure, Indie, RPG"


In [3]:
df.isna().sum()

bundle_id      0
item_id       10
item_name      0
genre        345
dtype: int64

In [4]:
df.query("item_id.isna()").groupby("bundle_id")[["item_id", "item_name"]].apply(lambda x: x)

Unnamed: 0_level_0,Unnamed: 1_level_0,item_id,item_name
bundle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15,2457,,GameLoading: Extra Content
15,2458,,GameLoading: OST and eBook
19,2461,,GameLoading: Extra Content
19,2462,,GameLoading: OST and eBook
166,726,,"Great White Shark: GTA$1,250,000"
264,2790,,"Whale Shark: GTA$3,500,000"
265,2792,,"Megalodon Shark: GTA$8,000,000"
362,1436,,Lara Croft and the Temple of Osiris - Season P...
1141,1229,,Battle vs. Chess - Dark Desert DLC
1141,1230,,Battle vs Chess - Floating Island DLC


Since we have the item names we can ignore item_id as the item_name values should be an equivalent identifier.  
    
The items with a missing item_id appear to be store items and not standalone content.

In [5]:
df["genre"] = df["genre"].fillna("none")

Items are not required to be part of a genre (take soundtracks for instance)

In [6]:
bundle_attrs = ["item_name", "genre"] # ignore item_id
item_genre_pairs = df.groupby("bundle_id")[bundle_attrs].agg(list)
item_genre_pairs

Unnamed: 0_level_0,item_name,genre
bundle_id,Unnamed: 1_level_1,Unnamed: 2_level_1
15,"[GameLoading: Rise of the Indies, GameLoading:...","[none, none, none]"
17,"[Angry Video Game Nerd Adventures, Angry Video...","[Action, Adventure, Indie, none]"
19,"[GameLoading: Extra Content, GameLoading: OST ...","[none, none]"
93,"[Mad Max: Fury Road, Mad Max]","[none, Action, Adventure]"
125,"[Mad Max: Fury Road, Mad Max 2: The Road Warri...","[none, none, none, none, Action, Adventure]"
...,...,...
1473,[Naruto Shippuden Uncut: The Man Who Died Twic...,"[none, none, none, none, none, none, none, non..."
1474,"[Naruto Shippuden Uncut: Sakura's Feelings, Na...","[none, none, none, none, none, none, none, non..."
1477,"[Halcyon 6: Starbase Commander, Halcyon 6: Sta...","[Indie, RPG, Simulation, Strategy, Indie, RPG,..."
1478,"[Paws and Claws: Pet Vet, Paws and Claws: Pet ...","[Casual, Simulation, Casual, Simulation, Casua..."


In [7]:
indices = item_genre_pairs.index

bundle_selection = dict()
for i, ix in enumerate(indices):
    curr = item_genre_pairs.iloc[i]
    curr_item, curr_genre = curr["item_name"], curr["genre"]
    curr_pairs = list(zip(curr_item, curr_genre))
    bundle_selection[ix] = curr_pairs

len(bundle_selection)

615

In [8]:
raw_train, raw_test = dict(), dict()

np.random.seed(25)
i = 0
cutoff = len(bundle_selection) // 2
for bundle_id, bundle_items in bundle_selection.items():
    N = len(bundle_items[0])
    ix = N//2 # at least 2 items per bundle, so all bundles can be trained or test on
    # print(len(bundle_items))
    curr_data = np.array(bundle_items)
    # print(curr_train, len(curr_train))
    # print(curr_test, len(curr_test))

    if i < cutoff:
        raw_train[bundle_id] = curr_data
    else:
        raw_test[bundle_id] = curr_data
    i += 1
    

print(len(raw_train), len(raw_test))

307 308


I am only considering full bundles for training as bundles with only 2 items cannot be trained since a single item is not associated with any other items (and hence no relationship information can be derived in parameters).

In [9]:
def load_bundle_items(raw_data):
    
    raw_data = list(raw_data.items())
    loaded = (
        pd.DataFrame(raw_data)
            .explode(1)
    )
    loaded_ix = loaded[0]
    loaded = (
        pd.DataFrame(loaded[1].to_list(),
                    columns=["item_name", "genre"],
                    index=loaded_ix)
        .reset_index()
        .rename(columns={0: "bundle_id"})
    )
    loaded["genre"] = loaded["genre"].str.split(", ")
    return loaded

loaded_train = load_bundle_items(raw_train)
loaded_test = load_bundle_items(raw_test)

loaded_train.shape, loaded_test.shape
loaded_train

Unnamed: 0,bundle_id,item_name,genre
0,15,GameLoading: Rise of the Indies,[none]
1,15,GameLoading: Extra Content,[none]
2,15,GameLoading: OST and eBook,[none]
3,17,Angry Video Game Nerd Adventures,"[Action, Adventure, Indie]"
4,17,Angry Video Game Nerd: The Movie,[none]
...,...,...,...
1933,694,ENKI,"[Adventure, Indie]"
1934,694,N.E.R.O.: Nothing Ever Remains Obscure,"[Adventure, Indie]"
1935,695,Breached,"[Adventure, Indie]"
1936,695,Breached - Original Soundtrack,[none]


In [10]:
print(loaded_train.shape)
loaded_train.head(3)

(1938, 3)


Unnamed: 0,bundle_id,item_name,genre
0,15,GameLoading: Rise of the Indies,[none]
1,15,GameLoading: Extra Content,[none]
2,15,GameLoading: OST and eBook,[none]


In [11]:
print(loaded_test.shape)
loaded_test.head(3)

(1587, 3)


Unnamed: 0,bundle_id,item_name,genre
0,696,Techwars Online,"[Action, Indie, Massively Multiplayer, Strategy]"
1,696,Techwars online - Original Soundtrack,"[Action, Indie, Massively Multiplayer, Strategy]"
2,696,Techwars online - Art book,"[Action, Indie, Massively Multiplayer, Strategy]"


In [15]:
# loaded_test.to_csv("old_bundle.csv", index=False)