<a href="https://colab.research.google.com/github/Alamoooooooo/AlamoWork/blob/master/bc2abf40d0393360080f5f7f9869f92b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation 

![](https://images.aicrowd.com/raw_images/challenges/banner_file/1116/6c8fecd6d7c225b4ed11.jpg)

This notebook will contains instructions and example submission with random predictions.



## Installations 🤖

1. `aicrowd-cli` for downloading challenge data and making submissions
2. `pyarrow` for saving to parquet for submissions

In [2]:
!pip install -r requirements.txt

Collecting Click==7.0
  Using cached Click-7.0-py2.py3-none-any.whl (81 kB)
Collecting Pygments==2.5.2
  Using cached Pygments-2.5.2-py2.py3-none-any.whl (896 kB)
Collecting requests==2.22.0
  Using cached requests-2.22.0-py2.py3-none-any.whl (57 kB)
Collecting toml==0.10.0
  Using cached toml-0.10.0-py2.py3-none-any.whl (25 kB)
Collecting tqdm==4.41.1
  Using cached tqdm-4.41.1-py2.py3-none-any.whl (56 kB)
Installing collected packages: toml, tqdm, requests, Pygments, Click
  Attempting uninstall: toml
    Found existing installation: toml 0.10.2
    Uninstalling toml-0.10.2:
      Successfully uninstalled toml-0.10.2
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.65.0
    Uninstalling tqdm-4.65.0:
      Successfully uninstalled tqdm-4.65.0
  Attempting uninstall: requests
    Found existing installation: requests 2.28.2
    Uninstalling requests-2.28.2:
      Successfully uninstalled requests-2.28.2
  Attempting uninstall: Pygments
    Found existing installatio

In [3]:
!pip install aicrowd-cli pyarrow

Collecting click<8,>=7.1.2
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting toml<1,>=0.10.2
  Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting tqdm<5,>=4.56.0
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting requests<3,>=2.25.1
  Using cached requests-2.28.2-py3-none-any.whl (62 kB)
Collecting pygments<3.0.0,>=2.6.0
  Using cached Pygments-2.14.0-py3-none-any.whl (1.1 MB)
Installing collected packages: tqdm, toml, requests, pygments, click
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
  Attempting uninstall: toml
    Found existing installation: toml 0.10.0
    Uninstalling toml-0.10.0:
      Successfully uninstalled toml-0.10.0
  Attempting uninstall: requests
    Found existing installation: requests 2.22.0
    Uninstalling requests-2.22.0:
      Successfully uninstalled requests-2.22.0
  Attempting uninstall: pygments
    Found existing

## Login to AIcrowd and download the data 📚

In [4]:
!aicrowd login

Please login here: [34m[1m[4mhttps://api.aicrowd.com/auth/roghsVupx9VgR6HfqYSjXKEYmTP0uToaGHcXOHQuMz0[0m
[32mAPI Key valid[0m
[32mGitlab access token valid[0m
[32mSaved details successfully![0m


In [5]:
!aicrowd dataset download --challenge task-1-next-product-recommendation

sessions_test_task1.csv: 100%|█████████████| 19.4M/19.4M [00:02<00:00, 8.54MB/s]
sessions_test_task2.csv: 100%|█████████████| 1.92M/1.92M [00:01<00:00, 1.61MB/s]
sessions_test_task3.csv: 100%|█████████████| 2.67M/2.67M [00:01<00:00, 2.10MB/s]
products_train.csv: 100%|████████████████████| 589M/589M [00:38<00:00, 15.1MB/s]
sessions_train.csv: 100%|████████████████████| 259M/259M [00:17<00:00, 14.8MB/s]


## Setup data and task information

In [5]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache

In [6]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task1'
PREDS_PER_SESSION = 100

In [7]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

## Data Description

The Multilingual Shopping Session Dataset is a collection of **anonymized customer sessions** containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: **user sessions** and **product attributes**. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.

---

### Each product as its associated information:


**locale**: the locale code of the product (e.g., DE)

**id**: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

**title**: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

**price**: price of the item in local currency (e.g., 24.99)

**brand**: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

**color**: color of the item (e.g., “Black”)

**size**: size of the item (e.g., “xxl”)

**model**: model of the item (e.g., “iphone 13”)

**material**: material of the item (e.g., “cotton”)

**author**: author of the item (e.g., “J. K. Rowling”)

**desc**: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)


## EDA 💽

In [8]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")

In [9]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)

: 

: 

In [None]:
products.sample(5)

Unnamed: 0,id,locale,title,price,brand,color,size,model,material,author,desc
1535876,B08B3QTXJZ,IT,kwmobile Custodia Compatibile con Apple iPhone...,8.49,KW-Commerce,blu chiaro matt,,49982.58_m000813,Silicone,,ANTI URTO: i bordi rialzati della copertina pr...
1198060,B0B2KN1Q6M,UK,Me To You Bear Sister Just For You Birthday Card,2.99,Carte Blanche,,,,,,
1024050,B099W7JSMT,UK,Syhood 32.8 Feet Christmas Metallic Tinsel Twi...,8.99,Syhood,Blue,,,Metal,,Christmas style decor: the Christmas metallic ...
895070,B0BCG44MBT,JP,ラップタオル大人用 速乾大きいサイズ 風呂用サウナ 着るバスシャワー超吸水水泳 温泉湯浴み着...,1589.0,OTTCFRN,ピンク,ワンサイズ,,ポリエステル,,3 Dトリミング設計、（非ベルクロデザイン）よりユーザーフレンドリーで、使用時に音が出ず、肌...
1084330,B007E9VUQS,UK,"Smiffys Make-Up FX Face and Body Paint, 16 ml ...",2.94,Smiffy's,Brown (dark),One Size,39184,,,Add colour to your dress-up costume!


In [None]:
train_sessions = read_train_data()
train_sessions.sample(5)

Unnamed: 0,prev_items,next_item,locale
2994683,['B07T3GN2VH' 'B07T2FDFKZ' 'B07T3DJMT5' 'B098T...,B07ZJZNRMP,UK
190907,['B07ZRN33PQ' 'B07ZRMCRG7' 'B09C24TXP4' 'B09C2...,B091G94JDR,DE
3595388,['B09BJNQRNZ' 'B08XK8M5Z3' 'B09DPD5QJ8' 'B09KN...,B0B4RYT3ZS,IT
465436,['B07GXQCFXK' 'B07GXQD5Y3' 'B00E6722OK'],B00D3HZYGW,DE
3518477,['B09ZY6WYJX' 'B0B85CSNXW' 'B09ZY6WYJX' 'B0B85...,B083DRSWKR,IT


In [None]:
test_sessions = read_test_data(task)
test_sessions.sample(5)

Unnamed: 0,prev_items,locale
202046,['B08H95Y452' 'B0BG3GRMF9' 'B0BG3GRMF9' 'B0BF5...,UK
98284,['B09RQ8T72D' 'B09998MBFM' 'B09RQ8T72D'],DE
191260,['B0871Z739B' 'B09N92NHGR' 'B0871Z739B'],JP
113547,['B0B56Q2VXW' 'B0B56NPJ4G' 'B0B56Q2VXW'],JP
102804,['B08G97TPH8' 'B08G91WFQR' 'B08G93D8LZ' 'B082P...,DE


## Generate Submission 🏋️‍♀️



Submission format:
1. The submission should be a **parquet** file with the sessions from all the locales. 
2. Predicted products ids per locale should only be a valid product id of that locale. 
3. Predictions should be added in new column named **"next_item_prediction"**.
4. Predictions should be a list of string id values

In [None]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = []
    for _ in range(len(sess_test_locale)):
        predictions.append(
            list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
        ) 
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale

In [None]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)

Unnamed: 0,locale,next_item_prediction
197622,JP,"[B0B3JKGTBH, B07WFD1L1R, B0B1N2FMMG, B0BLJSMWJ..."
108611,JP,"[B07WV5GXPB, B0B7DS3HQL, B0866HDFTS, B009GQYDX..."
284074,UK,"[B09BB5SPR3, B0816CXMSZ, B08JV76967, B08MW68KC..."
34652,DE,"[B007H6POYW, B08M5GZGFT, B08JQZMFL7, B0BKP9BSL..."
268639,UK,"[B06XCGCKG7, B0B9NXKN54, B091YX63K7, B00Z65X1G..."


# Validate predictions ✅

In [None]:
def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"

In [None]:
check_predictions(predictions)

In [None]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

## Submit to AIcrowd 🚀

In [None]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-1-next-product-recommendation -f "submission_task1.parquet"