## This experiment is to apply multiple algorithms and attain the best possible mape for the problem in hand.

### Approach 1

**Algorithm 1:** Find the most purchased article for each user, he/she is mostly will spent similar bucks in the same item or in some other article which is similar to that.  
This notebook will help us to get this algorithm up and running for each user. I will try to get the top 12 articles which are mostly assumed to be purchased by each user. If any user has purchased less than 12 articles, we will use top selling articles to complete article list till it reaches 12 in count.

**Algorithm 2:** Use LightFM algorithm to find the relation among customer and articles which have seen no interaction in past however the customer may have a tendency to buy these unpurchased articles in future visits. This notebook is already trained and hyperparameters are also optimized based on reduced transaction dates. Please find the link to these notebooks below:  

Link to Training Light FM model: https://www.kaggle.com/rickykonwar/h-m-lightfm-nofeatures  
Link to Hyper Parameter Tuning Light FM Model: https://www.kaggle.com/rickykonwar/h-m-lightfm-nofeatures-hyperparamter-tuning/notebook

### Approach 2  



## Importing Libraries

In [59]:
import time
import tqdm
import datetime
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

## Utility Functions

In [60]:
def apk(actual, predicted, k=12):
    '''
    Function to get Average Precision at K
    '''
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0
    
    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    '''
    Function to get Mean Average Precision at K
    '''
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

## Data Loading

In [61]:
data_path = r'../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
customer_data_path = r'../input/h-and-m-personalized-fashion-recommendations/customers.csv'
article_data_path = r'../input/h-and-m-personalized-fashion-recommendations/articles.csv'
submission_data_path = r'../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv'

In [62]:
# Data Extraction
def create_data(datapath, data_type=None):
    if data_type is None:
        df = pd.read_csv(datapath)
    elif data_type == 'transaction':
        df = pd.read_csv(datapath, dtype={'article_id': str}, parse_dates=['t_dat'])
    elif data_type == 'article':
        df = pd.read_csv(datapath, dtype={'article_id': str})
    return df

In [63]:
%%time

# Load all sales data (for 3 years starting from 2018 to 2020)
# ALso, article_id is treated as a string column otherwise it 
# would drop the leading zeros while reading the specific column values
transactions_data=create_data(data_path, data_type='transaction')
print(transactions_data.shape)

# # Unique Attributes
print(str(len(transactions_data['t_dat'].drop_duplicates())) + "-total No of unique transactions dates in data sheet")
print(str(len(transactions_data['customer_id'].drop_duplicates())) + "-total No of unique customers ids in data sheet")
print(str(len(transactions_data['article_id'].drop_duplicates())) + "-total No of unique article ids courses names in data sheet")
print(str(len(transactions_data['sales_channel_id'].drop_duplicates())) + "-total No of unique sales channels in data sheet")

(31788324, 5)
734-total No of unique transactions dates in data sheet
1362281-total No of unique customers ids in data sheet
104547-total No of unique article ids courses names in data sheet
2-total No of unique sales channels in data sheet
CPU times: user 48.6 s, sys: 7.02 s, total: 55.6 s
Wall time: 1min 2s


In [64]:
%%time

# Load all Customers
customer_data=create_data(customer_data_path)
print(customer_data.shape)

print(str(len(customer_data['customer_id'].drop_duplicates())) + "-total No of unique customers ids in customer data sheet")

(1371980, 7)
1371980-total No of unique customers ids in customer data sheet
CPU times: user 5.53 s, sys: 472 ms, total: 6 s
Wall time: 5.99 s


In [65]:
%%time

# Load all Articles
article_data=create_data(article_data_path, data_type='article')
print(article_data.shape)

print(str(len(article_data['article_id'].drop_duplicates())) + "-total No of unique article ids in article data sheet")

(105542, 25)
105542-total No of unique article ids in article data sheet
CPU times: user 651 ms, sys: 31.8 ms, total: 683 ms
Wall time: 682 ms


In [66]:
%%time

# Load all submission samples data
submission_data=create_data(submission_data_path)
print(submission_data.shape)

print(str(len(submission_data['customer_id'].drop_duplicates())) + "-total No of unique customer ids in submission data sheet")

(1371980, 2)
1371980-total No of unique customer ids in submission data sheet
CPU times: user 2.87 s, sys: 177 ms, total: 3.04 s
Wall time: 3.04 s


## Capturing Seasonal Effect by Limiting the transaction date
Based on notebook with link: https://www.kaggle.com/tomooinubushi/folk-of-time-is-our-best-friend/notebook

In [67]:
transactions_data = transactions_data[transactions_data['t_dat'] > '2020-08-21']
transactions_data.shape

(1190911, 5)

## Splitting transaction data to training and validation set

In [68]:
train_start_date = transactions_data.t_dat.min()
split_date = transactions_data.t_dat.max() - datetime.timedelta(days = 7)
train_transaction_data = transactions_data[(transactions_data.t_dat <= split_date) & (transactions_data.t_dat >= train_start_date)].copy()
val_transaction_data = transactions_data[transactions_data.t_dat > split_date].copy()

print(train_transaction_data.shape)
print(val_transaction_data.shape)

(950600, 5)
(240311, 5)


In [69]:
train_transaction_data = train_transaction_data.groupby(['customer_id','article_id']).agg({'t_dat':'count'}).reset_index()
val_transaction_data = val_transaction_data.groupby(['customer_id','article_id']).agg({'t_dat':'count'}).reset_index()

train_transaction_data.rename({'t_dat': 't_count'}, axis=1, inplace=True)
val_transaction_data.rename({'t_dat': 't_count'}, axis=1, inplace=True)

## Approach 1

### Algorithm 1: Extracting Topmost articles for each user based on number of purchase made

In [70]:
train_val_merge_transaction_data = train_transaction_data[train_transaction_data.article_id.isin(val_transaction_data.article_id.unique())] 
train_top_articles = train_val_merge_transaction_data.sort_values(['customer_id', 't_count'], ascending=False).groupby(['customer_id']).head(12)
val_top_articles = val_transaction_data.sort_values(['customer_id', 't_count'], ascending=False).groupby(['customer_id']).head(12)

# Overall highly sold articles
overall_top_articles = train_top_articles.groupby(['article_id'], as_index = False)['t_count'].sum().sort_values(['t_count'])['article_id'][-12:].values
overall_top_articles = overall_top_articles[::-1]

In [71]:
overall_top_articles

array(['0751471001', '0706016001', '0915526001', '0751471043',
       '0918292001', '0915529003', '0898694001', '0448509014',
       '0863595006', '0896152002', '0850917001', '0915526002'],
      dtype=object)

### Evaluating this algorithm

In [115]:
%%time

preds = []
trues = []
counter=1

for customer in tqdm.tqdm(train_top_articles.customer_id.unique(), desc='Evaluating Simple Algorithm'):

    predict_n_articles = train_top_articles[train_top_articles.customer_id.isin([customer])]['article_id'].values[:12]
    actual_n_articles = val_top_articles[val_top_articles.customer_id.isin([customer])]['article_id'].values[:12]
    
    if len(predict_n_articles) < 12:
        predict_n_articles = list(predict_n_articles[:len(predict_n_articles)]) + list(overall_top_articles[:12 - len(predict_n_articles)])
    
    preds.append(list(predict_n_articles))
    trues.append(list(actual_n_articles))
    counter+=1

Evaluating Simple Algorithm:   4%|▍         | 8409/214109 [04:12<1:43:07, 33.25it/s]


KeyboardInterrupt: 

In [87]:
# score = np.round(mapk(trues, preds, k = 12), 5)
# print(f'MAP@{12} = {score}')

In [117]:
purchase_dict = {}

for i,x in enumerate(zip(transactions_data['customer_id'], transactions_data['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}
    
    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    purchase_dict[cust_id][art_id] += 1
    
print(len(purchase_dict))

256355


In [76]:
not_so_fancy_but_fast_benchmark = submission_data[['customer_id']]
prediction_list = []
dummy_list = list((transactions_data['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission_data['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict:
        l = sorted((purchase_dict[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

not_so_fancy_but_fast_benchmark['prediction'] = prediction_list
print(not_so_fancy_but_fast_benchmark.shape)
not_so_fancy_but_fast_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0915529003 0915526001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0751471001 0915529003 0915526001 0918292001 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0751471001 0915529003 0915526001 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0751471001 0915529003 0915526001 0918292001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0751471001 0915529003 0915526001 0918292001 07...


In [77]:
dummy_list

['0751471001',
 '0915529003',
 '0915526001',
 '0918292001',
 '0706016001',
 '0751471043',
 '0448509014',
 '0898694001',
 '0863595006',
 '0909370001',
 '0850917001',
 '0714790020']

### Algorithm 2: Load predictions on submission data from Light FM experiment

In [23]:
optimized_lightfm_submission_path = r'../input/hm-trained-models/lightfm_nofeatures/submission_optimized_01.csv'
lightfm_submission_data = create_data(optimized_lightfm_submission_path)
print(lightfm_submission_data.shape)

print(str(len(lightfm_submission_data['customer_id'].drop_duplicates())) + "-total No of unique customer ids in lightfm submission data sheet")

(1371980, 2)
1371980-total No of unique customer ids in lightfm submission data sheet


### Combine

In [None]:
for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):