# **Thanks for Kouki's sharing[LSTM by Keras with Unified Wi-Fi Feats](http://www.kaggle.com/kokitanisaka/lstm-by-keras-with-unified-wi-fi-feats), I just make a little change to it, I am used to writing codes with pytorch, So as you can see, I changed keras code to pytorch, socre 7.53**

## Overview

It demonstrats how to utilize [the unified Wi-Fi dataset](https://www.kaggle.com/kokitanisaka/indoorunifiedwifids).<br>
The Neural Net model is not optimized, there's much space to improve the score. 

In this notebook, I refer these two excellent notebooks.
* [wifi features with lightgbm/KFold](https://www.kaggle.com/hiro5299834/wifi-features-with-lightgbm-kfold) by [@hiro5299834](https://www.kaggle.com/hiro5299834/)<br>
 I took some code fragments from his notebook.
* [Simple 👌 99% Accurate Floor Model 💯](https://www.kaggle.com/nigelhenry/simple-99-accurate-floor-model) by [@nigelhenry](https://www.kaggle.com/nigelhenry/)<br>
 I use his excellent work, the "floor" prediction.

It takes much much time to finish learning. <br>
And even though I enable the GPU, it doesn't help. <br>
If anybody knows how to make it better, can you please make a comment? <br>

Thank you!

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from pathlib import Path
import glob
import pickle
from tqdm import tqdm
import random
import os
import copy
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

### options
We can change the way it learns with these options. <br>
Especialy **NUM_FEATS** is one of the most important options. <br>
It determines how many features are used in the training. <br>
We have 100 Wi-Fi features in the dataset, but 100th Wi-Fi signal sounds not important, right? <br>
So we can use top Wi-Fi signals if we think we need to. 

In [2]:
# options

N_SPLITS = 10

SEED = 2021

NUM_FEATS = 20 # number of features that we use. there are 100 feats but we don't need to use all of them

base_path = '/kaggle'

In [3]:
def set_seed(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
def get_timestamp():
    import time
    timestamp = ''
    for i, d in enumerate(time.localtime()):
        if i == 3:
            d += 8
        timestamp += str(d) + '-'
        if i == 4:
            break
    return timestamp[:-1]
def comp_metric(xhat, yhat, fhat, x, y, f):
    intermediate = np.sqrt((xhat-x)**2 + (yhat-y)**2) + 15 * np.abs(fhat-f)
#     intermediate = np.sqrt((xhat-x)**2 + (yhat-y)**2)
    return intermediate.sum()/xhat.shape[0]

In [4]:
feature_dir = f"{base_path}/input/indoorunifiedwifids"
train_files = sorted(glob.glob(os.path.join(feature_dir, '*_train.csv')))
test_files = sorted(glob.glob(os.path.join(feature_dir, '*_test.csv')))
subm = pd.read_csv(f'{base_path}/input/indoor-location-navigation/sample_submission.csv', index_col=0)

In [5]:
with open(f'{feature_dir}/train_all.pkl', 'rb') as f:
  data = pickle.load( f)

with open(f'{feature_dir}/test_all.pkl', 'rb') as f:
  test_data = pickle.load(f)

In [6]:
# training target features
BSSID_FEATS = [f'bssid_{i}' for i in range(NUM_FEATS)]
RSSI_FEATS  = [f'rssi_{i}' for i in range(NUM_FEATS)]

In [7]:
# get numbers of bssids to embed them in a layer

wifi_bssids = []
for i in range(100):
    wifi_bssids.extend(data.iloc[:,i].values.tolist())
wifi_bssids = list(set(wifi_bssids))

wifi_bssids_size = len(wifi_bssids)
print(f'BSSID TYPES: {wifi_bssids_size}')

wifi_bssids_test = []
for i in range(100):
    wifi_bssids_test.extend(test_data.iloc[:,i].values.tolist())
wifi_bssids_test = list(set(wifi_bssids_test))

wifi_bssids_size = len(wifi_bssids_test)
print(f'BSSID TYPES: {wifi_bssids_size}')

wifi_bssids.extend(wifi_bssids_test)
wifi_bssids_size = len(wifi_bssids)

BSSID TYPES: 61206
BSSID TYPES: 33042


In [8]:
# preprocess

le = LabelEncoder()
le.fit(wifi_bssids)
le_site = LabelEncoder()
le_site.fit(data['site_id'])

ss = StandardScaler()
ss.fit(data.loc[:,RSSI_FEATS])

StandardScaler()

In [9]:
data.loc[:,RSSI_FEATS] = ss.transform(data.loc[:,RSSI_FEATS])
for i in BSSID_FEATS:
    data.loc[:,i] = le.transform(data.loc[:,i])
    data.loc[:,i] = data.loc[:,i] + 1
    
data.loc[:, 'site_id'] = le_site.transform(data.loc[:, 'site_id'])

data.loc[:,RSSI_FEATS] = ss.transform(data.loc[:,RSSI_FEATS])

In [10]:
test_data.loc[:,RSSI_FEATS] = ss.transform(test_data.loc[:,RSSI_FEATS])
for i in BSSID_FEATS:
    test_data.loc[:,i] = le.transform(test_data.loc[:,i])
    test_data.loc[:,i] = test_data.loc[:,i] + 1
    
test_data.loc[:, 'site_id'] = le_site.transform(test_data.loc[:, 'site_id'])

test_data.loc[:,RSSI_FEATS] = ss.transform(test_data.loc[:,RSSI_FEATS])

In [11]:
site_count = len(data['site_id'].unique())
data.reset_index(drop=True, inplace=True)

In [12]:
set_seed(SEED)

## The model
The first Embedding layer is very important. <br>
Thanks to the layer, we can make sense of these BSSID features. <br>
<br>
We concatenate all the features and put them into LSTM. <br>
<br>
If something is theoritically wrong, please correct me. Thank you in advance. 

In [13]:
class IndoorDataset(Dataset):
    def __init__(self, data, flag='TRAIN'):
        self.data = data
        self.flag = flag
    def __len__(self):
        return self.data.shape[0]
    def __getitem__(self, index):
        tmp_data = self.data.iloc[index]
        if self.flag == 'TRAIN':
            ## 加载数据也许花费许久的时间
            return {
                'BSSID_FEATS':tmp_data[BSSID_FEATS].values.astype(float),
                'RSSI_FEATS':tmp_data[RSSI_FEATS].values.astype(float),
                'site_id':tmp_data['site_id'].astype(int),
                'x':tmp_data['x'],
                'y':tmp_data['y'],
                'floor':tmp_data['floor'],
            }
        else:
            return {
                'BSSID_FEATS':tmp_data[BSSID_FEATS].values.astype(float),
                'RSSI_FEATS':tmp_data[RSSI_FEATS].values.astype(float),
                'site_id':tmp_data['site_id'].astype(int)
            }
class simpleLSTM(nn.Module):
    def __init__(self, embedding_dim = 64, seq_len=20):
        super(simpleLSTM, self).__init__()
        self.emb_BSSID_FEATS = nn.Embedding(wifi_bssids_size, embedding_dim)
        self.emb_site_id = nn.Embedding(site_count, 2)
        self.lstm1 = nn.LSTM(input_size=256,hidden_size=128, dropout=0.3, bidirectional=False)
        self.lstm2 = nn.LSTM(input_size=128,hidden_size=16, dropout=0.1, bidirectional=False)
        self.lr = nn.Linear(NUM_FEATS, NUM_FEATS * embedding_dim)
        self.lr1 = nn.Linear(2562, 256)
        self.lr_xy = nn.Linear(16, 2)
        self.lr_floor = nn.Linear(16, 1)
        self.batch_norm1 = nn.BatchNorm1d(NUM_FEATS)
        self.batch_norm2 = nn.BatchNorm1d(2562)
        self.batch_norm3 = nn.BatchNorm1d(1)
        self.dropout = nn.Dropout(0.3)
    def forward(self, x):
        
        x_bssid = self.emb_BSSID_FEATS(x['BSSID_FEATS'])
        x_bssid = torch.flatten(x_bssid, start_dim=-2)
        
        x_site_id = self.emb_site_id(x['site_id'])
        x_site_id = torch.flatten(x_site_id, start_dim=-1)
        x_rssi = self.batch_norm1(x['RSSI_FEATS'])
        x_rssi = self.lr(x_rssi)
        x_rssi = torch.relu(x_rssi)
        
        x = torch.cat([x_bssid, x_site_id, x_rssi], dim=-1)
        x = self.batch_norm2(x)
        x = self.dropout(x)
        x = torch.relu(self.lr1(x))

        x = x.unsqueeze(-2)
        x = self.batch_norm3(x)
        x = x.transpose(0, 1)
        x, _ = self.lstm1(x)
        x = x.transpose(0, 1)
        x = torch.relu(x)
        x = x.transpose(0, 1)
        x, _ = self.lstm2(x)
        x = x.transpose(0, 1)
        x = torch.relu(x)
        xy = self.lr_xy(x)
        floor = self.lr_floor(x)
        floor = torch.relu(floor)
        return xy.squeeze(-2), floor.squeeze(-2)

In [14]:
def evaluate(model, data_loader,  device='cuda'):
    model.to(device)
    model.eval()
    x_list = []
    y_list = []
    floor_list = []
    prexs_list = []
    preys_list = []
    prefloors_list = []
    for d in tqdm(data_loader):
        data_dict['BSSID_FEATS'] = d['BSSID_FEATS'].to(device).long()
        data_dict['RSSI_FEATS'] = d['RSSI_FEATS'].to(device).float()
        data_dict['site_id'] = d['site_id'].to(device).long()
        x = d['x'].to(device).float()
        y = d['y'].to(device).float()
        floor = d['floor'].to(device).long()
        x_list.append(x.cpu().detach().numpy())
        y_list.append(y.cpu().detach().numpy())
        floor_list.append(floor.cpu().detach().numpy())
        xy, floor = model(data_dict)
        prexs_list.append(xy[:, 0].cpu().detach().numpy())
        preys_list.append(xy[:, 1].cpu().detach().numpy())
        prefloors_list.append(floor.squeeze().cpu().detach().numpy())
    x = np.concatenate(x_list)
    y = np.concatenate(y_list)
    floor = np.concatenate(floor_list)
    prexs = np.concatenate(prexs_list)
    preys =np.concatenate(preys_list)
    prefloors = np.concatenate(prefloors_list)
    eval_score = comp_metric(x, y, floor, prexs, preys, prefloors)
    return eval_score
def get_result(model, data_loader, device='cuda'):
    model.eval()
    model.to(device)
    prexs_list = []
    preys_list = []
    prefloors_list = []
    data_dict = {}
    for d in tqdm(data_loader):
        data_dict['BSSID_FEATS'] = d['BSSID_FEATS'].to(device).long()
        data_dict['RSSI_FEATS'] = d['RSSI_FEATS'].to(device).float()
        data_dict['site_id'] = d['site_id'].to(device).long()
        xy, floor = model(data_dict)
        prexs_list.append(xy[:, 0].cpu().detach().numpy())
        preys_list.append(xy[:, 1].cpu().detach().numpy())
        prefloors_list.append(floor.squeeze(-1).cpu().detach().numpy())
    prexs = np.concatenate(prexs_list)
    preys =np.concatenate(preys_list)
    prefloors = np.concatenate(prefloors_list)
    return prexs, preys, prefloors

In [15]:
score_df = pd.DataFrame()
oof = list()
predictions = list()

oof_x, oof_y, oof_f = np.zeros(data.shape[0]), np.zeros(data.shape[0]), np.zeros(data.shape[0])
preds_x, preds_y = 0, 0
preds_f_arr = np.zeros((test_data.shape[0], N_SPLITS))

# for fold, (trn_idx, val_idx) in enumerate(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED).split(data.loc[:, 'path'], data.loc[:, 'path'])):
    
#     train_data = data.loc[trn_idx]
#     valid_data = data.loc[val_idx]
#     train_dataset = IndoorDataset(train_data)
#     train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=6)
#     valid_dataset = IndoorDataset(valid_data)
#     valid_dataloader = DataLoader(valid_dataset, batch_size=128, shuffle=True, num_workers=6)
#     test_dataset = IndoorDataset(test_data, 'TEST')
#     test_dataloader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=6)
#     device = 'cuda' if torch.cuda.is_available() else 'cpu'
#     model = simpleLSTM()
#     model = model.to(device)
    
#     mse = nn.MSELoss()
#     mse = mse.to(device)
#     optim = torch.optim.Adam(model.parameters(), lr=5e-3)
    
#     data_dict ={}
#     best_loss = 1000
#     num_epochs = 1
#     best_epoch = 0
#     for epoch in range(num_epochs):
#         model.train()
#         losses = []
#         pbar = tqdm(train_dataloader)
#         for d in pbar:
#             data_dict['BSSID_FEATS'] = d['BSSID_FEATS'].to(device).long()
#             data_dict['RSSI_FEATS'] = d['RSSI_FEATS'].to(device).float()
#             data_dict['site_id'] = d['site_id'].to(device).long()
#             x = d['x'].to(device).float().unsqueeze(-1)
#             y = d['y'].to(device).float().unsqueeze(-1)
#             floor = d['floor'].to(device).long()
#             xy, floor = model(data_dict)
#             label = torch.cat([x, y], dim=-1)
#             loss = mse(xy, label)
#             loss.backward()
#             optim.step()
#             optim.zero_grad()
#             losses.append(loss.cpu().detach().numpy())
#             pbar.set_description(f'loss:{np.mean(losses)}')
#         score = evaluate(model, valid_dataloader, device)
#         if score < best_loss:
#             best_loss = score
#             best_epoch = epoch
#             best_model = copy.deepcopy(model)
#         if best_epoch + 2<epoch:
#             break
#         print("*="*50)
#         print(f"fold {fold} EPOCH {epoch}: mean position error {score}")
#         print("*="*50)
#     test_x, test_y, test_floor = get_result(best_model, test_dataloader, device)
#     preds_f_arr[:,fold] = test_floor
#     preds_x += test_x
#     preds_y += test_y

In [16]:
# test_x /= (fold + 1)
# test_y /= (fold + 1)
    
# print("*+"*40)
# # as it breaks in the middle of cross-validation, the score is not accurate at all.
# score = comp_metric(oof_x, oof_y, oof_f, data.iloc[:, -5].to_numpy(), data.iloc[:, -4].to_numpy(), data.iloc[:, -3].to_numpy())
# oof.append(score)
# print(f"mean position error {score}")
# print("*+"*40)

# preds_f_mode = stats.mode(preds_f_arr, axis=1)
# preds_f = preds_f_mode[0].astype(int).reshape(-1)
# test_preds = pd.DataFrame(np.stack((preds_f, test_x, test_y))).T
# test_preds.columns = subm.columns
# test_preds.index = test_data["site_path_timestamp"]
# test_preds["floor"] = test_preds["floor"].astype(int)
# predictions.append(test_preds)

In [17]:
# all_preds = pd.concat(predictions)
# all_preds = all_preds.reindex(subm.index)

## Fix the floor prediction
So far, it is not successfully make the "floor" prediction part with this dataset. <br>
To make it right, we can incorporate [@nigelhenry](https://www.kaggle.com/nigelhenry/)'s [excellent work](https://www.kaggle.com/nigelhenry/simple-99-accurate-floor-model). <br>

In [18]:
# simple_accurate_99 = pd.read_csv('../input/simple-99-accurate-floor-model/submission.csv')

# all_preds['floor'] = simple_accurate_99['floor'].values

In [19]:
# all_preds.to_csv('submission.csv')

That's it. 

Thank you for reading all of it.

I hope it helps!

Please make comments if you found something to point out, insights or suggestions. 