<!-- <img src="https://i.imgur.com/oiG2jZl.png"> -->
<center><h1>🧭Indoor Location and Navigation🧭</h1></center>

# 1. Introduction
> 📌**Goal**: Predicting the indoor position of smartphones📱 based on a *real-time* sensor🎯.

> We'll also learn how to use the **GitHub Repository** available through this competition and call the custom functions **without** copy-pasting them into the notebook.

### Libraries📚

In [None]:
# CPU libraries
import os
import json
import glob
import cv2
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go

from PIL import Image, ImageOps
from skimage import io
from skimage.color import rgba2rgb, rgb2xyz
from tqdm import tqdm
from dataclasses import dataclass
from math import floor, ceil

mycolors = ["#797D62", "#9B9B7A", "#D9AE94", "#FFCB69", "#D08C60", "#997B66"]

In [None]:
# Get path to all TRAIN & TEST files
train_paths = glob.glob('../input/indoor-location-navigation/train/*/*/*')
test_paths = glob.glob('../input/indoor-location-navigation/test/*')
sites = glob.glob('../input/indoor-location-navigation/metadata/*')

print("No. Files in Train: {:,}".format(len(train_paths)), "\n" +
      "No. Files in Test: {:,}".format(len(test_paths)), "\n" +
      "Total Sites (metadata): {:,}".format(len(sites)))

In [None]:
# How 1 path looks
base = '../input/indoor-location-navigation'
path = f'{base}/train/5cd56b5ae2acfd2d33b58549/5F/5d06134c4a19c000086c4324.txt'

with open(path) as p:
    lines = p.readlines()

print("No. Lines in 1 example: {:,}". format(len(lines)), "\n" +
      "Example (5 lines): ", lines[0:10])

# 2. How to use a GitHub repo on Kaggle?🔗

> 📌**Goal**: This competition has a [GitHub repo](https://github.com/location-competition/indoor-location-competition-20) available. We can use the `read_data_file` function in the `io_f` file to read in the information ( no additional struggle on our side + no copy-paste and cluttered code :) ).

#### *🙏🏻Special thanks to [Laura](https://www.kaggle.com/allunia) for teaching me this awesome trick.🙏🏻*

**Steps**:
* 🦶🏻 - download the repo from [this link](https://github.com/location-competition/indoor-location-competition-20)
* 🦶🏻 - copy the package to the Kaggle environment (`!cp -r path/* ./`)
* 🦶🏻 - import it and use it as you please

In [None]:
!cp -r ../input/indoor-locationnavigation-2021/indoor-location-competition-20-master/indoor-location-competition-20-master/* ./

In [None]:
# Import custom function from the repository
from io_f import read_data_file

# Read in 1 random example
sample_file = read_data_file(path)

# You can access the information for each variable:
print("~~~ Example ~~~")
print("acce: {}".format(sample_file.acce.shape), "\n" +
      "acacce_uncalice: {}".format(sample_file.acce_uncali.shape), "\n" +
      "ahrs: {}".format(sample_file.ahrs.shape), "\n" +
      "gyro: {}".format(sample_file.gyro.shape), "\n" +
      "gyro_uncali: {}".format(sample_file.gyro_uncali.shape), "\n" +
      "ibeacon: {}".format(sample_file.ibeacon.shape), "\n" +
      "magn: {}".format(sample_file.magn.shape), "\n" +
      "magn_uncali: {}".format(sample_file.magn_uncali.shape), "\n" +
      "waypoint: {}".format(sample_file.waypoint.shape), "\n" +
      "wifi: {}".format(sample_file.wifi.shape))

In [None]:
# pull out all the buildings actually used in the test set, given current method we don't need the other ones
ssubm = pd.read_csv('../input/indoor-location-navigation/sample_submission.csv')

# only 24 of the total buildings are used in the test set, 
# this allows us to greatly reduce the intial size of the dataset
ssubm_df = ssubm["site_path_timestamp"].apply(lambda x: pd.Series(x.split("_")))
ssubm_df.head()

In [None]:
# How 1 path looks
base = '../input/indoor-location-navigation/train/'
# used_buildings = sorted(ssubm_df[0].value_counts().index.tolist())
used_buildings = ["5a0546857ecc773753327266"]

wifi_cols = ["Time", "ssid", "bssid", "RSSI", "last seen timestamp"]
df = pd.DataFrame(columns=wifi_cols)

for building in used_buildings:
    print(building)
    folders = sorted(glob.glob(os.path.join(base + building + '/*')))
    # Read in 1 random example
    for folder in folders:
        paths = glob.glob(os.path.join(folder, "*.txt"))
        for path in paths:
            temp = read_data_file(path)
            # Stack the DataFrames on top of each other
            if temp.wifi.size != 0:
                df_wifi = pd.DataFrame(data=temp.wifi, columns=wifi_cols)
                min_result = 10e9
                for j in range(df_wifi.shape[0]):
                    for i in range(len(temp.waypoint)):
                        min_t = abs(int(df_wifi.loc[j, 'Time']) - int(temp.waypoint[i][0]))
                        if min_result > min_t:
                            min_result = min_t
                            min_index = i
                            
                    df_wifi.loc[j, 'pointTime'] = int(temp.waypoint[min_index][0])
                    df_wifi.loc[j, 'x'] = int(temp.waypoint[min_index][1])
                    df_wifi.loc[j, 'y'] = int(temp.waypoint[min_index][2])
                    df = df.append(df_wifi)

df.head()

In [None]:
df.head()

# 4. Baseline Model

> 📌**Note**: Preprocessed data is from [this dataset](https://www.kaggle.com/devinanzelmo/indoor-navigation-and-location-wifi-features) by [Devin Anzelmo](https://www.kaggle.com/devinanzelmo).

### RAPIDS function

In [None]:
from cupy import sqrt as sqrt_g
from cupy import power as power_g
from cupy import abs as abs_g

In [None]:
def mean_position_error_gpu(x_pred, y_pred, f_pred, x_true, y_true, f_true, p=15):
    '''Same, but Faster ;)
    Using RAPIDS here for our XGBoost model later.'''
    
    N = len(x_true)
    formula = sqrt_g( power_g(x_pred - x_true, 2) + power_g(y_pred - y_true, 2) )
    formula = formula + p * abs_g(f_pred - f_true)
    formula = formula.sum() / N
    return formula

In [None]:
def mean_position_error(x_pred, y_pred, f_pred, x_true, y_true, f_true, p=15):
    '''Custom function to evaluate Mean Position Error.
    x: x coordinate of the waypoint position; dtype list()
    y: y coordinate of the waypoint position; dtype list()
    f: exact floor or the building; dtype list()
    p: floor penalty, set to 15 (always)'''
    
    N = len(x_true)
    #1
    formula = sqrt( power(x_pred - x_true, 2) + power(y_pred - y_true, 2) )
    #2
    formula = formula + p * absolute(f_pred - f_true)
    #3
    formula = formula.sum() / N
    
    return formula

In [None]:
def make_submission(predictions, sample_subm, name="base.csv"):
    '''Receives a list of predictions in dataframe format.'''
    final_submission = pd.concat(predictions).reset_index(drop=True)
    final_submission.index = sample_subm.index
    final_submission.to_csv(name)
    print("Submission ready.")

## I. Light GBM

In [None]:
# Import Libraries
import lightgbm as lgb

# ~~~~
# Data
# ~~~~
base_dir = "../input/indoor-navigation-and-location-wifi-features/wifi_features"
train_dir = "/train/*_train.csv"
test_dir = "/test/*_test.csv"


# Paths for train & test files
train_paths = sorted(glob.glob(base_dir + train_dir))
test_paths = sorted(glob.glob(base_dir + test_dir))
sample_subm = pd.read_csv('../input/indoor-location-navigation/sample_submission.csv',
                          index_col=0)

print("Len Train Files: {}".format(len(train_paths)), "\n" +
      "Len Test Files: {}".format(len(test_paths)))

In [None]:
# Initialize new experiment (LGBM)
run = wandb.init(project="indoor-location-kaggle", name="lgbm_train")

wandb.log({'Len Train Files' : len(train_paths),
           'Len Test Files' : len(test_paths)})

> Schema of the "Training loop" (inspired by [Jiwei Liu's work](https://www.kaggle.com/jiweiliu)):
<img src="https://i.imgur.com/UQmdRcz.png" width=750>

#### Code Below

In [None]:
def train_lgbm(train_perc=0.75, version=1, n_estimators=150, num_leaves=127):
    '''
    Training loop
    '''

    f = open(f"lgbm_logs_{version}.txt", "w+")
    lgbm_predictions = []
    
    # Log in W&B
    wandb.log({'n_estimators': n_estimators, 'num_leaves': num_leaves})
    
    k = 1
    for train_path, test_path in zip(train_paths, test_paths):


        # --- Read in data ---
        train_df = pd.read_csv(train_path, index_col=0)
        train_df = train_df.sample(frac=1, random_state=10)

        # Erase last column (which is "site_path_timestamp")
        test_df = pd.read_csv(test_path, index_col=0).iloc[:, :-1]

        # Sample out training and validation data
        ### we need to be careful to choose same information for ALL 3 models
        ### 1 for x, 1 for y and 1 for floor

        train_size = int(len(train_df) * train_perc)


        # --- Data Validation ---
        # Train features + targets
        X_train = train_df.iloc[:train_size, :-4]
        y_train_x = train_df.iloc[:train_size, -4]
        y_train_y = train_df.iloc[:train_size, -3]
        y_train_f = train_df.iloc[:train_size, -2]

        # Valid features + targets
        X_valid = train_df.iloc[train_size:, :-4]
        y_valid_x = train_df.iloc[train_size:, -4]
        y_valid_y = train_df.iloc[train_size:, -3]
        y_valid_f = train_df.iloc[train_size:, -2]


        # --- Model Training ---
        lgbm_x = lgb.LGBMRegressor(n_estimators=n_estimators, num_leaves=num_leaves)
        lgbm_x.fit(X_train, y_train_x)

        lgbm_y = lgb.LGBMRegressor(n_estimators=n_estimators, num_leaves=num_leaves)
        lgbm_y.fit(X_train, y_train_y)

        lgbm_f = lgb.LGBMClassifier(n_estimators=n_estimators, num_leaves=num_leaves)
        lgbm_f.fit(X_train, y_train_f)


        # --- Model Validation Predictions ---
        preds_x = lgbm_x.predict(X_valid)
        preds_y = lgbm_y.predict(X_valid)
        preds_f = lgbm_f.predict(X_valid).astype(int)
        
        mpe = mean_position_error(preds_x, preds_y, preds_f,
                                  y_valid_x, y_valid_y, y_valid_f)
        print("{} | MPE: {}".format(k, mpe))
        # Save logs
        with open(f"lgbm_logs_{version}.txt", 'a+') as f:
            print("{} | MPE: {}".format(k, mpe), file=f)
        
        # Log MPE of this experiment
        wandb.log({'MPE' : mpe, 'step' : k})
        
        k+=1


        # --- Model Test Predictions ---
        test_preds_x = lgbm_x.predict(test_df)
        test_preds_y = lgbm_y.predict(test_df)
        test_preds_f = lgbm_f.predict(test_df).astype(int)

        all_test_preds = pd.DataFrame({'floor' : test_preds_f,
                                       'x' : test_preds_x, 
                                       'y' : test_preds_y})
        lgbm_predictions.append(all_test_preds)
    
    
    return lgbm_predictions

In [None]:
train_df = pd.read_csv(train_paths[0], index_col=0)
# train_size = int(len(train_df) * 0.7)
train_df.iloc[:train_size, :]

### Training

In [None]:
# Uncomment line below to train your own model
# lgbm_predictions = train_lgbm(train_perc = 0.75, version=1)

# # Logs from my training:
print(open('../input/indoor-locationnavigation-2021/lgbm_logs_1.txt', "r").read())

> ❗**Attention**: *5th* dataframe had a BIG error (jumped from ~4 on average to 18). This case HAS to be taken into consideration, as the models seems to be underfitting. The *10th, 13th, 21st* have big MPE as well.

### Submission LGBM

In [None]:
# Uncomment line below to make your own submission
# make_submission(lgbm_predictions, sample_subm, name="lgbm_base.csv")

# My submission:
lgbm_predictions = pd.read_csv("../input/indoor-locationnavigation-2021/lgbm_base.csv")
lgbm_predictions.to_csv("lgbm_base.csv", index=False)

In [None]:
# ~ END of EXPERIMENT ~
wandb.finish()
# ~~~~~~~~~~~~~~~~~~~~~

## II. XGBoost - Faster with RAPIDS

> 📌**Note**: I will use a combination of **RAPIDS** libraries on GPU and XGBoost as one of my base models. More information on this open source suite of libraries [here](https://rapids.ai/).

In [None]:
# Libraries
import cudf
import cupy
import cuml
import xgboost

# Adjust floor function
### As the Multiclass XGBoost takes only labels between [0, n)
### But we have negative floor values
def adjust_floor(df, col_name):
    '''Adjusts the floor to be >= 0.
    Also returns the number fo classes (also used to complete classification).'''
    num_classes = df[col_name].nunique()
    smallest = df[col_name].unique().min()
    df[col_name] = df[col_name] - smallest
    
    return df[col_name], num_classes, smallest

In [None]:
# Initialize new experiment (XGB)
run = wandb.init(project="indoor-location-kaggle", name="xgb_train")

wandb.log({'Len Train Files' : len(train_paths),
           'Len Test Files' : len(test_paths)})

In [None]:
def train_xgb(train_perc=0.75, version=1):
    '''
    Training loop
    '''

    f = open(f"xgb_logs_{version}.txt", "w+")
    xgb_predictions = []
    
    
    k = 1
    for train_path, test_path in zip(train_paths, test_paths):


        # --- Read in data ---
        train_df = cudf.read_csv(train_path, index_col=0)
        train_df = train_df.sample(frac=1, random_state=10)
        train_df["f"], num_classes, smallest = adjust_floor(train_df, 'f')

        # Erase last column (which is "site_path_timestamp")
        test_df = cudf.read_csv(test_path, index_col=0).iloc[:, :-1]

        # Sample out training and validation data
        ### we need to be careful to choose same information for ALL 3 models
        ### 1 for x, 1 for y and 1 for floor

        train_size = int(len(train_df) * train_perc)


        # --- Data Validation ---
        # Train features + targets
        X_train = train_df.iloc[:train_size, :-4]
        y_train_x = train_df.iloc[:train_size, -4]
        y_train_y = train_df.iloc[:train_size, -3]
        y_train_f = train_df.iloc[:train_size, -2]

        # Valid features + targets
        X_valid = train_df.iloc[train_size:, :-4]
        y_valid_x = cupy.asanyarray(train_df.iloc[train_size:, -4])
        y_valid_y = cupy.asanyarray(train_df.iloc[train_size:, -3])
        y_valid_f = cupy.asanyarray(train_df.iloc[train_size:, -2])
        
        
        # --- Parameters ---
        regr_params = {'max_depth' : 4, 'max_leaves' : 2**4, 
                       'tree_method' : 'gpu_hist', 'objective' : 'reg:squarederror',
                       'grow_policy' : 'lossguide', 'colsample_bynode': 0.8}
        classif_params = {'max_depth' : 4, 'max_leaves' : 2**4,
                          'tree_method' : 'gpu_hist', 'objective' : 'multi:softmax',
                          'num_class' : num_classes, 'grow_policy' : 'lossguide',
                          'colsample_bynode': 0.8, 'verbosity' : 0}
        
        # Log once to W&B
        if k == 1:
            wandb.log(regr_params)
            wandb.log(classif_params)


        # --- Model Training ---
        trainMatrix_x = xgboost.DMatrix(data=X_train, label=y_train_x)
        xgboost_x = xgboost.train(params=regr_params, dtrain=trainMatrix_x)

        trainMatrix_y = xgboost.DMatrix(data=X_train, label=y_train_y)
        xgboost_y = xgboost.train(params=regr_params, dtrain=trainMatrix_y)

        trainMatrix_f = xgboost.DMatrix(data=X_train, label=y_train_f)
        xgboost_f = xgboost.train(params=classif_params, dtrain=trainMatrix_f)


        # --- Model Validation Predictions ---
        preds_x = cupy.asanyarray(xgboost_x.predict(xgboost.DMatrix(X_valid)))
        preds_y = cupy.asanyarray(xgboost_y.predict(xgboost.DMatrix(X_valid)))
        preds_f = cupy.asanyarray(xgboost_f.predict(xgboost.DMatrix(X_valid)).astype(int))

        mpe = mean_position_error_gpu(preds_x, preds_y, preds_f,
                                      y_valid_x, y_valid_y, y_valid_f)
        print("{} | MPE: {}".format(k, mpe))
        # Save logs
        with open(f"xgb_logs_{version}.txt", 'a+') as f:
            print("{} | MPE: {}".format(k, mpe), file=f)
        
        # Log MPE of this experiment
        wandb.log({'MPE' : mpe, 'step' : k})
        
        k+=1


        # --- Model Test Predictions ---
        test_preds_x = xgboost_x.predict(xgboost.DMatrix(test_df))
        test_preds_y = xgboost_y.predict(xgboost.DMatrix(test_df))
        test_preds_f = xgboost_f.predict(xgboost.DMatrix(test_df)).astype(int) + smallest

        all_test_preds = pd.DataFrame({'floor' : test_preds_f,
                                       'x' : test_preds_x, 
                                       'y' : test_preds_y})
        xgb_predictions.append(all_test_preds)
    
    
    return xgb_predictions

### Training

In [None]:
# Uncomment line below to train your own model
# xgb_predictions = train_xgb(train_perc=0.75, version=1)

# Logs from my training:
print(open('../input/indoor-locationnavigation-2021/xgb_logs_1.txt', "r").read())

### Submission

In [None]:
# Uncomment line below to make your own submission
# make_submission(xgb_predictions, sample_subm, name="xgb_base.csv")

# My submission:
xgb_predictions = pd.read_csv("../input/indoor-locationnavigation-2021/xgb_base.csv")
xgb_predictions.to_csv("xgb_base.csv", index=False)

In [None]:
# ~ END of EXPERIMENT ~
wandb.finish()
# ~~~~~~~~~~~~~~~~~~~~~

# 5. Save W&B Submissions and Logs

> We can save the predictions and logs to W&B.

In [None]:
# Submissions
### we can save the predictions in W&B
run = wandb.init(project='indoor-location-kaggle', name='submissions')
artifact = wandb.Artifact(name='submissions', 
                          type='dataset')

artifact.add_file("../input/indoor-locationnavigation-2021/lgbm_base.csv")
artifact.add_file("../input/indoor-locationnavigation-2021/xgb_base.csv")

wandb.log_artifact(artifact)
wandb.finish()

In [None]:
# Logs
### we can save the predictions in W&B
run = wandb.init(project='indoor-location-kaggle', name='training_logs')
artifact = wandb.Artifact(name='training_logs', 
                          type='dataset')

artifact.add_file("../input/indoor-locationnavigation-2021/lgbm_logs_1.txt")
artifact.add_file("../input/indoor-locationnavigation-2021/xgb_logs_1.txt")

wandb.log_artifact(artifact)
wandb.finish()

> The Artifact section of the project:
<img src="https://i.imgur.com/KqzJuFL.png">

# Blending Stirring Cooking 🥙

> First let's compare the 2 model's predictions. (needs more work)

In [None]:
# # Read in data
# lgb_preds = pd.read_csv("../input/indoor-locationnavigation-2021/lgbm_base.csv")
# xgb_preds = pd.read_csv("../input/indoor-locationnavigation-2021/xgb_base.csv")

# sample_submission = pd.read_csv("../input/indoor-location-navigation/sample_submission.csv")

# # Sample Submission
# sample_submission["x"] = lgb_preds["x"] * 0.9 + xgb_preds["x"] * 0.1
# sample_submission["y"] = lgb_preds["y"] * 0.9 + xgb_preds["y"] * 0.1

# sample_submission.to_csv("blend1.csv", index=False)

<img src="https://i.imgur.com/cUQXtS7.png">

# Specs on how I prepped & trained ⌨️🎨
### (on my local machine)
* Z8 G4 Workstation 🖥
* 2 CPUs & 96GB Memory 💾
* NVIDIA Quadro RTX 8000 🎮
* RAPIDS version 0.17 🏃🏾‍♀️