# Data Analytics & Machine Learning - Miniproject
**Author: Eduard Pascale**

**Date: 1/05/2023**

## Short Intro

In the miniproject we have to tackle down a problem regarding [DAL module](https://fhict.instructure.com/courses/13020/assignments/218989?module_item_id=915417), as agreed with the teacher I had to choose between two datasets to dive into and get the most out of it. I choose to work with UJIIndoorLoc dataset because it's a dataset which is relevant for me, in my industry project I have to work with a similar problem therefore I hope this dataset will give me the training that I need.

In this report, I will explain how to organize and analyze data in a way that helps make better decisions. I'll go over the steps to prepare data for analysis, explore the data to find patterns, and create a data model that can predict future outcomes. To show how this works, I'll use a set of data provided by my teacher. I'll also explain each step of the process in detail, so I can learn how to improve my data modeling skills.

## UJIIndoorLoc

The [UJIIndoorLoc Data Set](https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc) is a collection of WiFi signals data collected from access points placed in different indoor locations, such as offices, classrooms, and corridors. The data has been preprocessed to extract a set of features related to signal strength, position, and orientation, and it includes measurements of signal strengths from 520 access points. The goal of the dataset is to predict the indoor location of a person based on the received WiFi signals. The dataset has 19,937 instances and 529 features, making it a useful resource for indoor localization and navigation tasks, including classification, clustering, and regression.

source: https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

In the machine learning repository, this dataset provides relevant papers for further study. As my first initiative, I'm going to look at a conference paper titled ["UJIIndoorLoc: A new multi-building and multi-ﬂoor database for WLAN ﬁngerprint-based indoor localization problems"](https://www.researchgate.net/publication/283894296_UJIIndoorLoc_A_new_multi-building_and_multi-floor_database_for_WLAN_fingerprint-based_indoor_localization_problems)

## Relevant Papers

#### IUJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems

One of common thing that many application wants to make use of is automatic user localization. This works by
using an electronic device, usually a mobile phone to compute the user latitude, longitude and altitude.
Many problems occur in indoor settings, we cannot make use of GPS sensor to locate a user because GPS signal gets lost in indoor environments.

Many people have tried to make use of WLANs, which is especially a good approach because it doesn't require any extra equipment and can be found on most smartphones plus WLANs are now ubiquitous. WLAN Fingerprint-based positioning systems are based on the [Received Signal Strength Indicator (RSSI)](https://en.wikipedia.org/wiki/Received_signal_strength_indicator) value, a radio map of the area where the users should be detected is constructed and later the user obtains the signal strength of all visible access points of the WLAN.

One important draw-back in this field is the lack of a common database for comparison purposes, **the UJIIndoorLoc database is presented to overcome this gap.**

UJIIndoorLoc database contains characteristics such as:
* Covers a surface of 108703m 2 including 3 buildings with 4 or 5 floors 
* The number of different places is 933
* 21049 sampled points have been captured: 19938 for training/learning and 1111 for validation/testing
* Validation (or testing) samples were taken 4 months after Training ones.
* The number of different wireless access points (WAPs) appearing in the database is 520
* Data were collected by more than 20 users using 25 different models of mobile devices 

This work is based on an infrastructure-less approache, mainly taking advantage of the powerful mobile phone 
sensors.

#### UJIIndoorLoc DATABASE DESCRIPTION

The whole database contains 21049 records, each record is directly related to 529 numeric elements:

* 001-520 RSSI levels
* 521-523 Real world coordinates of the sample points
* 524 BuildingID
* 525 SpaceID
* 526 Relative position with respect to SpaceID
* 527 UserID
* 528 PhoneID
* 529 Timestamp

A total number of 520 WAPs appear in the database and the 520-element vector from each record contains the raw intensity levels of the detected WAPs from a single WiFiscan. One scan cannot capture the adresses of every WAP possible, the unrelated WAP will be filled with an artificial RSSI value of +100dBm. The paper also mention that the main factors that affects the number of WAPs reported by WIFi scan 
are location and the phone model.

![image.png](attachment:image.png)
Fig. 1. Frequency distribution of the Number of WAPs that are detected on
a single capture.

Real world coordinates of the sample points contains the longitude and latitude coordinates and the floor of the building. BuildingID coresponds to the building in which the capture was taken. SpaceID cointais one integer values to identify the particular space office, lab, etc. where the capture was taken.The other vector positions are more or less self-explanatory.

The database is split into two subsets: the **training subset** and the **validation subset**.

In the training subset, the reference points are these points above captured by, at least, two users.
In the validation subset, the measures were taken as would happen in a real localization system and because of this SpaceID and Relative position are not stored in records.

#### Baseline

The objective of this work was not to provide an accurate indoor positioning system, the objective is to provide an objective database which can be used for comparing positioning systems and other algorithms based on WLAN-fingerprinting.

#### REFERENCES

[1] Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., Lázaro-Gredilla, M., Huerta, J., & Belmonte, Ò. (2014). UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN) (pp. 261-270). IEEE.

## What is my goal?

I want to create a machine learning model that can accurately predict the location of devices inside buildings using WLAN fingerprinting. The objective is to estimate the building, floor and coordinates (latitude and longitude).

## Research Questions

What is the best machine learning algorithm for indoor localization predictions?

How do the number and placement of access points in a building impact the accuracy of indoor localization predictions?

## Context Diagram

This diagram shows the steps involved in building a machine learning model for indoor localization using the UJIIndoorLoc dataset.

![image-2.png](attachment:image-2.png)

## Loading the libraries

I first load some libraries which provides different methods, as they are required throughout the entire process.

In [2]:
import numpy as np  # import auxiliary library, typical idiom
import pandas as pd  # import the Pandas library, typical idiom

Now, I selectively import the relevant classes and functions from `sklearn`
[(_SciKit Learn_, a Python library for machine learning)](http://scikit-learn.org/.)

In [3]:
from sklearn.model_selection import train_test_split

## Datasets

This is the step where I get to know my data for the first time after the research.

In [7]:
df_train = pd.read_csv("Data/trainingData.csv")
df_val = pd.read_csv("Data/validationData.csv")

In [8]:
df_train.head(5)

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
0,100,100,100,100,100,100,100,100,100,100,...,100,-7541.2643,4864921.0,2,1,106,2,2,23,1371713733
1,100,100,100,100,100,100,100,100,100,100,...,100,-7536.6212,4864934.0,2,1,106,2,2,23,1371713691
2,100,100,100,100,100,100,100,-97,100,100,...,100,-7519.1524,4864950.0,2,1,103,2,2,23,1371714095
3,100,100,100,100,100,100,100,100,100,100,...,100,-7524.5704,4864934.0,2,1,102,2,2,23,1371713807
4,100,100,100,100,100,100,100,100,100,100,...,100,-7632.1436,4864982.0,0,0,122,2,11,13,1369909710


In [9]:
df_val.head(5)

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
0,100,100,100,100,100,100,100,100,100,100,...,100,-7515.916799,4864890.0,1,1,0,0,0,0,1380872703
1,100,100,100,100,100,100,100,100,100,100,...,100,-7383.867221,4864840.0,4,2,0,0,0,13,1381155054
2,100,100,100,100,100,100,100,100,100,100,...,100,-7374.30208,4864847.0,4,2,0,0,0,13,1381155095
3,100,100,100,100,100,100,100,100,100,100,...,100,-7365.824883,4864843.0,4,2,0,0,0,13,1381155138
4,100,100,100,100,100,100,100,100,100,100,...,100,-7641.499303,4864922.0,2,0,0,0,0,2,1380877774


In [10]:
df_train.shape

(19937, 529)