# TMDA (Transportation Mode Detector and Analyzer)

## Introduction

Transportation mode detection (TMD) is a well-known sub-task of a more general one called Human Activity Recognition (HAR), which aim to understand what activities a user is performing, through data produced during those activities.

For this particular project only the classes bus, car, still, train and walking are considered to be a transportation mode, while data used to train a classifier comes from accelerometer and gyroscope sensors, read with a frequency of 20 Hz.

To show more about a detection, location data (latitude and longitude) are also included, but not used in the training phase.

## Workflow

<img src="./workflow/WorkFlow.drawio.png" width="800" />

## Data Acquisition

Data are collected via the mobile app **Phyphox**, that allows to select sensors to read from and the reading frequency (in this case 20 Hz).

Once the recording is started, sensors data are read and, at the end of it, can be exported as a zip file, containing one CSV for each sensors, plus metadata (such as time of recording and device used).

A script in python will listen to a folder containing the above zip files and, when there is something to read, it will extract and merge sensors CSVs into one single CSV called **sensors.csv**.

In **update_sensors_csv()** is also performed a filtering of sensors data older than 5 seconds from the start of the recording, so in the end only the first 5 seconds of reading are kept.

In future this coul be an embedded function of the recording app. 

In [None]:
while(True):
        # get a list of zip files
        zip_list = glob.glob('*.zip') 
        # if is not empty, update sensors.csv, else sleep for 5 sec
        if len(zip_list) > 0:
            zf = get_earliest_zipfile(zip_list)
            update_sensors_csv(zf)
        sleep(5)

## Data Ingestion

Ingestion is provided by Logstash, that reads updated lines of the file sensors.csv constantly, sending them to a kafka topic called **sensors-raw**.

That's it, thank you Logstash <3.

<p align="center">
  <img src="./memes/logstash_heaven.jpg"/>
</p>

## Stream Processing

A Kafka cluster is responsible for handling streams coming from Logstash and storing them efficientely to specific topics.

Data are also replicated across different Kafka brokers, so if one of them goes down (temporarily or permanently) we don't lose any data.

There are two main topics: sensors-raw (for data coming from Logstash) and sensors (for data cleaned by a Spark container) 

## Data Cleaning

Before making any prediction, data must be cleaned, in order to extract features like **mean**, **min**, **max** and **stddev** in a 5 seconds time window.

For this reason session windows were used.

A session window begins with a single data point and broadens itself in case of the upcoming element has been collected inside of the gap period.
When the last item is accepted, the session window ends when no items are acknowledged inside of the gap period.

Note that we can't control the dimension of a window, wich is determined by events themselves