# Data extraction from trace
This notebook aims to take the `.pcap` files generated from the trace and use those to create a csv file that contains the features and labels to input to our classifier.

In [18]:
import scapy
import pandas as pd
import os
from typing import List, Tuple

In [19]:
"""
represents a packet by its:
- IN/OUT nature (IN == to client == 1) (OUT == to server == 0)
- timestamp
- length
"""
Packet = Tuple[int, float, int]


## Client-Server interaction
We first recall the implementation of client server interaction. Indeed, this one is described in `handout_project_secretstroll.pdf`. Taking a look at implementation of `server.py` and `client.py`, we must make key observations:
* When querying the server, the client will sequentially:
  * retrieve the list of PoI (**which can pretty much vary in size between cells**)
  * the client then iterates through each PoI -> **no shuffling** is performed at any step, this lets us suppose that 2 queries for the same cell will produce a **same sequence of queries for the PoI infos** to the server.
  * they make a query to retrieve the PoI infos for each. We must note that the **number of PoI ratings may vary a lot from a PoI to another**, making a PoI info response trace pretty unique.
* When receiving client's queries, the server will:
  * return the list of around PoI (as explained with a probably preserved order between queries). Here comes a pretty nasty trick: the **implementation given seems to not take into account the variety of possible queries**. Indeed, independently of the queries `types` of PoI (the user subscription), the server will in any case answer by **giving the full list of PoI in the cell** (`records = PoI.query.filter_by(grid_id=cell_id).all()`). This should not happen under the functionnalities described in the handout, but we'll make an adapted training set that will not need to test different types of queries, only the cell ID will vary. If we were to make the attack under fully functionnal system, we'd make a way bigger training set considering the different subscription traces and their combinations.
  * when queried on the info for the point of interest, makes an answer that contains all its information. Note that the **number of ratings associated to it which has the most significant impact on the trace**.

Considering this, we will now be able to analyse our received data.

## Received data
First, the data we receive is in the form of `.pcap` files. Those were filtered to contain only (see `capture.sh`):
* **TCP packets**, other protocols packets are considered not relevant to the application (ARP requests, DNS stuff etc...)
* **packets that have a TCP payload**. Basically all TCP ACKs and other controls packets are filtered and not considered as those depend on the networking related traffic and will noise the core of the application layer generated fingerprint

From those packets, can be retrieved:
* **Ethernet header** (14 bytes, bytes 0 to 13). Those will probably not be relevant for us as depending on the inner Docker networking (or any used LAN environment)
* **IP headers** (20 bytes, bytes 14 to 33). Those bytes mostly contain the routing information (source IP, destination IP, routing flags, etc...). The main information we may want to retrieve from this is the direction of the packets we observe (query or response).
* **TCP headers** (20 bytes, bytes 34 to 53). These bytes contain the stream informations (sequence and ACK indexes, timestamps, payload size, etc...). These can all be very useful to distinguish a trace from another by the client-server interaction implementation.
* **TLS headers** (10 bytes, bytes 54 to 63). These bytes contains some informations about the version and inner data. the bytes 59 to 63 are the first bytes of the encrypted data which is not recorded as it is not meaningful for us. Indeed the headers already contain the length meta-data and we will not get much more from the correctly encrypted payload (unless we make cryptographic attacks which does not seem reasonnable as the TLS1.2 version is used here).

From this, we are able to define the data we'll want to keep and output to our classifier.

## Desired data model
First, we must decide on the way we'll represent our data as features.

Label (cell_id queried) | Trace (packet list generated)
:---: | :---: 
1 | `[pkt1,pkt2,...]`

We must now decide on a way to represent our packets. To reduce the amount of input and redundant information, we can remove the IP addresses as there will only be 2 of them, either going in or out.

## Extraction from pcap file to dict

In [21]:
def from_file_to_packets(path: str)->dict[int, List[Packet]]:
    return

In [22]:
for root, dirs, files in os.walk('traces'):
    print([file for file in files if file.endswith('.pcap')])

['trace_01_04_08h_cell_1.pcap', 'trace_01_04_08h_cell_10.pcap', 'trace_01_04_08h_cell_100.pcap', 'trace_01_04_08h_cell_11.pcap', 'trace_01_04_08h_cell_12.pcap', 'trace_01_04_08h_cell_13.pcap', 'trace_01_04_08h_cell_14.pcap', 'trace_01_04_08h_cell_15.pcap', 'trace_01_04_08h_cell_16.pcap', 'trace_01_04_08h_cell_17.pcap', 'trace_01_04_08h_cell_18.pcap', 'trace_01_04_08h_cell_19.pcap', 'trace_01_04_08h_cell_2.pcap', 'trace_01_04_08h_cell_20.pcap', 'trace_01_04_08h_cell_21.pcap', 'trace_01_04_08h_cell_22.pcap', 'trace_01_04_08h_cell_23.pcap', 'trace_01_04_08h_cell_24.pcap', 'trace_01_04_08h_cell_25.pcap', 'trace_01_04_08h_cell_26.pcap', 'trace_01_04_08h_cell_27.pcap', 'trace_01_04_08h_cell_28.pcap', 'trace_01_04_08h_cell_29.pcap', 'trace_01_04_08h_cell_3.pcap', 'trace_01_04_08h_cell_30.pcap', 'trace_01_04_08h_cell_31.pcap', 'trace_01_04_08h_cell_32.pcap', 'trace_01_04_08h_cell_33.pcap', 'trace_01_04_08h_cell_34.pcap', 'trace_01_04_08h_cell_35.pcap', 'trace_01_04_08h_cell_36.pcap', 'trace_01