# Dataset Preparation

This notebook was created in order to investigate the input point cloud to the model in all its detail. Mainly, this is done to make sure that the input data we provide is pristine and exactly what we imagine it to be like. Therefore, this notebook will guide everyone through the theoretical relevance and also the practical implementation of all the transformations, so we can at least rule out a bad dataset as a reason for suboptimal performance.

## Imports, Directory Roots and Statics

Just a quick reminder, in this repo and environment, the notebook is supposed to be run in the "base" environment, so that it selects the appropriate and working ekrnel for the python ntoebook structure. 

In [6]:
# Importing needed packages
import torch
import numpy as np 
import os

# Setting correct directory roots for the data
DATA_ROOT = "/home/innolidix/Pointnet_Pointnet2_pytorch/data/testdata"
NUM_CLASSES = 2

## Checking the existing structure of the input point cloud

Before starting to dwelve off and creating own new datasets, we firstly want to focus on inspecting the dataset created to make sure it is either correct or incorrect.

### Data Loading

First, we check the data loading. This includes the loading in of the data and the subsequent distribution of points and labels, and accordingly also the labelweights.

In [7]:
# Creating path to input point cloud
input_data_path = os.path.join(DATA_ROOT, "data_labelled_int.npy")

# Loading in the current labelled test data
input_data = np.load(input_data_path)

# Splitting points and labels
points, labels = input_data[:, 0:6], input_data[:, 6]

# Getting counts of labelled points
labelweights = np.zeros(2)
tmp, _ = np.histogram(labels, range(3))
labelweights += tmp
# Creating labelweights
model_labelweights = np.sum(labelweights) / labelweights

# Retrieving minimum and maximum of coordinates for this file
coord_min, coord_max = np.amin(points, axis=0)[:3], np.amax(points, axis=0)[:3]

Now we have completed the first basic operations. It is important to keep in mind, that the data we are using for the overtraining currently is already transformed. Thereby, it is meant, that the point cloud itself already has been rotated and moved. Hence, the minimum one can observe is 0, while the maximum is a number. As a next step we do now take a look at the operations, that happen in the *__getitem__* method so we perfectly recreate the data used in the model currently. 

### *__getitem__* transformations

Within the magic getitem method there are some operations and transformations taking place. Mainly, the following are carried out: <br>
- Normalization of the point cloud coordinates to a randomly chosen center <br>
- Normalization of the RGB values of the points

In [16]:
# Variables needed
N_points = points.shape[0]
block_size = 4.0
num_point = 4096

# TRADITIONAL SPACIAL NORMALIZATION AFTER SAMPLING
while(True):
    center = points[np.random.choice(N_points)][:3]
    # Creating block minimum and maximum
    block_min = center - [block_size / 2.0, block_size / 2.0, 0]
    block_max = center + [block_size / 2.0, block_size / 2.0, 0]
    # Pulling indices of points within the boundaries
    point_idxs = np.where((points[:,0] >= block_min[0]) &
                          (points[:,0] <= block_max[0]) &
                          (points[:,1] >= block_min[1]) &
                          (points[:,1] <= block_max[1]))[0]
    if point_idxs.size > 1024:
        break

# (Sub-) Sampling points from input point cloud data
if point_idxs.size >= num_point:
    selected_point_idxs = np.random.choice(point_idxs, num_point, replace=False)
else:
    selected_point_idxs = np.random.choice(point_idxs, num_point, replace=True)

# SPACIAL NORMALIZATION according to randomly selected center point
selected_points = points[selected_point_idxs, :]
current_points = np.zeros((num_point, 9))
# Shifting of X and Y coordinate according to randomly selected center
selected_points[:, 0] = selected_points[:, 0] - center[0]
selected_points[:, 1] = selected_points[:, 1] - center[1]

# RGB NORMALIZATION
selected_points[:, 3:6] /= 255

# creating new rows corresponding to spacially normalized values
current_points[:, 6] = selected_points[:, 0] / coord_max[0]
current_points[:, 7] = selected_points[:, 1] / coord_max[1]
current_points[:, 8] = selected_points[:, 2] / coord_max[2]

# Piecing it together to return selected current points
current_points[:, 0:6] = selected_points
current_labels = labels[selected_point_idxs]


array([288.875869, 226.199528, 177.267736, 255.      , 255.      ,
       255.      ])

### Adjusted new getitem transformations

Below here, you can find the adjusted transformations in the getitem method. Mainly, this has to do with getting rid of all the random choices, that are implemented there. Thus, we want to randomly select,but still select the same over and over again. Hence, we might insert a random seed to ensure that this is happening. 

In [12]:
labelweights

array([654326.,   8199.])