'Argparse module' 

In the context of python's 'argparse' module, 'Namespace' is a simple class used to hold the attributes and values parsed from command-line arguments. When you use 'ArgumentParser' to parse arguments, it returns a 'Namespace' object that contains the arguments as attributes. This makes it easy to access the values passed from the command line in a structured and organised way.
Benefits of using 'Namespace':
1) structured access: The 'Namespace' object provides a clear and structured way to access the command-line arguments.
2) Attribute access: Attributes are accessed using dot notation (e.g. 'args.input') making the code readable and easy to manage.
3) flexibility: you can easily add or remove arguents without changing how they are accessed in your code.



In [None]:
#example
import argparse
parser= argparse.ArgumentParser(description='Example script') #create an instance of the 'ArgumentParser' class 
#the 'description' parameter provides a brief description of the script, which is displayed when the user runs 
#the script with the '--help' flag

#define expected arguments to the script
parser.add_argument('--input', type=str, help='Input file path') #the add_argument method specifies the command-line arguments that the script expects
#'--input': the name of the argument, it is expected to be a string, and the help message explain that it should be an input file path.
parser.add_argument('--output', type=str, help='Output file path')
#this is expected to be a string representing the output file path.
parser.add_argument('--verbose', action= 'store_true', help='Increase output verbosity')
#this is a flag (i.e. 'store_true') meaning that it doesn't take a value. if the flag is present, 'args.verbose' will be 'True' otherwise false.

#parse the arguments
args= parser.parse_args() #this method parses the command-line arguments provided to the script and 
#returns them as an 'argparse.Namespace' object, where the attributes correspond to the degined arguments.
#the parsed arguments can be accessed as attributes of the 'args' object.
#example: args.input, args.output, args.verbose

MMCV library over standard computer vision libraries 
1. consistency and flexibility: 
- Unified interface: MMCV provides a unified interface for building various types of layers and models, for more consistency across different parts of the project.
- flexible configuration: MMCV's layer builders, like 'build_conv_layer' allow for more flexible and configurable definitions of layers. This can be particularly useful when experimenting with different architectures or hyperparameters.




8-bit Unsigned Integer Format and its suitability for image data
An 8-bit unsigned integer is a data type that can represent integer values ranging from 0 to 255. this format is widely used in image procesing because:
- color representation: most standard image format, such as JPEG or PNG, use 8-bit unsigned integers to represent pixel intensity values. for each color channel (red, green, blue) the values range from 0 (black) to 255(full intensity)
- memory efficiency: using 8 bits per channel per pixel is a balance between image quality and memory/storage efficiency, making it suitable for a wide range of applications from simple graphics to complex processing tasks.

Clipping between 0 and 1
Clipping refers to limiting the values of a variable within a specified range. In this context, np.clip(x.cpu(),numpy(),0,1) ensures that all values in the tensor are constrained to the interval [0,1]. This is important for several reasons: 
1. normalisation: many machine learning models output values in a normalised range (0,1). clipping ensures that any outliers are adjusted to fit within this expeted range.

2. image representation: beore scalling to 255, values need to be between 0 and 1 to represent valid image intensities when converted to an 8-bit format.


Visualisation and CPU usage
while GPU operations are generally faster for computational tasks, visualisation and certain data manipulations need to be performed on the CPU for the following reasons:
1) library compatibility: visualisation libraries (e.g. matplotlib, PIL) and some data processing libraries do not directly support GPU tensors. they require data in CPU memory, typically as numpy arrays.
2) Data transfer: Before visualising, data often needs to be transferred from the GPU to CPU. This is because the rendering process typically occurs on the CPU, even though GPUs are used for accelerating computations.



Lambda functions
to8b = lambda x: (255 * np.clip(x.cpu().numpy(), 0, 1)).astype(np.uint8)

the function takes a single argument x (pytorch tensor) 
.cpu(): moves the tensor from GPU memory to CPU memory. this is necessary because numpy operations do not work directly on GPU tensors.
.numpy(): converts the tensor to a numpy array. this step is needed because numpy functions like np.clip operate on numpy arrays not pytorch tensors.
np.clip: limits the values in the array to the range [0,1]. this ensures that all values are between 0 and 1 which is common in image processing to normalise pixel values.
scaling to 255: scales the clipped values to the range [0,255]. this is necessary because pixel values in 8-bit images are typically in this range

general syntax: lambda arguments: expression
where arguments: a comma-separated list of arguments
expression: an expression that is evaluated and returned

Understanding checkpoint serialisation in pytorch (torch.save() and torch.load())
Saving and loading a general checkpoint model for inference or resuming training can be helpful for picking up where you last left off. When saving a general checkpoint, you must save more than just the model's state_dict.  it is important to also save the optimizer's state_dict, as this contains buffers and parameters that are updated as the model trains. other items that you may want to save are the epoch you let off on, the latest recorded training loss, external torch.nn.Embedding layers and more.
To save multiple checkpoints, you must organise them in a dictionary and use torch.save() to serialise the dictionary. A common pytorch convention is to save these checkpoints using the .tar file extension. To load the items, first initialise the model and optimiser, then load the dictionary locally using torch.load(). from here, you can easily access the saved ites by simply querying the dictionary as you would expect.



Tangent on serialisation in pytorch:
serialisation in the context of computer programming refers to the process of converting an object (in this case, a dictionary) into a format that can be easily stored and later reconstructed. this is useful for saving the state of a machine learning model and its optimiser during training, so training can be resumed from that point later.

Checkpoint serialisation with pytorch:
in pytorch, saving checkpoints involves serialising a dictionary o objects (model state, optimizer state, etc) using torch.save(). the dictionary structure typically follows a key-value pair format. 

EPOCH = 5
PATH = "model.pt"
LOSS = 0.4
Example: torch.save({
            'epoch': EPOCH,
            'model_state_dict': net.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': LOSS,
            }, PATH)
#LOAD THE GENERAL CHECKPOINT
##First initialise the model and optimiser, then load the dictionary locally 
model = Net()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.eval()
# - or -
model.train()

Checkpoint serialisation in pytorch involves saving the state dictionaries of the model and optimiser along with other relevant information into a dictionary. this dictionary is then saved to a file using torch.save(). by organising the saved info with clear keys, it becomes easy to load and restore the training state using torch.load(). 

Exponential Moving Average for metrics like loss and PSNR during training

provides advantages over simplym keeping the original, raw values. Why?
1. smoothing noisy data
purpose: training loss and other metrics can be quite noisy, especially in the early stages of training. this noise can make it difficult to discern trends and assess the overall progress of the training process.
EMA advantage: EMA smooths out short-term fluctuations, highlighting longer-term trends and making the data easier to interpret. 
2. reacting to changes:
original loss values can vary widely from one iteration to the next due to several factors. this variability can obscure important changes in training dynamics, making hard to assess the model's progress accurately. Reasons why:
- randomness in mini-batch gradient descent (SGD)
- non smooth landscape: numerous local minima, saddle points etc
- regularization techniques: dropout and batch normalisation introduce additional variability 
3. reducing overfitting risk
purpose: relying on raw loss values can lead to overemphasizing the impact of outlier values, potentially leading to overfitting.



smooths-out short-term fluctuations making the overall trend more apparent, stability: provides ore stable and reliable metric for monitoring and logging which helps in making decisions about adjusting hyperparameters or determining early stopping.


os.listdir() method is used to get the list of all files and directories in the specifed directory. If we don't specify any directory, then a list of files and directories in the current working directory will be returned.
Output: returns a list o filenames present in that folder



os.path.join(): sub-module of the OS module for pathname manipulation.


A PLY file (polygon file format) is a computer file format used to store three-dimensional data. it is commonly used in the field of computer graphics and 3D modeling. This format is used for storing 3D data such as 3D scans and point clouds. Here are the key characteristics of PLY files:
- 3D data storage: PLY files are used to store 3D models including the geometry (vertices) and attributes (color, normals, texture coordinates)


Structure from Motion (COLMAP)
COLMAP is a popular tool for structure-from-motion and multi-view stereo which is often used to generate 3D point clouds from a series of 2D images.

1. image input: COLMAP takes a set of overlapping images as input
2. feature extraction: detects and extracts keypoints (features) from each image using methods like SIT
3. feature matching: matches these keypoints across different images to find correspondences
4. camera pose estimation: estimates the relative camera positions and orientations for each image, creating a sparse 3D point cloud of the scene.
5. Bundle adjustment: optimises the camera parameters and the 3d point positions to minimise the reprojection error, refining the sparse point cloud.

How point clouds are incorporated in 4D Gaussian Splatting:
1. initialisation: the point cloud obtained from COLMAP serves as the initialisation for the 3D positions of the Gausians.
2. 4D Extension: each 3d gaussian is extended to 4d by associating it with a temporal dimension. this can be done by capturing point clouds at different times and associating each with a corresponding time stamp.
3. hierarchical modeling: as described in the paper, the model may use a hierarchical approach to refine the representation from coarse to fine stages, ensuring that both spatial and temporal details are captured accurately.


Real cameras vs virtual cameras in the 4D Gaussian Splatting setting

Real cameras: used to capture images from different angles around a real-world scene. This involves setting up physical cameras or moving a single camera around the scene to capture multiple viewpoints.
dataset creation: the collected images and their correspomding camera parameters (e.g. position, orientation) form the training dataset for the model. these images and their corresponding camera parameters are used to create a point cloud of the scene.

Virtual cameras:
rendering and training: once the real camera data is collected, it is used to create a 3D representation of the scene. During the training and rendering proceses, virtual cameras are used within the 3D space.
virtual cameras are software constructs that simulate the behavior of real cameras within a virtual environment. they can be placed at any position and orientation to render the scene from different viewpoints. 

'torch.utils.data.Dataset'
part of the pytorch's torch.utils.data module which provides utilities for loading data.

torch.utils.data.Dataset is base class for all datasets. Your custome dataset should inherit from this class and override the following methods:
'__len__': this method should return the size of the dataset.
'__getitem__': this method should support integer indexing in the range from 0 to 'len(self)', exclusive.

Inheritance and Extension:
When you create a custom dataset in Pytorch, you typically extend the torch.utils.data.Dataset class. by doing so, you need to impelement the required ethods ('__len__' and '__getitem__') to make your dataset compatible with pytorch's data loading utilities.

Example: 
How FourDGSdataset' extends 'torch.utils.data.Dataset':
'FourDGSdataset' is defined to inherit from 'troch.utils.data.Dataset'. This means 'FourDGSdataset' is a subclass of 'Dataset'. 

'__init__' method:
initialises the dataset object. It stores the dataet, arguments, and dataset type as instance variables.

'__getitem__' method:
retrieves an item from the dataset at the specified index.
It handles different dataset types, extracting relevant info and constructing a 'Camera' object for non-'PanopticSports' datasets.

'__len__' method:
returns the number of items in the dataset.

So by extending 'torch.utils.data.Dataset' and implementing the necessary methods 'FourDGSdataset' becomes compatible with Pytorch's data loading utilities. Specifically, it can be used with torch.utils.data.DataLoader to create batches of data, shuffle the data, and load data in parallel using multiple worker processes.



after training run render.py
point cloud
github link for rotating
test.py (SSIM)



colmap.sh:
a typical COLMAP processing script includes the following script:
1. feature extraction: detect features in the input images located in 'data/hypernerf/virg/broom2'
2. feature matching: match the detected features across images to find correspondences
3. structure from motion: reconstruct the camera positions and sparse 3D structure from the matched features
4. dense reconstruction: generate a dense point cloud from the sparse reconstruction
5. output: save the generated models and point clouds in a specified output directory.


**gaussian_model.py (scene)**



Spherical harmonics are functions defined on the surface of a sphere and are widely used in computer graphics to model the angular variation of data such as light intensity and other properties across a spherical domain.


In the context of 3D gaussians:
SH degree (self.active_sh_degree): this represents the complexity of the SH representation. A higher degree means more detailed and complex shapes can be captured.
These coefficients encode the appearance of the Gaussian splat in different directions. Higher SH degrees allow for more accurate and detailed modeling of the Gaussian's appearance.

Opacity:
measures how transparent or opaque a gaussian splat is, ranging from 0 (completely transparent) to 1 (completely opaque). During rendering, opacity values are used to blend multiple Gaussian splats. Gaussians with higher opacity closer to the viewer will occlude those behind them, creating a sense of depth.

Each 3D gaussian is characterised by a covariance matrix and a center point. For differentiable optimisation, the covariance matrix can be decomposed into a scaling matrix and a rotation matrix.


On pruning: gaussians with low gradients, indicating they contribute little to the training loss, are pruned. Gaussians below a certain opacity threshold are also pruned.

COLMAP
software tool for 3D reconstruction from images. It combines structure from motion and multi-view stereo to create 3D models from 2D photos. this means that SfM reconstructs the 3D structure of a scene from a set of 2D images by estimating camera poses and creating a sparse 3D point cloud. Then MVS densifies the point cloud, creating a dense 3D reconstruction. Steps involved:
1. feature extraction and matching: detecting and matching keypoints across multiple images.
2. structure from motion: estimating camera positions and creating a sparse 3D point cloud.
3. multi-view stereo: densifying the sparse point cloud to generate a detailed 3D model.

How is COLMAP produced?
1. image capture: take overlapping images from various angles around the subject or scene.
2. feature detection: extract keypoints and descriptors from the images.
3. feature matching: match thee keypoints across different images.
4. SfM: incrementally build a sparse 3D structure by triangulating matched points and estimating camera poses. (triangulation for 3D point cloud: fundamental matrix estimation which encodes the epipolar geometry and then use the matched keypoints and their corresponding camera positions to triangulate the 3D coordinates of the points. This involves solving for the 3D position that best matches the 2D projections in each image.)
5. MVS: convert these sparse point cloud into a dense 3D model using stereo matching techniques.

How is COLMAP used in the 4D Gaussian Splatting model?
In the 4D Gaussian Splatting model, COLMAP serves as a foundational tool to provide essential 3D data. 
1. Data Loading: 
dataset identification: the presence of a "sparse" directory indicates a COLMAP dataset. 
scene loading: The 'sceneLoadTypeCallbacks["Colmap"]' function loads the dataset, including camera parameters and the initial sparse point cloud.
2. Camera setup: The dataset is split into training, test and video cameras, which are stored in 'FourDGSdataset' objects for easy access.
3. Point Cloud Processing:
Bounding box calculation: compute the bounding box of the point cloud.
Guassian Model setup: initialise the Gaussian model using the point cloud data from COLMAP.
4. Gaussian model integration: The point cloud and camera parameters from COLMAP are used to intialise the Gaussian model, which is essential for 4D representation and rendering.




Dataset prep: input a set o 2d images
output a sparse 3d point cloud (.ply file) and camera parameters (intrinsics and extrinsics) using COLMAP.

loading the dataset: sceneLoadTypeCallbacs["Colmap"] function reads the COLMAP output, loading the 3d point cloud and camera pparameters into the 4D Gaussian Splatting model.
camera parameters: 
intrinsics: focal length, principal point, distortion coefficients.
extrinsics: rotation and translation matrices (camera pose).

Point cloud processing in 4D Gaussian Splatting:
the bounding box encompasses the min and max extents of the point cloud in the 3D space.
purpose: used to set the limits for the deformation network in the 4D gaussian splatting model.
it is calculated from the point cloud data during scene initialisation within the 4D Gaussian splatting model.
ensures the deformation network operates within the bounds of the scene.





https://colmap.github.io/faq.html#faq-fix-intrinsics
https://colmap.github.io/format.html#output-format

COLMAP output

data/hypernerf/virg/broom2/colmap/sparse_custom:
COLMAP exports the following three text files for every reconstructed model:
cameras.txt, images.txt, points3D.txt.

1) cameras.txt: 


this file contains the intrinsic parameters of all reconstructed cameras in the dataset using one line per camera
I see multiple lines in the cameras.txt but with identical entries-> this means we have 196 cameras with the same sensor dimensions (width, height)?.
In the context of COLMAP, "reconstructed cameras" refer to the estimated camera models derived from the structure-from-motion process. These models include intrinsic parameters (like focal length and principal point) and extrinsic parameters (like rotation and translation) for each camera used in capturing the images. The term "reconstructed" implies that these parameters are computed based on the 2D images and their correspondences, resulting in accurate camera positions and orientations relative to the 3D structure of the scene. The camera intrinsic parameters describe how the camera lens projects 3D points into the 2D image plane.
The camera extrinsic parameters define the position and orientation of the camera in the world coordinate system, typically represented by a rotation matrix (or quaternion) and a translation vector.

The structure of the cameras.txt file is:

Camera list with one line of data per camera:

CAMERA_ID, MODEL (e.g. SIMPLE_PINHOLE), WIDTH, HEIGHT (width and height are the dimensions of the image sensor), PARAMS[] (= intrinsic parameters such as focal length and principal point)

Number of cameras: 3

1 SIMPLE_PINHOLE 3072 2304 2559.81 1536 1152

2 PINHOLE 3072 2304 2560.56 2560.56 1536 1152

3 SIMPLE_RADIAL 3072 2304 2559.69 1536 1152 -0.0218531




2) images.txt:
This file contains the pose and keypoints of all reconstructed images in the dataset using two lines per image (extrinsic parameters). The reconstructed pose of an image is specified as the projection from world to the camera coordinate system using a quaternion and a translation vector. 

The structure of the images.txt file:
Image list with two lines of data per image:
IMAGE_ID, QW, QX, QY, QZ: quaternion components for rotation
 TX, TY, TZ: translaiton vector
 CAMERA_ID, NAME: reference to the camera model
POINTS2D[]: list of keypoints with their 2D coordinates and associated 3D points IDs. as (X, Y, POINT3D_ID)


3) points3D.txt
This file contains the info of all reconstructed 3D points in the dataset using one line per point:
3D point list with one line of data per point:

POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX)
TRACK[]: list of observations indicating which images observe this 3D point and the index of the corresponding keypoint.
Number of points: 3, mean track length: 3.3334




**Components and attributes of the 'Scene' class**

Arguments: takes in model parameters (args), a gausian model(gaussians) and optional parameters for loading iterations, shuffling etc.


Loading dataset: the init method (scene> init.py) checks the **dataset type based on the presence of specific files in the 'source_path' and calls the appropriate loader function from 'sceneLoadTypeCallbacks' (scene> dataset_reader.py) to load the dataset.**


What does the 'Scene' class do?
1. loads iteration checpoint to retrieve the latest or specified iteration from the saved model.
2. identifies dataset type by searching for specific files in the source path.
3. initialises camera data (training, test and video cameras uses 'FourDGSdataset' class)
4. computes bounding box (min and max coordinates) from the 3D point cloud data. This bounding box is used to set the axis-aligned bounding box for the deformation network.
5. configures the deformation network with the bounding box.
6. loads or initialises Gaussian model: if loading a specific iteration, the model and point cloud are loaded from the PLY file. If no iteration is specified, a new Gaussian model is created from the point cloud data, initialised with the xamera extent and macimum time value.


What is the aim of the 'Scene' object in 4DGS?
primarily: manages and provides the necessary data for rendering images from various viewpoints during training and evaluation. 
- dataset loading and management: identifies the type of dataset and loads a specific iteration allowing resumption of training from a checkpoint
- initialises camera data for training testing and video cameras
- computes the bounding box for the point cloud data
- loads and saves or initialises the Gaussian model from the point cloud data


Usage in training:

the camera data from the 'scene' object is used to render images from different viewpoints. Functions like getTrainCameras(), getTestCameras() and getVideoCameras() provide the necessary camera data for different stages of training and evaluation.


dataset.py (scene)
**FourDGSdataset class:**
retrives camera properties and constructs a 'Camera' object. This object includes all necessary info to render the scene from this viewpoint (rotation, translation and field of view). The 'FourDGSdataset' class provides a consistent interface to access camera properties which are essential for rendering scenes correctly.