# COMPLETE ANALYSIS OF LIDAR LA PAZ
## CALLP

Author: Ludving Cano Fernandez

_Based on the code written by Maria Fernanda, 2018._

## 1. Pre-processing
### 1.1. Data files
Data obtained by LIDAR are formatted in the following way:

 - File formated with YYYY_MM_DD
   - Inside there are files with the format DD_MM_YY_HR_HHMM_ AA.ch1.txt (ex. 03_12_18_HR1407_90.ch1.txt)

The first three columns are respectively: HOUR, MINUTE, SECOND. But when the data is adquired the hours and minutes are taken as decimal, so we have $8,000000$, which for memory usage we can reduce to only $8$ and also we can round SECONDS to three decimals.

In the other hand, we have comma (,) as decimal symbol, Python uses dots (.) as standard decimal symbol, so we need to transform all data to use this symbol.

Data uses a tabulator (/t) as separator by default, we will keep it as it's easy to read different separators.

Finally, we want to concatenate all data files of a day to a new file so we can read all data at once when we will process it.

### 1.2. Concatenating data

#### 1.2.1. `concatenate.py`

This basic code has two main and one optional tasks:

1. Replace all commas with dots
2. Concatenate all files from a day into a new file named `YYYY_MM_DD_rawdata_AA.txt`
3. (OPTIONAL) Convert hours and minutes to integers, and round seconds to three decimals.


In [2]:
import os
import re

import time
start_time = time.time()

orig_data_path = "/home/ludving/PROJ/LFA/LIDAR_Code/"
ls = os.listdir(orig_data_path)

r = re.compile("^\d{4}_\d{2}_\d{2}")
ls = list(filter(r.match, ls))

print("¿Cuál carpeta usar?")
for i, j in enumerate(ls):
    print(j, ".....", i)

#selection = int(input("Ingrese número: "))
selection = 1
selection = ls[selection]
new_path = orig_data_path + selection + "/"
new_ls = os.listdir(new_path)

r = re.compile("\d{2}_\d{2}_\d{2}_HR\d{4}_\d{2}.ch1.txt")
ls = list(filter(r.match, new_ls))

qtn = len(ls)
print("Se encontraron", qtn, "archivos")

new_file = ""
for i in ls:
    file_path = new_path + i
    with open(file_path, "r") as file:
        ff = file.read()
        ff = ff.replace(",", ".")
    new_file = new_file + ff + "\n"
    
raw_path = "/home/ludving/PROJ/LFA/LIDAR_Code/raw_data/resultt.txt"
with open(raw_path, "w+") as file:
    file.write(new_file)
    
print("--- %s seconds ---" % (time.time() - start_time))
size = os.path.getsize("/home/ludving/PROJ/LFA/LIDAR_Code/raw_data/resultt.txt")
print("Size of result file: ",round(float(size)/10e8,2), "GB")

¿Cuál carpeta usar?
2018_12_03 ..... 0
2018_04_30 ..... 1
Se encontraron 8 archivos
--- 29.797842741012573 seconds ---
Size of result file:  1.43 GB


There are other 2 versions of `concatenate.py`, each one has different approaches to handle the data and to write a new file, also we write a bash (`concat.sh`), which has some disavantages: it has to copied into the data folder and is only executable on Linux.

Now, the description of the other 3 codes:

#### 1.2.2. `concatenate2.py`
This version uses the `pandas` library, which nowadays is widely used in data science, as it's supposed that this handles numerical data efficiently, this code implements objectives (2) and (3) too. Sadly it's the slowest, as it has to run line by line and do the calculations, then append the data to a list and finally convert this list of lists to a pandas DataFrame that can be written as a .txt

In [4]:
import os
import re
import pandas as pd

import time
start_time = time.time()


orig_data_path = "/home/ludving/PROJ/LFA/LIDAR_Code/"
ls = os.listdir(orig_data_path)

r = re.compile("^\d{4}_\d{2}_\d{2}")
ls = list(filter(r.match, ls))

print("¿Cuál carpeta usar?")
for i, j in enumerate(ls):
    print(j, ".....", i)

#selection = int(input("Ingrese número: "))
selection = 1
selection = ls[selection]
new_path = orig_data_path + selection + "/"
new_ls = os.listdir(new_path)

r = re.compile("\d{2}_\d{2}_\d{2}_HR\d{4}_\d{2}.ch1.txt")
ls = list(filter(r.match, new_ls))

qtn = len(ls)
print("Se encontraron", qtn, "archivos")

new_file = ""
ff = []
for i in ls:
    file_path = new_path + i
    file = open(file_path)
    for line in file:
        lst = []
        line = line.replace(",", ".")
        """
        dat = np.array(line.split())
        dat = dat.astype(float)
        for i in range(2):
            dat[i] = dat[i].astype(int)
        """
        for j in line.split():
            lst.append(float(j))
        lst[0] = int(lst[0])
        lst[1] = int(lst[1])
        lst[2] = round(lst[2],3)
        ff.append(lst)
        del(lst)
        
df = pd.DataFrame(ff)
del(ff)
df.to_csv('filename.txt', sep='\t', index=False, header = False)


print("--- %s seconds ---" % (time.time() - start_time))
size = os.path.getsize('filename.txt')
print("Size of result file: ",round(float(size)/10e8,2), "GB")

¿Cuál carpeta usar?
2018_12_03 ..... 0
2018_04_30 ..... 1
Se encontraron 8 archivos
--- 156.19528150558472 seconds ---
Size of result file:  1.36 GB
