## 2 Data extraction from h5 to csv
### 2.1 Particularity of h5 data

The dataset is in .h5 format. This is a specific format, like python libraries, to save and structure file with huge amount of data.   
As there are specific functions for this format, we choose to transform it in .csv format to retrieve all the functions we already know (read.csv(), fwrite(), fread()).    

However, the initial dataset (in .h5) correspond to a 4-dimensional-array (946_ID*7_canals*40_segments*500_observations).   
To transform it in a .csv format, we need to flatten dimensions from 4-dim to 2-dim. To do that, using three for-loops and for each segment (and its 500 observations), we save the segment number, the canal number, the ID number and its corresponding 500 observations.      

At the end, we obtain 946*7*40 = **264 880 rows**.   
Hence, for each row, the key is (ID, CANAL, SEGMENT) and the attributes are the 500 observations.

*Due to the size of the source files, we have run these functions on different instances locally. Here, the code is therefore not launched in html. The final results of the functions have been uploaded to the cloud in csv format so that everyone can retrieve them and follow the code of steps 3, 4 and 5.* 

### 2.2 Functions to convert h5

In [4]:
library(magrittr)
library(data.table)
#install.packages("BiocManager")
library(BiocManager)
#BiocManager::install(c("rhdf5"))
library(rhdf5)

In [5]:
x_train=H5Fopen("X_train_new.h5")
View(x_train$features[,,1,1]) #(500, 7, 40, 946)
x_train.new <- aperm(x_train$features, c(4,2,1,3)) #(946, 7, 500, 40)
View(x_train.new[946,1,1,1]) #(946, 7, 500, 40)

ERROR: Error in H5Fopen("X_train_new.h5"): HDF5. File accessibilty. Unable to open file.


In [7]:
df_aux <- data.table(matrix(ncol=3, nrow= 1))
df_aux[,1] = 1
df_aux[,2] = 1
df_aux[,3] = 1
df_aux = cbind(df_aux, t(x_train.new[1,1,,1]))

df <- df_aux

V1,V2,V3
<lgl>,<lgl>,<lgl>
,,


In [None]:
c=0
for (i in 1:946) {
  for(j in 1:7) {
    for(k in 1:40) {
       df_aux <- data.table(matrix(ncol=3, nrow= 1))
       df_aux[,1] = i
       df_aux[,2] = j
       df_aux[,3] = k
       
       df_aux = cbind(df_aux, t(x_train.new[i,j,,k]))
       ###valeurs 
       df = rbind(df, df_aux)
       
       c=c+1
       
       print(c)
    }
  }

Export results

In [None]:
fwrite(df,"X_train.csv")

##### Same steps for X_test

In [None]:
x_test=H5Fopen("X_test_new.h5")
View(x_test$features[,,1,1]) #(500, 7, 40, 946)
x_test.new <- aperm(x_test$features, c(4,2,1,3)) #(946, 7, 500, 40)
View(x_test[946,1,1,1]) #(946, 7, 500, 40)
df_aux <- data.table(matrix(ncol=3, nrow= 1))
df_aux[,1] = 1
df_aux[,2] = 1
df_aux[,3] = 1
df_aux = cbind(df_aux, t(x_train.new[1,1,,1]))

df <- df_aux
c=0
for (i in 1:946) {
  for(j in 1:7) {
    for(k in 1:40) {
       df_aux <- data.table(matrix(ncol=3, nrow= 1))
       df_aux[,1] = i
       df_aux[,2] = j
       df_aux[,3] = k
       
       df_aux = cbind(df_aux, t(x_train.new[i,j,,k]))
       ###valeurs 
       df = rbind(df, df_aux)
       
       c=c+1
       
       print(c)
    }
  }
fwrite(df,"X_test.csv")

### 2.3 Verify results and clean

#### 2.3.1 Verify for X_train

In [None]:
X_train=fread("X_train.csv")
X_train<-X_train[-1,]
X_train[1,]

setnames(X_train, "V1", "Idligne")
setnames(X_train, "V1", "Id")
setnames(X_train, "V2", "Channels")
setnames(X_train, "V3", "Segments")


x_train_h5=H5Fopen("X_train_new.h5")
x_train_h5 <- aperm(x_train_h5$features, c(4,2,1,3)) #(946, 7, 500, 40)


h5closeAll()

In [None]:
First ID check, 1st channel, 1 st Segment

(X_train[1,])
View(x_train_h5[1,1,,1])


In [None]:
Ok

1s ID check, 1st channel, 2 sd Segment


In [None]:
(X_train[3,])
View(x_train_h5[1,1,,2])

Ok ! 

There should be 264 880 obv (946*7*40)

In [None]:
(X_train[X_train$Channel==1 & X_train$Segments==1 & X_train$Id==1])


=> First duplicate at line 9391 - 1 (we removed 1 line at the strat)
We remove the first 1:9390 lines (9389 obs which makes 264880)

In [None]:
X_train<-X_train[-c(1:9389),]
X_train<-X_train[,-1]

#### 2.3.2 Verify for X_test.csv

In [None]:
X_test=fread("X_test_new.csv")
X_test[1:4,]

In [None]:

X_test<-X_test[-c(1,2),]#Remove first and second line
X_test[1,]
setnames(X_test, "V1", "Id")
setnames(X_test, "V2", "Channels")
setnames(X_test, "V3", "Segments")


In [None]:

x_test_h5=H5Fopen("X_test_new.h5")


x_test_h5 <- aperm(x_test_h5$features, c(4,2,1,3)) #(946, 7, 500, 40)


h5closeAll()


h5ls("X_train_new.h5", all=TRUE)


In [None]:
First ID check, 1st channel, 1 st Segment


In [None]:
(X_test[1,])
View(x_test_h5[1,1,,1])

In [None]:
Ok ! 

1s ID check, 2sn channel, 2 Segment

In [None]:
(X_test[X_test$Channel==2 & X_test$Segments==2 & X_test$Id==1])
View(x_test_h5[1,2,,2])

In [None]:
but OK ! 

#### 2.3.3 Export our new dataset cleaned

In [None]:
fwrite(X_test,"X_test_clean")
fwrite(X_train,"X_train_clean")

We have uploaded our csv cleaned in aws cloud :    
X_test("https://u2bigdataprojectpredictfrombraina-donotdelete-pr-keui4jukxng1lb.s3.eu.cloud-object-storage.appdomain.cloud/X_test.csv")     
Y_train("https://u2bigdataprojectpredictfrombraina-donotdelete-pr-keui4jukxng1lb.s3.eu.cloud-object-storage.appdomain.cloud/y_train.csv")    
X_train("https://u2bigdataprojectpredictfrombraina-donotdelete-pr-keui4jukxng1lb.s3.eu.cloud-object-storage.appdomain.cloud/X_train.csv")    

We can now pass to the EDA & Pre-processing ! 