# Notebook description

### Objective 

This notebook processes the per nucleotide DP information obtained for the 72 ECOR strains.
It calculates for each locus the average read depth as the sum of reads per nucleotide divided by the locus length.

### Notebook organization
**Environment setup**:
This parts load the required packages
The package data.table allows R to deal and load large dataframes

**Function definition**
This parts defines the function to use to calculate the average read depth (DP) per locus in an array that contains DP information for multiple strains (as columns).

**Importing Locus names**
Imports the names of all the loci in the pangenome and generate a vector containing this information

**ECOR_1**
This part import and processes DP information for the first 40 ECOR strains
Because, the dataset is too large to be loaded as a single data.frame, we first import the first 40 strains

**ECOR_2**
This part import and processes DP information for the last 32 ECOR strains



---

## Environment setup

This parts load the required packages. 
The package data.table allows R to deal and load large dataframes

In [None]:
library(tidyverse)
library(data.table)

── [1mAttaching core tidyverse packages[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘data.table’


The following obje

--- 

## Function definition

Here we define the function that calculates the average number of reads per locus.

In [None]:
# Function : Calculating the average number of read/nucleotide for each locus and strain
## This function uses a for loop to treat each locus one by one, and calculate the average DP for this locus in each strain of the array

av_coverage=function(array_ecor,Loci){
  array_ecor$Loc_name=sub("_\\d+$", "", array_ecor$LOCUS) #simplifies the LOCUS name to keep only the first part of the name (before the "_") - This is necessary for the Locus name in the array to match with the locus name in the pangenome
  
  
  df_av_cov=data.frame() #starts an empty dataframe to fill it the for loop
  for (i in 1: length(Loci)){
    Loc=Loci[i] 
    df_temp=subset(array_ecor, array_ecor$Loc_name==Loc)
    coverage_means<- sapply(df_temp, function(x) if(is.integer(x)) mean(x) else NA)
    coverage_means_df=as.data.frame(t(coverage_means))
    coverage_means_df$LOCUS=Loc
    coverage_means_df = coverage_means_df %>% select(-last_col())
   
    df_av_cov=rbind(df_av_cov,coverage_means_df)
   
    if (i %% 1000 == 0) {
      print(i) 
    }
  }
  df_av_cov
}
                       

                            

---

## Importing Locus names

In this section, we import the names of all the loci in the pangenome and generate a vector containing this information

For the function to work, we need to provide it with a vector containing all the loci names.
These are available in the pangenome presence-absence csv file whole_pan_ecor_presence_absence.csv

In [None]:
pan_pres_ab=read.csv('dataset_generation/data/dp_threshold/whole_pan_ecor_presence_absence.csv')

Loci=c(unique(pan_pres_ab$Locus))

--- 

## ECOR_1
This part import and processes DP information for the first 40 ECOR strains. 

Because loading the full array with all 72 strains is too computationally intensive, we first focus on the first 40 strains in the original array ecor72_array.txt 

In [None]:
#Importing the first 40 strains
array_ecor_1=fread('data_generation/results/ecor72_DP/ecor72_array.txt',sep="\t",
                 select=1:41)

### Data formating
We reformat the data before calculating the average read depth per locus per strain

In [None]:
# Adjusting the names of the columns of array_ecor to keep only the sample ID

current_names <- colnames(array_ecor_1)

# Remove the path and the file extension
        # This captures the part after the last slash and removes the '.txt' extension
short_names <- gsub("^.*/(.*)\\.txt$", "\\1", current_names)

# Assign the cleaned names back to array_ecor
colnames(array_ecor_1) <- short_names

# Check the new column names
print(colnames(array_ecor_1))

### Data Processing


We process the read depth data for the first 40 strains using the function av_coverage to obtain the average read depth per locus for each strain and eventually save that information into a csv file

In [None]:
df_av_cov_ecor_1=av_coverage(array_ecor_1,Loci)

In [None]:
dim(df_av_cov_ecor_1)

In [None]:
head(df_av_cov_ecor_1)

In [None]:
write.csv(df_av_cov_ecor_1,'dataset_generation/data/dp_threshold/average_coverage_41.csv')

--- 

## ECOR_2
This part import and processes DP information for the last 32 ECOR strains

In [None]:
#Importing the last 32 strains
array_ecor2 <- fread('data_generation/results/ecor72_DP/ecor72_array.txt', sep = "\t", select = c(1, 42:73))

### Data formating
We reformat the data before calculating the average read depth per locus per strain

In [None]:
# Adjusting the names of the columns of array_ecor to keep only the sample ID

current_names <- colnames(array_ecor2)

# Remove the path and the file extension
        # This captures the part after the last slash and removes the '.txt' extension
short_names <- gsub("^.*/(.*)\\.txt$", "\\1", current_names)

# Assign the cleaned names back to array_ecor
colnames(array_ecor2) <- short_names

# Check the new column names
print(colnames(array_ecor2))

### Data Processing

We process the read depth data for the last 32 strains using the function av_coverage to obtain the average read depth per locus for each strain and eventually save that information into a csv file.

In [None]:
df_av_cov_ecor_2=av_coverage(array_ecor_2,Loci)

In [None]:
dim(df_av_cov_ecor_2)

In [None]:
head(df_av_cov_ecor_2)

write.csv(df_av_cov_ecor_1,'dataset_generation/data/dp_threshold/average_coverage_last32.csv')

--- 

In [None]:
sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS/LAPACK: /home/manon-morin/miniforge3/envs/r_env/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.15.4 lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1    
 [5] dplyr_1.1.4       purrr_1.0.2       readr_2.1.5       tidyr_1.3.1      
 [9] tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.3.3    crayon_1.5.2     
 [5] tidyselect_1.2.1  IRdisplay_1.1     scales_1.3.0      uuid_1.2-0       
 [9] fastmap_1.2.0     IRkernel_1.3.2    R6_2.5.1          generics_0.1.3   
[13] munsell_0.5.1     pillar_1.9.0      tzdb_0.4.0        rlang_1.1.4      
[17] utf8_1.2.4        stringi_1.8.4     repr_1.1.7        timechange_0.3.0 
[21] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35    
[25] grid_4.3.3        base64enc_0.1-3   hms_1.1.3         pbdZMQ_0.3-11    
[29] lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.24.0   glue_1.7.0       
[33] fansi_1.0.6       colorspace_2.1-0  tools_4.3.3       pkgconfig_2.0.3  
[37] htmltools_0.5.8.1


---