# <span style="color:#006E7F">_ CIBIG Metagenomic water projet analysis__ <a class="anchor"></span>  


Created by F. Sorgho (CRUN) and E. Badoum (GRAS) - Novembre 2024


# <span style="color:#006E7F">BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> Creating the folder, downloading data and so on</span>  


### <span style="color: #4CACBD;"> 1. Data </span>

We will analyse a Freshwater Samples (Urban et al., 2021; https://elifesciences.org/articles/61504).  

In [None]:
cd ~/projet_BS
mkdir -p Data
cd Data
# download your compressed Freshwater Sample, https://www.ebi.ac.uk/ena/browser/view/PRJEB34900 ERR3806859_1-ERR3806892_1
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/009/ERR3806859/ERR3806859_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/006/ERR3806876/ERR3806876_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/000/ERR3806860/ERR3806860_1.fastq.gz

## <span style="color: #4CACBC;"> 2. Basecalling </span>  

Electrical signals are stocked on fast5 format files when DNA molecules are sequenced.

These signals need to be converted on standard fastq files to post-analysis.

Several training dataset models are usually used to convert fast5 to fastq. 
For this Job we used Guppy  

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads with Nanoplot </span>  

Control reads quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

In [None]:
Install Nanoplot/1.44.0
NanoPlot -t 8 --fastq ~/projet_BS/Data/*fastq.gz --outdir NANOPLOT_files

In [None]:
#test for one sample
NanoPlot --fastq /home/faiza/Downloads/projet_turore/database_initial/ERR3806875_1.fastq.gz --outdir ./output 

In [None]:
# for all of sample
NanoPlot --fastq /home/faiza/Downloads/projet_turore/database_initial/ERR38068*.fastq.gz --outdir ./output_combined

Observe Stats

## <span style="color: #4CACBC;"> Use KRAKEN2 for taxonomic assignation<a class="anchor" id="kraken2"> </span>


### <span style="color: #4CACBC;"> 3.1. Download a bacterial database<a class="anchor" id="viraldb"> </span>

In [None]:
Kraken2 --help

In [None]:
kraken2-build --special "silva" --db kraken_database/.


In [None]:
# Inspect the database content
kraken2-inspect --db kraken_database | head -15

### <span style="color: #4CACBC;"> 3.2. run Kraken2 <a class="anchor" id="kraken"> </span>

In [None]:
kraken2 --db kraken_database/ /projects/medium/CIBIG_metagenomic_eaux/RAW_DATA/FASTQ_DIR/ERR38068*.fastq.gz --report report.txt --report-minimizer-data --> output_kraken

### <span style="color: #4CACBC;"> 3.3. Vizualise kraken2 output with krona<a class="anchor" id="krakenkrona"> </span>

In [None]:
ktImportTaxonomy -m 3 -t 5 report.txt -o kraken.html 2> krakenkrona.err

## <span style="color: #4CACBC;"> 4. Use Diamond for taxonomic assignation<a class="anchor" id="diamond"> </span>

### <span style="color: #4CACBC;"> 4.1. Download bacteria bank<a class="anchor" id="bacteriadbdiamond"> </span>¶

In [None]:
Set paths to your directories and files
input_dir="/projects/medium/CIBIG_metagenomic_eaux/RAW_DATA/FASTQ_DIR"      # Directory containing the FASTQ files
output_dir="/home/sorgho/output_dir/DIAMOND_Results"      # Directory where the DIAMOND results will be saved
diamond_db="/diamond/database"    # Path to your DIAMOND protein database
diamond_exec="diamond"                    # DIAMOND executable (assuming it's in the PATH)

### <span style="color: #4CACBC;"> 4.2. Run Diamond<a class="anchor" id="bacteriadbdiamond"> </span>¶

In [1]:
mkdir -p DIAMOND_Results

# Loop through each FASTQ file in the input directory
for fastq_file in $input_dir/*.fastq.gz; do
    # Extract the base name of the FASTQ file (without path and extension)
    filename=$(basename "$fastq_file" .fastq.gz)

    # Define output files for DIAMOND results
    diamond_output="$output_dir/${filename}_diamond_output.dmnd"
    diamond_report="$output_dir/${filename}_diamond_report.txt"

    # Run DIAMOND alignment (using blastx for nucleotide-to-protein alignment)
    echo "Running DIAMOND for $fastq_file ..."
    $diamond_exec blastx \
        --db $diamond_db \
        --query $fastq_file \
        --out $diamond_output \
        --outfmt 6 \
        --threads 4 \
        --more-sensitive \
        --verbose \
        --report $diamond_report

    # Check if DIAMOND was successful
    if [ $? -eq 0 ]; then
        echo "DIAMOND analysis completed for $fastq_file. Results saved to $diamond_output and $diamond_report."
    else
        echo "Error: DIAMOND failed for $fastq_file."
    fi
done

echo "All FASTQ files have been processed."

SyntaxError: invalid syntax (3032097987.py, line 1)

## <span style="color: #4CACBC;"> 5. Use Canu for Assembly<a class="anchor" id="diamond"> </span>

### <span style="color: #4CACBC;"> 5.1. Define variables<a class="anchor" id="bacteriadbdiamond"> </span>¶

In [None]:
# Define variables
INPUT_DIR="/projects/medium/CIBIG_metagenomic_eaux/RAW_DATA/FASTQ_DIR"  # Replace with actual path
GENOME_SIZE="2.0m"          # Estimated genome size for a metagenomic project. Adjust if necessary.
OUTPUT_ROOT_DIR="output_dir"  # Directory where results will be saved.
PREFIX="16S_assembly"         # Output file prefix.
THREADS=4                     # Number of threads (adjust according to your hardware).
# Loop to process each FASTQ file in the input directory
for fastq_file in $INPUT_DIR/*.fastq.gz; do
  # Extract file name without extension
  filename=$(basename "$fastq_file" .fastq.gz)

  # Define the specific output directory for each FASTQ file
  OUTPUT_DIR="$OUTPUT_ROOT_DIR/${filename}_16S_assembly"

  # Create output directory if necessary
  mkdir -p $OUTPUT_DIR

### <span style="color: #4CACBC;"> 5.2. Run Canu<a class="anchor" id="bacteriadbdiamond"> </span>¶

In [None]:
Run CANU for each FASTQ file
  echo "Process of $fastq_file ..."
  canu -p $PREFIX -d $OUTPUT_DIR genomeSize=$GENOME_SIZE -nanopore-raw $fastq_file

  echo "Analysis complete for $fastq_file. Results in $OUTPUT_DIR."
done

echo "All FASTQ files have been processed."

# Optional: monitor logs in real time during execution
# This will allow you to track assembly logs
tail -f $OUTPUT_ROOT_DIR/*/*.log