# <span style="color:#006E7F">  <center> How to analyze mapping results ? </center> </span>

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[PRACTICE I - Get some basic mapping stats with samtools flagstat](#mappingstats)

* [Run samtools flagstat](#flagstat)
* [Samtools flagstat output](#flagstatoutput)
* [Merge individual flagstat files into an unique file with python code](#multiflagstat) 
* [Plot mapping ratio per sample](#ratioplot)
* [EXERCICE : DO THE SAME MANIP WITH MINIMAP2 RESULTS](#minimap)

</span>

***



## <span style="color:#006E7F"> PRACTICE I - Get some basic mapping stats with samtools flagstat <a class="anchor" id="mappingstats"></span>  

### <span style="color: #4CACBC;"> First go into the directory that contains all the bam files and list the content of the directory</span>  

As we will mainly launch python code to analyse mapping results, the kernel of this jupyter book is Python3. So we are going to add `%` to execute some specific linux commands or `%%bash`  to execute any linux command.

In [None]:
%%bash
cd ~/2-MAPPING
ls

### <span style="color: #4CACBC;">Grouping the flagstat files previosly created</span>  

* Create the subdirectory FLAGSTAT into the directory MAPPING_ILL
* copy the flagstat files previously generated into this new directry

In [None]:
%%bash
mkdir -p FLAGSTAT
cp *flagstat ./FLAGSTAT/

#### <span style="color: #4CACBC;">Check that the flagstat files have been correctly copied</span>

### <span style="color: #4CACBC;">Let's look the content of one flagstat file <a class="anchor" id="flagstatoutput"></span> 

### <span style="color: #4CACBC;">Merge individual flagstat files into an unique file with python code <a class="anchor" id="multiflagstat"></span> 

In [None]:
# IMPORT PYTHON PACKAGE USED BY THE CODE
import os
import pandas as pd

# VARIABLE INITIALIZATION

## NAME OF THE DIRECTORY THAT CONTAINS FLAGSTAT FILES
flagstat_dir = "/home/jovyan/work/MAPPING-ILL/FLAGSTAT" #PUT THE DIRECTORY NAME THAT CONTAINS FLAGSTAT FILES 

## NAME OF THE FILE THAT WILL CONTAIN ALL THE FLAGSTAT RESULTATS
stat_file = f"{flagstat_dir}/all_stat.csv"

# PRINT THE CONTENT OF 2 PREVIOUS VARIABLES INITIALIZED
print("DIRECTORY : ",flagstat_dir)
print("FINAL STAT FILE : ",stat_file)


In [None]:
# OPEN THE FINAL FILE IN WHICH WE PRINT SOME STATS EXTRACTED FROM EACH INDIVIDUAL FILE GENERATED BY SAMTOOLS FLAGSTAT
with open(stat_file, 'w') as stat: 
    # WRITE A HEADER LINE IN OUR STAT FILE
    header_line = "sample,mapped,paired,unmapped"
    stat.write(header_line)
    
    # READING EACH FILE OF THE FLGSTAT DIRECTORY
    for file in os.listdir(flagstat_dir):
        # If flagstat is in name of file
        filen = flagstat_dir + "/" + file
        if "flagstat" in file:
            # Extract sample name and save into a new variable newLine 
            new_line = f"\n{file.split('.')[0]},"
            # OPEN AND READS FLAGSTAT FILE
            with open(filen, "r") as flagstat:
                # read file line by line
                for line in flagstat:
                    # remove the line skipper at the endo of the line
                    line = line.rstrip()              
                    # Keep only line mapped, paired or singleton word
                    if 'mapped (' in line or 'paired (' in line or 'singleton' in line:
                        # get percentage value and save it into the varaible called perc
                        perc = f"{line.split('(')[1].split('%')[0]}"
                        new_line += f"{perc},"
                # WRITE THE LINE ONCE THE FLAGSTAT FILE COMPLETELY READ
                stat.write(new_line.strip(","))

### <span style="color: #4CACBC;">Display the content of the final stat file  <a class="anchor" id="statfile"></span> 

### <span style="color: #4CACBC;">Plot mapping ratio per sample <a class="anchor" id="ratioplot"></a></span> 

#### Load csv file into a panda datafrale


In [None]:
df_bam_stat = pd.read_csv(stat_file, index_col=False, sep=",")
df_bam_stat

#### Basic stats

In [None]:
# Je n'affiche que les valeurs de la colonne "mapped"
print(df_bam_stat['mapped'])

In [None]:
# J affiche la moyenne, min et max de cette colonne
minimum = df_bam_stat["mapped"].min()
maximun = df_bam_stat["mapped"].max()
mean_flag = df_bam_stat["mapped"].mean()

print("\n######## BASIC STATS\n MAPPED")       
print(f"\t%min : {minimum}\t %max : {maximun}\t %mean : {mean_flag}")


#### Sort by clone name

In [None]:
df_bam_stat_sorted=df_bam_stat.sort_values(by=['sample'])
df_bam_stat_sorted

#### Your first plot with python

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize = (15,8))
sns.scatterplot(x="sample",y="paired", data=df_bam_stat_sorted)

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize = (15,8))
ax=sns.scatterplot(x="sample",y="value", hue='variable', data=pd.melt(df_bam_stat_sorted, 'sample'))
ax.set_title("PUT YOUR TITLE")
ax.set_xlabel("PUT YOUR X-AXIS NAME")
ax.set_ylabel("PUT YOUY Y-AXIS NAME")