# 1. Ambiguous nucleotides - removing sequencing errors
- check if they are abundant in certain sequences (higher than 1% could be because of sequencing errors => not good to use)
- check if there is a big los if we delete sequences with 1 or more ambiguous nucleotides
- what if juste delete the ambiguous nucleotides in the sequences themselves but don't delete the entire sequence?\
If we are able to delete these without losing a lot of important data (distribution remains the same for example, not a big loss of indidvidual species), it would simplify the OHE and kmers a lot (because only ATGC instead of ATGCRYSWKMBDHVN)

In [None]:
import pandas as pd

In [None]:
def how_many_ambi(dataframe):
    df = pd.read_csv(dataframe)
    searchfor = ['R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N']
    df['AMBI'] = df['SEQ'].str.count('|'.join(searchfor))
    df['AMBIPERC'] = df['AMBI'] / df['LENGTH']
    print('Amount of ambiguous nucleotides were added to dataframe in absolute numbers and percentages')
    return df


def too_much_ambi(dataframe, threshold=0.01):
    df = dataframe
    df = df[df.AMBIPERC < threshold]
    print('Sequences containing more than 1% ambiguous nucleotides were deleted from dataframe')
    return df


def drop_all_ambi(dataframe):
    df = dataframe
    df = df[df.AMBIPERC == 0.0]
    return df

# 2. Length - removing too long or too short ITS
Technical standpoint: Neural networks require input having the same size, this is why we e.g. extract some features with kmer (because they all end up having same size). For OHE, we need to padd the sequences up with an ambiguous nucleotide to all have the same length (this could be explained more).\
Biological standpoint: average length of ITS region is 550bp(?), ITS regions much shorter and much longer than that maybe do not hold any biological meaning\
This is why we are checking the distribution of the lengths of ITS, where to cut (min and max) and how much we lose if we cut at min and max

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


def length_stats(filecsv):
    df = pd.read_csv(filecsv)
    print(f"Length of shortest sequence: {df['LENGTH'].min()}")
    print(f"Length of longest sequence: {df['LENGTH'].max()}")
    print(f"Average length of sequences: {df['LENGTH'].mean():0.2f}")
    upper = df[df.LENGTH <= 700]
    lowerupper = upper[upper.LENGTH >= 400]
    print(f"There are {lowerupper.shape[0]} sequences with lengths between 400 and 700.")
    print(f"This is {lowerupper.shape[0]/df.shape[0]*100:0.2f}% of the original dataset.")

    # Project settings for plots
    plt.style.use('bmh')
    plt.rcParams['font.family'] = 'serif'
    plt.rcParams['font.serif'] = 'UGent Panno Text'
    plt.rcParams['font.monospace'] = 'UGent Panno Text'
    plt.rcParams['font.size'] = 10
    plt.rcParams['axes.labelsize'] = 10
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.titlesize'] = 10
    plt.rcParams['xtick.labelsize'] = 8
    plt.rcParams['ytick.labelsize'] = 8
    plt.rcParams['legend.fontsize'] = 10
    plt.rcParams['figure.titlesize'] = 12
    # For main: (UGent blue: color='#1E64C8')
    # For accents: (UGent yellow: color='#FFD200', linestyle='dashed', linewidth=2)
    # Set an aspect ratio
    width, height = plt.figaspect(4)
    fig = plt.figure(figsize=(width, height), dpi=400)

    df.plot.hist(color='#1E64C8', bins=100, align='right')
    plt.title(f'Distribution of the lengths of ITS sequences from QIIME release with {filecsv[9:-10]}% similarity threshold')
    plt.xlabel('Length of ITS')
    plt.xticks(range(0, 1501, 50), rotation='vertical')
    plt.axvline(x=400, color='#FFD200', linestyle='dashed', linewidth=2)
    plt.gca().get_xticklabels()[8].set_weight('bold')
    plt.axvline(x=700, color='#FFD200', linestyle='dashed', linewidth=2)
    plt.gca().get_xticklabels()[14].set_weight('bold')
    # svg ensures no loss of quality when enlarging
    plt.savefig(f'LengthDistribution{filecsv[9:-10]}.svg')
    plt.show()


length_stats('Taxonomy_97_final.csv')
length_stats('Taxonomy_99_final.csv')
length_stats('Taxonomy_dynamic_final.csv')

![LengthDistribution97](https://user-images.githubusercontent.com/127412115/236632302-41945a7c-6f5d-494f-9110-4c3d61741841.svg)\
Length of shortest sequence: 140\
Length of longest sequence: 1492\
Average length of sequences: 552.11\
There are 49390 sequences with lengths between 400 and 700.\
This is 94.69% of the original dataset.

![LengthDistribution99](https://user-images.githubusercontent.com/127412115/236632323-b30fc254-ee90-4248-a2fe-3c6991aa9034.svg)\
Length of shortest sequence: 142\
Length of longest sequence: 1501\
Average length of sequences: 542.38\
There are 99509 sequences with lengths between 400 and 700.\
This is 96.00% of the original dataset.

![LengthDistributiondynamic](https://user-images.githubusercontent.com/127412115/236632336-66ec2312-a899-4022-b879-c7fd56e5ea0c.svg)\
Length of shortest sequence: 140\
Length of longest sequence: 1501\
Average length of sequences: 544.77\
There are 80441 sequences with lengths between 400 and 700.\
This is 95.56% of the original dataset.

# 3. Representation - making the dataset more balanced (- Augmentation?)
To make the dataset more balanced:
- working with a minimum requirement (e.g. a species needs at least 5 representations of sequences)
- working with a maximum requirement (e.g. overrepresented species get randomly deleted some sequences to reduce their abundance)
- adding reverse complements to the underrepresented species without changing the overrepresented species\
note: overrepresented=higher than average amount of sequences per species, underrepresented=lower than average amount of sequences per species