# 1. Ambiguous nucleotides - removing sequencing errors
- check if they are abundant in certain sequences (higher than 1% could be because of sequencing errors => not good to use)
- check if there is a big los if we delete sequences with 1 or more ambiguous nucleotides
- what if juste delete the ambiguous nucleotides in the sequences themselves but don't delete the entire sequence?\
If we are able to delete these without losing a lot of important data (distribution remains the same for example, not a big loss of indidvidual species), it would simplify the OHE and kmers a lot (because only ATGC instead of ATGCRYSWKMBDHVN)

In [None]:
import pandas as pd

In [None]:
def how_many_ambi(dataframe):
    df = pd.read_csv(dataframe)
    searchfor = ['R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N']
    df['AMBI'] = df['SEQ'].str.count('|'.join(searchfor))
    df['AMBIPERC'] = df['AMBI'] / df['LENGTH']
    print('Amount of ambiguous nucleotides were added to dataframe in absolute numbers and percentages')
    return df


def too_much_ambi(dataframe, threshold=0.01):
    df = dataframe
    df = df[df.AMBIPERC < threshold]
    print('Sequences containing more than 1% ambiguous nucleotides were deleted from dataframe')
    return df


def drop_all_ambi(dataframe):
    df = dataframe
    df = df[df.AMBIPERC == 0.0]
    return df

# 2. Length - removing too long or too short ITS
Technical standpoint: Neural networks require input having the same size, this is why we e.g. extract some features with kmer (because they all end up having same size). For OHE, we need to padd the sequences up with an ambiguous nucleotide to all have the same length (this could be explained more).\
Biological standpoint: average length of ITS region is 550bp(?), ITS regions much shorter and much longer than that maybe do not hold any biological meaning\
This is why we are checking the distribution of the lengths of ITS, where to cut (min and max) and how much we lose if we cut at min and max

In [None]:
print(matplotlib.get_configdir())
df = pd.read_csv("Taxonomy_97_final.csv")

print(df['LENGTH'].min())
print(df['LENGTH'].max())
print(df['LENGTH'].mean())

values = pd.Series(df['LENGTH'].value_counts(), name='values')
perc = pd.Series(df['LENGTH'].value_counts(normalize=True)*100, name='perc')
print(pd.concat([values, perc], axis=1))

# Global settings for plots
plt.style.use('bmh')
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = 'UGent Panno Text'
plt.rcParams['font.monospace'] = 'UGent Panno Text'
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.titlesize'] = 10
plt.rcParams['xtick.labelsize'] = 8
plt.rcParams['ytick.labelsize'] = 8
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['figure.titlesize'] = 12
# color for plot (UGent blue: color='#1E64C8')
# color for accents (UGent yellow: color='#FFD200')
# Set an aspect ratio
width, height = plt.figaspect(4)
fig = plt.figure(figsize=(width,height), dpi=400)

df.plot.hist(color='#1E64C8', bins=94, align='right')
plt.title('Distribution of the lengths of ITS sequences')
plt.xlabel('Length of ITS')
plt.xticks(range(0, 1501, 50), rotation='vertical')
plt.axvline(x=450, color='#FFD200', linestyle='dashed', linewidth=2)
plt.gca().get_xticklabels()[9].set_weight('bold')
plt.axvline(x=650, color='#FFD200', linestyle='dashed', linewidth=2)
plt.gca().get_xticklabels()[13].set_weight('bold')
# svg ensures
plt.savefig('LengthDistribution.svg')
plt.show()

LengthDistribution.svg


# 3. Representation - making the dataset more balanced (- Augmentation?)
To make the dataset more balanced:
- working with a minimum requirement (e.g. a species needs at least 5 representations of sequences)
- working with a maximum requirement (e.g. overrepresented species get randomly deleted some sequences to reduce their abundance)
- adding reverse complements to the underrepresented species without changing the overrepresented species\
note: overrepresented=higher than average amount of sequences per species, underrepresented=lower than average amount of sequences per species