## 1. Collecting the data
QIIME release from UNITE website, following text is copied from the unite website and is a good example of what we could write in the thesis:

*Three sets of QIIME files are released, corresponding to the SHs resulting from clustering at the 3% distance (97% similarity) and 1% distance (99% similarity) threshold levels. The third set of files is the result of a dynamic use of clustering thresholds, such that some SHs are delimited at the 3% distance level, some at the 2.5% distance level, some at the 2% distance level, and so on; these choices were made manually by experts of those particular lineages of fungi. The syntax is the same throughout the three sets of files.*

*Each SH is given a stable name of the accession number type, here shown in the FASTA file of the dynamic set:*

\>SH099456.05FU_FJ357315_refs 
CACAATATGAAGGCGGGCTGGCACTCCTTGAGAGGACCGGC…

*SH099456 = accession number of the SH 05FU = global key release 5, organism group FUngi FJ357315 = GenBank/UNITE accession number of sequence chosen to represent the SH refs = this is a manually designated RefS (reps = this is an automatically chosen RepS*)

*In the corresponding text file, the classification string of the SH is found:*

*SH099456.05FU_FJ357315_refs k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Embellisia;s__Embellisia_planifunda*

*This specifies the hierarchical classification of the sequence. k = kingdom; p = phylum ; c = class ; o = order ; f = family ; g = genus ; and s = species. Missing information is indicated as "unidentified" item; “f__unidentified;” means that no family name for the sequence exists.*

- something more about which version (most recent one) and why
- something about the contents of the download (so 3 sets of QIIME release, all having 1 fasta file and 1 txt file, kind of like the explanation above)
- maybe some more explanation about what each file (fasta and txt) file contains, kind of like the explanation above but then in a full text.

# Saving the data in csv files
- Because we find csv files are easier to handle than the txt + fasta seperately
- CSV file are easy to manipulate and visualise with pandas
- working with the 3 releases (99%, 97% and dynamic thresholds) because idk honestly, I'm just not sure which one to use yet so I'll convert all of them:) if we end op finding a good reason to choose one or the other, we can still delete them

**CSV file contents**

Taxonomy_XX.csv is a csv file with all info from the fasta and txt file from QIIME release + length of sequences

It has following columns: ['ACCNUM' 'KINGDOM' 'PHYLUM' 'CLASS' 'ORDER' 'FAMILY' 'GENUS' 'SPECIES', 'SEQ' 'LENGTH']

Function was made where the following steps were taken (basically what the code below does):
1. loading TXT file with pandas read_csv
2. Removing prefixes from each column (k__, p__, ...)
3. Entries with XX_XX_Incertae_sedis replaced with NaN for easier filtering with pandas as we will delete them anyway (for supervised training, we need data with the right labels)
4. Create list of sequences from FASTA file and add them to the dataframe (note: txt has same order as fasta file so we don't need to worry about losing the order later as the order is contained when adding the list to the dataframe)
5. Adding column 'SEQ' with sequences to dataframe
6. Adding column 'LENGTH' with lengths of sequences: for easy visualisation later on, length can be determined quickly with the BioPython/BioSeq module (maybe some short intermezzo about BioPython can be added here?)
7. saving pandas dataframe into csv with save_csv

Same steps for each QIIME set!! So 3 times



In [None]:
import pandas as pd
from Bio import SeqIO
import matplotlib.pyplot as plt

In [None]:
def QIIME_to_CSV(filetxt, filefasta):
    # Loading Taxonomy data into dataframe, make column names/header
    header = ['ACCNUM', 'KINGDOM', 'PHYLUM', 'CLASS', 'ORDER', 'FAMILY', 'GENUS', 'SPECIES']
    DataframeAll = pd.read_csv(filetxt,
                               delimiter=r'[\t;]',
                               engine='python',
                               names=header)

    # Removing prefixes from each column (k__, p__, ...)
    for column in DataframeAll:
        DataframeAll[column] = DataframeAll[column].str.lstrip(f'{str(column)[0].lower()}__')
        if str(column) == 'SPECIES':
            DataframeAll[column] = DataframeAll[column].str.rstrip('_sp')

    # Replacing XX_XX_Incertae_sedis with NaN
    for column in DataframeAll:
        DataframeAll.loc[DataframeAll[str(column)].str.contains('sedis'), str(column)] = 'NaN'

    # Create list of sequences
    # Note: Fasta file has same order as txt file
    sequences = []
    for record in SeqIO.parse(filefasta, 'fasta'):
        sequences.append(record.seq)

    # Adding column 'SEQ' with sequences to dataframe
    # Adding column 'LENGTH' with lengths of sequences
    # Note: format of sequences = BioSeq
    DataframeAll['SEQ'] = sequences
    DataframeAll['LENGTH'] = DataframeAll['SEQ'].str.len()

    # Saving csv file for further use
    DataframeAll.to_csv(f'Taxonomy_{str(filefasta)[19:-17]}.csv', index=None)
    print(f'QIIME set combined to Taxonomy_{str(filefasta)[19:-17]}.csv')


QIIME_to_CSV('sh_taxonomy_qiime_ver9_97_29.11.2022.txt', 'sh_refs_qiime_ver9_97_29.11.2022.fasta')
QIIME_to_CSV('sh_taxonomy_qiime_ver9_99_29.11.2022.txt', 'sh_refs_qiime_ver9_99_29.11.2022.fasta')
QIIME_to_CSV('sh_taxonomy_qiime_ver9_dynamic_29.11.2022.txt', 'sh_refs_qiime_ver9_dynamic_29.11.2022.fasta')

# 2. Cleaning the data
Cleaning was done following these steps:
1. check database for duplicates and delete one of the duplicate duos so no more duplicates in the dataset are present
2. Check for rows with NaN in dataset at any of the taxonomic levels and delete these rows (for supervised training, we need data with the right labels)
3. give back report/table/overview of cleaning steps
4. Save cleaned data into csv for further use (Taxonomy_XX_final.csv)

In [None]:
def data_cleaning(filecsv):
    # Performing the desired cleaning steps
    df = pd.read_csv(filecsv)
    df_noduplicates = df.drop_duplicates(subset='SEQ')
    df_final = df_noduplicates.dropna()
    # calculating 'losses' due to cleaning
    s_df = len(df.index)
    s_dup = len(df.index) - len(df_noduplicates.index)
    p_dup = f'{str(s_dup / s_df * 100)[:5]}%'
    s_nan = len(df.index) - len(df_final)
    p_nan = f'{str(s_nan / s_df * 100)[:5]}%'
    s_final = len(df_final.index)
    p_final = f'{str(s_final / s_df * 100)[:5]}%'
    df_sizes = pd.DataFrame({'Consisting of': ["QIIME Release", "Duplicates", "Entries with NaN", "Final Dataset"],
                             'Size': [s_df, s_dup, s_nan, s_final],
                             '% to start': ['/', p_dup, p_nan, p_final]})
    # Saving the cleaned data into csv and printing the 'losses' due to cleaning
    df_final.to_csv(f'{filecsv.rstrip(".csv")}_final.csv', index=None)
    print(f'Data is cleaned and saved in {filecsv[:-4]}_final.csv\n'
          f'Loss of cleaning[NEEDS TO BE CHANGED]:\n'
          f'{df_sizes.to_markdown()}')


data_cleaning('Taxonomy_97.csv')
data_cleaning('Taxonomy_99.csv')
data_cleaning('Taxonomy_dynamic.csv')

Data is cleaned and saved in Taxonomy_97_final.csv\
Loss of cleaning[NEEDS TO BE CHANGED]:

|    | Consisting of    |   Size | % to start   |
|---:|:-----------------|-------:|:-------------|
|  0 | QIIME Release    | 111485 | /            |
|  1 | Duplicates       |      0 | 0.0%         |
|  2 | Entries with NaN |  59327 | 53.21%       |
|  3 | Final Dataset    |  52158 | 46.78%       |

Data is cleaned and saved in Taxonomy_99_final.csv\
Loss of cleaning[NEEDS TO BE CHANGED]:

|    | Consisting of    |   Size | % to start   |
|---:|:-----------------|-------:|:-------------|
|  0 | QIIME Release    | 197557 | /            |
|  1 | Duplicates       |      2 | 0.001%       |
|  2 | Entries with NaN |  93903 | 47.53%       |
|  3 | Final Dataset    | 103654 | 52.46%       |

Data is cleaned and saved in Taxonomy_dynamic_final.csv\
Loss of cleaning[NEEDS TO BE CHANGED]:

|    | Consisting of    |   Size | % to start   |
|---:|:-----------------|-------:|:-------------|
|  0 | QIIME Release    | 162041 | /            |
|  1 | Duplicates       |      1 | 0.000%       |
|  2 | Entries with NaN |  77864 | 48.05%       |
|  3 | Final Dataset    |  84177 | 51.94%       |

# DONE => analysis can start