trail 1 
项目结构回顾：
    功能：实现单次微调或超参数优化(基于Optuna)
    数据集:
        基于datasets-cli并指定--reference参数下载了NCBI上Taxonomy是Fungi的genome file
        原始文件中显示5089条文件，但3811下载的路径下只找到2363条
        统计这批数据的metadata，我选择了这几个指标：Accession, GC content, Assembly level, Genome size，organism name
        以下代码用于获取这批数据的metadata:

        ```bash
        cat accession_paths.txt | awk -F '/' '{print $(NF)}' | cut -d '.' -f1 > accessions.txt
         datasets summary genome accession --inputfile accessions.txt --as-json-lines | dataformat tsv genome --fields accession,organism-name,assminfo-level,assmstats-gc-percent > genome_metadata.tsv
        ```
        由于datasets-cli获取genome size的参数不知道，所以用下面的shell脚本根据原始文件映射得到:
        

        ```bash
        awk 'BEGIN{OFS="\t"} NR==FNR{a[$1]=$NF; next} FNR==1{print $0, "Total Sequence Length"} {print $0, a[$1]}' \
         original_fungi_assembly_stats.tsv \
        genome_metadata.tsv > genome_metadata_with_length.tsv

        echo "合并完成，输出文件为 genome_metadata_with_length.tsv"
        ```
        这样就得到了目标指标的metadata文件，文件内容如下：

        ```bash
        Assembly Accession      Organism Name   Assembly Level  Assembly Stats GC Percent       Total Sequence Length
        GCA_000412225.2 [Ashbya] aceris (nom. inval.)   Complete Genome 51.5    8867527
        GCA_947297635.1 [Candida] aaseri        Contig  34      10658036
        GCA_030558135.1 [Candida] adriatica     Scaffold        52      10407487
        GCA_963924405.1 [Candida] anglica       Complete Genome 38.5    13719988
        GCA_030566955.1 [Candida] anutae        Scaffold        48      11255714
        GCA_001661425.1 [Candida] arabinofermentans NRRL YB-2248        Scaffold        34.5    13233932
        GCA_030561325.1 [Candida] atlantica     Scaffold        39      11581361
        GCA_030566905.1 [Candida] aurita        Scaffold        35.5    19613085
        ```

In [None]:
import pandas as pd

# Load the CSV file
csv_file_path = 'parsed_genome.csv'
df = pd.read_csv(csv_file_path)

# Display the first few rows of the dataframe to inspect the data
print(df.head())

In [2]:
unique_labels = df['organism_name'].unique().tolist()

In [3]:
unique_labels

['Fusarium lyarnte (nom. inval.)']

In [4]:
label_to_id = {label: i for i, label in enumerate(unique_labels)}

In [5]:
label_to_id

{'Fusarium lyarnte (nom. inval.)': 0}

In [6]:
train_df = df.sample(frac=0.8, random_state=42)
val_df = df.drop(train_df.index)

In [9]:
train_df

Unnamed: 0,contig,contig_accession,organism_name,label
1118,TAGTATCGCGAAGACTCTGATTGGCTAGATCCACTCCCCAGGCATT...,JAAVUB010001119.1,Fusarium lyarnte (nom. inval.),0
643,TTTCCACGTTGGATGGGGGATGGCTGATGGGTGAAGAACGGAAGGG...,JAAVUB010000644.1,Fusarium lyarnte (nom. inval.),0
422,CTTTCAAGGTGTGCGGCGCGATCTCACGAGGAGTGAAGCGATGACT...,JAAVUB010000423.1,Fusarium lyarnte (nom. inval.),0
413,TGTTAAGAGGATATCACCAAAGACTTGATCTACATTTGAGGTCCTG...,JAAVUB010000414.1,Fusarium lyarnte (nom. inval.),0
451,ttttctttttctttttactctcttcaacttttAACTTTCTTAACCT...,JAAVUB010000452.1,Fusarium lyarnte (nom. inval.),0
...,...,...,...,...
476,GATTAAGATAATATTAATCCTTATATTAATTCCTATTGTCCTTAAT...,JAAVUB010000477.1,Fusarium lyarnte (nom. inval.),0
157,ATTCAATAGTTTACTTTCTCTAATAGTTTCTATTCCAGGAATTAAA...,JAAVUB010000158.1,Fusarium lyarnte (nom. inval.),0
1195,AGGAGCCGTGGACAGCTCATACCAAAGACCGCTGTCTGACCAAGAC...,JAAVUB010001196.1,Fusarium lyarnte (nom. inval.),0
16,TCGGCCACCCCACCCCCCCCACGCATTACCAGTGACCGGCGTTGTA...,JAAVUB010000017.1,Fusarium lyarnte (nom. inval.),0


In [8]:
train_df['label'] = train_df['organism_name'].map(label_to_id)
val_df['label'] = val_df['organism_name'].map(label_to_id)

In [10]:
from datasets import load_dataset, Dataset, DatasetDict

In [11]:
dataset_dict = {
    'train': Dataset.from_pandas(train_df),
    'validation': Dataset.from_pandas(val_df)
}

In [12]:
dataset_dict

{'train': Dataset({
     features: ['contig', 'contig_accession', 'organism_name', 'label', '__index_level_0__'],
     num_rows: 1202
 }),
 'validation': Dataset({
     features: ['contig', 'contig_accession', 'organism_name', 'label', '__index_level_0__'],
     num_rows: 300
 })}

In [13]:
dataset = DatasetDict(dataset_dict)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['contig', 'contig_accession', 'organism_name', 'label', '__index_level_0__'],
        num_rows: 1202
    })
    validation: Dataset({
        features: ['contig', 'contig_accession', 'organism_name', 'label', '__index_level_0__'],
        num_rows: 300
    })
})
