<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification - (part0: Retrieve Data)</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>  
</div>

* **Initial setup**

In [1]:
import re, os, sys, json, random, requests
from  tqdm import tqdm
import pandas as pd
from sklearn.utils import resample

current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
utils_directory = os.path.join(root_directory, 'processing')
sys.path.append(utils_directory)

# Import Utils
import fasta

* **Read fasta**<br>

We will read our fasta file and remove unkown character and save this in unique dataset<br>
We will then create k-mer dataset  for k-values={2, 3, 4, 5}

In [9]:
gene_info_path = "../data/gene_info.json"
fasta_dir      = "../data/raw_data/"
storage_dir    = "../data/one_vs_other/"

with open(gene_info_path, 'r') as json_file:
    gene_info = json.load(json_file)

**Note**: Create the one-vs-other datasets and save the dataset for one gene to a CSV file

In [5]:
def create_one_vs_other_dataset(gene_info, fasta_dir, storage_dir):
    for gene, info in gene_info.items():
        print('Processing : ', gene)
        # Read the current gene's fasta file
        file_path = os.path.join(fasta_dir, info['filename'])
        gene_df = fasta.read(file_path, family=1)

        # Read and combine all other genes' fasta files, sampling proportionally
        other_dfs = []
        total_other_count = sum(other_info['count'] for other_gene, other_info in gene_info.items() if other_gene != gene)
        for other_gene, other_info in tqdm(gene_info.items()):
            if other_gene != gene:
                other_file_path = os.path.join(fasta_dir, other_info['filename'])
                other_df = fasta.read(other_file_path, family=0)
                
                # Calculate the proportion of samples to take from this other gene
                proportion = other_info['count'] / total_other_count
                num_samples = max(1, int(proportion * info['count']))
                #print('proportion: ', proportion)
                #print('num_samples: ', num_samples)
                
                # Resample the other gene dataframe
                other_df_resampled = resample(other_df, 
                                              replace=False, 
                                              n_samples=num_samples, 
                                              random_state=42)
                other_dfs.append(other_df_resampled)

        # Combine current gene and sampled other genes into one dataframe
        other_df_combined = pd.concat(other_dfs, ignore_index=True)
        combined_df = pd.concat([gene_df, other_df_combined], ignore_index=True)

        # Store the dataframe in the dictionary
        gene_filename = f"{gene.replace('/', '__')}.csv"  
        combined_df.to_csv(storage_dir+gene_filename, index=False)

In [6]:
create_one_vs_other_dataset(gene_info, fasta_dir, storage_dir)

Processing :  AP2


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.87it/s]


Processing :  ARF


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.63it/s]


Processing :  ARR-B


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.30it/s]


Processing :  B3


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.90it/s]


Processing :  BBR-BPC


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.27it/s]


Processing :  BES1


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.68it/s]


Processing :  C2H2


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.20it/s]


Processing :  C3H


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.98it/s]


Processing :  CAMTA


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.74it/s]


Processing :  CO-like


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.68it/s]


Processing :  CPP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.63it/s]


Processing :  DBB


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.76it/s]


Processing :  Dof


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.74it/s]


Processing :  E2F/DP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.70it/s]


Processing :  EIL


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.73it/s]


Processing :  ERF


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.13it/s]


Processing :  FAR1


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.92it/s]


Processing :  G2-like


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.96it/s]


Processing :  GATA


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.74it/s]


Processing :  GRAS


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.02it/s]


Processing :  GRF


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.59it/s]


Processing :  GeBP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.74it/s]


Processing :  HB-PHD


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.55it/s]


Processing :  HB-other


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.57it/s]


Processing :  HD-ZIP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.99it/s]


Processing :  HRT-like


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.51it/s]


Processing :  HSF


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.84it/s]


Processing :  LBD


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.77it/s]


Processing :  LFY


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.61it/s]


Processing :  LSD


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.75it/s]


Processing :  M-type_MADS


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.81it/s]


Processing :  MIKC_MADS


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.16it/s]


Processing :  MYB


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.30it/s]


Processing :  MYB_related


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.01it/s]


Processing :  NAC


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.07it/s]


Processing :  NF-X1


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.53it/s]


Processing :  NF-YA


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.77it/s]


Processing :  NF-YB


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.68it/s]


Processing :  NF-YC


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.82it/s]


Processing :  NZZ/SPL


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.72it/s]


Processing :  Nin-like


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.88it/s]


Processing :  RAV


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.63it/s]


Processing :  S1Fa-like


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.76it/s]


Processing :  SAP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.78it/s]


Processing :  SBP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.83it/s]


Processing :  SRS


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.68it/s]


Processing :  STAT


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.66it/s]


Processing :  TALE


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.68it/s]


Processing :  TCP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.79it/s]


Processing :  Trihelix


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.57it/s]


Processing :  VOZ


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.62it/s]


Processing :  WOX


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.47it/s]


Processing :  WRKY


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.11it/s]


Processing :  Whirly


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.38it/s]


Processing :  YABBY


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.71it/s]


Processing :  ZF-HD


100%|█████████████████████████████████████████| 58/58 [00:05<00:00, 11.56it/s]


Processing :  bHLH


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 12.53it/s]


Processing :  bZIP


100%|█████████████████████████████████████████| 58/58 [00:04<00:00, 11.86it/s]


* **Check**<br>

We check if we have good processign file checking 20 preteiene and extra X character


In [17]:
for gene, info in gene_info.items():
    file_path = os.path.join(storage_dir, info['filename'])
    dataset = pd.read_csv(storage_dir+info['file_code']+".csv")

    # Proteins
    pattern = r'^[ACDEFGHIKLMNPQRSTVWYX]+$'
    assert dataset['sequence'].str.match(pattern).all(), "Error: Invalid characters found in sequence column"

    # Banlanced
    class_counts = dataset['class'].value_counts()
    total_samples = len(dataset)
    class_counts_df = pd.DataFrame(class_counts)
    class_counts_df.columns = ['Count']
    class_counts_df['Percentage'] = (class_counts_df['Count'] / total_samples * 100).round(2)
    print("\n>>> Class Distribution: ", gene)
    print(class_counts_df)
    print("\nTotal Samples:", total_samples)
    imbalance_ratio = class_counts_df['Count'].max() / class_counts_df['Count'].min()
    imbalance_threshold = 1.5
    print("Imbalance Ratio:", imbalance_ratio)
    if imbalance_ratio >= imbalance_threshold:
        print("#### IMBALANCED DATA : ", gene)


>>> Class Distribution:  AP2
       Count  Percentage
class                   
1       4461       50.16
0       4433       49.84

Total Samples: 8894
Imbalance Ratio: 1.0063162643807806

>>> Class Distribution:  ARF
       Count  Percentage
class                   
1       4578       50.16
0       4549       49.84

Total Samples: 9127
Imbalance Ratio: 1.0063750274785668

>>> Class Distribution:  ARR-B
       Count  Percentage
class                   
1       2354       50.33
0       2323       49.67

Total Samples: 4677
Imbalance Ratio: 1.0133448127421438

>>> Class Distribution:  B3
       Count  Percentage
class                   
1      10609       50.08
0      10576       49.92

Total Samples: 21185
Imbalance Ratio: 1.0031202723146748

>>> Class Distribution:  BBR-BPC
       Count  Percentage
class                   
1       1256        50.5
0       1231        49.5

Total Samples: 2487
Imbalance Ratio: 1.0203086921202276

>>> Class Distribution:  BES1
       Count  Percentage
cla