<!-- @format -->

### Implementation

1. remove duplicate column
2. combine two sets if needed
3. create array of probe name and type
4. using train_test_split to split train and test data
5. combine probe name, type and train_X as new train_df to output

### Notes

1. Be aware of random seed ratio setting
2. Ensure the first n row are normal in inputData
3. It takes around 5 min to finish breast cancer data
4. (PLEASE change the path by youself)
5. make sure input df is as follow:

| Unnamed: 0   |            |            |          |          |
| ------------ | ---------- | ---------- | -------- | -------- |
| cgxxxxxxxxxx | 0.86868686 | 0.86868686 | 0.363636 | 0.363636 |
| cgxxxxxxxxxx | 0.86868686 | 0.86868686 | 0.363636 | 0.363636 |
| cgxxxxxxxxxx | 0.86868686 | 0.86868686 | 0.363636 | 0.363636 |
| cgxxxxxxxxxx | 0.86868686 | 0.86868686 | 0.363636 | 0.363636 |

### Input Columns

1. `Unnamed: 0` - id of the sample
   > list of serial number for each sample

### Output File

1. training_data.csv
2. testing_data.csv

### Parameters

1. `seed` - make sure to change it if you want to have muiltple diffrent result
2. `test_ratio` - the ratio to split train and test from champ data
3. `hasTwoSets` - combine two set if needed, remember to set secChampDataPath 
4. `normalNumber` - the total number of tumor data
5. `champDataPath` - path of input data file (PLEASE change the path by youself)
6. `secChampDataPath` - path of input data file (PLEASE change the path by youself)
7. `outputTrainDataPath` - path of output train data file
8. `outputTestDataPath` - path of output test data file


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
test_ratio = 0.2
seed = 42


hasTwoSets = True

normalNumber = 8

champDataPath = "../champ_result/breast/GSE148663/all_beta_normalized.csv"

secChampDataPath = "../champ_result/breast/GSE148663/all_beta_normalized.csv"

outputTrainDataPath = "../champ_result/breast/GSE148663/train80/all_beta_normalized.csv"

outputTestDataPath = "../champ_result/breast/GSE148663/test20/all_beta_normalized.csv"

In [None]:
df = pd.read_csv(champDataPath)
df = df.iloc[:, ::2]
if(hasTwoSets):
    df2 = pd.read_csv(secChampDataPath)
    df2 = df2.iloc[:, ::2]
    df = df.merge(df2, on='Unnamed: 0', how='outer', suffixes=('_df1', '_df2'))
df

In [None]:
X = df.iloc[:, 1::]

X = [X.iloc[i, :].values.flatten().tolist() for i in range((df.shape[0]))]

X = pd.DataFrame(X)  

type_array = [("normal" if i < normalNumber else "tumor") for i in range((df.shape[0]))]

name_array = df["Unnamed: 0"].tolist()

In [None]:
from collections import Counter

X_train, X_test, type_train, type_test, name_train, name_test = train_test_split(X, type_array, name_array, test_size=0.2, random_state=42)

print(f"訓練集樣本數量： {len(X_train)}")
print(f"測試集樣本數量： {len(X_test)}")
train_class_distribution = Counter(type_train)
val_class_distribution = Counter(type_test)
print("訓練集中各類別樣本數量：")
print(train_class_distribution)
print("測試集中各類別樣本數量：")
print(val_class_distribution)

In [None]:
train_df = pd.concat(
    [pd.DataFrame(name_train),pd.DataFrame(type_train),pd.DataFrame(X_train)], ignore_index=True, axis=1
)
test_df = pd.concat(
    [pd.DataFrame(name_test),pd.DataFrame(type_test),pd.DataFrame(X_test)], ignore_index=True, axis=1
)

train_df = train_df.sort_values(by=[train_df.columns[1],train_df.columns[0]])
test_df = test_df.sort_values(by=[test_df.columns[1],test_df.columns[0]])
train_df

In [None]:
# export the training and testing sets to CSV
train_df.to_csv(outputTrainDataPath, index=False)

test_df.to_csv(outputTestDataPath, index=False)