# **Context**

This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. These data were used to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).
# **Content**

Golub et al "Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring"

There are two datasets containing the initial (training, 38 samples) and independent (test, 34 samples) datasets used in the paper. These datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. Intensity values have been re-scaled such that overall intensities for each chip are equivalent. 

# **Importing basics libraries and CSV files**

In [24]:
import pandas as pd
import numpy as np

path_train="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/data_set_ALL_AML_train.csv"
path_test="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/data_set_ALL_AML_independent.csv"
path_val="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/actual.csv"

train_df=pd.read_csv(path_train)
test_df=pd.read_csv(path_test)
val_df=pd.read_csv(path_val, index_col='patient')

# Initial tests


In [8]:
#verify if there are any missing values
print("missing values in train_df: ", train_df.isnull().sum().sum())
print("missing values in test_df: ",test_df.isnull().sum().sum())
print("missing values in val_df: ",val_df.isnull().sum().sum())

test_df

missing values in train_df:  0
missing values in test_df:  0
missing values in val_df:  0


Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,48,call.4,49,call.5,41,call.6,43,call.7,44,call.8,45,call.9,46,call.10,70,call.11,71,call.12,72,call.13,68,call.14,69,call.15,67,call.16,55,call.17,56,call.18,59,call.19,52,call.20,53,call.21,51,call.22,50,call.23,54,call.24,57,call.25,58,call.26,60,call.27,61,call.28,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,-130,A,-256,A,-62,A,86,A,-146,A,-187,A,-56,A,-55,A,-59,A,-131,A,-154,A,-79,A,-76,A,-34,A,-95,A,-12,A,-21,A,-202,A,-112,A,-118,A,-90,A,-137,A,-157,A,-172,A,-47,A,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,-177,A,-249,A,-23,A,-36,A,-74,A,-187,A,-43,A,-44,A,-114,A,-126,A,-136,A,-118,A,-98,A,-144,A,-118,A,-172,A,-13,A,-274,A,-185,A,-142,A,-87,A,-51,A,-370,A,-122,A,-442,A,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,-28,A,-410,A,-7,A,-141,A,170,A,312,A,43,A,12,A,23,A,-50,A,49,A,-30,A,-153,A,-17,A,59,A,12,A,8,A,59,A,24,A,212,A,102,A,-82,A,-77,A,38,A,-21,A,-5,A,63,A,-46,A,-124,A,-81,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,328,A,295,A,276,A,182,A,266,A,24,A,142,A,252,A,174,A,142,A,177,A,129,A,146,A,211,A,180,A,68,A,237,A,152,A,270,A,172,A,38,A,309,A,170,A,314,A,319,P,178,A,340,A,31,A,396,A,141,A,95,A,146,A,431,A,9,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-224,A,-226,A,-211,A,-289,A,-170,A,-535,A,-233,A,-201,A,-32,A,114,A,-116,A,-108,A,-171,A,-206,A,-257,A,-110,A,-215,A,-174,A,-229,A,-137,A,-128,A,-456,A,-197,A,-401,A,-283,A,-135,A,-438,A,-201,A,-351,A,-256,A,-191,A,-172,A,-496,A,-294,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7124,PTGER3 Prostaglandin E receptor 3 (subtype EP3...,X83863_at,1074,A,67,A,893,P,722,A,612,A,1950,A,245,A,1235,A,354,A,304,A,625,A,320,A,348,A,874,A,524,A,742,A,441,A,806,A,1068,A,673,A,133,A,523,A,1110,A,882,A,618,A,507,A,1372,A,87,A,1111,A,707,A,423,A,809,A,466,A,551,A
7125,HMG2 High-mobility group (nonhistone chromosom...,Z17240_at,475,A,263,A,297,A,170,A,370,A,906,A,164,A,9,A,-42,A,-1,A,173,A,174,A,208,A,393,P,249,A,234,A,99,A,342,A,412,A,208,A,50,A,577,P,174,A,264,A,308,A,64,A,642,P,98,A,459,A,354,A,41,A,445,A,349,A,194,A
7126,RB1 Retinoblastoma 1 (including osteosarcoma),L49218_f_at,48,A,-33,A,6,A,0,A,29,A,79,A,84,A,7,A,-100,A,-207,A,63,A,-4,A,0,A,34,A,40,A,72,A,-8,A,14,A,-43,A,-68,A,-51,A,-26,A,8,A,73,A,0,A,-11,A,-9,A,-26,A,-8,A,-22,A,0,A,-2,A,0,A,20,A
7127,GB DEF = Glycophorin Sta (type A) exons 3 and ...,M71243_f_at,168,A,-33,A,1971,P,510,P,333,A,170,A,100,A,1545,P,45,A,112,A,63,A,176,P,74,A,237,A,-68,A,109,A,80,A,239,A,702,P,226,A,91,A,208,A,533,P,315,P,196,A,198,A,608,A,153,M,73,A,260,A,1777,P,210,A,284,A,379,A


# Data prep

Removing calls columns

In [9]:
calls_collumns_train = [calls for calls in train_df.columns if 'call' in calls]
calls_collumns_test = [col for col in test_df.columns if 'call' in col]

train_df = train_df.drop(calls_collumns_train, axis=1)
test_df = test_df.drop(calls_collumns_test, axis=1)

Transposing the dataframe


In [10]:
train_df_transposed = train_df.T
test_df_transposed = test_df.T

Removing the genes describing rows

In [39]:
train_df_transposed_final = train_df_transposed.drop(['Gene Description', 'Gene Accession Number'])
test_df_transposed_final = test_df_transposed.drop(['Gene Description', 'Gene Accession Number'])

Separating the results and subistuing ALL and AML for 1 and 0, respectvely

In [77]:
val_df = val_df.replace('ALL', 1)
val_df = val_df.replace('AML', 0)

val_df_test = val_df[val_df.index > 38]
val_df_train = val_df[val_df.index <= 38]
val_df_train = val_df_train.cancer
val_df_test=val_df_test.cancer


train_df_transposed_final.index = pd.to_numeric(train_df_transposed_final.index)
train_df_ordered = train_df_transposed_final.sort_index()

test_df_transposed_final.index = pd.to_numeric(test_df_transposed_final.index)
test_df_ordered = test_df_transposed_final.sort_index()


patient
39    1
40    1
41    1
42    1
43    1
44    1
45    1
46    1
47    1
48    1
49    1
50    0
51    0
52    0
53    0
54    0
55    1
56    1
57    0
58    0
59    1
60    0
61    0
62    0
63    0
64    0
65    0
66    0
67    1
68    1
69    1
70    1
71    1
72    1
Name: cancer, dtype: int64

# ML models

In [81]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

randomforest_model = RandomForestRegressor(random_state=1)
randomforest_model.fit(train_df_ordered, val_df_train)
preds=randomforest_model.predict(test_df_transposed_final)
print(mean_absolute_error(val_df_test, preds))

0.41470588235294115
