# **Context**

This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. These data were used to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).
# **Content**

Golub et al "Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring"

There are two datasets containing the initial (training, 38 samples) and independent (test, 34 samples) datasets used in the paper. These datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. Intensity values have been re-scaled such that overall intensities for each chip are equivalent. 

# **Importing basics libraries and CSV files**

In [1]:
import pandas as pd
import numpy as np

path_train="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/data_set_ALL_AML_train.csv"
path_test="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/data_set_ALL_AML_independent.csv"
path_val="https://raw.githubusercontent.com/RafaelM0raes/Gene/main/actual.csv"

train_df=pd.read_csv(path_train)
test_df=pd.read_csv(path_test)
val_df=pd.read_csv(path_val, index_col='patient')


# Data Prep

removing call columns


In [2]:
call_cols_train = [call for call in train_df.columns if 'call' in call]
call_cols_test = [call for call in test_df.columns if 'call' in call]

train_df=train_df.drop(call_cols_train, axis=1)
test_df=test_df.drop(call_cols_test, axis=1)

test_df=test_df.drop(test_df.columns[0:2], axis=1)
train_df=train_df.drop(train_df.columns[0:2], axis=1)

merging test_df and train_df and sorting

In [3]:
train_df = train_df.T
test_df = test_df.T

df=train_df.append(test_df, ignore_index=True)
df_sorted=df.sort_index()

transforming AML and ALL to int

In [4]:
#val_df = val_df.replace('ALL', 1)
#val_df = val_df.replace('AML', 0)

separing into test an train sets

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_sorted, val_df, test_size=0.33, random_state=42)


# Models

## XGboost Classifier


In [9]:
from xgboost import XGBClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import confusion_matrix
from seaborn import heatmap

for n in [50, 100, 500, 1000, 1500]:
  model_xg=XGBClassifier(random_state=42, n_estimators=n, learning_rate=0.01, n_jobs=2, verbosity=0)
  model_xg.fit(X_train, y_train.cancer, verbose=False)
  preds_xg = model_xg.predict(X_test)
cf_matrix = confusion_matrix(y_test, preds_xg)
print(cf_matrix)
heatmap(cf_matrix, annot=True)
  #print(f"the mean absolute error is {mean_absolute_error(y_test, preds_xg)}, for n_estimators={n}")

KeyboardInterrupt: ignored

## K-nearest neighbors

In [7]:
from sklearn.neighbors import KNeighborsClassifier

for i in [5, 10, 20, 25, 30]:
  model_kn=KNeighborsClassifier(n_neighbors=i, n_jobs=2)
  model_kn.fit(X_train, y_train.cancer)
  preds_kn = model_kn.predict(X_test)
  print(f"the mean absolute error is {mean_absolute_error(y_test, preds_kn)}, for n_neighbors={i}")

ValueError: ignored