## Prediction of cancer

In this project you will work with an extension of the the gene expression dataset used in the course. Briefly, the dataset consists of 26143 samples coming from either healthy tissues or various cancers.

In order to complete the challenge take you time to think and discuss the following:

1. What is the machine learning task you want to solve? (I encourage you to start by discriminating between cancer and healthy tissue)
2. Design the architecture and loss function.
3. Go further! If you feel confident, do not limit yourself to above points, try to explore your own ideas (be realistic), such as distinguishing between cancers etc.

In order to facilitate the challenge, over the following lines I download the data, and encode the labels for you based on whether they come from a cancer sample or not.

In [None]:
!wget https://zenodo.org/record/7828660/files/gtex_with_cancer.csv.gz

In [2]:
import pandas as pd
data = pd.read_csv("gtex_with_cancer.csv.gz",sep="\t")
# encode labels
data['label'] =  [1 if 'carcinoma' in x else 0 for x in data['tissue']]
data.head()

Unnamed: 0,ENSG00000177374.12,ENSG00000177707.10,ENSG00000225950.8,ENSG00000120526.10,ENSG00000164070.11,ENSG00000196663.15,ENSG00000148200.16,ENSG00000138755.5,ENSG00000204262.11,ENSG00000169618.6,...,ENSG00000090905.18,ENSG00000119844.15,ENSG00000196152.10,ENSG00000185024.16,ENSG00000166170.9,ENSG00000198026.7,ENSG00000175970.10,ENSG00000213780.10,tissue,label
0,226787,77871,0,198267,72174,163235,9013,58467,343201,1104,...,386315,128187,18552,164687,272994,217936,183796,100360,Adipose - Subcutaneous,0
1,889415,291912,296,125298,153665,193285,15275,27044,1504785,4590,...,809856,238313,29292,317344,252189,330179,164379,150144,Adipose - Subcutaneous,0
2,344838,303004,0,75094,57021,155177,7878,593,628205,472,...,412007,157403,21848,159849,266391,175808,119655,66478,Adipose - Subcutaneous,0
3,306402,124189,309,47470,40676,119022,3568,152,676679,13064,...,361948,93965,11706,153276,273299,115782,150893,105214,Adipose - Visceral (Omentum),0
4,478036,153432,211,169526,67231,165061,8517,5854,663403,395,...,543113,205176,24213,258205,335637,278530,141132,180491,Adipose - Subcutaneous,0


### Dataset class, train & test split and dataloaders

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader

class GeneExpressionDataset(Dataset):
    '''
    Creates a Dataset class for gene expression dataset
    gene_dim is the number of genes (features)
    The rows of the dataframe contain samples, and the 
    columns contain gene expression values 
    and the class label (tissue) at label_position.
    '''
    def __init__(self, dataset,metadata_columns = -2,label_position=-1):
        '''
        Args:
            gtex: pandas dataframe containing input and output data
            label_position: column id of the class labels
        '''
        self.dataframe = dataset
        self.label_position = label_position

        self.label = torch.tensor(self.dataframe.iloc[:,label_position].to_numpy())
        self.data = torch.tensor(self.dataframe.iloc[:, :-2].values).float()
        
    def __len__(self):
        return(len(self.dataframe))
    
    def __getitem__(self, idx):
        # get expression and labels
        expression = self.data[idx,:]
        label = self.label[idx]
        return expression, label

In [4]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data,test_size=0.2,random_state=42)
train_dataset = GeneExpressionDataset(train)
test_dataset = GeneExpressionDataset(test)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, num_workers=0, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, num_workers=0, shuffle=True)
