# Programming Exercise Week 10

You can download the data [here](https://drive.google.com/file/d/1dpuJKyX6vvSRGDTiRRHsk5OChK23joRA/view?usp=share_link). Please do not hesitate to contact Xiaochen Zheng by [this email](mailto:xzheng@ethz.ch) if you have any question.

The path to your data folder, where the data is saved:

In [None]:
pwd = ''

We will use a special package to load and investigate the dataset, the [pandas](https://pandas.pydata.org/docs/) library. Take a look at the documentation to see all the options!

In [None]:
import os
import pandas as pd # Package to load and investigate data

## Dataset description

In this exercise, we want to design a machine learning/deep learning algorithm to help determine/predict whether a patient is non-diabetic (int `0`) or diabetic (int `1`). Each patient is identified with a unique patient ID (pid). In `full_data_train.csv`, medical, demographic, and diagnosis data for each patient is arranged in 20 consecutive rows. Research has identified the following as **important risk factors** for diabetes:

```high blood pressure, high cholesterol, smoking, obesity, age and sex, race, diet, exercise, alcohol consumption, BMI, household income, marital status, sleep, time since last checkup, education, health care coverage, mental Health```

Given these risk factors, we selected features from a open survey of diabetes related to these risk factors.


### Features

`Diabetes_binary`

(Ever diagonsed) diabetes 

`HighBP` -> `Bool`

High Blood Pressure

`HighChol` -> `Bool`

High Cholesterol

`CholCheck` -> `Bool`

Cholesterol check within past five years

`BMI` -> `Float`

Body Mass Index (BMI)

`Smoker` -> `Bool`

Have you smoked at least 100 cigarettes (5 packs) in your entire life? 

`Stroke` -> `Bool`

(Ever diagosed) stroke. 

`HeartDiseaseorAttack` -> `Bool`

Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)

`PhysActivity` -> `Bool`

Adults who reported doing physical activity or exercise during the past 30 days other than their regular job

`Fruits` -> `Bool`

Consume Fruit 1 or more times per day 

`Veggies` -> `Bool`

Consume Vegetables 1 or more times per day 

`HvyAlcoholConsump` -> `Bool`

Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)

`AnyHealthcare` -> `Bool`

Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? 

`NoDocbcCost` -> `Bool`

Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?

`GenHlth` -> `Int`

Would you say that in general your health is between 5 (highest) and 1 (lowest).

`MentHlth` -> `Int`

Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? 

`PhysHlth` -> `Int`

Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? 

`DiffWalk` -> `Int`

Do you have serious difficulty walking or climbing stairs? 


`Sex`, and `Age` -> `Int`

`Education` -> `Int`

This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more


`Income` -> `Int`

Variable is already ordinal with 1 being less than \$10,000 all the way up to 8 being \$75,000 or more

### Load the data 

In [None]:
# Training dataset
full_train = pd.read_csv(os.path.join(pwd, 'full_data_train.csv'))
# Test dataset
X_test = pd.read_csv(os.path.join(pwd, 'indicators_test.csv'))
y_test = pd.read_csv(os.path.join(pwd, 'y_test.csv'))

### Check the raw data

Use ```pandas.DataFrame.info``` to describe null values, data type, memory usage

In [None]:
full_train.info()

In [None]:
X_test.info()

## Data Preprocessing
Take a look at the raw data and think carefully about what kinds of data preprocessing methods needed.

In [None]:
# You do not necessarily need to do anything here, this is just to provide some space to look at the dataset's properties and contents.
"*** YOUR CODE HERE. ***"

### Task 1
Notice that there is one column name **PID** in both *full_train* and *X_test*. Why should we better remove this from the data?

 Your Answer:

### Task 2
 Use the pandas `drop` function to remove the PID column in test and training set

In [None]:
full_train = ... # drop the PID column
X_test = ... # drop the PID column

### Task 3
Separate the labels in the column `Diabetes_binary` from the training set and create a new tensor `y_train` and `y_test` containing the labels

In [None]:
X_train = ... # drop Diabetes_binary column
y_train = ... # use Diabetes_binary column as labels
y_test = ... # use Diabetes_binary column as labels

### Task 4 - Standardization Scaling
Notice that different features have different scales. For example, `BMI` ranges from 12.0 to 98.0 and `Age` ranges from 1.0 to 13.0. Normalization is a data preparation technique that is frequently used in machine learning to deal with data with different scales.

Here you will apply **standardization scaling**. The term **standardization** refers to the process of centering a variable at zero and standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard deviation is the procedure. The features will be rescaled so that they have the attributes of a typical normal distribution with standard deviations.

***Hint***: Use `numpy` or `Standardscaler` provided by `sklearn`

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

"*** YOUR CODE HERE. ***"

### Task 5 - Data Structure
By applying `pd.read_csv()`, you store your data in `pandas.DataFrame`. After finishing task 4, you should store your data in `numpy.ndarray`. But for `torch.nn.Module`, you need to transfer your data to the data type `torch.Tensor`.

***Hint***: Try to learn and apply [torch.from_numpy()](https://pytorch.org/docs/stable/generated/torch.from_numpy.html), [torch.Tensor.to()](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html), [torch.tensor()](https://pytorch.org/docs/stable/generated/torch.tensor.html).

Transform the variables `X_train`, `X_test`, `y_train` and `y_test`

In [None]:
import torch
"*** YOUR CODE HERE. ***"


## Deep learning model

### Task 6
Finish the deep learning skeleton step-by-step.

***Hint***: 

(1) Implement a multilayer perceptron with several [linear layers](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (e.g. 4 linear layers) followed by [relu activation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU). We show you an example of the first layer.

(2) Take regularization into account and implement certain layers to avoid overfitting. e.g.

(3) Make sure that the output's size of one layer should match the input's size of the following/subsequent layer by checking `tensor.shape`.

(4) Make sure that your model's output should have the size of ($N$, 2), where $N$ is the batch size. 2 represents the possible outcome state of the model e.g. diabetic or non-diabetic



In [None]:
from torch import nn
class YourModel(torch.nn.Module):
    """ Your model should inherite from torch.nn.Module.
    """
    def __init__(self):
        super().__init__()
        "*** YOUR CODE HERE ***"
        self.fc1 = nn.Linear(21,64)

    def forward(self, x):
        '''Forward pass.'''
        x = self.fc1(x)
        x = nn.functional.relu(x)
        "*** YOUR CODE HERE ***"

In [None]:
def train(model, train_loader, criterion, optimizer, epoch):
    model.train()
    # Iterate over the DataLoader for training data
    for batch_idx, (data, target) in enumerate(train_loader):
        # Zero the gradients
        "*** YOUR CODE HERE ***"
        
        # Perform forward pass
        "*** YOUR CODE HERE ***"
        
        # Compute loss
        "*** YOUR CODE HERE ***"
        
        # Perform backward pass
        "*** YOUR CODE HERE ***"
        
        # Perform optimization
        "*** YOUR CODE HERE ***"
        
        # Printing
        if batch_idx % 10 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def test(model, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            output = model(data)
            test_loss += criterion(output, target).item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

In [None]:
from torch.utils.data import TensorDataset, DataLoader
def main(train_data, train_label, test_data, test_label, batch_size, epochs):
    """ Training your model.

    Args:
        train_data, test_data (tensor): The training/testing data. It should have a shape of (n_instance,aaaaaa n_features).
        train_label, test_label (tensor): The labels of training/testing instances. It should have a shape of (n_instance, 1).
        batch_size  (Union[int, NoneType]): The number of samples loaded for one iteration.
        epochs (Union[int, NoneType]): The number of epochs. When this reaches, the training stops.
    """
    # Set fixed random number seed. DO NOT CHANGE IT.
    torch.manual_seed(336699)
    
    # Prepare series dataset.
    train_dataset = "*** YOUR CODE HERE. ***" # TensorDataset()
    train_loader = "*** YOUR CODE HERE. ***" # DataLoader()
    test_dataset = "*** YOUR CODE HERE. ***" # TensorDataset()
    test_loader = "*** YOUR CODE HERE. ***" # DataLoader()

    # Initialize proposed model.
    model = "*** YOUR CODE HERE. ***"

    # Define the loss function and optimizer. You can freely choose your loss function and optimizer based on your task.
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    criterion = torch.nn.CrossEntropyLoss()
    criterion_test = torch.nn.CrossEntropyLoss(reduction='sum')

    # Run the training loop
    for epoch in range(1, epochs+1):
        # Print epoch
        print(f'Starting epoch {epoch}')

        train(model, train_loader, criterion, optimizer, epoch)
        test(model, test_loader, criterion_test)
    
    # Process is complete.
    print('Training process has finished.')


if __name__ == '__main__':
    # Run your codes here.
    main(train_data, train_label, test_data, test_label, batch_size, epochs) #exchange with your variabel names