Now that I have a good idea of the dataset I always start by a state of the art and the problem definition.

I used https://www.notion.so/ to document my state of the art. I have a template that I always use to document my research.
You can find in it :
- Model / architecture selection
- Hyper-parameters
- Rough description of the data (origin, size, date, features…)
- Results (ie: precision, recall, f1…).
- A link to a snapshot of data (if possible)
- Commentary and learning
- ...

The Problem definition is really important to evaluate the client's needs (Data Availability + Accuracy requirement + Problem difficulty).
- Understand how it will benefit the business.
- Once you get an idea and you determine business compatibility, you need to define a success metric. Is it 90% accuracy or 95% accuracy or 99% accuracy.
- Does your model need to work in realtime?
- The project should have high impact, where cheap prediction is valuable for the complex parts of your business process.
- The project should have high feasibility, which is driven by the data availability, accuracy requirements, and problem difficulty.
- The result should be archived by a human. If a human can't do it, the machine will not too.
- Make a Checklist for the project and plan your time.

For the state of the art model, you can look on paperswithcode and use frameworks like mmdetection, mmtracking, mmsegmentation or detectron2 to test them easily.
=> In actual industry you don’t need crazy perfect metrics. You just need to build things so they can be used and upgraded and then you slowly improve once the infra is in place and working.
=> But don't forget, the first thing to do is to have a model which works as good as the actual state => Improve it after.

When I can't find a model that fits my needs, I try to get inspiration from the different models to build my own.
I always start with a simple architecture :
- If you don't find a pre-trained model that could work
   - If your data looks like images, start with a LeNet-like architecture and consider using something like ResNet as your codebase gets more mature.
   - If your data looks like sequences, start with an LSTM with one hidden layer and/or temporal/classical convolutions. Then, when your problem gets more mature, you can move to an Attention-based model or a WaveNet-like model.
   - For all other tasks, start with a fully-connected neural network with one hidden layer and use more advanced networks later depending on the problem.
- Try to start with a basic algorithm machine learning, not neural network (XgBoost, RandomForest, simple algorithm...)
- Frame all problems as binary classification
- Build complicated data pipelines later. These are important for large-scale ML systems, but you should not start with them because data pipelines themselves can be a big source of bugs. Just start with a dataset that you can load into memory.
=> I always try to get my model first. I try to overfit a single batch to spot if I have errors in my code.
=> You need 10/15 epochs to know if your network is bad or good, don't change anything before that. After that, you can analyse your loss to spot problems and see where the model is bad to maybe add data (Spot underfitting/Overfitting).

For Hyper parameters :
- I select sensible hyper-parameter defaults :
   - Common choices of learning rates are normally in the range α = 0.1, 0.01, 0.001.
       - Adam optimizer with a “magic” learning rate value of 3e-4.
   - ReLU activation for fully-connected and convolutional models and Tanh activation for LSTM models.
   - He initialization for ReLU activation function and Glorot initialization for Tanh activation function.
   - No regularization and data normalization.
   - Typical batch sizes include 32, 64, 128, and 256
   - SGD with momentum or Mini batch gradient descent on big dataset - Nesterov acceleration on small dataset
        - The momentum term γ is commonly set to 0.9
   - Add a Regularization penalty.
   - Implement Learning rate schedulers to increase classification accuracy.
- Simplify the problem:
    - Working with a small training set around 10,000 examples.
    - Using a fixed number of objects, classes, input size, etc.
- Evaluate Rank 1 and Rank 5 accuracy
- Choose a simple performance metric, but only one ! Try to find one which corresponds to your business and the objective of the model.
    - Classification :
        - Accuracy - If the classes are well balanced in the dataset like 50/50
        - Precision (P) - If the classes are not well balanced in the dataset like 10/90
        - Recall (R)
        - F1 Score (F1)
        - Area under the ROC curve or simple AUC
        - Log loss
        - Precision at k (P@k)
        - Average precision at k (AP@k)
        - Mean average precision at k (MAP@k)
    - Regression :
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root mean squared error (RMSE)
        - Root mean squared logarithmic error (RMSLE)
        - Mean percentage error (MPE)
        - Mean absolute percentage error (MAPE)
        - R²

I then Setup my tools :
- I use DVC to manage my dataset
- Weight & Biases to track my experiments and to tune my hyper parameters.
- Ray to distribute my training

At the end, to improve your models you can :
   - If you can, add data + Take care that your dataset is well balanced and you doesn't have errors (Treat missing and Outlier values)
   - If the training is too slow :
        - Try to reduce I/O latency, transform first dataset into a HDF5 dataset. It will helps loading faster the data.
   - Feature Engineering
        - Feature engineering is highly influenced by hypotheses generation. Good hypothesis result in good features. That’s why, I always suggest to invest quality time in hypothesis generation. Feature engineering process can be divided into two steps :
            - Feature transformation: There are various scenarios where feature transformation is required:
               - Changing the scale of a variable from original scale to scale between zero and one.
               - Some algorithms works well with normally distributed data. Therefore, we must remove skewness of variable(s).
               - Some times, creating bins of numeric data works well, since it handles the outlier values also. Numeric data can be made discrete by grouping values into bins. This is known as data discretization.
            - Feature Creation: Deriving new variable(s) from existing variables is known as feature creation. It helps to unleash the hidden relationship of a data set.
   - Feature Selection
        - Feature Selection is a process of finding out the best subset of attributes which better explains the relationship of independent variables with target variable.
           - You can select the useful features based on various metrics like:
                - Domain Knowledge: Based on domain experience, we select feature(s) which may have higher impact on target variable.
                - Visualization: As the name suggests, it helps to visualize the relationship between variables, which makes your variable selection process easier.
   - Ensemble methods - This is the most common approach found majorly in winning solutions of Data science competitions. This technique simply combines the result of multiple weak models and produce better results. This can be achieved through many ways:
        - Bagging (Bootstrap Aggregating)
        - Boosting
   - To be sure of the results you can use :
        - Blending
        - Stacking
   - Tune hyper parameters at the end :
        - Learning rate : High
        - Learning rate schedule : High
        - Loss function : High
        - Layer size : High
        - Weight initialization : Medium
        - Model depth : Medium
        - Layer params : Medium
        - Weight of regularization : Medium
        - Optimizer choice : Low
        - Other optimizer params : Low
        - Batch size : Low
        - Nonlinearity : Low


Then When I have a good model for production :
- I use Triton to maximize my gpus power
- I convert my model to TensorRT (I convert it first to ONXX because I'm using Pytorch)
- I reduce fault precision to have better performance (FP32 => FP16 => FP8).
- Switch numpy or pandas to rapids.

In [62]:
import pandas as pd
from lazypredict.Supervised import LazyRegressor
from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

In [63]:
train_df = pd.read_csv("./data/osic/train.csv")
test_df = pd.read_csv("./data/osic/test.csv")

# We will try to train a model only on Weeks FVC Percent Age Sex and Smoking Status
# Drop Patient column
train_df = train_df.drop("Patient", 1)
test_df = test_df.drop("Patient", 1)

# Convert Sex and Smoking status using label encoder
le = LabelEncoder()

train_df['Sex'] = le.fit_transform(train_df['Sex'])
test_df['Sex'] = le.transform(test_df['Sex'])

train_df['SmokingStatus'] = le.fit_transform(train_df['SmokingStatus'])
test_df['SmokingStatus'] = le.transform(test_df['SmokingStatus'])

train_df.head()

Unnamed: 0,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,-4,2315,58.25,79,1,1
1,5,2214,55.71,79,1,1
2,7,2061,51.86,79,1,1
3,9,2144,53.95,79,1,1
4,11,2069,52.06,79,1,1


In [64]:
# Try to find the best model using lazypredict
clf = LazyRegressor(verbose=0,ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(train_df.drop("FVC", 1), test_df.drop("FVC", 1), train_df["FVC"], test_df["FVC"])

print(models)

100%|██████████| 42/42 [00:05<00:00,  7.87it/s]

                               Adjusted R-Squared  R-Squared    RMSE  \
Model                                                                  
KernelRidge                                140.04     -33.76 2723.52   
MLPRegressor                                87.02     -20.51 2142.23   
LinearSVR                                   35.11      -7.53 1348.98   
AdaBoostRegressor                            6.25      -0.31  529.03   
RANSACRegressor                              6.06      -0.27  519.78   
OrthogonalMatchingPursuit                    5.90      -0.22  511.11   
OrthogonalMatchingPursuitCV                  5.82      -0.21  507.28   
LassoLarsCV                                  5.77      -0.19  504.43   
LarsCV                                       5.77      -0.19  504.43   
PassiveAggressiveRegressor                   5.76      -0.19  504.03   
LassoCV                                      5.76      -0.19  503.98   
SGDRegressor                                 5.75      -0.19  50




When the results are not good I use a simple architecture.

In [65]:
# Basic data transformation
def transform_df(dataframe):
    dataframe.drop_duplicates(keep=False,inplace=True,subset=['Patient','Weeks'])
    dataframe['Weeks'] = dataframe['Weeks'].astype(int)
    dataframe['min_week'] = dataframe.groupby('Patient')['Weeks'].transform('min')
    dataframe['baseline_week'] = dataframe['Weeks'] - dataframe['min_week']
    base_df = dataframe.loc[dataframe.Weeks == dataframe.min_week][['Patient','FVC']].copy()
    base_df.columns = ['Patient','base_FVC']

    base_df['nb']=1
    base_df['nb'] = base_df.groupby('Patient')['nb'].transform('cumsum')

    base_df = base_df[base_df.nb==1]
    base_df.drop('nb',axis =1,inplace=True)
    df = dataframe.merge(base_df,on="Patient",how='left')
    df.drop(['min_week'], axis = 1)

    return df

In [66]:
train_df = pd.read_csv("./data/osic/train.csv")
test_df = pd.read_csv("./data/osic/test.csv")

train_df = transform_df(train_df)
test_df = transform_df(test_df)

train_columns = ['baseline_week','base_FVC','Percent','Age','Sex','SmokingStatus']
train_label = ['FVC']
sub_columns = ['Patient_Week','FVC','Confidence']

train = train_df[train_columns]
test = test_df[train_columns]

# Pre processing
transformer = ColumnTransformer([('s',StandardScaler(),[0,1,2,3]),('o',OneHotEncoder(),[4,5])])
target = train_df[train_label].values
train = transformer.fit_transform(train)
test = transformer.transform(test)

train_df.head()

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus,min_week,baseline_week,base_FVC
0,ID00007637202177411956430,-4,2315,58.25,79,Male,Ex-smoker,-4,0,2315
1,ID00007637202177411956430,5,2214,55.71,79,Male,Ex-smoker,-4,9,2315
2,ID00007637202177411956430,7,2061,51.86,79,Male,Ex-smoker,-4,11,2315
3,ID00007637202177411956430,9,2144,53.95,79,Male,Ex-smoker,-4,13,2315
4,ID00007637202177411956430,11,2069,52.06,79,Male,Ex-smoker,-4,15,2315


In [67]:
# Simple Pytorch Model as I said in the introduction of this file
class Model(nn.Module):
    def __init__(self,n):
        super(Model,self).__init__()
        self.layer1 = nn.Linear(n,200)
        self.layer2 = nn.Linear(200,100)
        self.out1 = nn.Linear(100,3)
        self.relu3 = nn.ReLU()
        self.out2 = nn.Linear(100,3)

    def forward(self,xb):
        x1 =  F.leaky_relu(self.layer1(xb))
        x1 =  F.leaky_relu(self.layer2(x1))
        o1 = self.out1(x1)
        o2 = F.relu(self.out2(x1))
        return o1 + torch.cumsum(o2,dim=1)

In [68]:
def score(outputs, target):
    confidence = outputs[:,2] - outputs[:,0]
    clip = torch.clamp(confidence, min=70)
    target = torch.reshape(target, outputs[:,1].shape)
    delta = torch.abs(outputs[:, 1] - target)
    delta = torch.clamp(delta,max=1000)
    sqrt_2 = torch.sqrt(torch.tensor([2.])).to(device)
    metrics = (delta*sqrt_2/clip) + torch.log(clip*sqrt_2)
    return torch.mean(metrics)

def qloss(outputs, target):
    qs = [0.25,0.5,0.75]
    qs = torch.tensor(qs,dtype=torch.float).to(device)
    e =  target - outputs
    e.to(device)
    v = torch.max(qs*e,(qs-1)*e)
    v = torch.sum(v,dim=1)
    return torch.mean(v)

def loss_fn(outputs, target, l):
    return l * qloss(outputs,target) + (1- l) * score(outputs,target)

def train_loop(train_loader, model, loss_fn, device, optimizer, lr_scheduler=None):
    model.train()
    losses = list()
    metrics = list()
    for i, (inputs, labels) in enumerate(train_loader):
        inputs = inputs.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        with torch.set_grad_enabled(True):
            outputs = model(inputs)
            metric = score(outputs,labels)

            loss = loss_fn(outputs,labels,0.8)
            metrics.append(metric.cpu().detach().numpy())
            losses.append(loss.cpu().detach().numpy())

            loss.backward()

            optimizer.step()
            if lr_scheduler != None:
                lr_scheduler.step()

    return losses,metrics

def valid_loop(valid_loader, model, loss_fn, device):
    model.eval()
    losses = list()
    metrics = list()
    for i, (inputs, labels) in enumerate(valid_loader):
        inputs = inputs.to(device)
        labels = labels.to(device)

        outputs = model(inputs)
        metric = score(outputs,labels)

        loss = loss_fn(outputs,labels,0.8)
        metrics.append(metric.cpu().detach().numpy())
        losses.append(loss.cpu().detach().numpy())

    return losses,metrics

In [69]:
kfold = KFold(3,shuffle=True,random_state=42)
#kfold
for k , (train_idx,valid_idx) in enumerate(kfold.split(train)):
    batch_size = 64
    epochs = 20
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"{device} is used")
    x_train,x_valid,y_train,y_valid = train[train_idx,:],train[valid_idx,:],target[train_idx],target[valid_idx]
    n = x_train.shape[1]
    model = Model(n)
    model.to(device)
    lr = 0.1
    optimizer = optim.Adam(model.parameters(),lr=lr)
    lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)

    train_tensor = torch.tensor(x_train,dtype=torch.float)
    y_train_tensor = torch.tensor(y_train,dtype=torch.float)

    train_ds = TensorDataset(train_tensor,y_train_tensor)
    train_dl = DataLoader(train_ds,
                          batch_size = batch_size,
                          num_workers=4,
                          shuffle=True
                          )

    valid_tensor = torch.tensor(x_valid,dtype=torch.float)
    y_valid_tensor = torch.tensor(y_valid,dtype=torch.float)

    valid_ds = TensorDataset(valid_tensor,y_valid_tensor)
    valid_dl = DataLoader(valid_ds,
                          batch_size = batch_size,
                          num_workers=4,
                          shuffle=False
                          )

    print(f"Fold {k}")
    for i in range(epochs):
        losses,metrics = train_loop(train_dl,model,loss_fn,device,optimizer,lr_scheduler)
        valid_losses,valid_metrics = valid_loop(valid_dl,model,loss_fn,device)
        if (i)%5==0:
            print(f"epoch:{i} Training | loss:{np.mean(losses)} score: {np.mean(metrics)}| \n Validation | loss:{np.mean(valid_losses)} score:{np.mean(valid_metrics)}|")
    torch.save(model.state_dict(),f'model{k}.bin')

cuda is used
Fold 0
epoch:0 Training | loss:1285.89990234375 score: 10.816282272338867| 
 Validation | loss:331.4751281738281 score:7.497641086578369|
epoch:5 Training | loss:134.091552734375 score: 6.514596939086914| 
 Validation | loss:135.57754516601562 score:6.515960693359375|
epoch:10 Training | loss:126.68991088867188 score: 6.451584815979004| 
 Validation | loss:131.455078125 score:6.479407787322998|
epoch:15 Training | loss:126.3822021484375 score: 6.449957847595215| 
 Validation | loss:131.31826782226562 score:6.478060722351074|
cuda is used
Fold 1
epoch:0 Training | loss:1388.9141845703125 score: 10.927675247192383| 
 Validation | loss:698.1372680664062 score:8.32965087890625|
epoch:5 Training | loss:138.52481079101562 score: 6.59238338470459| 
 Validation | loss:138.4564208984375 score:6.589868545532227|
epoch:10 Training | loss:133.8900146484375 score: 6.552117347717285| 
 Validation | loss:135.2467498779297 score:6.561423301696777|
epoch:15 Training | loss:133.549087524414

Now that we have 3 models, we can do ensemble if we want to predict FVC