<a href="https://colab.research.google.com/github/Ankur-singh/moa_kaggle/blob/main/session_moa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Read the docs, [here](https://github.com/Kaggle/kaggle-api).

First, we will update kaggle package . . . 

In [1]:
!pip install -U -q kaggle

Next, we will have to upload the `kaggle.json` file, point the environment to the directory where `kaggle.json` is saved, and finally update the permission. 

You can do all of it by running the below cell.

In [11]:
import os
from pathlib import Path

kpath = Path('/content')
os.environ['KAGGLE_CONFIG_DIR']= str(kpath)
(kpath/'kaggle.json').chmod(600)

Now that everything is setup, its time to download the dataset from kaggle.

**Note:** I have added two extra arguments:
- `-p`: path (where data is to be downloaded)
- `-q`: quiet

In [18]:
!kaggle competitions download -c lish-moa -p data -q



Its good time to mount our google drive and copy everything . . . 

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
!mkdir /content/drive/My\ Drive/moa_kaggle
!cp -r data /content/drive/My\ Drive/moa_kaggle/

This is all we need to download the dataset from kaggle and save it to our Google drive!

This was a lot of work, I agree. But you will have to do it only once. 

## Data 

In [23]:
import os
import random
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import QuantileTransformer

import warnings
warnings.filterwarnings('ignore')

In [25]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    
seed_everything(seed=42)

Set the `path` variable. It would be a great investment of your time to learn [pathlib](https://realpython.com/python-pathlib/) library.

In [26]:
path = Path('/content/drive/My Drive/moa_kaggle/data')
path

PosixPath('/content/drive/My Drive/moa_kaggle/data')

In [16]:
train_features       = pd.read_csv(path/'train_features.csv')
train_targets_scored = pd.read_csv(path/'train_targets_scored.csv')
test_features        = pd.read_csv(path/'test_features.csv')

Once we have loaded all our data. Lets find a good stater notebook. For this competition, we will be using [this notebook](https://www.kaggle.com/kushal1506/moa-pytorch-0-01859-rankgauss-pca-nn).

In [17]:
GENES = [col for col in train_features.columns if col.startswith('g-')]
CELLS = [col for col in train_features.columns if col.startswith('c-')]
len(GENES), len(CELLS)

In [35]:
# rankGauss
def rankgauss(train, test, col):
    transformer = QuantileTransformer(n_quantiles=100, random_state=0, output_distribution="normal")
    train[col] = transformer.fit_transform(train[col].values)
    test [col] = transformer.transform    (test [col].values)
    return train, test

col =  GENES + CELLS
train_features, test_features = rankGauss(train_features, test_features, col)
train_features.shape, test_features.shape

((23814, 876), (3982, 876))

In [37]:
# PCA
def pca(train, test, col, n_comp, prefix):
    data = pd.concat([pd.DataFrame(train[col]), pd.DataFrame(test[col])])
    data2 = (PCA(n_components=n_comp, random_state=42).fit_transform(data))

    train2 = data2[:train.shape[0]] 
    test2 = data2[-test.shape[0]:]

    train2 = pd.DataFrame(train2, columns=[f'pca_{prefix}-{i}' for i in range(n_comp)])
    test2 = pd.DataFrame(test2, columns=[f'pca_{prefix}-{i}' for i in range(n_comp)])

    # drop_cols = [f'c-{i}' for i in range(n_comp,len(CELLS))]
    train = pd.concat((train, train2), axis=1)
    test = pd.concat((test, test2), axis=1)
    return train, test

train_features, test_features = pca(train_features, test_features, GENES, 600, 'G')
train_features, test_features = pca(train_features, test_features, CELLS,  50, 'C')
train_features.shape, test_features.shape

((23814, 1526), (3982, 1526))

As you can see, we are repeating ourselves. Lets write a function for it.

In [39]:
def sanity_check(): return train_features.shape, test_features.shape

In [40]:
sanity_check()

((23814, 1526), (3982, 1526))

Great, now we have a handy little function to check the shapes.

## Writing reproducable code

#### 1. Make functions
Generally speaking, every 2-3 lines of code that does a **single task** should be placed inside a function. Having a consistent naming scheme for your functions is very important.

#### 2. Combine multiple functions into one
Functions that you call one after the other should be placed inside another function. These second level functions should perform a second level task. By second level task, I mean, single idea like creating folds, cleaning data, etc. which have multiple steps in them. 

#### 3. Make python scripts
Python script is the ultimate form of reproducible code (for me)! Copy paste all you functions (both first level and second level) into a file. 

Use `__name__ == "__main__"` whenever relevant. Its a great way to test your code. Also, it can act as documentation, show casing the following:
- what the inputs are?, 
- how to use the functions?, and 
- what the ouputs are?

**Note:**

- There is no best way! Everyone has their own coding style and taste. So, you should not blindly follow these rules. Experiment a lot and see it for yourself. 

- Another important thing; to become good at something, you will have to invest a lot of time. So, be patient! You won't become a master overnight.

Here is an example of all the principle that I talked about! I know, it's not perfect. It could have been much better. But for now, it should give you a pretty good idea.

In [41]:
!git clone https://github.com/Ankur-singh/moa_kaggle
%cd moa_kaggle

Cloning into 'moa_kaggle'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 37 (delta 15), reused 24 (delta 9), pack-reused 0[K
Unpacking objects: 100% (37/37), done.
/content/moa_kaggle


## Uploading your code to github

Any piece of code, that is frozen (you are sure that it works and you don't change it too often) should be uploaded to github.

Having you code organised as python scipts can be a huge time saver. Here are the benefits:

- You can experiment much faster. In kaggle competitions, your chances of winning are directly proportional to the number of iterations.

- Its really good for reproducibility. Every time you start a new session, you can simply clone the repo and be sure that you have all the latest changes across all your notebooks (be it kaggle kernel, colab, or local notebook).

## Futher Reading

Here are some good resources to get started with:
- https://realpython.com/python-pathlib/
- https://www.kaggle.com/hiramcho/moa-tabnet-with-pca-rank-gauss