# UnpackAI Library Development Plan
> What library should we be?
This proposal, is more of a branding proposal, targeting people who's going to play with AI, from various back grounds.
* That means, we're going to talk about how people view this library, how they think of ```pip install -Uqq unpackai``` like if I have dandroff recengtly and my mind just jump right into the headshoulders.
* For ML, currently, the **jump** is about the following, this is not a throught marketing research, just quick examples from a deep learning practitioner:
    * Try free structure quickly, do experiments: pytorch
    * Goes to production, run model on edge devices, Tensorflow
    * Play with GPU accelerated tensor calculation: Jax
    * Play with tf but in simpler layer sense: Keras
    * Transformer in clean code: Huggingface
    * Visualize things with interactive features: Plotly
    * Deploy model prototype: streamlit
* Surely you think I fail to mention ```fastai```, this is where the **branding goes wrong**, fastai library is bounded tightly with the education. It's considered a good creation along side its famous course, after the education. Its product feature has many limitation: docs too brief, not supporting multi-device training, very limited numbers of callbacks went beyond Jeremy H's own teaching.
* Most important of all, ```fastai``` isn't enjoyable to use, **it's just packing many things mentioned in the course**.

## What we shouldn't be
I know the course is life changing for me and I feel very grateful. But let's not be their library.

### The pipeline wrapping plan
It all started from a notebook, quite like a template notebook we have for the course. A notebook that achieves the data processing, model building, interpretation for a specific DL task.

Then came the packaging part, we wrap **dozens of lines of codes**, which scares our kind students, into simple functions, or class.

The wrapped functions are simple to use, to look at, it was executed in 1 line mostly. So friendly to our innocent students.

This is what a python library is about, right? Wrap things into functions which can be further wraped into even less lines.

It's nothing wrong about this approach at first. Some DL task, if need be, can be shrank into **less than 10 lines of codes.**
* The 1st line load the data, 
* the 2nd line set how to transform data, 
* the 3rd line build/load the model, 
* the 4th line trained model.
* the 5th line interpret the model in various ways

Well the above do look like a decent **structure** to start with, then we pave out the tasks, different contributors take different tasks, can be developed in parallel, and we can have the agile/crum/kanban fun to track our progress!

Even if we do this, we could build a useful product, no less.

#### Bad side about pipeline wrapping plan
So so many libraries are doing the same, from awesome people even. They usually end up to the following:
* It's a mess of functions, among them many good functions but a mess. It ends up a branding disaster. (**There is no way to answer: what can you library do, in a slogan**)
* A model zoo for a specific domain.
* Wraping things up means less and less involvement from the user. The user will spend very little time play with the functions, and each function usually achieve very specific task. Actually I do believe there is a equilibrium like:
$\large{UserPlayHours = a * Task Transferability}$

## Alternative approach

The salvation plan is somehow simpler at how we perceive the library:
* A library that allows you experiment AI/DL for various tasks

**BUT!!!**
* Many module with in the pipeline should be dropdown-list/checkbox **Choosable**.
* The **level of detail** we let them to play and choose, is the **level of the difficulty** we want them to enjoy

### What is level of detail ?
Level of details is the level of fuss we want user to focus on, this is the exact part fastai library got **WRONG**, which will explain most of our struggle so far:
* It offers smooth/ easy pipelines, for newbies and business people even.
* Any amount of reconfigure, is usually way too complicated for such audience
* There is a **GAP** between the 2 points above, hence no room for playing

#### Keras Example 
I started my AI journey with Keras, and I love keras by that time, because:
* Keras plays with **layers**(eg. Linear, Convolution), its most strenth is at astracting details beneath this level, and let users play with layers. 
* I spent lots of time, having fun playing with layers
* Aside from the things I have to redesign layer, I can deploy almost all kinds of models mentioned in any DL paper (𝑈𝑠𝑒𝑟𝑃𝑙𝑎𝑦𝐻𝑜𝑢𝑟𝑠=𝑎∗𝑇𝑎𝑠𝑘𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦)

#### Pytorch lightning example
Well I moved on to the career team. I have to deal with layer level, I have to deal with different data/forward pipeline. PL is a good library because:
* It allows me play with the things I mentioned, but save my energy on things like looping, logging, multidevice training detail etc.
* If you see a training notebook built by PL, you'll see very little lines around training template.
* You'll find about a lots of lines on the specifications you intend to be different.

>The branding image of the examples are simple:
* Keras: play TensorFlow in a concept of layers
* Pytorch-Lightning: writting less template code

#### Unpackai Example
For our lib, I intend for them to focus on, exactly the same range of things we want people to learn:
* choose the columns they intend to use, in what way
* choose the data transformations
* choose the loss, the model structure to use (not keras.layer, not nn.module)
* hit run

## Demo of such example

In [70]:
from ipywidgets import interact, interact_manual
from forgebox.imports import *
from forgebox.category import Category

In [14]:
HOME = Path(os.environ['HOME'])

Let's skip data download here, it's download, we're not going to reinvent brilliant stuff around download

In [None]:
BEAR_DATASET = HOME/"Downloads"/"bear_dataset"

### Step 1 Everything starts with dataframe

For fastai, everything starts from list, an ItemList to be specific. **ImageList** and **TextList** is [**ItemList**](https://fastai1.fast.ai/tutorial.itemlist.html) with some slight enhanced feature.```[🧂, 🏓, 🍷, 🐻]```

For the clarity of education, or for simplecity as ultimate form of beauty, we use [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) as starting point, ItemList in table format. In this way, every dataset has the same starting point, even the tabular data. 

In [26]:
def df_creator_image_folder(path: Path):
    path = Path(path)
    files = list(path.rglob("*.jpg"))
    files.extend(path.rglob("*.JPG"))
    files.extend(path.rglob("*.jpeg"))
    files.extend(path.rglob("*.JPEG"))
    files.extend(path.rglob("*.png"))
    files.extend(path.rglob("*.PNG"))
    return pd.DataFrame({"path":files}).sample(frac=1.).reset_index(drop=True)

In [27]:
bear_df = df_creator_image_folder(BEAR_DATASET)

In [25]:
bear_df

Unnamed: 0,path
0,/Users/xiaochen.zhang/Downloads/bear_dataset/g...
1,/Users/xiaochen.zhang/Downloads/bear_dataset/g...
2,/Users/xiaochen.zhang/Downloads/bear_dataset/t...
3,/Users/xiaochen.zhang/Downloads/bear_dataset/t...
4,/Users/xiaochen.zhang/Downloads/bear_dataset/b...
...,...
517,/Users/xiaochen.zhang/Downloads/bear_dataset/b...
518,/Users/xiaochen.zhang/Downloads/bear_dataset/g...
519,/Users/xiaochen.zhang/Downloads/bear_dataset/b...
520,/Users/xiaochen.zhang/Downloads/bear_dataset/b...


### Enrich columns (feature transformation, label extraction)
After this step, there will only be **MORE** column ➕

In [67]:
from typing import List, Dict, Callable, Any, Tuple
from torchvision import transforms as tfm
from PIL import Image

In [72]:
class Enrich:
    prefer = None
    def __call__(self,row):
        return row
    
class EnrichImage(Enrich):
    """
    Create Image column from image path column
    """
    prefer = "QuantifyImage"
    typing = Image
    def __init__(
        self,
        path_col: str = "path",
    ):
        self.path_col = path_col
        
    def __repr__(self):
        return f"[Image]"
    
    def __call__(self, row):
        img = Image.open(row[self.path_col]).convert('RGB')
        return img

def enrich_parent_as_label(
    df: pd.DataFrame,
    path_col:str = "path",
    parent_col:str="parent"
) -> pd.DataFrame:
    """
    Use parent folder name as label
    """
    df[parent_col] = df[path_col].apply(lambda i:Path(i).parent.name)
    return df

In [75]:
bear_df['img'] = EnrichImage('path')

In [76]:
bear_df = enrich_parent_as_label(bear_df,)
bear_df

Unnamed: 0,path,parent,img
0,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
1,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
2,/Users/xiaochen.zhang/Downloads/bear_dataset/g...,grizzly,[Image]
3,/Users/xiaochen.zhang/Downloads/bear_dataset/t...,teddys,[Image]
4,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
...,...,...,...
517,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
518,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
519,/Users/xiaochen.zhang/Downloads/bear_dataset/b...,black,[Image]
520,/Users/xiaochen.zhang/Downloads/bear_dataset/g...,grizzly,[Image]


### Quantify: Choose columns as X and Y, put them into number

In [80]:
class Quantify:
    is_quantify = True
    """
    Transform list of things to torch tensor
    """
    def __call__(self, list_of_items):
        return torch.Tensor(list_of_items)


class QuantifyImage(Quantify):
    """
    Transform PIL.Image to tensor
    """

    def __init__(
        self,
        image_size: Tuple[int] = (224, 224),
        mean_=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ):
        self.transform = tfm.Compose([
            tfm.Resize(image_size),
            tfm.ToTensor(),
            tfm.Normalize(mean=mean_, std=std),
        ])

    def __call__(self, list_of_image):
        return self.transform(list_of_image)


class QuantifyCategory(Quantify):
    """
    Transform single categorical data to index numbers in pytorch tensors
    """

    def __init__(self, col_name: str, min_frequency: int = 5):
        self.col_name = col_name
        self.min_frequency = min_frequency

    def summarize_category(self, df):
        df = pd.DataFrame(df)
        value_counts = df.vc(self.col_name)
        categories = np.array(
            list(value_counts.index[value_counts.values > self.min_frequency]))
        self.category = Category(arr=categories, pad_mst=True)
        self.num_category = len(self.category)

    def __call__(self, list_of_strings):
        return torch.LongTensor(self.category.c2i[np.array(list_of_strings)])


class QuantifyMultiCategory(Quantify):
    """
    Turn Multi-categorical data to n_hot encoding numbers in pytorch tensors
    """
    def __init__(self, col_name: str):
        self.col_name = col_name
        
QUANTIFY = dict(
    Quantify=Quantify,
    QuantifyImage=QuantifyImage,
    QuantifyCategory=QuantifyCategory,
    QuantifyMultiCategory=QuantifyMultiCategory
)

###  Choose your model, loss

### Training