# Classification tutorial

In this tutorial, we will apply machine learning on the [titanic dataset](https://www.kaggle.com/c/titanic).      

In daft we use `read_csv` to load the csv file.     
We use `show` to present the data since daft is lazy and does not run calcualtions if not needed.

In [1]:
import daft 

df = daft.DataFrame.read_csv('~/development/datasets/titanic.csv')
df.limit(2).show()

2023-01-30 15:14:27.066 | INFO     | daft.context:runner:77 - Using PyRunner


pclass INTEGER,survived INTEGER,name STRING,sex STRING,age FLOAT,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,embarked STRING,boat STRING,body FLOAT,home_dest STRING
1,0,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.338,B5,S,2,,"St Louis, MO"
1,0,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


<div class="alert alert-warning">
    If you look closely, you can see that unless instructed otherwise, daft is using the <b>PyRunner</b> to run the calcualtions.     
<br>
    This is great for quick expermintaion and for many standard use-cases where the data fits in memory. 
</div>

### First things first, we split the data to train and test.    
In daft, we can create lazy functions to use as columns, which can save us memory.     
To learn more about extending daft with functions, have a look at the [docs](https://getdaft.io/docs/learn/10-min.html#user-defined-functions).

Here we will use a `uniform` function to split the data.     
By using numpy [default_rng](https://numpy.org/doc/stable/reference/random/generator.html), we can make sure that the function always returns the same values for the split, which guarantees the correctness of the train-test-split.

In [2]:
from daft import polars_udf
import polars as pl
import numpy as np

@polars_udf(return_type=float)
def uniform(name: pl.Series, seed=0):
    return np.random.default_rng(seed).uniform(0, 1, len(name))

def train_test_split(df, fraction=0.8):
    return df.where(uniform(df['name']) <= fraction), df.where(uniform(df['name']) > fraction)

train, test = train_test_split(df)

## Feature engineering

daft has an expressions system, which make a feature like *family_size* easy to do.

In [3]:
train = train.with_column('family_size', train['parch'] + train['sibsp'] + 1)  # the +1 is for self
train.limit(2).show()

pclass INTEGER,survived INTEGER,name STRING,sex STRING,age FLOAT,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,embarked STRING,boat STRING,body FLOAT,home_dest STRING,family_size INTEGER
1,0,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.338,B5,S,2,,"St Louis, MO",1
1,0,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",4


we can see that the average age is about 30, let's use it to fill missing values using the [if_else](https://getdaft.io/docs/learn/user_guides/expressions.html#if-else-pattern) method.

In [4]:
train.mean('age').show()

age FLOAT
29.6177


* It is important that both options of the *if_else* are of the same type, in this case float.

In [6]:
train = train.with_column('age', (train['age'].is_null().if_else(30.0, train['age'])))
train.limit(2).show()

pclass INTEGER,survived INTEGER,name STRING,sex STRING,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,embarked STRING,boat STRING,body FLOAT,home_dest STRING,family_size INTEGER,age FLOAT
1,0,"Allen, Miss. Elisabeth Walton",female,0,0,24160,211.338,B5,S,2,,"St Louis, MO",1,29.0
1,0,"Allison, Master. Hudson Trevor",male,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",4,0.9167


If we need a "complicated" *apply*, we can use [polars](https://pola-rs.github.io/polars-book/user-guide/introduction.html) with a UDF

In [8]:
@polars_udf(return_type=str)
def age_group(sex: pl.Series, age: pl.Series):
    
    def group(x):
        if x['sex'] == 'male':
            if x['age'] <= 15: 
                return 'boy'
            return 'adult male'
        if x['sex'] == 'female':
            if x['age'] <= 15:
                return 'girl'
            return 'adult female'
        return 'other'
    return pl.DataFrame([sex.alias('sex'), age.alias('age')]).select( 
        pl.struct(['sex' ,'age']).apply(group).alias('value'))['value'] 


train = train.with_column('age_group', age_group(train['sex'], train['age']))
train.limit(2).show()

pclass INTEGER,survived INTEGER,name STRING,sex STRING,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,embarked STRING,boat STRING,body FLOAT,home_dest STRING,family_size INTEGER,age FLOAT,age_group STRING
1,0,"Allen, Miss. Elisabeth Walton",female,0,0,24160,211.338,B5,S,2,,"St Louis, MO",1,29.0,adult female
1,0,"Allison, Master. Hudson Trevor",male,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",4,0.9167,boy


Similarly, we can also use [numpy](https://numpy.org/doc/stable/) for vectorization.    
In this case we want to bin the family size and set it as a categorical string.


In [9]:
@polars_udf(return_type=str)
def bin_family(size):
    bins = [0, 1, 2, 5, 7, 100, 1000]
    return np.vectorize(str)(np.digitize(size, bins))

train = train.with_column('family_bin',bin_family(train['family_size']))
train.limit(2).show()

pclass INTEGER,survived INTEGER,name STRING,sex STRING,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,embarked STRING,boat STRING,body FLOAT,home_dest STRING,family_size INTEGER,age FLOAT,age_group STRING,family_bin STRING
1,0,"Allen, Miss. Elisabeth Walton",female,0,0,24160,211.338,B5,S,2,,"St Louis, MO",1,29.0,adult female,2
1,0,"Allison, Master. Hudson Trevor",male,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",4,0.9167,boy,3


Let's fill all potensial missing values before feeding them to our model

In [11]:
def fillnull(df:daft.DataFrame, columns: str, value:str):
    for column in columns:
        df = df.with_column(column, (df[column].is_null().if_else(value, df[column])))
    return df

train = fillnull(train, ['embarked', 'sex', 'family_bin', 'age_group'], '')
train.limit(2).show()

pclass INTEGER,survived INTEGER,name STRING,sibsp INTEGER,parch INTEGER,ticket STRING,fare FLOAT,cabin STRING,boat STRING,body FLOAT,home_dest STRING,family_size INTEGER,age FLOAT,embarked STRING,sex STRING,family_bin STRING,age_group STRING
1,0,"Allen, Miss. Elisabeth Walton",0,0,24160,211.338,B5,2,,"St Louis, MO",1,29.0,S,female,2,adult female
1,0,"Allison, Master. Hudson Trevor",1,2,113781,151.55,C22 C26,11,,"Montreal, PQ / Chesterville, ON",4,0.9167,S,male,3,boy


### Let's put it all together

In [12]:
@polars_udf(return_type=str)
def age_group(sex: pl.Series, age: pl.Series):
    
    def group(x):
        if x['sex'] == 'male':
            if x['age'] <= 15:
                return 'boy'
            return 'adult male'
        if x['sex'] == 'female':
            if x['age'] <= 15:
                return 'girl'
            return 'adult female'
        return 'other'
    return pl.DataFrame([sex.alias('sex'),age.alias('age')]).select(
        pl.struct(['sex' ,'age']).apply(group).alias('value'))['value'] 


@polars_udf(return_type=str)
def bin_family(size):
    bins = [0, 1, 2, 5, 7, 100, 1000]
    return np.vectorize(str)(np.digitize(size, bins))


def fillnull(df:daft.DataFrame, columns: str, value:str):
    for column in columns:
        df = df.with_column(column, (df[column].is_null().if_else(value, df[column])))
    return df


def preprocess(df):
    df = df.with_column('family_size', df['parch'] + df['sibsp'] + 1)
    df = df.with_column('age', (df['age'].is_null().if_else(30.0, df['age'])))
    df = df.with_column('age_group', age_group(df['sex'], df['age']))
    df = df.with_column('family_bin', bin_family(df['family_size']))
    df = fillnull(df, ['embarked', 'sex', 'family_bin', 'age_group'], '')
    return df

train, test = train_test_split(df)
train, test = preprocess(train), preprocess(test)

## Modelling with [LightGBM](https://lightgbm.readthedocs.io)

In [14]:
from lightgbm.sklearn import LGBMClassifier
from sklearn.metrics import classification_report
import pandas as pd 

pd.options.mode.chained_assignment = None


def to_x_y(df):
    data = df.to_pandas()[['embarked', 'sex', 'family_bin', 'age_group','survived']] # fail for some reason if select columns before "to_pandas"
    X, y = data[['embarked', 'sex', 'family_bin', 'age_group']], data['survived'].values
    for column in X.columns:
        X[column] = X[column].astype('category')
    return X, y


X_train, y_train = to_x_y(train)
X_test, y_test = to_x_y(test)

model = LGBMClassifier().fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.81      0.74      0.77        99
           1       0.86      0.90      0.88       174

    accuracy                           0.84       273
   macro avg       0.83      0.82      0.83       273
weighted avg       0.84      0.84      0.84       273



# What about big data? Let's use [Ray](https://docs.ray.io/en/latest/ray-air/examples/lightgbm_example.html)
In this case, all of our preprocessing already works in a distributed manner over ray - no changes needed!

To run using ray (even locally), we use `daft.context.set_runner_ray()` at the begining.    

<div class="alert alert-warning">
    Please restart the kernel for it to work.
</div>

## Setup

In [1]:
import daft
from daft import polars_udf
import polars as pl
import numpy as np
import pandas as pd 

pd.options.mode.chained_assignment = None

daft.context.set_runner_ray() # <-- This is all you need

features = ['embarked', 'sex', 'family_bin', 'age_group']
target = 'survived'

@polars_udf(return_type=float)
def uniform(name: pl.Series, seed=0):
    return np.random.default_rng(seed).uniform(0, 1, len(name))

def train_test_split(df, fraction=0.8):
    return df.where(uniform(df['name']) <= fraction), df.where(uniform(df['name']) > fraction)

@polars_udf(return_type=str)
def age_group(sex: pl.Series, age: pl.Series):
    
    def group(x):
        if x['sex'] == 'male':
            if x['age'] <= 15:
                return 'boy'
            return 'adult male'
        if x['sex'] == 'female':
            if x['age'] <= 15:
                return 'girl'
            return 'adult female'
        return 'other'
    return pl.DataFrame([sex.alias('sex'),age.alias('age')]).select(
        pl.struct(['sex' ,'age']).apply(group).alias('value'))['value'] 


@polars_udf(return_type=str)
def bin_family(size):
    bins = [0, 1, 2, 5, 7, 100, 1000]
    return np.vectorize(str)(np.digitize(size, bins))


def fillnull(df:daft.DataFrame, columns: str, value:str):
    for column in columns:
        df = df.with_column(column, (df[column].is_null().if_else(value, df[column])))
    return df


def preprocess(df):
    df = df.with_column('family_size', df['parch'] + df['sibsp'] + 1)
    df = df.with_column('age', (df['age'].is_null().if_else(30.0, df['age'])))
    df = df.with_column('age_group', age_group(df['sex'], df['age']))
    df = df.with_column('family_bin', bin_family(df['family_size']))
    df = fillnull(df, features, '')
    return df


df = daft.DataFrame.read_csv('~/development/datasets/titanic.csv')

train, test = train_test_split(df)
train, test  = preprocess(train), preprocess(test)

2023-01-30 15:18:28.392 | INFO     | daft.context:runner:71 - Using RayRunner
2023-01-30 15:18:30,476	INFO worker.py:1538 -- Started a local Ray instance.


## Modelling distributed with [Ray-lightgbm](https://docs.ray.io/en/latest/ray-air/examples/lightgbm_example.html)

In [2]:
import warnings
import ray
from ray.data.preprocessors import OrdinalEncoder, Chain, Categorizer
from ray.train.batch_predictor import BatchPredictor
from ray.air.config import ScalingConfig, RunConfig, CheckpointConfig
from ray.train.lightgbm import LightGBMTrainer, LightGBMPredictor
from tempfile import TemporaryDirectory
from sklearn.metrics import classification_report


def to_ray(df):
    return df.to_ray_dataset().select_columns(features + [target])


def to_x_y(df):
    data = df.to_pandas()[['embarked', 'sex', 'family_bin', 'age_group','survived']] # fail for some reason if select before "to_pandas"
    X, y = data[['embarked', 'sex', 'family_bin', 'age_group']], data['survived'].values
    for column in X.columns:
        X[column] = X[column].astype('category')
    return X, y


tmpdir = TemporaryDirectory() # for local clean experimentation, alternatively it is saved in the ~/ray_result directory
run_config = RunConfig(local_dir=tmpdir.name)

datasets = {"train": to_ray(train), "test":to_ray(test)}

trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(num_workers=1, use_gpu=False), # you can scale up as you like, use GPUS and all that jazz
    run_config = run_config,
    label_column=target,
    params={"objective": "binary", "metric": ["binary_logloss", "binary_error"], "verbose":-1},
    datasets=datasets,
    preprocessor=Chain(Categorizer(features)),
    num_boost_round=100,
)

result = trainer.fit()
model = LightGBMPredictor.from_checkpoint(result.checkpoint)
# model = BatchPredictor.from_checkpoint(result.checkpoint, LightGBMPredictor) # if we want batch predictions
X_test, y_test   = to_x_y(test)
print(classification_report(y_test, (model.predict(X_test) > 0.5).astype(int)))

Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.45it/s]
Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.02it/s]


0,1
Current time:,2023-01-30 15:20:22
Running for:,00:00:09.50
Memory:,56.2/64.0 GiB

Trial name,status,loc,iter,total time (s),train-binary_logloss,train-binary_error,test-binary_logloss
LightGBMTrainer_35c5d_00000,TERMINATED,127.0.0.1:4600,101,6.72115,0.477441,0.198842,0.437776








Trial name,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,node_ip,pid,should_checkpoint,test-binary_error,test-binary_logloss,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,train-binary_error,train-binary_logloss,training_iteration,trial_id,warmup_time
LightGBMTrainer_35c5d_00000,2023-01-30_15-20-22,True,,fc50cfe07b8e416c80b1d6b4e3360bda,0,MacBook-Pro-2,101,127.0.0.1,4600,True,0.157509,0.437776,6.72115,0.849191,6.72115,1675088422,0,,0.198842,0.477441,101,35c5d_00000,0.01087


2023-01-30 15:20:22,724	INFO tune.py:762 -- Total run time: 9.69 seconds (9.47 seconds for the tuning loop).


              precision    recall  f1-score   support

           0       0.81      0.74      0.77        99
           1       0.86      0.90      0.88       174

    accuracy                           0.84       273
   macro avg       0.83      0.82      0.83       273
weighted avg       0.84      0.84      0.84       273



# Conclution
We have just used the flexability of Polars, with the ecosystem of numpy and pandas, and the scalability of Ray, to run preprocessing modelling with lightGBM.