# The data block API

data block API를 활용하면 lets you customize the creation of a DataBunch by isolating the underlying parts of that proces in seperate blocks.

1. Where the inputs and how to create them? 
2. How to split the data into a training and validation sets?
3. How to label the inputs?
4. What transforms to apply?
5. How to add a test set?
6. How to wrap in dataloaders and create the DataBunch?

> DataBunch는 collection of dataloaders(train, validation, test)

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## IMDB

- text binary classification

In [2]:
import pandas as pd
import fastai.text as fnlp

- path

In [3]:
imdb = fnlp.untar_data(fnlp.URLs.IMDB_SAMPLE)
imdb

PosixPath('/root/.fastai/data/imdb_sample')

In [4]:
imdb.ls()

[PosixPath('/root/.fastai/data/imdb_sample/texts.csv')]

In [5]:
pd.read_csv(imdb.ls()[0]).head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


- data block api
    - csv하나에 input과 label이 df 형태로 존재하는 경우.
    - is_valid라는 train과 valdiation을 분리하는 컬럼이 존재함.

In [6]:
data_lm = (fnlp.TextList.from_csv(imdb, csv_name='texts.csv', cols='text') #Where are the text? Column 'text' of texts.csv
.split_by_rand_pct() #How to split it? Randomly with the default 20% in valid
.label_for_lm() #Label it for a language model
.databunch())  #Finally we convert to a DataBunch\

In [7]:
data_lm.show_batch()

idx,text
0,"! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk ! xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is"
1,scene the xxmaj xxunk xxunk his six - shooter about nine times and could n't hit any of three large men who were only about twenty feet away . i had to turn it off after about 15 minutes of this xxunk . xxmaj perhaps those who xxunk in this movie could have taken some lessons at the xxmaj xxunk xxmaj xxunk xxmaj school of xxmaj acting . xxbos xxmaj
2,"film the xxmaj derek couple has ever made and if you think this is a recommendation then you have n't seen any of the others . xxmaj there are the usual xxunk : it is just as poorly acted as their other efforts , we can watch xxmaj bo xxunk or xxunk for wet t - shirt xxunk quite frequently , the story is just laughably idiotic , and the"
3,"a camera could be out done by a xxunk . \n \n xxmaj the highlights ( what little there were ) came from the special effects , which were "" xxup ok "" . xxmaj the acting for the most part was also "" xxup ok "" ; though nothing special , it was of a higher quality than other b - xxmaj movies i have seen in the"
4,", they were terribly xxunk , as was the entire xxunk scene . xxmaj the xxunk of xxmaj fantine 's suffering xxunk us from feeling too much pity for her . xxmaj that xxmaj cosette knows xxmaj valjean 's past from the start xxunk with the plot a good deal . i did not even see xxmaj xxunk , and xxmaj xxunk . xxmaj xxunk only had a few seconds"


- classification

In [8]:
data_clas = (fnlp.TextList.from_csv(imdb, 'texts.csv', cols='text')
                   .split_from_df(col='is_valid')
                   .label_from_df(cols='label')
                   .databunch())

In [9]:
data_clas.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj",negative
"xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with",positive
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n \n xxmaj it 's usually satisfying to watch a film director change his style /",negative
"xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj xxunk",positive
"xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics",positive


## Tabular

In [10]:
import fastai.tabular as ftbl

In [12]:
adult = ftbl.untar_data(ftbl.URLs.ADULT_SAMPLE)
adult

PosixPath('/root/.fastai/data/adult_sample')

- path

In [13]:
adult.ls()

[PosixPath('/root/.fastai/data/adult_sample/export.pkl'),
 PosixPath('/root/.fastai/data/adult_sample/adult.csv'),
 PosixPath('/root/.fastai/data/adult_sample/models')]

In [14]:
(adult / 'models').ls()

[PosixPath('/root/.fastai/data/adult_sample/models/mini_train.pth')]

In [18]:
df = pd.read_csv(adult/'adult.csv')

In [16]:
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']
procs = [ftbl.FillMissing, ftbl.Categorify, ftbl.Normalize]

In [19]:
data = (ftbl.TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx=range(800,1000))
                           .label_from_df(cols=dep_var)
                           .databunch())

In [20]:
data.show_batch()

workclass,education,marital-status,occupation,relationship,race,sex,native-country,education-num_na,education-num,hours-per-week,age,capital-loss,fnlwgt,capital-gain,target
Private,Some-college,Never-married,Handlers-cleaners,Unmarried,White,Female,United-States,False,-0.0312,-0.0356,-0.6294,-0.2164,0.4568,-0.1459,<50k
Private,Some-college,Divorced,Exec-managerial,Not-in-family,White,Female,United-States,False,-0.0312,-2.9515,-0.1896,-0.2164,0.4157,-0.1459,<50k
Private,Masters,Married-spouse-absent,Exec-managerial,Not-in-family,White,Male,United-States,False,1.5334,0.7743,0.3235,-0.2164,-0.8876,-0.1459,<50k
Private,Assoc-acdm,Married-civ-spouse,Adm-clerical,Husband,White,Male,United-States,False,0.7511,-0.4406,1.5695,-0.2164,-1.077,-0.1459,>=50k
Federal-gov,Some-college,Never-married,Armed-Forces,Not-in-family,Black,Male,United-States,False,-0.0312,1.5843,-0.7027,-0.2164,1.0551,-0.1459,<50k
