# Data Exploration: Image Pre-Processing
Let's preprocess the images so that they are:
- center cropped
- NxN dimensions (same height and width)

Finally, we output the images into two folders: `train` and `test`
Each file will be a 137x236 image with the `image_id` as the filename

## NOTE: images are normalized individually instead of in batches or by the whole dataset!

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
from pathlib import Path

import pandas as pd
from helpers.preprocess import gen_preprocessed_data

In [3]:
path = Path('./data')
sorted(os.listdir(path))

['bengaliai-cv19.zip',
 'class_map.csv',
 'mini-train',
 'mini-train.csv',
 'models',
 'sample_submission.csv',
 'test',
 'test.csv',
 'test_image_data_0.parquet',
 'test_image_data_1.parquet',
 'test_image_data_2.parquet',
 'test_image_data_3.parquet',
 'train',
 'train.csv',
 'train_image_data_0.parquet',
 'train_image_data_1.parquet',
 'train_image_data_2.parquet',
 'train_image_data_3.parquet']

In [4]:
HEIGHT = 137
WIDTH = 236
TRAIN_DATASETS = [
    path/'train_image_data_0.parquet',
    path/'train_image_data_1.parquet',
    path/'train_image_data_2.parquet',
    path/'train_image_data_3.parquet',
]
TEST_DATASETS = [
    path/'test_image_data_0.parquet',
    path/'test_image_data_1.parquet',
    path/'test_image_data_2.parquet',
    path/'test_image_data_3.parquet',
]

In [None]:
gen_preprocessed_data(TRAIN_DATASETS, path/'train')

In [None]:
gen_preprocessed_data(TEST_DATASETS, path/'test')

# Create mini dataset

In [5]:
def create_mini_train_dataset(parquet_path, save_dir=path/'mini-train', 
                              csv_fn='mini-train.csv'):
    # We assume the save_dir is already created
    # Add image files to save_dir
    gen_preprocessed_data([parquet_path], save_dir)
    # Get labels from train.csv
    train_df = pd.read_csv(path/'train.csv')
    mini_image_ids = pd.read_parquet(parquet_path).image_id
    mini_df = train_df[train_df.image_id.isin(mini_image_ids)]
    assert len(mini_image_ids) == len(mini_df)
    # Save mini df
    mini_df.to_csv(path/csv_fn)
    return mini_df

In [6]:
mini_train = create_mini_train_dataset(TRAIN_DATASETS[0])
mini_train.head()

Completed 0.000000% in 4.302143
Completed 9.958176% in 9.230625
Completed 19.916351% in 14.064235
Completed 29.874527% in 18.915031
Completed 39.832703% in 23.706350
Completed 49.790878% in 28.483585
Completed 59.749054% in 33.398123
Completed 69.707230% in 38.242907
Completed 79.665405% in 43.051191
Completed 89.623581% in 47.840643
Completed 99.581757% in 52.656681
Total time for df:  52.960880279541016

Total time:  52.960991859436035


Unnamed: 0,image_id,grapheme_root,vowel_diacritic,consonant_diacritic,grapheme
0,Train_0,15,9,5,ক্ট্রো
1,Train_1,159,0,0,হ
2,Train_2,22,3,5,খ্রী
3,Train_3,53,2,2,র্টি
4,Train_4,71,9,5,থ্রো
