# Bengali.AI Competition - Data Preprocessing
### Team MuchLearningSuchWow

This notebook contains the code we used to preprocess the data. The four `.parquet` training files from Kaggle contain a total of 200.840 training images with a resolution of 236 by 137 pixels, taking up a total of 4.78 gigabytes. The graphemes in these images are not centered.

In this notebook, we transform these images to a size of 64 by 64 pixels, and we center the graphemes and normalize pixel values. This reduces the size of the data to 271 megabytes, so that all data can be loaded in at once (instead of one `.parquet` file at a time).

The code for centering, reshaping and normalizing the data is based heavily on [this](https://www.kaggle.com/iafoss/image-preprocessing-128x128) kaggle kernel. Note that this notebook is not intended to run on kaggle; to run this code on kaggle, add "/kaggle/" in front of all filenames.

## Imports

In [1]:
import cv2
import os
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from tqdm.auto import tqdm

## Filenames

In [2]:
for dirname, _, filenames in os.walk('input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

input\bengaliai-cv19\class_map.csv
input\bengaliai-cv19\sample_submission.csv
input\bengaliai-cv19\test.csv
input\bengaliai-cv19\test_image_data_0.parquet
input\bengaliai-cv19\test_image_data_1.parquet
input\bengaliai-cv19\test_image_data_2.parquet
input\bengaliai-cv19\test_image_data_3.parquet
input\bengaliai-cv19\train.csv
input\bengaliai-cv19\train_image_data_0.parquet
input\bengaliai-cv19\train_image_data_1.parquet
input\bengaliai-cv19\train_image_data_2.parquet
input\bengaliai-cv19\train_image_data_3.parquet
input\bengaliai-cv19\train_image_data_preprocessed.parquet
input\bengaliai-cv19\train_image_data_preprocessed_0.parquet
input\bengaliai-cv19\train_image_data_preprocessed_1.parquet
input\bengaliai-cv19\train_image_data_preprocessed_2.parquet
input\bengaliai-cv19\train_image_data_preprocessed_3.parquet


In [None]:
train_filenames = ["input/bengaliai-cv19/train_image_data_0.parquet",
                   "input/bengaliai-cv19/train_image_data_1.parquet",
                   "input/bengaliai-cv19/train_image_data_2.parquet",
                   "input/bengaliai-cv19/train_image_data_3.parquet"]
processed_filenames = ["input/bengaliai-cv19/train_image_data_preprocessed_0.parquet",
                       "input/bengaliai-cv19/train_image_data_preprocessed_1.parquet",
                       "input/bengaliai-cv19/train_image_data_preprocessed_2.parquet",
                       "input/bengaliai-cv19/train_image_data_preprocessed_3.parquet"]
final_filename = "input/bengaliai-cv19/train_image_data_preprocessed.parquet"

## Preprocessing

In [3]:
HEIGHT = 137
WIDTH = 236
RESIZE_SIZE = 64

In [4]:
# Source: https://www.kaggle.com/iafoss/image-preprocessing-128x128

def bbox(img):
    rows = np.any(img, axis=1)
    cols = np.any(img, axis=0)
    rmin, rmax = np.where(rows)[0][[0, -1]]
    cmin, cmax = np.where(cols)[0][[0, -1]]
    return rmin, rmax, cmin, cmax

def crop_resize(img0, size=RESIZE_SIZE, pad=16):
    #crop a box around pixels large than the threshold 
    #some images contain line at the sides
    ymin,ymax,xmin,xmax = bbox(img0[5:-5,5:-5] > 80)
    #cropping may cut too much, so we need to add it back
    xmin = xmin - 13 if (xmin > 13) else 0
    ymin = ymin - 10 if (ymin > 10) else 0
    xmax = xmax + 13 if (xmax < WIDTH - 13) else WIDTH
    ymax = ymax + 10 if (ymax < HEIGHT - 10) else HEIGHT
    img = img0[ymin:ymax,xmin:xmax]
    #remove lo intensity pixels as noise
    img[img < 28] = 0
    lx, ly = xmax-xmin,ymax-ymin
    l = max(lx,ly) + pad
    #make sure that the aspect ratio is kept in rescaling
    img = np.pad(img, [((l-ly)//2,), ((l-lx)//2,)], mode='constant')
    return cv2.resize(img,(size,size))

In [5]:
# For each training .parquet file:
for i in range(len(train_filenames)):
    # Load the dataframe and reshape it to the correct size
    df = pd.read_parquet(train_filenames[i])
    data = 255 - df.iloc[:, 1:].values.reshape(-1, HEIGHT, WIDTH)
    
    # Process all images
    processed = []
    names = []
    
    for idx in tqdm(range(len(df))):
        names.append(df.iloc[idx,0])
        #normalize each image by its max val
        img = (data[idx]*(255.0/data[idx].max())).astype(np.uint8)
        img = crop_resize(img)
        processed.append(img.flatten())
    
    # Delete the data to save memory
    del df
    del data
    
    # Convert the processed data to a dataframe
    processed_df = pd.DataFrame(processed)
    del processed

    # Restore the "image_id" column
    processed_df.insert(0, "image_id", names)
    del names

    # Convert the dataframe to a parquet table
    table = pa.Table.from_pandas(processed_df)
    del processed_df
    
    # Write the table to a parquet file
    pq.write_table(table, processed_filenames[i])
    del table

HBox(children=(IntProgress(value=0, max=50210), HTML(value='')))






HBox(children=(IntProgress(value=0, max=50210), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50210), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50210), HTML(value='')))




In [6]:
# Load all of the separate preprocessed dataframes from the parquet files
dfs = []
for filename in processed_filenames:
    dfs.append(pd.read_parquet(filename))

# Combine all dataframes into a single dataframe
combined_df = pd.concat(dfs)
del dfs

# Convert the combined dataframe to a parquet table
table = pa.Table.from_pandas(combined_df)
del combined_df

# Write the table to a parquet file
pq.write_table(table, final_filename)
del table

In [7]:
# Inspect whether preprocessing worked by re-loading the resulting dataframe from the parquet file
df = pd.read_parquet(final_filename)
print("Data shape: "+str(df.shape))
print(df.head())

Data shape: (200840, 4097)
  image_id  0  1  2  3  4  5  6  7  8  ...  4086  4087  4088  4089  4090  \
0  Train_0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
1  Train_1  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
2  Train_2  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
3  Train_3  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
4  Train_4  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   

   4091  4092  4093  4094  4095  
0     0     0     0     0     0  
1     0     0     0     0     0  
2     0     0     0     0     0  
3     0     0     0     0     0  
4     0     0     0     0     0  

[5 rows x 4097 columns]
