# Outline

I decided to fine tune **pretrained on ImageNet inception model** from `torchivision`, so I need to preprocess data according to `ImageFolder` format.

I will take given train images and **my own generated ones**. I went [to Stable Diffusion website](https://stablediffusionweb.com/#demo) and  [generated 65 more images with brown bears](https://drive.google.com/drive/folders/1c1XLbw4x_rzCpegpHfJOVREq1A2FSXd8?usp=sharing) similar to the pictures present in the dataset _(bear with white hare, bear in the city, bear on the tree, bear in winter forest, bear under sky full of stars, bear in the mud, bear near the water and so on)_.

Prepared `ImageFolder` will be used for binary classification **whether there is a brown bear on the picture**.

# Disclaimer

The task was to train the model that generalizes traing data the best. In real industry problem we will not have test data on our hands to validate on it. Test data assumes that we can't look at it, so in my work I used only train set of images to train, validate and test my models. 149 test images were used only for generating the final predictions for submission without looking at it.

# Imports

In [5]:
import os
import pandas as pd
import shutil

from sklearn.model_selection import train_test_split
from methods import ptree

SEED = 42

# Labeling Data

I will take trainig part of images set and extract labels from `train.csv` file.

In [6]:
df = pd.read_csv('./train.csv')
df.head()

Unnamed: 0,file_name,x1,y1,x2,y2,confidence
0,image_100.png,0,0,0,0,0.0
1,image_102.jpeg,282,223,755,723,1.0
2,image_103.png,0,0,0,0,0.0
3,image_104.png,0,0,0,0,0.0
4,image_105.jpeg,189,328,402,728,1.0


`confidence == 1.0` means that there is a bear on the image.

Let's take only this column and cast it to _int_.

In [7]:
df['label'] = df['confidence'].astype(int)
df = df[['file_name', 'label']]
df.head()

Unnamed: 0,file_name,label
0,image_100.png,0
1,image_102.jpeg,1
2,image_103.png,0
3,image_104.png,0
4,image_105.jpeg,1


Also I need to create labels for **generated by myself images**. I generated only ones with bears, so labels everywhere are 1.

In [8]:
filenames = list(os.listdir('./generated'))
labels = [1] * len(filenames)

genearated_df = pd.DataFrame({'file_name': filenames, 'label': labels})
genearated_df.head()

Unnamed: 0,file_name,label
0,winter5.jpeg,1
1,stars2.jpeg,1
2,winter6.jpeg,1
3,water_mud1.jpeg,1
4,monkey.jpeg,1


Finally let's combine labels for all these images.

In [10]:
markup = pd.concat([df, genearated_df], axis=0, ignore_index=True)
markup

Unnamed: 0,file_name,label
0,image_100.png,0
1,image_102.jpeg,1
2,image_103.png,0
3,image_104.png,0
4,image_105.jpeg,1
...,...,...
401,city1.jpeg,1
402,water3.jpeg,1
403,grass1.jpeg,1
404,white3.jpeg,1


# Train-Test-Valid Split

I want to separate our data into **stratified train-valid-test split** to keep distribution of data.

I implemented my own method to keep stratified separation to three parts.

In [11]:
def train_valid_test_split(
        df: pd.DataFrame,
        val_size: float=0.1,
        test_size: float=0.1,
        stratify_col: str=None
    ) -> pd.DataFrame:
    '''Performs stratified train-valid-test split on the given data.

    This method adds three new boolean columns to the given dataframe.
    True values stands to what separation belongs to (train, valid or split).

    Args:
        df: Dataframe to separate data from.
        val_size: Fraction of validation data.
        test_size: Fraction of test data.
        stratify_col: Column to count stratification statistics on.

    Returns:
        A copy of the original dataframe with new columns containing separation info.
    '''
    df = df.copy()

    train_idxs, val_idxs = train_test_split(
                            df.index, 
                            test_size=val_size, 
                            stratify=df[stratify_col],
                            random_state=SEED
                        )
    
    scaled_test_size = test_size / (1 - val_size)

    train_idxs, test_idxs = train_test_split(
                            train_idxs,
                            test_size=scaled_test_size, 
                            stratify=df[stratify_col].iloc[train_idxs],
                            random_state=SEED
                        )

    df['train'] = df.index.isin(train_idxs)
    df['val'] = df.index.isin(val_idxs)
    df['test'] = df.index.isin(test_idxs)

    return df


markup_separated = train_valid_test_split(markup, stratify_col='label')
markup_separated.head()

Unnamed: 0,file_name,label,train,val,test
0,image_100.png,0,True,False,False
1,image_102.jpeg,1,True,False,False
2,image_103.png,0,False,True,False
3,image_104.png,0,True,False,False
4,image_105.jpeg,1,False,True,False


# Dataset Creation

Finally, I implemeted unversal method for creating directories and moving files in `torchvision ImageFolder` format.

In [91]:
def create_image_folder(
        df: pd.DataFrame, 
        data_folder: str, 
        folder_name: str
    )  -> None:
    '''Creates a directory in format of torchvision folder.

    Args:
        df: Dataframe with columns [file_name, label, train, val, test].
        data_folder: Path to the directory containing files to spread in the folder 
            being created.
        folder_name: Name with what to create new image folder directory.
    '''
    types = ['train', 'val', 'test']
    targets = list(df['label'].unique())
    os.makedirs(folder_name)

    for type in types:
        path = os.path.join(folder_name, type)
        os.makedirs(path)
        for target in targets:
            dst_folder = os.path.join(path, str(target))
            os.makedirs(dst_folder)

            folder = df[type] == True
            labels = df['label'] == target
            filenames = list(df[labels & folder]['file_name'])
            for filename in filenames:
                src = os.path.join(data_folder, filename)
                dst = os.path.join(dst_folder, filename)
                shutil.copyfile(src, dst)
            

create_image_folder(markup_separated, 'all_images', 'bears_image_folder')

Let's take a look at the final result.

In [92]:
ptree('bears_image_folder')

bears_image_folder
|-- val
|   |-- 0
|   |-- 1
|-- test
|   |-- 0
|   |-- 1
|-- train
|   |-- 0
|   |-- 1


In [3]:
!ls "./bears_image_folder/val/1"

baby1.jpeg	image_122.jpeg	image_340.jpeg	road2.jpeg  water3.jpeg
city4.jpeg	image_165.jpeg	image_367.png	tree5.jpeg  water_mud4.jpeg
image_105.jpeg	image_291.png	image_393.png	tree8.jpeg  white3.jpeg
