## Downsampling
___

The model is not performing optimally due to class imbalance in the dataset. While performing at 99% accuracy on the highest representation class (food), the model's accuracy drops to an abysmal 47% on the least represented class (menu). One way to work around the presence of class imbalance in the dataset is to take a random sample of the more represented classes, to bring all the classes into balance with the least represented class. While this will greatly reduce the total amount of data that the model is trained on, the elimination of class imbalance may lead to the model performing better anyway.

In [13]:
import pandas as pd
import shutil

In [14]:
df = pd.read_csv('photo_labels_final.csv', index_col=0)
df_train = df[df.set == 'train']
df_val = df[df.set == 'val']

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 220992 entries, vhnZ58_1shy9HNVdZgtMLw to HOl4tj5PJc5siu1GE_vDWg
Data columns (total 2 columns):
label    220992 non-null object
set      220992 non-null object
dtypes: object(2)
memory usage: 5.1+ MB


In [15]:
# class imbalance in the larger train dataset
df_train.label.value_counts()

food       144885
inside      48551
outside     18266
drink        8203
menu         1087
Name: label, dtype: int64

In [16]:
# class imbalance in the larger val dataset
df_val.label.value_counts()

food       13150
inside      4420
outside     1643
drink        696
menu          91
Name: label, dtype: int64

In [17]:
# collect all images of menu class
menu_train = df_train[df_train.label == 'menu']
train_n = len(menu_train.label)

menu_val = df_val[df_val.label == 'menu']
val_n = len(menu_val.label)

# downsample other classes
food_train = df_train[df_train.label == 'food'].sample(n=train_n, replace=False, random_state=12)
food_val = df_val[df_val.label == 'food'].sample(n=val_n, replace=False, random_state=123)

inside_train = df_train[df_train.label == 'inside'].sample(n=train_n, replace=False, random_state=23)
inside_val = df_val[df_val.label == 'inside'].sample(n=val_n, replace=False, random_state=234)

outside_train = df_train[df_train.label == 'outside'].sample(n=train_n, replace=False, random_state=34)
outside_val = df_val[df_val.label == 'outside'].sample(n=val_n, replace=False, random_state=345)

drink_train = df_train[df_train.label == 'drink'].sample(n=train_n, replace=False, random_state=45)
drink_val = df_val[df_val.label == 'drink'].sample(n=val_n, replace=False, random_state=456)

In [18]:
# combine all samples
train = pd.concat([menu_train, food_train, inside_train, outside_train, drink_train], axis=0)
val = pd.concat([menu_val, food_val, inside_val, outside_val, drink_val], axis=0)

In [19]:
# classes are now balanced in the downsampled dataset
print(train.label.value_counts())
print(val.label.value_counts())

food       1087
drink      1087
outside    1087
menu       1087
inside     1087
Name: label, dtype: int64
drink      91
food       91
inside     91
menu       91
outside    91
Name: label, dtype: int64


In [20]:
train['set'] = 'train'
val['set'] = 'val'
downsampled = pd.concat([train, val], axis=0)
downsampled.set.value_counts()

train    5435
val       455
Name: set, dtype: int64

In [21]:
downsampled.to_csv('photo_labels_downsampled.csv')

In [22]:
# copy photos into new directories
orig_dir = 'H:/springboard/other_data/yelp/Photos/yelp_academic_dataset_photos/'
dest_dir = 'C:/Users/Nils/Documents/GitHub/Springboard-Capstone-2-local-yelp/downsampled/'

for index, row in downsampled.iterrows():
    filepath = orig_dir + index + '.jpg'
    filedest = dest_dir + row['set'] + '/' + row['label']
    _ = shutil.copy(filepath, filedest)