# Springboard Capstone Project 2

## Image sampling
___

The full Yelp dataset contains more than 280,000 photos. To use the full dataset for the design of a deep learning algorithm would require a prohibitive amount of processing power and would lead to a long time between iterations. Instead, a subset of the photos will be sampled, and the algorithm will be designed to perform well on this subset. Then, for a final evaluation of the success of the algorithm, it will be trained once on the full dataset.

In [1]:
import pandas as pd
import shutil

In [2]:
# read json into pandas dataframe
all_images = pd.read_json('yelp_academic_dataset_photo.json', lines=True)
all_images.head()

Unnamed: 0,business_id,caption,label,photo_id
0,wRKYaVXTks43GVSI2awTQA,,food,IuXwafFH3fZlTyXA-poz0w
1,wRKYaVXTks43GVSI2awTQA,,food,vhnZ58_1shy9HNVdZgtMLw
2,wRKYaVXTks43GVSI2awTQA,,food,j9ad7H2IBEzhfNCuJu4ukg
3,wRKYaVXTks43GVSI2awTQA,,food,du-5X44HccQ9Zo3pQPiFgQ
4,wRKYaVXTks43GVSI2awTQA,The classic Farmer's Choice Breakfast has a li...,food,u7Tt1nvclYNoq3UOToP-GA


In [3]:
# trim dataframe
all_images = all_images[['photo_id', 'label']]

# ensure there are no missing data from label field (nan or empty strings)
print(all_images.info())
print(all_images['label'].unique())

# ensure all photo IDs are unique
assert all_images.photo_id.nunique() == len(all_images.photo_id)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280992 entries, 0 to 280991
Data columns (total 2 columns):
photo_id    280992 non-null object
label       280992 non-null object
dtypes: object(2)
memory usage: 4.3+ MB
None
['food' 'drink' 'outside' 'inside' 'menu']


In [4]:
# set photo_id as the index
all_images.set_index('photo_id', inplace=True)

In [5]:
# sample out test image set
test_images = all_images.sample(n=30000, replace=False, weights=None, random_state=12)
test_images['set'] = 'test'

# remove test images from consideration
train_images = all_images[~all_images.index.isin(test_images.index)]

# sample out validation image set
val_images = train_images.sample(n=20000, replace=False, weights=None, random_state=34)
val_images['set'] = 'val'

# remaining images are training images
train_images = train_images[~train_images.index.isin(val_images.index)]
train_images['set'] = 'train'

In [6]:
# combine sets back together
final = pd.concat([test_images, val_images, train_images], axis=0)
final['set'].value_counts()

train    230992
test      30000
val       20000
Name: set, dtype: int64

In [7]:
# ensure all photo IDs are unique
assert final.index.nunique() == len(final.index)

In [8]:
# save photo dataframe as csv file
final.to_csv('photo_labels_all.csv')

In [9]:
# copy images into new directories
orig_dir = 'H:/springboard/other_data/yelp/Photos/yelp_academic_dataset_photos/'
dest_dir = 'H:/springboard/other_data/yelp/Photos/final_photos/'

for index, row in final.iterrows():
    filepath = orig_dir + index + '.jpg'
    filedest = dest_dir + row['set'] + '/' + row['label']
    _ = shutil.copy(filepath, filedest)

## Downsampling
___

The model is not performing optimally due to class imbalance in the dataset. While performing at 99% accuracy on the highest representation class (food), the model's accuracy drops to an abysmal 47% on the least represented class (menu). One way to work around the presence of class imbalance in the dataset is to take a random sample of the more represented classes, to bring all the classes into balance with the least represented class. While this will greatly reduce the total amount of data that the model is trained on, the elimination of class imbalance may lead to the model performing better anyway.

In [10]:
# class imbalance in the training set
train_images.label.value_counts()

food       151588
inside      50684
outside     19138
drink        8486
menu         1096
Name: label, dtype: int64

In [11]:
# class imbalance in the validation set
val_images.label.value_counts()

food       13242
inside      4263
outside     1639
drink        751
menu         105
Name: label, dtype: int64

In [12]:
# collect all images of menu class
menu_train = train_images[train_images.label == 'menu']
train_n = len(menu_train.label)

menu_val = val_images[val_images.label == 'menu']
val_n = len(menu_val.label)

# downsample other classes
food_train = train_images[train_images.label == 'food'].sample(n=train_n, replace=False, random_state=12)
food_val = val_images[val_images.label == 'food'].sample(n=val_n, replace=False, random_state=123)

inside_train = train_images[train_images.label == 'inside'].sample(n=train_n, replace=False, random_state=23)
inside_val = val_images[val_images.label == 'inside'].sample(n=val_n, replace=False, random_state=234)

outside_train = train_images[train_images.label == 'outside'].sample(n=train_n, replace=False, random_state=34)
outside_val = val_images[val_images.label == 'outside'].sample(n=val_n, replace=False, random_state=345)

drink_train = train_images[train_images.label == 'drink'].sample(n=train_n, replace=False, random_state=45)
drink_val = val_images[val_images.label == 'drink'].sample(n=val_n, replace=False, random_state=456)

In [13]:
# combine all samples
train = pd.concat([menu_train, food_train, inside_train, outside_train, drink_train], axis=0)
val = pd.concat([menu_val, food_val, inside_val, outside_val, drink_val], axis=0)

In [14]:
# classes are now balanced in the downsampled dataset
print(train.label.value_counts())
print(val.label.value_counts())

drink      1096
inside     1096
outside    1096
menu       1096
food       1096
Name: label, dtype: int64
menu       105
drink      105
outside    105
inside     105
food       105
Name: label, dtype: int64


In [15]:
# combine dataframes
train['set'] = 'train'
val['set'] = 'val'
downsampled = pd.concat([train, val], axis=0)
downsampled.set.value_counts()

train    5480
val       525
Name: set, dtype: int64

In [16]:
# save to csv
downsampled.to_csv('photo_labels_downsampled.csv')

In [17]:
# copy images into new directories
orig_dir = 'H:/springboard/other_data/yelp/Photos/yelp_academic_dataset_photos/'
dest_dir = 'C:/Users/Nils/Documents/GitHub/Springboard-Capstone-2-local-yelp/downsampled/'

for index, row in downsampled.iterrows():
    filepath = orig_dir + index + '.jpg'
    filedest = dest_dir + row['set'] + '/' + row['label']
    _ = shutil.copy(filepath, filedest)