# Springboard Capstone Project 2

## Sampling photos for algorithm design
___

The full Yelp dataset contains more than 280,000 photos. To use the full dataset for the design of a deep learning algorithm would require a prohibitive amount of processing power and would lead to a long time between iterations. Instead, a subset of the photos will be sampled, and the algorithm will be designed to perform well on this subset. Then, for a final evaluation of the success of the algorithm, it will be trained once on the full dataset.

In [1]:
import pandas as pd

In [2]:
# read json into pandas dataframe
photos_df = pd.read_json('yelp_academic_dataset_photo.json', lines=True)
photos_df.head()

Unnamed: 0,business_id,caption,label,photo_id
0,wRKYaVXTks43GVSI2awTQA,,food,IuXwafFH3fZlTyXA-poz0w
1,wRKYaVXTks43GVSI2awTQA,,food,vhnZ58_1shy9HNVdZgtMLw
2,wRKYaVXTks43GVSI2awTQA,,food,j9ad7H2IBEzhfNCuJu4ukg
3,wRKYaVXTks43GVSI2awTQA,,food,du-5X44HccQ9Zo3pQPiFgQ
4,wRKYaVXTks43GVSI2awTQA,The classic Farmer's Choice Breakfast has a li...,food,u7Tt1nvclYNoq3UOToP-GA


In [3]:
# ensure there are no missing data from label field (nan or empty strings)
print(photos_df.info())
print(photos_df['label'].unique())

# ensure all photo IDs are unique
assert photos_df.photo_id.nunique() == len(photos_df.photo_id)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280992 entries, 0 to 280991
Data columns (total 4 columns):
business_id    280992 non-null object
caption        280992 non-null object
label          280992 non-null object
photo_id       280992 non-null object
dtypes: object(4)
memory usage: 8.6+ MB
None
['food' 'drink' 'outside' 'inside' 'menu']


In [4]:
# trim dataframe
photos_df = photos_df[['photo_id', 'label']]
photos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280992 entries, 0 to 280991
Data columns (total 2 columns):
photo_id    280992 non-null object
label       280992 non-null object
dtypes: object(2)
memory usage: 4.3+ MB


In [5]:
# sample photos
photos_sample = photos_df.sample(n=10000, replace=False, weights=None, random_state=123)
photos_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 9462 to 12831
Data columns (total 2 columns):
photo_id    10000 non-null object
label       10000 non-null object
dtypes: object(2)
memory usage: 234.4+ KB


In [6]:
# compare label proportions between sample and entire dataset
label_props = photos_df.label.value_counts()/len(photos_df.label)
print('label_props\n', label_props)

label_props_sample = photos_sample.label.value_counts()/len(photos_sample.label)
print('\nlabel_props_sample\n', label_props_sample)

label_props
 food       0.656446
inside     0.219294
outside    0.082614
drink      0.036834
menu       0.004812
Name: label, dtype: float64

label_props_sample
 food       0.6587
inside     0.2224
outside    0.0790
drink      0.0363
menu       0.0036
Name: label, dtype: float64


In [7]:
# randomly assign sampled photos to train and validation sets
photos_sample_val = photos_sample.sample(frac=0.1, replace=False, weights=None, random_state=123)
photos_sample_val['set'] = 'val'
photos_sample = photos_sample.merge(photos_sample_val, how='left')
photos_sample.fillna(value='train', inplace=True)
photos_sample.set.value_counts()

train    9000
val      1000
Name: set, dtype: int64

In [8]:
# save photo dataframes as csv files
photos_df.to_csv('photo_labels_all.csv')
photos_sample.to_csv('photo_labels_sample.csv')