# Whale Tail Oversample

This data set is unbalance data, some whales have many pictures while a lot of whales only have one image.

In [2]:
import pandas as pd

## A look at the data

In [3]:
df = pd.read_csv('../whale-identification/input/train.csv')
df.head()

Unnamed: 0,Image,Id
0,0000e88ab.jpg,w_f48451c
1,0001f9222.jpg,w_c3d896a
2,00029d126.jpg,w_20df2c5
3,00050a15a.jpg,new_whale
4,0005c1ef8.jpg,new_whale


In [4]:
# df.Id.value_counts()[df.Id.value_counts()>=2] # the ouput is 2932, the whales with at least 2 images
df.Id.value_counts().head()

new_whale    9664
w_23a388d      73
w_9b5109b      65
w_9c506f6      62
w_0369a5c      61
Name: Id, dtype: int64

In [21]:
len(df.Id.value_counts()[df.Id.value_counts()>=2])
# images 数量大于等于2的鲸鱼有2931种

2931

In [5]:
(df.Id == 'new_whale').mean()

0.3810575292772367

In [6]:
(df.Id.value_counts() == 1).mean()

0.4141858141858142

41% of all whales have only a single image associated with them.

38% of all images contain a new whale - a whale that has not been identified as one of the known whales.


## Oversample

We will define the `train data` and `val data` below.
- `val data`: 1 image from the whales that own at least 2 images(2931).
- `train data`: all the data without new_whale - val_data

And then we will do **oversampling** on the two data sets.


In [7]:
im_count = df[df.Id != 'new_whale'].Id.value_counts()
im_count.name = 'sighting_count'
df = df.join(im_count, on='Id')
val_fns = set(df.sample(frac=1)[(df.Id != 'new_whale') & (df.sighting_count > 1)].groupby('Id').first().Image)

  after removing the cwd from sys.path.


In [8]:
len(val_fns)

2931

In [9]:
df = df[df.Id != 'new_whale']

In [10]:
df.shape

(15697, 3)

In [12]:
df_val = df[df.Image.isin(val_fns)]
df_train = df[~df.Image.isin(val_fns)]
df_train_with_val = df

# val data 是选取了1000张new whale的图片，加上sighting>2的鲸鱼每人一张
# train是去除掉val剩下的图片

In [13]:
df_val.shape, df_train.shape, df_train_with_val.shape

((2931, 3), (12766, 3), (15697, 3))

oversmapling begins

In [14]:
%%time

res = None
sample_to = 15
'''
对不含new_whale和val data的train data set进行上采样。
得到上采样后的训练集。
'''
for grp in df_train.groupby('Id'):
    n = grp[1].shape[0]  # 每一类的行数 = sighting number
    additional_rows = grp[1].sample(0 if sample_to < n  else sample_to - n, replace=True)
    # 可以直接在dataframe中进行sample， replace代表有放回和无放回.
    # 如果本身数量比15多，additional_rows就是0.如果比15少，那么additional就是要补足数量
    rows = pd.concat((grp[1], additional_rows))
    
    if res is None: res = rows
    else: res = pd.concat((res, rows))

CPU times: user 14.8 s, sys: 357 ms, total: 15.2 s
Wall time: 15.4 s


In [15]:
%%time

res_with_val = None
sample_to = 15
'''
对含val data set但是不含new_whale的数据集进行上采样。
'''

for grp in df_train_with_val.groupby('Id'):
    n = grp[1].shape[0]
    additional_rows = grp[1].sample(0 if sample_to < n  else sample_to - n, replace=True)
    rows = pd.concat((grp[1], additional_rows))
    
    if res_with_val is None: res_with_val = rows
    else: res_with_val = pd.concat((res_with_val, rows))

CPU times: user 15.7 s, sys: 390 ms, total: 16.1 s
Wall time: 16.6 s


In [16]:
res.shape, res_with_val.shape

((76174, 3), (76287, 3))

In [22]:
res[['Image', 'Id']].to_csv('input/oversampled_train_data.csv', index=False)
df_val[['Image', 'Id']].to_csv('input/val_data.csv', index=False)

# res_with_val[['Image', 'Id']].to_csv('input/oversampled_train_and_val.csv', index=False)