# 🐠 Reef - CV strategy: subsequences!

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31703/logos/header.png)

## Problem: there are 3 videos. Using videos as split units for cross-validation or train-validation splits is not optimal, as it generates a too large validation portion.


## In this notebook we explore sequences as potential units for cross-validation, but since there are only 20 sequences and their sizes are quite disimilar, we propose an approach to split them into smaller chunks, that we name _subsequences_.

A **sequence**, as stated in the [data tab of the competition](https://www.kaggle.com/c/tensorflow-great-barrier-reef/data), is:
> sequence - ID of a gap-free subset of a given video. The sequence ids are not meaningfully ordered.

**Subsequences**, as we will define them below,  are parts of a sequences where objects are continually present or are continually not present. We isolate 2 kind of subsequences: with objects and with no objects.

&nbsp;

Let's see an **example**. Consider the sequence `A` with the following frames:
* `1-20` - No annotations present
* `21-30` - Annotations present
* `31-60` - No annotations
* `61-80` - Annotations present

In this case, we say that the sequence `A` has `4` subsequences (`1-20`, `21-30`, `31-60`, `61-80`).

&nbsp;


A subsequence seems to me like the minimal atom for ensuring no leaks happen between train and test.


&nbsp;
&nbsp;
&nbsp;

---

The notebook goes as follows:
1. Analize sequences as potential units for spliting
2. Propose and create subsequences, and create videos for sequences and subsequences to get a feeling of them
3. Create common train-validation splits (1%, 5%, 10%, 20%) using subsequences
4. Create 5-fold splits and 10-fold splits using subsequences

&nbsp;
&nbsp;


### The resulting dataframes are provided as a dataset for ease of use here: [reef-cv-strategy-subsequences-dataframes](https://www.kaggle.com/julian3833/reef-cv-strategy-subsequences-dataframes)


# Please, _DO_ upvote if you find this useful or interesting!



# Analyze sequences

In [None]:
import os
import cv2
import subprocess
from tqdm.auto import tqdm
import pandas as pd
from IPython.display import Video, display, HTML
import warnings; warnings.simplefilter("ignore")


BASE_PATH = '../input/tensorflow-great-barrier-reef/train_images/'

df = pd.read_csv("/kaggle/input/tensorflow-great-barrier-reef/train.csv")
df['annotations'] = df['annotations'].apply(eval)
df['n_annotations'] = df['annotations'].str.len()
df['has_annotations'] = df['annotations'].str.len() > 0
df['has_2_or_more_annotations'] = df['annotations'].str.len() >= 2
df['doesnt_have_annotations'] = df['annotations'].str.len() == 0
df['image_path'] = BASE_PATH + "video_" + df['video_id'].astype(str) + "/" + df['video_frame'].astype(str) + ".jpg"

**There are 20 sequences**:

In [None]:
df['sequence'].unique()

In [None]:
df['sequence'].nunique()

**"sequence" is a global identifier for a sequence (aka it's not relative to the video_id)**

In [None]:
df.groupby("sequence")['video_id'].nunique()

In [None]:
# Videos 0 and 1 have 8 sequences, while video 2 has 4
df.groupby("video_id")['sequence'].nunique()

In [None]:
df_agg = df.groupby(["video_id", 'sequence']).agg({'sequence_frame': 'count', 'has_annotations': 'sum', 'doesnt_have_annotations': 'sum'})\
           .rename(columns={'sequence_frame': 'Total Frames', 'has_annotations': 'Frames with at least 1 object', 'doesnt_have_annotations': "Frames with no object"})
df_agg

In [None]:
df_agg.sort_values("Total Frames")

In [None]:
df_agg.sort_values("Frames with at least 1 object")

## The total amount of frames in each sequence varies a lot, so it might be quite difficult to use sequences as the splitting unit.

In [None]:
# image_id is a unique identifier for a row
df['image_id'].nunique() == len(df)

## What do we want, ideally?

The ideal scenario would be that, if we split 80 - 20 (train - validation)
Then the 80% of the training data has:
* 80% sequences
* 80% of frames 
* 80% of frames with objects
* 80% of individuals

Splitting by sequence ensures that no region of the coral appears both in training and validation data.


# Can we go into a subsequence level? 

So, one idea is the following: There are large gaps of a sequence with no objects in sights. May be we can split the sequence into a smaller piece during those "empty" times and that might ensure that the same region of the coral, with the same individuals, doesn't appear in both training and validation data.


### See for example, sequence 40258:

In [None]:
df_agg.loc[[(0, 40258)]]

In [None]:
pd.set_option("display.max_rows", 500)
df[df['sequence'] == 40258]

### The sequence has 4 subsequences with objects surrounded by other parts with no objects at all. We could split this sequence in these 4 subsequences and use that as units for train-validation splits.


### Here we cut continuous subsequence with objects and without objects:

In [None]:
df['start_cut_here'] = df['has_annotations'] & df['doesnt_have_annotations'].shift(1)  & df['doesnt_have_annotations'].shift(2)
df['end_cut_here'] = df['doesnt_have_annotations'] & df['has_annotations'].shift(1)  & df['has_annotations'].shift(2)
df['sequence_change'] = df['sequence'] != df['sequence'].shift(1)
df['last_row'] =  df.index == len(df)-1
df['cut_here'] = df['start_cut_here'] | df['end_cut_here'] | df['sequence_change'] | df['last_row']


In [None]:
start_idx = 0
for subsequence_id, end_idx in enumerate(df[df['cut_here']].index):
    df.loc[start_idx:end_idx, 'subsequence_id'] = subsequence_id
    start_idx = end_idx

In [None]:
df['subsequence_id'] = df['subsequence_id'].astype(int)

In [None]:
df['subsequence_id'].nunique()

In [None]:
drop_cols = ['start_cut_here', 'end_cut_here', 'sequence_change', 'last_row', 'cut_here', 'has_2_or_more_annotations', 'doesnt_have_annotations']
df = df.drop(drop_cols, axis=1)
df.head()

### The method didn't work perfectly for some subsequences, but the "broken ones" don't look that bad so we are cool 👍👍

In [None]:
df.groupby("subsequence_id")['has_annotations'].mean().round(2).sort_values().value_counts()

In [None]:
df_subseq_agg = df.groupby("subsequence_id")['has_annotations'].mean()
df_subseq_agg[~df_subseq_agg.isin([0, 1])]

In [None]:
df[df['subsequence_id'] == 52]

In [None]:
df[df['subsequence_id'] == 53]

In [None]:
df[df['subsequence_id'] == 54]

# Let's see how a sequence and a subsequence look like as videos!!

## mp4 generating code from [create annotated video](https://www.kaggle.com/bamps53/create-annotated-video)

### I changed it to have sequence as parameter instead of video_id

In [None]:
! mkdir videos/

In [None]:
def load_image(img_path):
    assert os.path.exists(img_path), f'{img_path} does not exist.'
    img = cv2.imread(img_path)
    return img

def load_image_with_annotations(img_path, annotations):
    img = load_image(img_path)
    if len(annotations) > 0:
        for ann in annotations:
            cv2.rectangle(img, (ann['x'], ann['y']),
                (ann['x'] + ann['width'], ann['y'] + ann['height']),
                (255, 255, 0), thickness=2,)
    return img

def make_video(df, part_id, is_subsequence=False):
    """
    Args:
        - part_id: either a sequence or a subsequence id
    """
    
    if is_subsequence:
        part_str = "subsequence_id"
    else:
        part_str = "sequence"
    
    print(f"Creating video for part={part_id}, is_subsequence={is_subsequence} (querying by {part_str})")
    # partly borrowed from https://github.com/RobMulla/helmet-assignment/blob/main/helmet_assignment/video.py
    fps = 15 # don't know exact value
    width = 1280
    height = 720
    save_path = f'videos/video_{part_str}_{part_id}.mp4'
    tmp_path = f'videos/tmp_video_{part_str}_{part_id}.mp4'
    
    
    output_video = cv2.VideoWriter(tmp_path, cv2.VideoWriter_fourcc(*"MP4V"), fps, (width, height))
    
    df_part = df.query(f'{part_str} == @part_id')
    for _, row in tqdm(df_part.iterrows(), total=len(df_part)):
        img = load_image_with_annotations(row.image_path, row.annotations)
        output_video.write(img)
    
    output_video.release()
    # Not all browsers support the codec, we will re-load the file at tmp_output_path
    # and convert to a codec that is more broadly readable using ffmpeg
    if os.path.exists(save_path):
        os.remove(save_path)
    subprocess.run(
        ["ffmpeg", "-i", tmp_path, "-crf", "18", "-preset", "veryfast", "-vcodec", "libx264", save_path],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    os.remove(tmp_path)
    print(f"Finished creating video for {part_id}... saved as {save_path}")
    return save_path

# Video for sequence _40258_ and its subsequences

In [None]:
video_path = make_video(df, 40258)

In [None]:
Video(video_path, width= 1280/2, height= 720/2)

In [None]:
subsequences = df.loc[df['sequence'] == 40258, 'subsequence_id'].unique()
subsequences

In [None]:
for subsequence in subsequences:
    video_path = make_video(df, subsequence, is_subsequence=True)
    display(HTML(f"<h2>Subsequence ID: {subsequence}</h2>"))
    display(Video(video_path, width= 1280/2, height= 720/2))

This looks good 😁, let's use it for creating the splits...

# Generate some common splits based on _subsequences_

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold
df.head()

In [None]:
df_split  = df.groupby("subsequence_id").agg({'has_annotations': 'max', 'video_frame': 'count'}).astype(int).reset_index()
df_split.head()

## Train-validation splits for 1%, 5%, 10% and 20%

In [None]:
!mkdir train-validation-split/

In [None]:
def analize_split(df_train, df_val, df):
     # Analize results
    print(f"   Train images                 : {len(df_train) / len(df):.3f}")
    print(f"   Val   images                 : {len(df_val) / len(df):.3f}")
    print()
    print(f"   Train images with annotations: {len(df_train[df_train['has_annotations']]) / len(df[df['has_annotations']]):.3f}")
    print(f"   Val   images with annotations: {len(df_val[df_val['has_annotations']]) / len(df[df['has_annotations']]):.3f}")
    print()
    print(f"   Train images w/no annotations: {len(df_train[~df_train['has_annotations']]) / len(df[~df['has_annotations']]):.3f}")
    print(f"   Val   images w/no annotations: {len(df_val[~df_val['has_annotations']]) / len(df[~df['has_annotations']]):.3f}")
    print()
    print(f"   Train mean annotations       : {df_train['n_annotations'].mean():.3f}")
    print(f"   Val   mean annotations       : {df_val['n_annotations'].mean():.3f}")
    
    print()

In [None]:
for test_size in [0.01, 0.05, 0.1, 0.2]:
    print(f"Generating train-validation split with {test_size*100}% validation")
    df_train_idx, df_val_idx = train_test_split(df_split['subsequence_id'], stratify=df_split["has_annotations"], test_size=test_size, random_state=42)
    df['is_train'] = df['subsequence_id'].isin(df_train_idx)
    df_train, df_val = df[df['is_train']], df[~df['is_train']]
    
    # Print some statistics
    analize_split(df_train, df_val, df)
    
    # Save to file
    f_name = f"train-validation-split/train-{test_size}.csv"
    print(f"Saving file to {f_name}")
    df.to_csv(f_name, index=False)
    print()
    

In [None]:
!ls -l train-validation-split/

## Create 5-folds cross validation

In [None]:
df = df.drop("is_train", axis=1)

In [None]:
n_splits = 5
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2021)
for fold_id, (_, val_idx) in enumerate(kf.split(df_split['subsequence_id'], y=df_split["has_annotations"])):
    subseq_val_idx = df_split['subsequence_id'].iloc[val_idx]
    df.loc[df['subsequence_id'].isin(subseq_val_idx), 'fold'] = fold_id
    
df['fold'] = df['fold'].astype(int)
df['fold'].value_counts(dropna=False)

In [None]:
for fold_id in df['fold'].sort_values().unique():
    print("=============================")
    print(f"Analyzing fold {fold_id}")
    df_train, df_val = df[df['fold'] != fold_id], df[df['fold'] == fold_id]
    analize_split(df_train, df_val, df)
    print()

In [None]:
!mkdir cross-validation/

In [None]:
df.to_csv("cross-validation/train-5folds.csv", index=False)

## Create 10-fold cross validation

In [None]:
n_splits = 10
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2021)
for fold_id, (_, val_idx) in enumerate(kf.split(df_split['subsequence_id'], y=df_split["has_annotations"])):
    subseq_val_idx = df_split['subsequence_id'].iloc[val_idx]
    df.loc[df['subsequence_id'].isin(subseq_val_idx), 'fold'] = fold_id
    
df['fold'] = df['fold'].astype(int)
df['fold'].value_counts(dropna=False)

In [None]:
for fold_id in df['fold'].sort_values().unique():
    print("=============================")
    print(f"Analyzing fold {fold_id}")
    df_train, df_val = df[df['fold'] != fold_id], df[df['fold'] == fold_id]
    analize_split(df_train, df_val, df)
    print()

In [None]:
df.to_csv("cross-validation/train-10folds.csv", index=False)

# Please, _DO_ upvote if you find this useful or interesting!