In [2]:
import cv2
import os
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split

### Data Loading and Cleaning
<blockquote>Clean up the "video_corpus.csv" file which contains the annotations, and the <strong>bos</strong> and <strong>eos</strong>.</blockquote>
<blockquote>The "annotations.txt" only contains the video_id and the associated annotation. The difference between this file and the "video_corpus.csv" file is that the latter one also contains annotations in other languages than English.</blockquote>
<blockquote>Therefore, the "video_corpus.csv" will contain only the English annotations after the cleaning process.</blockquote>

In [2]:
video_corpus_df = pd.read_csv("data/video_corpus.csv")
video_corpus_df.head()

Unnamed: 0,VideoID,Start,End,WorkerID,Source,AnnotationTime,Language,Description
0,mv89psg6zh4,33,46,588702,unverified,55,Slovene,Papagaj se umiva pod tekočo vodo v lijaku.
1,mv89psg6zh4,33,46,588702,unverified,37,Slovene,Papagaj se umiva pod tekočo vodo v lijaku.
2,mv89psg6zh4,33,46,362812,unverified,11,Macedonian,папагал се бања
3,mv89psg6zh4,33,46,968828,unverified,84,German,Ein Wellensittich duscht unter einem Wasserhahn.
4,mv89psg6zh4,33,46,203142,unverified,14,Romanian,o pasare sta intr-o chiuveta.


In [3]:
print(f"There are {len(video_corpus_df)} annotations in the 'video_corpus.csv', but only {len(video_corpus_df[video_corpus_df["Language"] == "English"])} English annotations.")

There are 122604 annotations in the 'video_corpus.csv', but only 85511 English annotations.


In [4]:
indexLanguage = video_corpus_df[video_corpus_df['Language'] != "English"].index
video_corpus_df.drop(indexLanguage , inplace=True)
video_corpus_df.head()

Unnamed: 0,VideoID,Start,End,WorkerID,Source,AnnotationTime,Language,Description
18,mv89psg6zh4,33,46,682611,clean,66,English,A bird in a sink keeps getting under the runni...
19,mv89psg6zh4,33,46,760882,clean,16,English,A bird is bathing in a sink.
20,mv89psg6zh4,33,46,878566,clean,76,English,A bird is splashing around under a running fau...
21,mv89psg6zh4,33,46,707318,clean,14,English,A bird is bathing in a sink.
22,mv89psg6zh4,33,46,135621,clean,58,English,A bird is standing in a sink drinking water th...


In [5]:
video_corpus_df["Source"].unique()

array(['clean', 'unverified'], dtype=object)

In [6]:
source_counts = video_corpus_df["Source"].value_counts()
source_proportions = video_corpus_df["Source"].value_counts(normalize=True)

print("Count of each Source:")
print(source_counts)

print("\nProportion of each Source:")
print(source_proportions)

Count of each Source:
Source
unverified    51673
clean         33838
Name: count, dtype: int64

Proportion of each Source:
Source
unverified    0.604285
clean         0.395715
Name: proportion, dtype: float64


In [7]:
video_corpus_df = video_corpus_df.drop(columns=['AnnotationTime', 'Language', 'WorkerID'])

In [8]:
video_corpus_df.head()

Unnamed: 0,VideoID,Start,End,Source,Description
18,mv89psg6zh4,33,46,clean,A bird in a sink keeps getting under the runni...
19,mv89psg6zh4,33,46,clean,A bird is bathing in a sink.
20,mv89psg6zh4,33,46,clean,A bird is splashing around under a running fau...
21,mv89psg6zh4,33,46,clean,A bird is bathing in a sink.
22,mv89psg6zh4,33,46,clean,A bird is standing in a sink drinking water th...


In [9]:
print(f"Now we only have the videoclips which have English annotation: {len(video_corpus_df)}")

Now we only have the videoclips which have English annotation: 85511


In [10]:
print(f"There are only {video_corpus_df['VideoID'].nunique()} unique IDs present in the data (video_corpus.csv).")

There are only 1586 unique IDs present in the data (video_corpus.csv).


In [12]:
video_corpus_df.head()

Unnamed: 0,VideoID,Start,End,Source,Description
18,mv89psg6zh4,33,46,clean,A bird in a sink keeps getting under the runni...
19,mv89psg6zh4,33,46,clean,A bird is bathing in a sink.
20,mv89psg6zh4,33,46,clean,A bird is splashing around under a running fau...
21,mv89psg6zh4,33,46,clean,A bird is bathing in a sink.
22,mv89psg6zh4,33,46,clean,A bird is standing in a sink drinking water th...


In [13]:
video_corpus_df['VideoID'] = video_corpus_df['VideoID'] + '_' + video_corpus_df['Start'].astype(str) + '_' + video_corpus_df['End'].astype(str)
video_corpus_df = video_corpus_df.drop(columns=['Start', 'End', 'Source'])

In [14]:
video_corpus_df.head()

Unnamed: 0,VideoID,Description
18,mv89psg6zh4_33_46,A bird in a sink keeps getting under the runni...
19,mv89psg6zh4_33_46,A bird is bathing in a sink.
20,mv89psg6zh4_33_46,A bird is splashing around under a running fau...
21,mv89psg6zh4_33_46,A bird is bathing in a sink.
22,mv89psg6zh4_33_46,A bird is standing in a sink drinking water th...


In [15]:
video_corpus_df.to_csv("data/video_corpus_english.csv", index=False)

#### Discussion over the data
<blockquote>Now the "video_corpus.csv" file only has details about English annotations, and I am interested mainly in the <strong>bos</strong> and <strong>eos</strong>.</blockquote>
<!-- <blockquote>In "video_corpus.csv", there are multiple annotations associated to the same video file, but as I have observed, only <strong>1586</strong> unique videos.</blockquote> -->
<blockquote>In the downloaded directory "YouTubeClips", there are <strong>1970</strong> unique files.</blockquote>
<blockquote>Other fields like "AnnotationTime" or "WorkerId" are not really relevant for my task.</blockquote>
<blockquote>The "Source" column indicates if the annotation is verified or not. I have observed that the proportion of "unverified" data is around 60%, while the "clean" data around 40%. By using both the "unverified" and "clean" data, I aim to observe the robustness of the model.</blockquote>
<blockquote>A last step before moving on to the next stage of the project, <strong>Input Preprocessing</strong>, the data should be split into training, and testing data.</blockquote>

In [11]:
def split_data(data = video_corpus_df, version = '1'):
    video_corpus_train, video_corpus_test = train_test_split(video_corpus_df, test_size=0.2, random_state=42, shuffle=True)
    
    print(f"Training set size: {len(video_corpus_train)}")
    print(f"Testing set size: {len(video_corpus_test)}")

    video_corpus_train.to_csv(f"data/video_corpus_train_{version}.csv", index=False)
    video_corpus_test.to_csv(f"data/video_corpus_test_{version}.csv", index=False)
    print(f"\nThe data splits have been saved locally as '{version}'.")

In [12]:
split_data(data = video_corpus_df, version = '1')

Training set size: 68408
Testing set size: 17103

The data splits have been saved locally as '1'.


### Input Preprocessing
<blockquote>Now that the data is loaded, I can focus on the <strong>input preprocessing</strong> step. As I have stated in the documentation, I will treat the video as a sequence of frames, but I will reduce this number of redundant frames by simply choose some key-frames and some in-between frames (capture the temporality).</blockquote>
<blockquote>There are several ways to perform this frame extraction, like <strong>uniform sampling (fixed time interval)</strong>, <strong>scene change detection techniques</strong> or the simplest of them all, <strong>ordinary frame extraction</strong> (pick a fixed number of frames, transform the video into the sequence of frames, and keep only the specified number of frames).</blockquote>
<blockquote>Before performing the frame extraction, an extra step of verification should be performed: is the video annotated or not?</blockquote>

In [3]:
video_corpus_train = pd.read_csv("data/video_corpus_train_1.csv")
video_corpus_train.head()

Unnamed: 0,VideoID,Start,End,Source,Description
0,AXL1oMdCFUM,45,59,clean,A man leads a team of oxen down a muddy path.
1,yFPHhRat6bc,160,210,unverified,the girl is cooking
2,nohvigNMsbo,199,207,clean,A woman is adding essence to corn flakes.
3,PECk9A-07Pw,1,9,clean,A man playing the accordion.
4,ymC2bNi6-Is,9,19,clean,The men are in jail.


In [4]:
video_corpus_test = pd.read_csv("data/video_corpus_test_1.csv")
video_corpus_test.head()

Unnamed: 0,VideoID,Start,End,Source,Description
0,jfrrO5K_vKM,55,65,clean,A man opens a box and shows a rifle inside.
1,n_Z0-giaspE,379,387,unverified,A man falls over in the road.
2,0hyZ__3YhZc,364,370,unverified,someone is preparing something
3,-_aaMGK6GGw,57,61,unverified,A man angrily pulls a teen aged boy on to the ...
4,bruzcOyIGeg,4,12,unverified,a person is playing with toy car


In [5]:
def check_file_annotations(video, video_corpus):
    return video in video_corpus['VideoID'].values

In [6]:
def check_all(dir_path, video_corpus):
    no_annotation_list = []
    for file in os.listdir(dir_path):
        video_id = "_".join(file.replace(".avi", "").split("_")[:-2]) # get only the actual ID of the video
        if not check_file_annotations(video_id, video_corpus):
            no_annotation_list.append(file)
    return no_annotation_list

In [7]:
def frame_extraction(video_path):
    video_id = os.path.splitext(os.path.basename(video_path))[0]
    
    save_dir = os.path.join("data/Frames", video_id)
    os.makedirs(save_dir, exist_ok=True)
    
    count = 0
    image_list = []
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Failed to open video: {video_path}")
    while cap.isOpened():
        ret, frame = cap.read()
        if ret is False:
            break
        frame_filename = f"frame{count}.jpg"
        frame_path = os.path.join(save_dir, frame_filename)
        cv2.imwrite(frame_path, frame)
        image_list.append(frame_path)
        count += 1

    cap.release()
    cv2.destroyAllWindows()
    return image_list

In [8]:
def select_frames(method='uniform_sampling'):
    pass

In [9]:
to_skip = check_all('data/YouTubeClips', video_corpus_train)
print(f"The videos which do not have any annotation training data is: {to_skip}")

The videos which do not have any annotation training data is: ['jbzaMtPYtl8_48_58.avi']


In [10]:
to_skip = check_all('data/YouTubeClips', video_corpus_test)
print(f"The videos which do not have any annotation testing data is: {to_skip}")

The videos which do not have any annotation testing data is: ['MMnnqzOoMF0_68_72.avi', 'jbzaMtPYtl8_48_58.avi']


In [20]:
print("Double-checking the existence of the video and its corresponding annotation:")
video_corpus_train[video_corpus_train['VideoID'] == 'jbzaMtPYtl8']

Double-checking the existence of the video and its corresponding annotation:


Unnamed: 0,VideoID,Start,End,Source,Description


In [21]:
def extract_frames(video_dir="data/YouTubeClips"):
    extracted = []
    failed = []    
    video_files = [f for f in os.listdir(video_dir) if f.endswith(".avi")]
    
    for filename in tqdm(video_files, desc="Extracting frames"):
        if filename.endswith(".avi"):
            video_path = os.path.join(video_dir, filename)
            try:
                frame_extraction(video_path)
                extracted.append(filename)
            except Exception as e:
                failed.append(filename)

    return extracted, failed

In [22]:
extracted, failed = extract_frames("data/YouTubeClips")

Extracting frames: 100%|████████████████████████████| 1970/1970 [25:15<00:00,  1.30it/s]


#### Discussion
<blockquote>I have checked if there are any video files in the YouTubeClips directory which do not have any annotation associated, and I found out there is only one file which was not annotated. Since it is only one video, there is no need to handle this.</blockquote>
<blockquote>As I was working on the frame extraction, I have observed that there are files present in the video_corpus.csv, but not in the actual downloaded directory containing the videoclips.</blockquote>
<blockquote>Therefore, I decided to append to a list all these files which were not found in order to perform another cleaning step over my video_corpus.csv file</blockquote>
<blockquote>The <strong>check_all</strong> method checks if there are any videos in the data (YouTubeClips) which do not actually have an annotation in the video_corpus. This method does not unveil the problem that there might be annotations for files which are not actually present in the YouTubeClips directory. This problem must be addressed.</blockquote>