## Copyright
Copyright (c) 2021, Her Majesty in Right of Canada as represented by the National Research Council Canada. Rights provided under GNU GENERAL PUBLIC LICENSE, Version 3. Full text of the license accessible at the [LICENSE](LICENSE) file.

---

This scripts automatically collect videos from the following sources, processes and curates them, and stores them along with their metadata on the user's device:

* ButterflyNetwork
* GrepMed
* LITFL
* The PocusAtlas
* Radiopaedia
* CoreUltrasound
* University of Florida (UF)
* Scientific Publications
* Clarius

In addition, it extracts images from the collected videos, processes and curates them, and stores images and their metadata locally on the user's device.

__Note:__ No data is stored on the NRC-COVIDx-US repository and it only contains scripts to systematically collect, curate, and integrate data on user's device.

---

#### Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import re
import shutil
import random 

import cv2
from PIL import Image

import zipfile
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
from vimeo_downloader import Vimeo
import urllib.request

from progressbar import ProgressBar

import time
from image_data import extract_images

import matplotlib.pyplot as plt

import subprocess # to unzip butterfly file
import glob

In [2]:
print("Pandas", pd.__version__)
import selenium
print("selenium", selenium.__version__)
print("requests", requests.__version__)

Pandas 1.3.4
selenium 3.141.0
requests 2.26.0


#### Functions

In [3]:
def get_download_path():
    """Returns the default downloads path for linux or windows"""
    if os.name == 'nt':
        import winreg
        sub_key = r'SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders'
        downloads_guid = '{374DE290-123F-4565-9164-39C4925E467B}'
        with winreg.OpenKey(winreg.HKEY_CURRENT_USER, sub_key) as key:
            location = winreg.QueryValueEx(key, downloads_guid)[0]
        return location
    else:
        return os.path.join(os.path.expanduser('~'), 'downloads')
    
def remove_html_tags(text):
    """Function to remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)


In [4]:
# code to download zip files from Google drive, in case required
def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768
    progress = ProgressBar() 
    
    with open(destination, "wb") as f:
        for chunk in progress(response.iter_content(CHUNK_SIZE)):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

#### Parameters

In [5]:
# set save path directory
SAVE_PATH = 'data'

# create data, video, and image folders, if they do not exist
if not os.path.exists('data'):
    os.makedirs('data')
if not os.path.exists('data/video'):
    os.makedirs('data/video')
if not os.path.exists('data/image'):
    os.makedirs('data/image')
    
# setting chrome driver
chromedriver = "utils/chromedriver.exe" 
os.environ["webdriver.chrome.driver"] = chromedriver
chrome_options = Options()
chrome_options.add_argument("--headless")

# setting global vars
VIDEO_PATH = 'data/video/'
IMAGE_PATH = 'data/image/'

#### Read the Metadata File

In [6]:
metadata = pd.read_csv('utils/video_metadata.csv', sep=',', encoding='latin1')
print(metadata.shape)
metadata.head(2)

(244, 21)


Unnamed: 0,id,filename,filetype,folder,source,url,probe,class,class_on_website,version,...,type,patient,case_no,gender,age,comment,paper_link,paper_doi,license,link
0,1_butterfly_covid,Coalescing B lines.mp4,mp4,data\tmp\Butterfly\B lines,Butterfly,https://butterflynetwork.getbynder.com/transfe...,Convex,COVID,,1.0,...,lung,,,,,,,,,
1,2_butterfly_covid,Confluent B lines.mp4,mp4,data\tmp\Butterfly\B lines,Butterfly,https://butterflynetwork.getbynder.com/transfe...,Convex,COVID,,1.0,...,lung,,,,,,,,,


# 1. Get Ultrasound Videos

## 1.1. ButterflyNetwork

__Note1:__ Depending on your system configuration the download button may not load in time. If you get an error, increase the sleep time in the following code:

```python
time.sleep(5)
```

__Note2:__ The below code block works with Chrome web browser and with ChromeDriver version 88 that is included in the utils folder. Depending on the version of your chrome browser, you may get a ChromeDriver version error. If occurs, please download the correct version of ChromeDriver based on the version of your chrome browser from this [link](https://chromedriver.chromium.org/downloads), and copy it to the utils folder.

In [9]:
# zip file url
butterfly_url = metadata[(metadata.source == 'Butterfly') & (metadata.date_added == 'Mar_2021')].url.unique()[0]
print('...Downloading ButterflyNetwork zip file...')

# simulatting button click to download the zip file
browser = webdriver.Chrome(chromedriver) #, options=chrome_options)
browser.get(butterfly_url)

# Download button sometimes doesn't load in time to click. If such error occurring, increase sleep time
time.sleep(5)

browser.find_element_by_class_name('btn-primary').click() 

# path to the downloaded zip file
zip_file_path = os.path.join(get_download_path(), 'Published -20210112T164653Z-001.zip')  # new version - checked March 9, 2021

# wait till the zip file is downloaded
while not os.path.exists(zip_file_path):
    time.sleep(1)

# create butterfly folder under video folder, if it does not exist
if not os.path.exists('data/tmp/Butterfly'):
    os.makedirs('data/tmp/Butterfly')
time.sleep(2)


print('...Extracting the video files...')
# extract the downloaded zip file and remove the zip file after extraction
os.rename(zip_file_path, zip_file_path.replace(' ', ''))
subprocess.Popen("utils/7z.exe" +' x ' + zip_file_path.replace(' ', '') + ' -o' + 'data/tmp/Butterfly',stdout=subprocess.PIPE)
time.sleep(5)

# copy files from subfolders to the video folder
for root, dirs, files in os.walk('data/tmp/Butterfly/Published_'):  
    for file in files:
        if file.endswith(".png"):
            continue
        path_file = os.path.join(root,file)
        shutil.copy2(path_file, 'data/video') 

# renaming extracted files to their ids
progress = ProgressBar() 
for root, dirs, files in os.walk('data/video'):  
    for file in progress(files):
        if file.endswith(".png"):
            continue
        path_file = os.path.join(root,file)
        file_id = metadata[metadata.filename == file].id.values[0] + '.mp4'
        # rename the file to its id
        os.rename(path_file, os.path.join(root,file_id))

print('=== ButterflyNetwork video files extraction done! ===')        
        
# delete the tmp folder and its contents
shutil.rmtree('data/tmp')
        
# remove the zip file
os.remove(zip_file_path.replace(' ', ''))

...Downloading ButterflyNetwork zip file...
...Extracting the video files...


FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\HEMANTH\\Downloads\\Published -20210112T164653Z-001.zip' -> 'C:\\Users\\HEMANTH\\Downloads\\Published-20210112T164653Z-001.zip'

**Note:** If you get error running the above cell, uncomment and run the following code block instead:

In [11]:
import requests

print('Downloading Butterfly zip file')
file_id = '18I4N6lWdcUW618Qwr6Krsd1Rkn946Ag0' # sharable link id
zip_file_path = os.path.join(get_download_path(), 'butterfly.zip')
download_file_from_google_drive(file_id, zip_file_path)

# unzip video files
print('Download complete\nExtracting video files')
open_file = subprocess.Popen("utils/7z.exe" +' x ' + zip_file_path + ' -o' + 'data/video',stdout=subprocess.PIPE)
print('Extraction complete')

Downloading Butterfly zip file


| |#                                                  | 0 Elapsed Time: 0:00:00


Download complete
Extracting video files
Extraction complete


## 1.2. GrepMed

In [12]:
print('...Extracting the video files...')
grepmed_df = metadata[metadata.source == 'GrepMed']

progress = ProgressBar(max_value=grepmed_df.shape[0]) 
for idx, row in progress(grepmed_df.iterrows()):
    filename = row.id + '.' + row.filetype
    # write the video file to disk
    vid = requests.get(row.url).content
    with open(os.path.join('data/video/', filename), 'wb') as handler:
        handler.write(vid)
print('=== GrepMed video files extraction done! ===')        

  5% (1 of 20) |#                        | Elapsed Time: 0:00:00 ETA:   0:00:03

...Extracting the video files...


100% (20 of 20) |########################| Elapsed Time: 0:00:03 Time:  0:00:03


=== GrepMed video files extraction done! ===


## 1.3. LITFL

In [13]:
print('...Extracting the video files...')
litfl_df = metadata[metadata.source == 'Litfl']

progress = ProgressBar(max_value=litfl_df.shape[0]) 
for idx, row in progress(litfl_df.iterrows()):
    filename = row.id + '.' + row.filetype
    # write the video file to disk
    vid = requests.get(row.url).content
    with open(os.path.join('data/video/', filename), 'wb') as handler:
        handler.write(vid)
print('=== LITFL video files extraction done! ===')        

  0% (0 of 63) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (63 of 63) |########################| Elapsed Time: 0:00:22 Time:  0:00:22


=== LITFL video files extraction done! ===


## 1.4. The POCUS Atlas

In [14]:
print('...Extracting the video files...')
pocus_df = metadata[metadata.source == 'PocusAtlas']

progress = ProgressBar(max_value=pocus_df.shape[0]) 
for idx, row in progress(pocus_df.iterrows()):
    filename = row.id + '.' + row.filetype
    # write the video file to disk
    vid = requests.get(row.url).content
    with open(os.path.join('data/video/', filename), 'wb') as handler:
        handler.write(vid)
print('=== THEPocusAtlas video files extraction done! ===')        

  0% (0 of 32) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (32 of 32) |########################| Elapsed Time: 0:00:06 Time:  0:00:06


=== THEPocusAtlas video files extraction done! ===


## 1.5. Radiopaedia

In [15]:
print('...Extracting the video files...')
radio_df = metadata[metadata.source == 'Radiopaedia']

progress = ProgressBar(max_value=radio_df.shape[0]) 
for idx, row in progress(radio_df.iterrows()):
    filename = row.id + '.' + row.filetype
    # write the video file to disk
    vid = requests.get(row.url).content
    with open(os.path.join('data/video/', filename), 'wb') as handler:
        handler.write(vid)
print('=== Radiopaedia video files extraction done! ===')        

  0% (0 of 5) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (5 of 5) |##########################| Elapsed Time: 0:00:01 Time:  0:00:01


=== Radiopaedia video files extraction done! ===


## 1.6. CoreUltrasound

In [16]:
print('...Extracting the video files...')
core_df = metadata[metadata.source == 'CoreUltrasound']

progress = ProgressBar(max_value=core_df.shape[0]) 
for idx, row in progress(core_df.iterrows()):
    filename = row.id + '.' + row.filetype
    
    # extract videos from Vimeo
    if 'vimeo' in row.url:
        v = Vimeo(row.url)
        stream = v.streams # List of available streams of different quality
        highest_quality_available = stream[-1]
        highest_quality_available.download(download_directory = 'data/video/', filename = filename.split('.')[0])
    # extract mp4 videos
    else:
        # write the video file to disk
        vid = requests.get(row.url).content
        with open(os.path.join('data/video/', filename), 'wb') as handler:
            handler.write(vid)
print('=== CoreUltrasound video files extraction done! ===')        

  0% (0 of 18) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


157_core_other.mp4: 1294KB [00:00, 11629.72KB/s]                                                                       
158_core_pneumonia.mp4: 3892KB [00:00, 23289.12KB/s]                                                                   
159_core_other.mp4: 3847KB [00:00, 24074.05KB/s]                                                                       
160_core_other.mp4: 3847KB [00:00, 23567.36KB/s]                                                                       
161_core_other.mp4: 4805KB [00:00, 21292.12KB/s]                                                                       
162_core_other.mp4: 4805KB [00:00, 26585.38KB/s]                                                                       
174_core_covid.mp4: 1633KB [00:00, 13163.75KB/s]                                                                       
100% (18 of 18) |########################| Elapsed Time: 0:00:16 Time:  0:00:16


=== CoreUltrasound video files extraction done! ===


## 1.7. UF

In [17]:
paper_df = metadata[(metadata.source == 'Paper') & ((metadata['id'].str.contains('199', na=False)) | (metadata['id'].str.contains('200', na=False)))] 

progress = ProgressBar(max_value=paper_df.shape[0]) 
for idx, row in progress(paper_df.iterrows()):
    filename = row.id + '.' + row.filetype
    
    # write the video file to disk
    r = requests.get(row.url, stream=True, headers={'User-agent': 'Mozilla/5.0'})
    if r.status_code == 200:
        with open(os.path.join('data/video/', filename), 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)       

    # set a random delay, otherwise the connection gets closed
    delay = random.randint(3, 5)
    time.sleep(delay)
print('=== 2 extra video files downloaded! ===')        

100% (2 of 2) |##########################| Elapsed Time: 0:00:08 Time:  0:00:08


=== 2 extra video files downloaded! ===


In [18]:
print('...Extracting the video files...')
uf_df = metadata[metadata.source == 'UF']

progress = ProgressBar(max_value=uf_df.shape[0]) 
for idx, row in progress(uf_df.iterrows()):
    filename = row.id + '.' + row.filetype
    
    # write the video file to disk
    r = requests.get(row.url, stream=True, headers={'User-agent': 'Mozilla/5.0'})
    if r.status_code == 200:
        with open(os.path.join('data/video/', filename), 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)       

    # set a random delay, otherwise the connection gets closed
    delay = random.randint(3, 5)
    time.sleep(delay)
print('=== UF video files extraction done! ===')        

  0% (0 of 24) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (24 of 24) |########################| Elapsed Time: 0:02:32 Time:  0:02:32


=== UF video files extraction done! ===


## 1.8. Scientific Publications

In [19]:
print('...Extracting the video files...')
paper_df = metadata[(metadata.source == 'Paper')] 

progress = ProgressBar(max_value=paper_df.shape[0]) 
for idx, row in progress(paper_df.iterrows()):
    filename = row.id + '.' + row.filetype
    
    r = requests.get(row.url, stream=True, headers={'User-agent': 'Mozilla/5.0'})
    if r.status_code == 200:
        with open(os.path.join('data/video/', filename), 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)       
        
    # set a random delay, otherwise the connection gets closed
    if (('241_' in row.id) | ('242_' in row.id) | ('243_' in row.id)): #longer delay for last files
        delay = random.randint(10, 20)
    else:
        delay = random.randint(3, 5)
    time.sleep(delay)
print('=== Video files extraction from papers is done! ===')        

  0% (0 of 22) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (22 of 22) |########################| Elapsed Time: 0:02:15 Time:  0:02:15


=== Video files extraction from papers is done! ===


## 1.9. Clarius
* Extracting the first part of Clarius files (**6 files**)

In [20]:
print('...Extracting the video files...')
clarius_df = metadata[metadata.source == 'Clarius'].iloc[:6, :]

progress = ProgressBar(max_value=clarius_df.shape[0]) 
for idx, row in progress(clarius_df.iterrows()):
    filename = row.id + '.' + row.filetype
    
    # write the video file to disk
    r = requests.get(row.url, stream=True, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64)'})
    if r.status_code == 200:
        with open(os.path.join('data/video/', filename), 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)       

    # set a random delay, otherwise the connection gets closed
    delay = random.randint(3, 5)
    time.sleep(delay)
print('=== Clarius video files extraction done! ===')        

  0% (0 of 6) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

...Extracting the video files...


100% (6 of 6) |##########################| Elapsed Time: 0:00:27 Time:  0:00:27


=== Clarius video files extraction done! ===


* Extracting the second part of Clarius files (**17 files**)

In [21]:
import requests

print('Downloading Clarius zip file...')
file_id = '1bqsqNzAJYwdriOP9CcGWPzCUB-7G72Ta' # sharable link id
zip_file_path = os.path.join(get_download_path(), 'clarius.zip')
download_file_from_google_drive(file_id, zip_file_path)

# unzip video files
open_file = subprocess.Popen("utils/7z.exe" +' x ' + zip_file_path + ' -o' + 'data/video',stdout=subprocess.PIPE)
print('=== Clarius video files extraction done! ===')

Downloading Clarius zip file...


| |#                                                  | 0 Elapsed Time: 0:00:00


=== Clarius video files extraction done! ===


# 2. Video Preprocessing

#### Move original video files to the original folder

In [23]:
source_dir = 'data/video/'
target_dir = 'data/video/original'
    
file_names = os.listdir(source_dir)

if not os.path.exists('data/video/original'): 
    os.makedirs('data/video/original') 

progress = ProgressBar()
for file_name in progress(file_names):
    shutil.move(os.path.join(source_dir, file_name), target_dir)

100% (1 of 1) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00


## 2.1. Fetching Video Files Properties

In [24]:
VIDEO_PATH_ORG = 'data/video/original/'

vid_files = os.listdir(VIDEO_PATH_ORG)

progress = ProgressBar(max_value=metadata.shape[0]) 
with open('utils/video_files_properties.csv', 'w') as f:
    # write the file header
    f.write('filename,framerate,width,height,frame_count,duration_secs\n')
    
    # loop over the video files and get their properties
    for vid in progress(vid_files):
        vid_filename = VIDEO_PATH_ORG + str(vid)
        file_type = vid.split('.')[-1]
        
        # get video file properties
        cv2video = cv2.VideoCapture(vid_filename)
        height = cv2video.get(cv2.CAP_PROP_FRAME_HEIGHT)
        width  = cv2video.get(cv2.CAP_PROP_FRAME_WIDTH) 
        frame_rate = round(cv2video.get(cv2.CAP_PROP_FPS), 2)
        frame_count=0
        duration=0
        if file_type == 'mp4':
            frame_count = cv2video.get(cv2.CAP_PROP_FRAME_COUNT) 
            duration = round((frame_count / frame_rate), 2)
        elif file_type == 'gif':
            frame_count = round(Image.open(vid_filename).n_frames) #round((duration * frame_rate ), 0)
            duration = round((frame_count / frame_rate), 2)

        # write video properties to the file
        line_to_write = str(vid) + ',' + str(frame_rate) + ',' + str(width) + ',' + str(height) + ',' + str(frame_count) + ',' + str(duration) + '\n'
        f.write(line_to_write)

100% (244 of 244) |######################| Elapsed Time: 0:00:07 Time:  0:00:07


## 2.2. Video Preprocessing

In [25]:
VIDEO_CROPPED_OUT = 'data/video/cropped/' #processed/cropped/'

# create processed and cropped folder if they don't already exist
if not os.path.exists('data/video/cropped'): #processed/cropped'):
    os.makedirs('data/video/cropped') #processed/cropped')

### 2.2.1. Inital Cropping

In [26]:
# read cropping metadata file
vid_crp_metadata = pd.read_csv('utils/video_cropping_metadata.csv', sep=',', encoding='latin1')
print(vid_crp_metadata.shape)
vid_crp_metadata.head(2)

(243, 27)


Unnamed: 0,filename,source,probe,class,org_width,org_height,org_framecount,org_framerate,org_duration,green_dot,...,del_upper,width_rate,x1_w_y1_h,cropped_filename,crp_width,crp_height,version,date_added,multiple_videos,Note
0,1_butterfly_covid.mp4,Butterfly,Convex,COVID,880,1080,65,19.57,3.32,no,...,15.0,0.035,,1_butterfly_covid_prc.avi,820.0,820.0,1.0,Nov_2020,,
1,2_butterfly_covid.mp4,Butterfly,Convex,COVID,720,1236,818,30.0,27.27,yes,...,83.0,0.068,,2_butterfly_covid_prc.avi,624.0,624.0,1.0,Nov_2020,,


In [27]:
progress = ProgressBar(max_value=vid_crp_metadata.shape[0])

for idx, row in progress(vid_crp_metadata.iterrows()):
    vid_arr = []  # array to store frames of a video file
    
    filename = row.filename
    file_label = filename.split('_')[-1].split('.')[0] # label of the video file
    
    # the following file was removed in the new release of butterfly data
    if filename == '22_butterfly_covid.mp4':
        continue
    
    cap = cv2.VideoCapture(os.path.join(VIDEO_PATH_ORG, filename))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) + 0.5)
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) + 0.5)
    dim = (width, height) # dimension of the original file
    
    if pd.isna(row.x1_w_y1_h): # square cropping
        DEL_UPPER = int(row.del_upper) # to remove top
        WIDTH_RATE = float(row.width_rate) # to remove sides e.g. the meter
        
        width_border = int(width * WIDTH_RATE)
        width_box = int(width - (2 * width_border)) 
        if width_box + DEL_UPPER > height:
            width_box = int(height - DEL_UPPER)
            width_border = int( (width / 2) - (width_box / 2))

        while(True):
            ret, frame = cap.read()

            if not ret:
                break

            # crop
            frame = frame[DEL_UPPER:width_box + DEL_UPPER, width_border:width_box + width_border]

            frame = np.asarray(frame).astype(np.uint8)
            vid_arr.append(frame)

    else: # crop using (x1,y1) and (x2, y2). The output will not be necessarily a square file
        X1 = int(row.x1_w_y1_h.split(',')[0].replace('(', ''))
        W = int(row.x1_w_y1_h.split(',')[1].strip())
        Y1 = int(row.x1_w_y1_h.split(',')[2].strip())
        H = int(row.x1_w_y1_h.split(',')[3].replace(')', '').strip())

        while(True):
            ret, frame = cap.read()

            if not ret:
                break

            # crop
            frame = frame[Y1:Y1 + H, X1:X1 + W]

            frame = np.asarray(frame).astype(np.uint8)
            vid_arr.append(frame)

    vid_arr = np.asarray(vid_arr)
    # print("vid_arr.shape {}".format(vid_arr.shape))
    if (len(vid_arr.shape) != 4):
        continue
    prc_dim = vid_arr.shape[1:3] # dimension of the cropped file
    prc_dim = (prc_dim[1], prc_dim[0])

    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    out = cv2.VideoWriter(os.path.join(VIDEO_CROPPED_OUT + filename.split('.')[0] + '_prc.avi'), fourcc, 20.0, tuple(prc_dim))

    for frame in vid_arr:
        out.write(frame.astype("uint8"))

    vid_crp_metadata.iloc[idx, vid_crp_metadata.columns.get_loc('crp_width')] = prc_dim[1]
    vid_crp_metadata.iloc[idx, vid_crp_metadata.columns.get_loc('crp_height')] = prc_dim[0]

    cap.release()
    out.release()
    cv2.destroyAllWindows()

vid_crp_metadata.to_csv('utils/video_cropping_metadata.csv', index=None)

print('Initial cropping done...')

100% (243 of 243) |######################| Elapsed Time: 0:01:19 Time:  0:01:19


Initial cropping done...


# 3. Extract Ultrasound Images from Videos

#### Read video properties

In [28]:
vid_prop_df = pd.read_csv('utils/video_files_properties.csv')

# merge with the video meta data file 
vid_prop_df.filename = vid_prop_df.filename.astype(str)
vid_prop_df.filename = vid_prop_df.filename.str.strip()

metadata['filename2'] = metadata.id + '.' + metadata.filetype
metadata.filename2 = metadata.filename2.astype(str)
metadata.filename2 = metadata.filename2.str.strip()

vid_prop_df = pd.merge(vid_prop_df, metadata[['filename2', 'source', 'probe', 'class']], left_on='filename', right_on='filename2', how='left').drop('filename2', axis=1)

del metadata['filename2']
print(vid_prop_df.shape)
vid_prop_df.head(2)

(188, 9)


Unnamed: 0,filename,framerate,width,height,frame_count,duration_secs,source,probe,class
0,100_litfl_other.mp4,15.0,480.0,360.0,46.0,3.07,Litfl,Convex,Other
1,101_litfl_other.mp4,15.0,480.0,360.0,28.0,1.87,Litfl,Convex,Other


#### Extract frames from original videos
* v1.4.: 32,052 images are extracted
* v1.3.: 19,161 images are extracted
* v1.2.: 15,282 images are extracted

In [29]:
IMAGE_PATH_ORG = 'data/image/original/'

# create a folder for images extracted from original videos, if doesn't exist
if not os.path.exists(IMAGE_PATH_ORG):
    os.makedirs(IMAGE_PATH_ORG)

In [30]:
extract_images(video_path= VIDEO_PATH_ORG, image_path=IMAGE_PATH_ORG, cropped=False)

100% (188 of 188) |######################| Elapsed Time: 0:02:47 Time:  0:02:47


## 3.1. Extract frames from cropped video files

In [31]:
IMAGE_CROPPED_OUT = 'data/image/cropped/'
IMAGE_MASK_OUT = 'data/mask/'

# create cropped and inpainted image folders and the mask folder if they don't already exist
if not os.path.exists(IMAGE_CROPPED_OUT):
    os.makedirs(IMAGE_CROPPED_OUT)
if not os.path.exists(IMAGE_MASK_OUT):
    os.makedirs(IMAGE_MASK_OUT)

In [32]:
extract_images(video_path= VIDEO_CROPPED_OUT, image_path=IMAGE_CROPPED_OUT, cropped=True)

100% (243 of 243) |######################| Elapsed Time: 0:01:41 Time:  0:01:41


### 3.1.1. (Optional) Extracting frames from cropped ultrasouund video files using a parameter set as filter
* You can extract images using the follwoing parameters:
    * maximum number of frames to be extracted from each video file
    * extracting a targetted set of classes from ['COVID', 'Pneumonia', 'Normal', 'Other']
    * extracting a targetted set of data sources from ['Butterfly', 'GrepMed', 'LITFL', 'PocusAtlas', 'CU', 'Radiopaedia', 'UF', 'Paper', 'Clarius']
    * extracting a targetted set of probes from ['convex', 'linear']

In [None]:
#extract_images(video_path= VIDEO_CROPPED_OUT, image_path=IMAGE_CROPPED_OUT, cropped=True, 
#                max_frames=10, 
#                target_class=['COVID', 'Pneumonia', 'Normal'],
#                target_source=['Butterfly', 'GrepMed', 'LITFL', 'PocusAtlas'],
#                target_probe=['convex', 'linear']))

## 3.2. Preprocessing Images

#### Read image preprocessing metadata

In [33]:
image_prc_df = pd.read_csv('utils/mask_metadata.csv')

image_prc_df = image_prc_df[image_prc_df.filename !='22_butterfly_covid.mp4'] # 22_butterfly_covid.mp4 was removed in March release of butterfly

print(image_prc_df.shape)
image_prc_df.head(2)

(242, 17)


Unnamed: 0,filename,source,probe,class,org_width,org_height,cropped_filename,crp_width,crp_height,need_mask_after_crop,need_multiple_masks,frame_specific_masks,delete_frames_from_to,mask_main_filename,tight_inpainting,version,date_added
0,1_butterfly_covid.mp4,Butterfly,Convex,COVID,880,1080,1_butterfly_covid_prc.avi,820,820,yes,no,,,1_butterfly_covid_prc_convex_frame0_mask.jpg,no,1.0,Nov_2020
1,2_butterfly_covid.mp4,Butterfly,Convex,COVID,720,1236,2_butterfly_covid_prc.avi,624,624,yes,yes,"118-130, 134-139, 147-150","131-133, 143-146, 154-202, 210-813",2_butterfly_covid_prc_convex_frame0_mask.jpg,no,1.0,Nov_2020


### 3.2.1. Removing frames with artifacts on the ROI
Some frames of the following files need to be deleted as the moving pointer is on ROI, we will remove them from the cropped images folder:
* 2_butterfly_covid.mp4
* 6_butterfly_covid.mp4
* 16_butterfly_covid.mp4
* 20_butterfly_normal.mp4
* 22_butterfly_covid.mp4 (it was removed in the March release of butterfly data)
* 25_grepmed_pneumonia.mp4
* 178_uf_other.mp4 (initial 30 frames are removed)
* 184_uf_pneumonia.mp4 (initial 30 frames are removed)

We need 2 masks for the following videos:
* 178_uf_other.mp4 
* 184_uf_pneumonia.mp4

**Number of frames:**
* __Initial total number of frames:__ 
    * v1.4.: 32,052
    * v1.3.: 19,161
    * v1.2.: 13,646
* __Total number of frames after removal:__ 
    * v1.4.: 29,651
    * v1.3.: 16,822
    * v1.2.: 11,307

In [39]:
progress = ProgressBar(max_value=image_prc_df[~pd.isna(image_prc_df.delete_frames_from_to)].shape[0])

for idx, row in progress(image_prc_df[~pd.isna(image_prc_df.delete_frames_from_to)].iterrows()):
    frames_to_delete = row.delete_frames_from_to.strip().split(',')
    frame_name_main = row.mask_main_filename.split('.')[0].split('_frame')[0]
    
    for frames in frames_to_delete:
        from_frame = int(frames.split('-')[0])
        to_frame = int(frames.split('-')[1]) + 1
        
        # delete frames with moving part on the roi
        for i in range(from_frame, to_frame):
            file_to_remove = IMAGE_CROPPED_OUT + frame_name_main + '_frame' + str(i) + '.jpg'
            if os.path.exists(file_to_remove):
                os.remove(file_to_remove)

print("=== Files removed! ===")

100% (7 of 7) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00


=== Files removed! ===


### 3.2.2. Applying masks

In [40]:
CLEAN_IMAGE_OUT = 'data/image/clean/'
CLEAN_VIDEO_OUT = 'data/video/clean/'

# create clean image and video folders if they don't already exist
if not os.path.exists(CLEAN_IMAGE_OUT):
    os.makedirs(CLEAN_IMAGE_OUT)
if not os.path.exists(CLEAN_VIDEO_OUT):
    os.makedirs(CLEAN_VIDEO_OUT)

In [49]:
def zero_pad_array(arr, pad=5):
    if len(arr.shape) == 3:
        padded_arr = np.zeros((arr.shape[0]+2*pad, arr.shape[1]+2*pad, arr.shape[2]), dtype=np.uint8)
        padded_arr[pad:pad + arr.shape[0], pad:pad + arr.shape[1], :] = arr
    else:
        padded_arr = np.zeros((arr.shape[0]+2*pad, arr.shape[1]+2*pad), dtype=np.uint8)
        padded_arr[pad:pad + arr.shape[0], pad:pad + arr.shape[1]] = arr
    return padded_arr
        

def frame_inpainting(frame_dict, mask, default_mask=0, kernel_size=(5,5), method='telea', pad=5):
    '''
    The function performs inpainting on frames using the created masks
    
    - frame_dict: dict of frames from video, indexed by frame number
    - mask: (h, w, 1) array if single mask, else dict of such arrays
        indexed by frame number
    - default_mask: index for mask to be used as default, for frames
        without specific mask (if mask is not constant across frames)
    - kernel_size: Size of patch used to perform inpainting
    - method: one of 'ns' (navier-stokes) or 'telea' - telea usually works better
    '''
    # Dilate mask make sure it covers enough of the ROI to be masked
    kernel = np.ones(kernel_size, np.uint8)
    if type(mask) is not dict:
        mask = {default_mask: mask}
    masks_processed = {key:cv2.dilate(zero_pad_array(m, pad=pad), kernel, iterations=1) for key, m in mask.items()}
    
    method_dict = {'ns':cv2.INPAINT_NS, 'telea':cv2.INPAINT_TELEA}
    
    frames_inpainted = {}
    for key, frame in frame_dict.items():
        if key in masks_processed:
            #print(frame.shape, masks_processed[key].shape)
            frames_inpainted[key] = cv2.inpaint(zero_pad_array(frame, pad=pad), masks_processed[key], 3, method_dict[method])[pad:-pad, pad:-pad, :]
        else: # default mask
            frames_inpainted[key] = cv2.inpaint(zero_pad_array(frame, pad=pad), masks_processed[default_mask], 3, method_dict[method])[pad:-pad, pad:-pad, :]
    print(frames_inpainted)
    return frames_inpainted

In [50]:
progress = ProgressBar(max_value=image_prc_df.shape[0])

for idx, row in progress(image_prc_df.iterrows()):     
    # get the main token of the filename
    if row.probe == 'Convex':
        filename_main = row.filename.split('.')[0] + '_prc_convex'
    elif row.probe == 'Linear':
        filename_main = row.filename.split('.')[0] + '_prc_linear'
        
    if row.tight_inpainting == 'yes':
        # objects close to ROI, avoid bleeding while inpainting
        inpainting_kernel_size = (1,1)
    else:
        # no objects close to ROI, more effective inpainting
        inpainting_kernel_size = (5,5)

    # check if the cropped frames need cleaning
    if row.need_mask_after_crop == 'no':
        frames = {}
        
        # 1. no clearning, copy cropped images and rename them to clean folder
        for file in os.listdir(IMAGE_CROPPED_OUT):
            if file.startswith(filename_main):
                #last_part = file.split('_')[-1]
                #last_part = last_part.replace('frame', '_clean_frame')
                new_filename = file.replace('frame', 'clean_frame')
                #print(file, new_filename) #, last_part)
                shutil.copy(IMAGE_CROPPED_OUT + file, CLEAN_IMAGE_OUT + new_filename)

                img = cv2.imread(os.path.join(CLEAN_IMAGE_OUT, new_filename))
                frame_num = int(re.search(r'\d+$', file[:-4]).group())
                frames[frame_num] = img
        
        # make a video out of the clean frames
        keys = list(frames.keys())
        keys.sort()
        clean_vid_frames = [frames[k] for k in keys]

        h, w, layers = clean_vid_frames[0].shape
        size = (w, h)

        out = cv2.VideoWriter(os.path.join(CLEAN_VIDEO_OUT + filename_main + '_clean.avi'), cv2.VideoWriter_fourcc(*'DIVX'), 15, size)
        for i in range(len(clean_vid_frames)):
            out.write(clean_vid_frames[i])
        out.release()    
        
    # 2. frames need cleaning
    else: 
        # create a dictionary of frames
        frames = {}
        for f in os.listdir(IMAGE_CROPPED_OUT):
            if f.startswith(filename_main):
                img = cv2.imread(os.path.join(IMAGE_CROPPED_OUT, f))
                frame_num = int(re.search(r'\d+$', f[:-4]).group())
                frames[frame_num] = img

        # check if the video file requires multiple masks or a single mask is enough
        if row.need_multiple_masks == 'no':
            mask = cv2.imread(os.path.join(IMAGE_MASK_OUT, filename_main + '_frame0_mask.jpg'))[:,:,0]

            # perform inpainting on frames using a single main mask
            frames_inpainted = frame_inpainting(frames, mask, kernel_size=inpainting_kernel_size)
        else:
            masks = {}
           
            for f in os.listdir(IMAGE_MASK_OUT):
                if f.startswith(filename_main):
                    img = cv2.imread(os.path.join(IMAGE_MASK_OUT, f))
                    frame_num = int(re.search(r'\d+$', f[:-9]).group())
                    masks[frame_num] = img[:,:,0]

            # perform inpainting on frames using multiple masks
            frames_inpainted = frame_inpainting(frames, masks, default_mask=0, kernel_size=inpainting_kernel_size)

        # write clean frames to the disk
        for key, value in frames_inpainted.items():
            cv2.imwrite(CLEAN_IMAGE_OUT + filename_main + "_clean_frame" + str(key) + ".jpg", value)

        # write clean video to the disk
        keys = list(frames_inpainted.keys())
        keys.sort()
        clean_vid_frames = [frames_inpainted[k] for k in keys]
        print(len(clean_vid_frames))
        h, w, layers = clean_vid_frames[0].shape
        size = (w, h)

        out = cv2.VideoWriter(os.path.join(CLEAN_VIDEO_OUT + filename_main + '_clean.avi'), cv2.VideoWriter_fourcc(*'DIVX'), 15, size)
        for i in range(len(clean_vid_frames)):
            out.write(clean_vid_frames[i])
        out.release()    

  0% (0 of 242) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--

{}
0


IndexError: list index out of range

-----


# Data Science Ethics Checklist

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

## A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

## B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

## C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

## D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate bias**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

## E. Deployment
 - [ ] **E.1 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.2 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.3 Concept drift**: Do we test and monitor for concept drift to ensure the model remains fair over time?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

*Data Science Ethics Checklist generated with [deon](http://deon.drivendata.org).*
