## Classify Videos, the .ipynb version

<p>A cell-by-cell breakdown of how this script functions in addition to all changes made so far.</p>

### A) Imports, directory, and parser set-up

Before executing this section, please go to your bash terminal and run `pip install torch torchvision`. There were circumstances when adding arguments to the parser that I would get an error due to the default parameter. It was set to `os.getenv("HOME")`, which would return `None` when I ran it. This happens in some environments, particularly on Windows, where `HOME` is not a standard environment. To handle this, I provided a fallback for the `HOME` environment variable.

In [1]:
import argparse, subprocess, datetime, os, pdb, sys

In [2]:
from Utils.CichlidActionRecognition import ML_model
from Utils.DataPrepare import DP_worker

In [3]:
# Get the HOME environment variable, with a fallback to the user's directory on Windows
home_dir = os.getenv("HOME") or os.path.expanduser("~")

In [4]:
parser = argparse.ArgumentParser(description='This script takes a model, and apply this model to new video clips')
needsDir = []

### B) Input data

The setup below helps make the script more flexible and user-friendly by allowing users to specify different directories for their various files, while ensuring that necessary directories are created if they don't already exist.

- `--Input_videos_directory` is the directory that holds all the labeled videos.
- `--Videos_to_project_file` (.csv) is a mapping of video clips to the project each animal belongs to.
- `--Trained_model_file` (.pth) is the data saved from the previous training.
- `--Trained_categories_file` (.json) was previously used for training.
- `--Training_options` (.log) was previously used for training.
- `--Output_file` (.csv) details the confidence and label for each video clip.

In [5]:
# Directory of video clips
parser.add_argument('--Input_videos_directory',
                    type = str, 
                    default = os.path.join(home_dir,'data/labeled_videos'),
                    required = False, 
                    help = 'Name of directory to hold all video clips')
needsDir.append("Input_videos_directory")

In [6]:
# Mapping of video clips to project
parser.add_argument('--Videos_to_project_file',
                    type = str, 
                    default = os.path.join(home_dir,'data/videoToProject.csv'),
                    help = 'Project each animal belongs to')
needsDir.append("Videos_to_project_file")

In [7]:
# Saving the previous training's model results
parser.add_argument('--Trained_model_file',
                    default = os.path.join(home_dir,'data/model.pth'),
                    type = str,
                    help = 'Save data (.pth) of previous training')
needsDir.append("Trained_model_file")

In [8]:
# JSON file previously used for training
parser.add_argument('--Trained_categories_file',
                    type = str, 
                    default = os.path.join(home_dir,'data/train.json'),
                    help = 'JSON file previously used for training')
needsDir.append("Trained_categories_file")

In [9]:
# Log file used for training
parser.add_argument('--Training_options_file',
                    type = str, 
                    default = os.path.join(home_dir,'data/log_test/val.log'),
                    help = 'log file in training')
needsDir.append("Training_options_file")

In [10]:
# Output CSV that details the confidence and label for each video clip 
parser.add_argument('--Output_file',
                    type = str, 
                    default = os.path.join(home_dir,'data/confusionMatrix.csv'),
                    help = 'CSV file that keeps the confidence and label for each video clip')
needsDir.append("Output_file")

### C) Temporary directories

These comprise temporary clips and files that would be deleted by the end of the analysis. Including more details below:

- `--Temporary_clips_directory` represent the location for the temporary clips to be stored.
- `--Temporary_output_directory` is the location for the temporary files to be stored.

In [11]:
# Location of temporary clips
parser.add_argument('--Temporary_clips_directory',
                    default = os.path.join(home_dir,'data/clips_temp'),
                    type = str, 
                    required = False, 
                    help = 'Location for temp clips to be stored')
needsDir.append("Temporary_clips_directory")

In [12]:
# Location of temporary files
parser.add_argument('--Temporary_output_directory',
                    default = os.path.join(home_dir,'data/intermediate_temp'),
                    type = str, 
                    required = False, 
                    help = 'Location for temp files to be stored')
needsDir.append("Temporary_output_directory")

### D) Arguments that don't require a file to be passed in

These are parser arguments that don't require a file to be passed in. These are typically hyperparameters that will be useful for training the model later on.

In [13]:
# Purpose of the script
parser.add_argument('--Purpose',
                    type = str, 
                    default = 'classify',
                    help = 'classify is the only function for this script for now')

# Batch size for the model
parser.add_argument('--batch_size', 
                    default=13, 
                    type=int, help='Batch Size')

# Number of workers
parser.add_argument('--n_threads',
                    default=5,
                    type=int,
                    help='Number of threads for multi-thread loading')

# GPU card to use
parser.add_argument('--gpu_card',
                    default='1',
                    type=str,
                    help='gpu card to use')

_StoreAction(option_strings=['--gpu_card'], dest='gpu_card', nargs=None, const=None, default='1', type=<class 'str'>, choices=None, required=False, help='gpu card to use', metavar=None)

In [14]:
# Similar parameters, but these are for the dataloader

# The sample duration of each inputted clip
parser.add_argument('--sample_duration',
                    default=96,
                    type=int,
                    help='Temporal duration of inputs')

# Standardized height and width of inputs                    
parser.add_argument('--sample_size',
                    default=120,
                    type=int,
                    help='Height and width of inputs')

_StoreAction(option_strings=['--sample_size'], dest='sample_size', nargs=None, const=None, default=120, type=<class 'int'>, choices=None, required=False, help='Height and width of inputs', metavar=None)

In [15]:
# Parameters for the optimizer
parser.add_argument('--learning_rate', default=0.1, type=float, help='Initial learning rate (divided by 10 while training by lr scheduler)')
parser.add_argument('--momentum', default=0.9, type=float, help='Momentum')
parser.add_argument('--dampening', default=0.9, type=float, help='dampening of SGD')
parser.add_argument('--weight_decay', default=1e-5, type=float, help='Weight Decay')
parser.add_argument('--nesterov', action='store_true', help='Nesterov momentum')
parser.set_defaults(nesterov = False)
parser.add_argument('--optimizer', default='sgd', type=str, help='Currently only support SGD')
parser.add_argument('--lr_patience', default=10, type=int, help='Patience of LR scheduler. See documentation of ReduceLROnPlateau.')
parser.add_argument('--resnet_shortcut', default='B', help='Shortcut type of resnet (A | B)')

_StoreAction(option_strings=['--resnet_shortcut'], dest='resnet_shortcut', nargs=None, const=None, default='B', type=None, choices=None, required=False, help='Shortcut type of resnet (A | B)', metavar=None)

In [16]:
# Parameters specific for training from scratch
parser.add_argument('--n_classes', default=10, type=int)

_StoreAction(option_strings=['--n_classes'], dest='n_classes', nargs=None, const=None, default=10, type=<class 'int'>, choices=None, required=False, help=None, metavar=None)

### E) Output Data

This is the directory where we would store all the sample logs.

In [17]:
# Creating the results directory
parser.add_argument('--Results_directory',
                    type = str,
                    default = os.path.join(home_dir,'data/results_dir_temp'),
                    help = 'directory to store sample prepare logs')
needsDir.append("Results_directory")

### F) Helper module that creates the required directories

In a Jupyter notebook, additional arguments (e.g., related to the Jupyter kernel) might be passed, which are not recognized by the script's argument parser. `parse_known_args()` helps to avoid errors by ignoring unrecognized arguments.

The other code blocks ensure that specific directories or file paths exist before performing operations that rely on them. If the directories or file paths do not exist, the code creates them.

And finally, the code iterates through a list called `needsDir`, which contains the names of directories or file paths that are required. For each item, it retrieves the corresponding path from the `args` object. If the item refers to a file path, it ensures that the directory for that file exists. If the item refers to a directory path, it ensures the directory exists. If any directory does not exist, the script creates it.

In [18]:
# The code below is needed if you're running this script from a Jupyter notebook
if 'ipykernel_launcher' in sys.argv[0]:
    args, unknown = parser.parse_known_args()
else:
    args = parser.parse_args()

In [19]:
# Function to check if a directory exists, and create it if it doesn't
def ensure_directory_exists(path):
    if not os.path.exists(path):
        print(f"Path does not exist, creating path: {path}")
        os.makedirs(path)
    print(f"Using directory: {path}")
    
# Function to check if a file's directory exists, and create it if it doesn't
def ensure_file_directory_exists(file_path):
    directory = os.path.dirname(file_path)
    if not os.path.exists(directory):
        print(f"Directory for the file does not exist, creating directory: {directory}")
        os.makedirs(directory)
    print(f"Using file path: {file_path}")

In [20]:
# Creating directories for every item in needsDir if it doesn't exist
for nDir in needsDir:
    arg_value = getattr(args, nDir)
    if nDir.endswith('_file'):
        ensure_file_directory_exists(arg_value)
    else:
        ensure_directory_exists(arg_value)

Path does not exist, creating path: /home/hice1/kpatherya3/data/labeled_videos
Using directory: /home/hice1/kpatherya3/data/labeled_videos
Using file path: /home/hice1/kpatherya3/data/videoToProject.csv
Using file path: /home/hice1/kpatherya3/data/model.pth
Using file path: /home/hice1/kpatherya3/data/train.json
Directory for the file does not exist, creating directory: /home/hice1/kpatherya3/data/log_test
Using file path: /home/hice1/kpatherya3/data/log_test/val.log
Using file path: /home/hice1/kpatherya3/data/confusionMatrix.csv
Path does not exist, creating path: /home/hice1/kpatherya3/data/clips_temp
Using directory: /home/hice1/kpatherya3/data/clips_temp
Path does not exist, creating path: /home/hice1/kpatherya3/data/intermediate_temp
Using directory: /home/hice1/kpatherya3/data/intermediate_temp
Path does not exist, creating path: /home/hice1/kpatherya3/data/results_dir_temp
Using directory: /home/hice1/kpatherya3/data/results_dir_temp


### G) rClone Set-Up

With the directory structure set-up, it's time to import the files from DropBox, and we leverage rClone to do so.

#### i) Setting up rClone within this notebook

The following code block downloads rClone, and prints out the version to verify that the installation was successful.

In [21]:
!wget https://downloads.rclone.org/rclone-current-linux-amd64.zip -O rclone.zip
!unzip rclone.zip
!mkdir -p ~/bin
!mv rclone-*-linux-amd64/rclone ~/bin/

# Add the ~/bin directory to the PATH environment variable
os.environ["PATH"] += os.pathsep + os.path.expanduser("~/bin")

# Verify rclone is in the PATH
!rclone version

--2024-05-30 22:05:32--  https://downloads.rclone.org/rclone-current-linux-amd64.zip
Resolving downloads.rclone.org (downloads.rclone.org)... 95.217.6.16, 2a01:4f9:c012:7154::1
Connecting to downloads.rclone.org (downloads.rclone.org)|95.217.6.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21137057 (20M) [application/zip]
Saving to: ‘rclone.zip’


2024-05-30 22:05:34 (14.2 MB/s) - ‘rclone.zip’ saved [21137057/21137057]

Archive:  rclone.zip
   creating: rclone-v1.66.0-linux-amd64/
  inflating: rclone-v1.66.0-linux-amd64/README.html  
  inflating: rclone-v1.66.0-linux-amd64/README.txt  
  inflating: rclone-v1.66.0-linux-amd64/rclone  
  inflating: rclone-v1.66.0-linux-amd64/git-log.txt  
  inflating: rclone-v1.66.0-linux-amd64/rclone.1  
rclone v1.66.0
- os/version: redhat 9.3 (64 bit)
- os/kernel: 5.14.0-362.24.1.el9_3.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.22.1
- go/linking: static
- go/tags: none


#### ii) Editing the rClone configuration file

Connecting to the BioSci-McGrath DropBox folder. Token codes have been masked to prevent any leaks.

In [22]:
config_content = """
[cichlidVideo]
type = dropbox
token = {"access_token":"---","token_type":"---","expiry":"---"}
"""

# Write the content to the rclone.conf file
with open('rclone.conf', 'w') as config_file:
    config_file.write(config_content)

#### iii) Testing the DropBox connection

Listing files of a random directory within the folder.

In [23]:
random_path = "BioSci-McGrath/Apps/CichlidPiData/__CredentialFiles/iof_credentials"
!rclone --config rclone.conf ls cichlidVideo:{random_path}

       14 hosts.secret
       15 pi_password.secret
      177 rclone.conf
       69 sendgrid_key.secret


#### iv) Copying files over to the relevant directory

Including a set of files that could be used by the ML model. I will revise these paths as I receive more direction.

##### a) Input Data

These are the files that will be necessary to have before running the model.

- `.../labeled_videos/` is the directory that holds all the labeled videos.
- `.../videoToProject.csv` (.csv) is a mapping of video clips to the project each animal belongs to.
- `.../model.pth` (.pth) is the data saved from the previous training.
- `.../train.json` (.json) was previously used for training.
- `.../log_test/val.log` (.log) was previously used for training.
- `.../confusionMatrix.csv` (.csv) details the confidence and label for each video clip.

In [24]:
# TODO: get the correct paths

videos_path = "BioSci-McGrath/Apps/CichlidPiData/__AnnotatedData/LabeledVideos/Clips"
vid_to_proj_path = "BioSci-McGrath/Apps/CichlidPiData/__MachineLearningModels/3DResnet/MCsingle_nuc/videoToProject.csv"
trained_model = "BioSci-McGrath/Apps/CichlidPiData/__MachineLearningModels/3DResnet/MCsingle_nuc/model.pth"
trained_categories = "BioSci-McGrath/Apps/CichlidPiData/__MachineLearningModels/3DResnet/Model18_All/Lijiang_best_model/train.json"
training_options = "BioSci-McGrath/Apps/CichlidPiData/__MachineLearningModels/3DResnet/MCsingle_nuc/val.log"
output_file = "BioSci-McGrath/Apps/CichlidPiData/__MachineLearningModels/3DResnet/MCsingle_nuc/confusionMatrix.csv"

In [25]:
!rclone --config rclone.conf -v copy cichlidVideo:{videos_path} data/labeled_videos/
!rclone --config rclone.conf -v copy cichlidVideo:{vid_to_proj_path} data/
!rclone --config rclone.conf -v copy cichlidVideo:{trained_model} data/
!rclone --config rclone.conf -v copy cichlidVideo:{trained_categories} data/
!rclone --config rclone.conf -v copy cichlidVideo:{training_options} data/log_test/
!rclone --config rclone.conf -v copy cichlidVideo:{output_file} data/

2024/05/30 22:05:38 INFO  : ._TI2_4.y4oZPj: Copied (new)
2024/05/30 22:05:39 INFO  : MC_singlenuc21_6_Tk53_030320.tar: Copied (new)
2024/05/30 22:05:46 INFO  : MC_singlenuc23_8_Tk33_031720.tar: Copied (new)
2024/05/30 22:05:50 INFO  : MC_singlenuc28_1_Tk3_022520.tar: Copied (new)
2024/05/30 22:05:51 INFO  : MC_singlenuc29_3_Tk9_030320.tar: Copied (new)
2024/05/30 22:05:52 INFO  : MC6_5.tar: Multi-thread Copied (new)
2024/05/30 22:05:54 INFO  : MC_singlenuc35_11_Tk61_051220.tar: Copied (new)
2024/05/30 22:05:55 INFO  : MC_singlenuc36_2_Tk3_030320.tar: Copied (new)
2024/05/30 22:05:56 INFO  : MC_singlenuc41_2_Tk9_030920.tar: Copied (new)
2024/05/30 22:05:56 INFO  : MC_singlenuc46_2_Tk53_030920.tar: Copied (new)
2024/05/30 22:05:58 INFO  : MC_singlenuc54_5_Tk53_051220.tar: Copied (new)
2024/05/30 22:05:59 INFO  : MC_singlenuc50_5_Tk9_050720.tar: Copied (new)
2024/05/30 22:06:00 INFO  : CV10_3.tar: Multi-thread Copied (new)
2024/05/30 22:06:01 INFO  : MC16_2.tar: Multi-thread Copied (new)


### Experimenting with data worker and ML model

In [26]:
ML_model = ML_model(args)
ML_model.work()

KeyError: 'Location'

In [27]:
data_worker = DP_worker(args)
data_worker.work()

convert video clips to images for faster loading
calculate mean file


ValueError: Cannot set a DataFrame with multiple columns to the single column MeanID

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_card
print(os.environ["CUDA_VISIBLE_DEVICES"])