# Install the ASR workshop files (Dzardzongke Exercise)
Rolando Coto-Solano (Rolando.A.Coto.Solano@dartmouth.edu)<br>
Dartmouth College. Last update: 20230420

The program takes two main inputs:

* `folderToCreate`: This folder will be created in your Google Drive. It will contain the files and folders necessary for ASR training. The default is `202303-asr-workshop-ka` for the first Kaike test.<br>
* `sandboxes`: An array with the names of the sandboxes that will be used. It requires at least two: `sandbox-user` as a temporary bucket, and `all-wavs` as a permanent one. You can add more sandboxes if you have more than one person working on your transcriptions.<br>

The program will perform the following tasks:

1. Asks for permission to read and write into your Google Drive
2. Create folders and subfolders for each sandbox
3. Download exercise files
4. Creates the necessary Google sheets to store transcriptions
5. Downloads the computer code for the transcription training

**Please remember to delete the Dzardzongke recordings once you're done.**

## (1) Questions needed in order to install

In [3]:
folderToCreate = "202303-asr-workshop-hew"
sandboxes = ["sandbox-user", "all-wavs"]

downloadDzAudioFiles = 0   # Set this 0 if you don't want to upload the wave files automatically

## (2) Request access to your google drive

In [4]:
# Load other libraries
import pandas as pd
import random
import os.path

# Load libraries for access to Google Spreadsheets
from google.colab import auth
auth.authenticate_user()
import gspread
from google.colab import drive

# It needs this permission to access the ASR spreadsheets in your GDrive
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)

In [5]:
# It needs this permission to read and write ASR files into your GDrive
drive.mount('/content/drive/',force_remount=True)

Mounted at /content/drive/


## (3) Create folders and download exercise files

In [6]:
# Example of the folder structure:

# -- 202303-asr-workshop
#    | - all-wavs
#    |   | - audiofiles-to-transcribe
#    |   | - logs-kaldi-res
#    |   | - logs-wav2vec2-res
#    |   | - processed-elan-files
#    |   | - tsv-inputs
#    |   | - tsv-outputs
#    |   | - wav
#    |   | - wav2vec2-model
#    | - sandbox-user
#    |   | - audiofiles-to-transcribe
#    |   | - logs-kaldi-res
#    |   | - logs-wav2vec2-res
#    |   | - processed-elan-files
#    |   | - tsv-inputs
#    |   | - tsv-outputs
#    |   | - wav
#    |   | - wav2vec2-model

In [7]:
#===========================================================================
# Create folders
#===========================================================================

!mkdir /content/drive/MyDrive/{folderToCreate}

for b in sandboxes:
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/audiofiles-to-transcribe
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/logs-kaldi-res
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/logs-wav2vec2-res
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/tsv-inputs
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/tsv-outputs
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/wav
  !mkdir /content/drive/MyDrive/{folderToCreate}/{b}/wav2vec2-model

In [8]:
#===========================================================================
# Download workshop files
#===========================================================================

for b in sandboxes:
  if (downloadDzAudioFiles == 1 and "sandbox-user" in b):
    !curl -o /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/dz-files-20230419.zip https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/dz-files-20230419.zip
    !unzip -o /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/dz-files-20230419.zip -d /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files
    !rm -f /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/dz-files-20230419.zip
    !mv /content/drive/MyDrive/{folderToCreate}/{b}/processed-elan-files/smt-003.wav /content/drive/MyDrive/{folderToCreate}/{b}/audiofiles-to-transcribe

## (4) Create Google Sheet files

In [9]:
#=================================================================
# Write Google Sheet files
#=================================================================

idSandboxes = []

for b in sandboxes:
  sheetName = "asr-transcriptions-" + b
  sh = gc.create(sheetName)
  worksheet = gc.open(sheetName).sheet1
  worksheet.update_title('wav-metadata')
  inValues = ["wav_filename", "dataProcessedBy", "dateAdded", "speakerPrefix", "gender", "wav_filesize", "duration_seconds", "codeSwitch", "needsFurtherCheck", "transcript", "original_transcript"]
  print(str(b) + "\t" + str(sh.id))
  idSandboxes.append(sh.id)
  worksheet.append_row(inValues)

sandbox-user	1F2yws84wNBLVTidLFo84OfJ5ihS-mPsgBbVQfN5fPlA
all-wavs	10-0Jmdg95kKsg7NgJgkUmMqilzE0uktaIAUTHzlNT9w


In [10]:
#=================================================================
# Write the file with the sandbox IDs
#=================================================================

output = ""
for i in range(0,len(sandboxes)):
  idSandboxes[i] = "https://docs.google.com/spreadsheets/d/" + idSandboxes[i] + "/edit?usp=sharing"
  output += sandboxes[i] + "\t" + idSandboxes[i] + "\n"

output = output[:-1]

print(output)

f = open("/content/drive/MyDrive/" + folderToCreate + "/sandboxes.txt", "w")
f.write(output)
f.close()

sandbox-user	https://docs.google.com/spreadsheets/d/1F2yws84wNBLVTidLFo84OfJ5ihS-mPsgBbVQfN5fPlA/edit?usp=sharing
all-wavs	https://docs.google.com/spreadsheets/d/10-0Jmdg95kKsg7NgJgkUmMqilzE0uktaIAUTHzlNT9w/edit?usp=sharing


In [11]:
#=================================================================
# Move the Google Sheet files to the installation folder
# (If you get an error that you "cannot stat" the file,
# give it a minute or two and try again. The drive might
# take a minute to update itself).
#=================================================================

drive.mount('/content/drive/',force_remount=True)

for b in sandboxes:
  sheetName = "asr-transcriptions-" + b
  !mv /content/drive/MyDrive/{sheetName}.gsheet /content/drive/MyDrive/{folderToCreate}

Mounted at /content/drive/
mv: cannot stat '/content/drive/MyDrive/asr-transcriptions-sandbox-user.gsheet': No such file or directory
mv: cannot stat '/content/drive/MyDrive/asr-transcriptions-all-wavs.gsheet': No such file or directory


## (5) Download Jupyter notebooks for the exercises

In [12]:
!curl -o /content/drive/MyDrive/{folderToCreate}/from-elan-to-wav-and-gsheet.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/from-elan-to-wav-and-gsheet.ipynb

!curl -o /content/drive/MyDrive/{folderToCreate}/from-gsheet-to-wav2vec2-files.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/from-gsheet-to-wav2vec2-files.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/train-wav2vec2.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/train-wav2vec2.ipynb

!curl -o /content/drive/MyDrive/{folderToCreate}/inference-split.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/inference-split.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/inference-transcribe-from-blank-elan.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/inference-transcribe-from-blank-elan.ipynb
!curl -o /content/drive/MyDrive/{folderToCreate}/inference-transcribe-from-user-wav2vec2.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/inference-transcribe-from-user-wav2vec2.ipynb

!curl -o /content/drive/MyDrive/{folderToCreate}/from-sandbox-to-permanent-gsheet.ipynb https://rcweb.dartmouth.edu/homes/f00458c/workshop-nepali-2023/from-sandbox-to-permanent-gsheet.ipynb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13270  100 13270    0     0   9045      0  0:00:01  0:00:01 --:--:--  9051
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13283  100 13283    0     0  12250      0  0:00:01  0:00:01 --:--:-- 12253
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67628  100 67628    0     0  46223      0  0:00:01  0:00:01 --:--:-- 46225
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9002  100  9002    0     0   7579      0  0:00:01  0:00:01 --:--:--  7583
  % Total    % Received % Xferd  Average Speed   Tim