All scripts used for my year 2 ABWE project. I anticipate 3 primary steps will be required to convert my raw data into findings, laid out below.
Audio collected from the site, containing details of behaviours performed, meterological values, and covariates will be transcribed with timestamps using Whisper Distil Large v3 (Gandhi et al. 2023). Human oversight was required to ensure transcription accuracy, and to reliably clean the data.
My favourite part of this is when it acknowledges after some time:
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
This shouldn't be difficult to implement, but I really can't see it providing a significant enough boost with such a small dataset. Nonetheless, this is a potential area for performance improvement alongside compilation/IO-aware attention/dusting off the RTX 3060.
Just use a spreadsheet, they said. It'll be easier, they said...
# What toy did the dog play with?
=IFERROR(MID(
OFFSET(INDIRECT(H2),0,4),
FIND(", ",OFFSET(INDIRECT(H2),0,4))+2,
LEN(OFFSET(INDIRECT(H2),0,4))
),"")
# How long did the dog sit down?
=SUMIF(
OFFSET(INDIRECT($H2), 1, 2, $I2, 1),
M$1, OFFSET(INDIRECT($H2), 1, 4, $I2, 1)
)
# How hot was it on average?
=AVERAGE(
VALUE(TRIM(MID(
OFFSET(INDIRECT(H2), 0, 1),
FIND(",", OFFSET(INDIRECT(H2), 0, 1)) + 1,
FIND(",", OFFSET(INDIRECT(H2), 0, 1) & ",",
FIND(",", OFFSET(INDIRECT(H2), 0, 1)) + 1)
-FIND(",", OFFSET(INDIRECT(H2), 0, 1)) - 1))),
VALUE(TRIM(MID(
OFFSET(INDIRECT(H2), 0, 2),
FIND(",", OFFSET(INDIRECT(H2), 0, 2)) + 1,
FIND(",", OFFSET(INDIRECT(H2), 0, 2) & ",",
FIND(",", OFFSET(INDIRECT(H2), 0, 2)) + 1)
-FIND(",", OFFSET(INDIRECT(H2), 0, 2)) - 1)))
)
# What was the relative humidity? (Abbott and Tabony 1985)
=100*EXP(1.8096+((17.2694*Z2)/(237.3+Z2)))-(0.0007866*998.3*(Y2-Z2)*(1+(Z2/610))))
/(EXP(1.8096+((17.2694*Y2)/(237.3+Y2)))
# What was the heat index? (Rothfusz 1990)
=(-8.78469475556)+(1.61139411*Y2)+(2.33854883889*AB2)
+(-0.14611605*Y2*AB2)+(-0.012308094*(Y2^2))+(-0.0164248277778*(AB2^2))
+(0.002211732*(Y2^2)*AB2)+(0.00085282*Y2*(AB2^2))
+(-0.000003582*(Y2^2)*(AB2^2))
Analysis mainly performed in GraphPad Prism, however beta regression, correlation matrix, and FAMD were done in R.
Expected inputs:
> head(propdata)
Temperature Proportion
1 36 0.23
2 36 0.00
3 36 0.06
4 37 0.16
5 36 0.00
6 36 0.79
> head(corrdata)
Size ToyState PantState DL DD DH DF AS AH AD NV NE NF Location Day Hour Cloud HeatIndex TotalDur TotalPlay
1 Medium 0 0 98.00 0 0 0.00 0 0.00 29.74 0 0 0 TQ275839 5 16:28 5 36 127.74 29.74
2 Medium 0 0 89.00 0 0 0.00 0 0.00 0.00 0 0 0 TQ275839 5 16:32 7 36 89.00 0.00
3 Small 1 0 30.00 0 0 0.00 2 0.00 0.00 0 0 0 TQ275839 5 16:32 7 36 32.00 2.00
4 Small 0 0 76.40 0 0 6.48 0 0.00 16.00 0 0 0 TQ275839 5 16:36 7 37 98.88 16.00
5 Medium 1 0 36.00 0 0 0.00 0 0.00 0.00 0 0 0 TQ275839 5 16:59 7 36 36.00 0.00
6 Large 0 0 20.06 0 0 0.00 0 73.48 0.00 0 0 1 TQ275839 5 16:59 7 36 93.54 73.48
> head(famddata)
Size ToyState PantState DL DD DH DF AS AH AD NV NE NF Location Cloud HeatIndex
1 Medium 0 0 98.00 0 0 0.00 0 0.00 29.74 0 0 0 TQ275839 5 36
2 Medium 0 0 89.00 0 0 0.00 0 0.00 0.00 0 0 0 TQ275839 7 36
3 Small 1 0 30.00 0 0 0.00 2 0.00 0.00 0 0 0 TQ275839 7 36
4 Small 0 0 76.40 0 0 6.48 0 0.00 16.00 0 0 0 TQ275839 7 37
5 Medium 1 0 36.00 0 0 0.00 0 0.00 0.00 0 0 0 TQ275839 7 36
6 Large 0 0 20.06 0 0 0.00 0 73.48 0.00 0 0 1 TQ275839 7 36
Step-by-step instructions to reproduce my method on various systems. Assumes Git, Python, pip, python3-venv, R (v4.4 or higher), and RStudio are installed.
Clone, enter, and install requirements for the repo:
git clone https://github.com/JoshuaJewell/abwe2-res.git
cd abwe2-res
python3 -m venv venv
pip install torch transformers
Copy recordings and run script:
cp -r /path/to/recordings /path/to/abwe2-res/transcription/recordings
python ./transcription/whisper.py
Paste csv data from transcriptions.txt into a spreadsheet:
ctrl-c
ctrl-v
Then spend a couple hours filtering out your unnecessary thoughts from the data and formatting for spreadsheet to take care of (as above).
Open breg-corr-famd.R in RStudio. Install requirements:
> install.packages(c("corrplot", "FactoMineR", "factoextra", "lmtest", "sandwich"))
Adjust paths in breg-corr-famd.R:
propdata <- read.csv('/path/to/propdata.csv')
corrdata <- read.csv('/path/to/corrdata.csv')
famddata <- read.csv('/path/to/famddata.csv')
Run script.
Clone, enter, and install requirements for the repo:
git clone https://github.com/JoshuaJewell/abwe2-res.git
cd abwe2-res
python -m venv venv
venv\Scripts\pip.exe install torch transformers
Copy recordings and run script:
xcopy /E /I C:\path\to\recordings C:\path\to\abwe2-res\transcription\recordings
python .\transcription\whisper.py
Paste csv data from transcriptions.txt into a spreadsheet and format accordingly.
Open breg-corr-famd.R in RStudio. Install requirements:
> install.packages(c("corrplot", "FactoMineR", "factoextra", "lmtest", "sandwich"))
Adjust paths in breg-corr-famd.R:
propdata <- read.csv('/path/to/propdata.csv')
corrdata <- read.csv('/path/to/corrdata.csv')
famddata <- read.csv('/path/to/famddata.csv')
Run script.
Basically use the Debian steps... I think.
- Abbott, P.F. and Tabony, R.C. (1985) ‘The estimation of humidity parameters’, Meteorological Magazine, 114(1351), 49–56.
- Husson, F., Josse, J., Le, S., and Jeremy, M. (2006) ‘FactoMineR: Multivariate Exploratory Data Analysis and Data Mining’, available: https://doi.org/10.32614/CRAN.package.FactoMineR.
- Gandhi, S., Platen, P. von, and Rush, A.M. (2023) ‘Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling’, available: https://doi.org/10.48550/arXiv.2311.00430.
- GraphPad Software Inc. (2023) GraphPad Prism (Version 10.4.2 for Windows), available: http://www.graphpad.com.
- Hothorn, T., Zeileis, A., Farebrother, R.W., and Cummins, C. (1999) ‘lmtest: Testing Linear Regression Models’, available: https://doi.org/10.32614/CRAN.package.lmtest.
- Judd, C.M., McClelland, G.H., and Ryan, C.S. (2011) Data Analysis [online], 0 ed., Routledge, available: https://doi.org/10.4324/9780203892053.
- Kassambara, A. and Mundt, F. (2016) ‘factoextra: Extract and Visualize the Results of Multivariate Data Analyses’, available: https://doi.org/10.32614/CRAN.package.factoextra.
- Pagès, J. (2004) ‘Analyse factorielle de données mixtes’, Revue de Statistique Appliquée, 52(4), 93–111.
- Rothfusz, L. (1990) ‘The heat index “equation”or more than you ever wanted to know about heat index: National weather service southern region technical attachment sr/ssd 90-23’, Fort Worth: National Weather Service, 1–2.
- Wei, T. and Simko, V. (2010) ‘corrplot: Visualization of a Correlation Matrix’, available: https://doi.org/10.32614/CRAN.package.corrplot.
- Zeileis, A. and Lumley, T. (2004) ‘sandwich: Robust Covariance Matrix Estimators’, available: https://doi.org/10.32614/CRAN.package.sandwich.