Year 2 ABWE Project Resources

All scripts used for my year 2 ABWE project. I anticipate 3 primary steps will be required to convert my raw data into findings, laid out below.

Process

1. Transcription, Python

Audio collected from the site, containing details of behaviours performed, meterological values, and covariates will be transcribed with timestamps using Whisper Distil Large v3 (Gandhi et al. 2023). Human oversight was required to ensure transcription accuracy, and to reliably clean the data.

My favourite part of this is when it acknowledges after some time:
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
This shouldn't be difficult to implement, but I really can't see it providing a significant enough boost with such a small dataset. Nonetheless, this is a potential area for performance improvement alongside compilation/IO-aware attention/dusting off the RTX 3060.

2. Storage, SQL (until it wasn't)

Just use a spreadsheet, they said. It'll be easier, they said...

# What toy did the dog play with?
=IFERROR(MID(
  OFFSET(INDIRECT(H2),0,4),
  FIND(", ",OFFSET(INDIRECT(H2),0,4))+2,
  LEN(OFFSET(INDIRECT(H2),0,4))
),"")

# How long did the dog sit down?
=SUMIF(
  OFFSET(INDIRECT($H2), 1, 2, $I2, 1),
  M$1, OFFSET(INDIRECT($H2), 1, 4, $I2, 1)
)

# How hot was it on average?
=AVERAGE(
  VALUE(TRIM(MID(
    OFFSET(INDIRECT(H2), 0, 1),
    FIND(",", OFFSET(INDIRECT(H2), 0, 1)) + 1,
    FIND(",", OFFSET(INDIRECT(H2), 0, 1) & ",",
    FIND(",", OFFSET(INDIRECT(H2), 0, 1)) + 1)
    -FIND(",", OFFSET(INDIRECT(H2), 0, 1)) - 1))),
  VALUE(TRIM(MID(
    OFFSET(INDIRECT(H2), 0, 2),
    FIND(",", OFFSET(INDIRECT(H2), 0, 2)) + 1,
    FIND(",", OFFSET(INDIRECT(H2), 0, 2) & ",",
    FIND(",", OFFSET(INDIRECT(H2), 0, 2)) + 1)
    -FIND(",", OFFSET(INDIRECT(H2), 0, 2)) - 1)))
)

# What was the relative humidity? (Abbott and Tabony 1985)
=100*EXP(1.8096+((17.2694*Z2)/(237.3+Z2)))-(0.0007866*998.3*(Y2-Z2)*(1+(Z2/610))))
/(EXP(1.8096+((17.2694*Y2)/(237.3+Y2)))

# What was the heat index? (Rothfusz 1990)
=(-8.78469475556)+(1.61139411*Y2)+(2.33854883889*AB2)
+(-0.14611605*Y2*AB2)+(-0.012308094*(Y2^2))+(-0.0164248277778*(AB2^2))
+(0.002211732*(Y2^2)*AB2)+(0.00085282*Y2*(AB2^2))
+(-0.000003582*(Y2^2)*(AB2^2))

3. Analysis, R

Analysis mainly performed in GraphPad Prism, however beta regression, correlation matrix, and FAMD were done in R.

Expected inputs:

> head(propdata)
  Temperature Proportion
1          36       0.23
2          36       0.00
3          36       0.06
4          37       0.16
5          36       0.00
6          36       0.79

> head(corrdata)
    Size ToyState PantState    DL DD DH   DF AS    AH    AD NV NE NF Location Day  Hour Cloud HeatIndex TotalDur TotalPlay
1 Medium        0         0 98.00  0  0 0.00  0  0.00 29.74  0  0  0 TQ275839   5 16:28     5        36   127.74     29.74
2 Medium        0         0 89.00  0  0 0.00  0  0.00  0.00  0  0  0 TQ275839   5 16:32     7        36    89.00      0.00
3  Small        1         0 30.00  0  0 0.00  2  0.00  0.00  0  0  0 TQ275839   5 16:32     7        36    32.00      2.00
4  Small        0         0 76.40  0  0 6.48  0  0.00 16.00  0  0  0 TQ275839   5 16:36     7        37    98.88     16.00
5 Medium        1         0 36.00  0  0 0.00  0  0.00  0.00  0  0  0 TQ275839   5 16:59     7        36    36.00      0.00
6  Large        0         0 20.06  0  0 0.00  0 73.48  0.00  0  0  1 TQ275839   5 16:59     7        36    93.54     73.48

> head(famddata)
    Size ToyState PantState    DL DD DH   DF AS    AH    AD NV NE NF Location Cloud HeatIndex
1 Medium        0         0 98.00  0  0 0.00  0  0.00 29.74  0  0  0 TQ275839     5        36
2 Medium        0         0 89.00  0  0 0.00  0  0.00  0.00  0  0  0 TQ275839     7        36
3  Small        1         0 30.00  0  0 0.00  2  0.00  0.00  0  0  0 TQ275839     7        36
4  Small        0         0 76.40  0  0 6.48  0  0.00 16.00  0  0  0 TQ275839     7        37
5 Medium        1         0 36.00  0  0 0.00  0  0.00  0.00  0  0  0 TQ275839     7        36
6  Large        0         0 20.06  0  0 0.00  0 73.48  0.00  0  0  1 TQ275839     7        36

How to

Step-by-step instructions to reproduce my method on various systems. Assumes Git, Python, pip, python3-venv, R (v4.4 or higher), and RStudio are installed.

Debian (I did this)

Step 1: Initialise repo

Clone, enter, and install requirements for the repo:

git clone https://github.com/JoshuaJewell/abwe2-res.git
cd abwe2-res

python3 -m venv venv
pip install torch transformers

Step 2: Transcribe recordings

Copy recordings and run script:

cp -r /path/to/recordings /path/to/abwe2-res/transcription/recordings
python ./transcription/whisper.py

Step 3: Clean data

Paste csv data from transcriptions.txt into a spreadsheet:

ctrl-c
ctrl-v

Then spend a couple hours filtering out your unnecessary thoughts from the data and formatting for spreadsheet to take care of (as above).

Step 4: Analyse

Open breg-corr-famd.R in RStudio. Install requirements:

> install.packages(c("corrplot", "FactoMineR", "factoextra", "lmtest", "sandwich"))

Adjust paths in breg-corr-famd.R:

propdata <- read.csv('/path/to/propdata.csv')
corrdata <- read.csv('/path/to/corrdata.csv')
famddata <- read.csv('/path/to/famddata.csv')

Run script.

Windows (You might want to do this)

Step 1: Initialise repo

Clone, enter, and install requirements for the repo:

git clone https://github.com/JoshuaJewell/abwe2-res.git
cd abwe2-res
python -m venv venv
venv\Scripts\pip.exe install torch transformers

Step 2: Transcribe recordings

Copy recordings and run script:

xcopy /E /I C:\path\to\recordings C:\path\to\abwe2-res\transcription\recordings
python .\transcription\whisper.py

Step 3: Clean data

Paste csv data from transcriptions.txt into a spreadsheet and format accordingly.

Step 4: Analyse

Open breg-corr-famd.R in RStudio. Install requirements:

> install.packages(c("corrplot", "FactoMineR", "factoextra", "lmtest", "sandwich"))

Adjust paths in breg-corr-famd.R:

propdata <- read.csv('/path/to/propdata.csv')
corrdata <- read.csv('/path/to/corrdata.csv')
famddata <- read.csv('/path/to/famddata.csv')

Run script.

MacOS

Basically use the Debian steps... I think.

References

Abbott, P.F. and Tabony, R.C. (1985) ‘The estimation of humidity parameters’, Meteorological Magazine, 114(1351), 49–56.
Husson, F., Josse, J., Le, S., and Jeremy, M. (2006) ‘FactoMineR: Multivariate Exploratory Data Analysis and Data Mining’, available: https://doi.org/10.32614/CRAN.package.FactoMineR.
Gandhi, S., Platen, P. von, and Rush, A.M. (2023) ‘Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling’, available: https://doi.org/10.48550/arXiv.2311.00430.
GraphPad Software Inc. (2023) GraphPad Prism (Version 10.4.2 for Windows), available: http://www.graphpad.com.
Hothorn, T., Zeileis, A., Farebrother, R.W., and Cummins, C. (1999) ‘lmtest: Testing Linear Regression Models’, available: https://doi.org/10.32614/CRAN.package.lmtest.
Judd, C.M., McClelland, G.H., and Ryan, C.S. (2011) Data Analysis [online], 0 ed., Routledge, available: https://doi.org/10.4324/9780203892053.
Kassambara, A. and Mundt, F. (2016) ‘factoextra: Extract and Visualize the Results of Multivariate Data Analyses’, available: https://doi.org/10.32614/CRAN.package.factoextra.
Pagès, J. (2004) ‘Analyse factorielle de données mixtes’, Revue de Statistique Appliquée, 52(4), 93–111.
Rothfusz, L. (1990) ‘The heat index “equation”or more than you ever wanted to know about heat index: National weather service southern region technical attachment sr/ssd 90-23’, Fort Worth: National Weather Service, 1–2.
Wei, T. and Simko, V. (2010) ‘corrplot: Visualization of a Correlation Matrix’, available: https://doi.org/10.32614/CRAN.package.corrplot.
Zeileis, A. and Lumley, T. (2004) ‘sandwich: Robust Covariance Matrix Estimators’, available: https://doi.org/10.32614/CRAN.package.sandwich.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
analysis		analysis
storage		storage
transcription		transcription
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Year 2 ABWE Project Resources

Process

1. Transcription, Python

2. Storage, SQL (until it wasn't)

3. Analysis, R

How to

Debian (I did this)

Step 1: Initialise repo

Step 2: Transcribe recordings

Step 3: Clean data

Step 4: Analyse

Windows (You might want to do this)

Step 1: Initialise repo

Step 2: Transcribe recordings

Step 3: Clean data

Step 4: Analyse

MacOS

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Year 2 ABWE Project Resources

Process

1. Transcription, Python

2. Storage, SQL (until it wasn't)

3. Analysis, R

How to

Debian (I did this)

Step 1: Initialise repo

Step 2: Transcribe recordings

Step 3: Clean data

Step 4: Analyse

Windows (You might want to do this)

Step 1: Initialise repo

Step 2: Transcribe recordings

Step 3: Clean data

Step 4: Analyse

MacOS

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages