# Fall 2023 DS-549 Project Data


## Data Sources

It is expected that this notebook is run on SCC since it has SCC specific paths.

### Missing Children

The dataset photos are being reorganized and the client will provide updated
dataset soon.

There is a version of data prepared last semester at the [repo](https://github.com/BU-Spark/ml-atfal-mafkoda-missing-children/tree/main). The data is in the
 [database](https://github.com/BU-Spark/ml-atfal-mafkoda-missing-children/tree/main/database) folder.

For convenience the dataset folder was copied to
`/projectnb/sparkgrp/datasets/missing_children/database`.`

Commands used to copy the dataset.

```shell
mkdir tmp_repo
git clone --filter=blob:none --no-checkout https://github.com/BU-Spark/ml-atfal-mafkoda-missing-children.git tmp_repo

cd tmp_repo/
git checkout main -- "database"
mv database/ /projectnb/sparkgrp/datasets/missing_children/

# remove temporary directory
cd /projectnb/ds549/workspaces/tgardos/
rm -rf tmp_repo/
```

### EPIK

data is in the github repo at https://github.com/BU-Spark/ml-epik-project-nlp/tree/dev/data

For convenience, it is also copied to `/projectnb/sparkgrp/datasets/EPIK/data`.

### Herbaria

Herbaria - Some data is on this [Google Drive](https://drive.google.com/drive/folders/1csp1nCAneQh1kXQ0oO4wflcG29kpPJgE?usp=sharing),
* subset: `/projectnb/sparkgrp/ml-herbarium-grp/ml-herbarium-data/TROCR_Training/goodfiles`
* more images: `/projectnb/sparkgrp/ml-herbarium-grp/ml-herbarium-data/scraped-data/drago_testdata/images`
* alternatively more can be scraped from GBIF


### MAPLE

Bill text needs to be collected via the
[API](https://malegislature.gov/api/swagger/index.html?url=/api/swagger/v1/swagger.json).

You can select a category on that page and it instructs you how to formulate
an HTTPS request to get the data.

There's a directory created at `/projectnb/sparkgrp/datasets/MAPLE` that you can use
to download data to.


As an example, we prompted Copilot to write the python code equivalent of a
shell command to get the list of hearings.

**Copilot Prompt:**

> Execute the equivalent of the following shell command in Python: "curl -X GET "https://malegislature.gov/api/Hearings" -H "accept: application/json""

But the python code error saying there was an SSL certificate error. After a few
iterations prompting Copilot with the error, it suggested one option was
disabling the warning and this code worked.

In [13]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning

url = "https://malegislature.gov/api/Hearings"
headers = {"accept": "application/json"}

requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(url, headers=headers, verify=False)

if response.status_code == 200:
    data = response.json()
    # Do something with the data
    print("Got the data!")
    print(f"Type of data: {type(data)}")
    print(f"Length of data: {len(data)}")
    for event in data:
        print(f"{event}")
else:
    print(f"Request failed with status code {response.status_code}")

Got the data!
Type of data: <class 'list'>
Length of data: 3360
{'EventId': 4572, 'Details': 'http://malegislature.gov/api/Hearings/4572'}
{'EventId': 4745, 'Details': 'http://malegislature.gov/api/Hearings/4745'}
{'EventId': 4722, 'Details': 'http://malegislature.gov/api/Hearings/4722'}
{'EventId': 4738, 'Details': 'http://malegislature.gov/api/Hearings/4738'}
{'EventId': 4717, 'Details': 'http://malegislature.gov/api/Hearings/4717'}
{'EventId': 4737, 'Details': 'http://malegislature.gov/api/Hearings/4737'}
{'EventId': 4732, 'Details': 'http://malegislature.gov/api/Hearings/4732'}
{'EventId': 4733, 'Details': 'http://malegislature.gov/api/Hearings/4733'}
{'EventId': 4739, 'Details': 'http://malegislature.gov/api/Hearings/4739'}
{'EventId': 4744, 'Details': 'http://malegislature.gov/api/Hearings/4744'}
{'EventId': 4714, 'Details': 'http://malegislature.gov/api/Hearings/4714'}
{'EventId': 4575, 'Details': 'http://malegislature.gov/api/Hearings/4575'}
{'EventId': 4730, 'Details': 'http:/

### Body Camera Timestamps

The dataset is initially in this [Google Drive](https://drive.google.com/drive/folders/1eMsS2tl9cgiBJ25kAfu4jjsFu1nvtnS0?usp=sharing).

For convenience, it is also downloaded to `/projectnb/sparkgrp/datasets/bodycam`.


In [2]:
# For reference, here is the code used to download the folder
import gdown

url = "https://drive.google.com/drive/folders/1eMsS2tl9cgiBJ25kAfu4jjsFu1nvtnS0"
output = "/projectnb/sparkgrp/datasets/bodycam"

gdown.download_folder(url=url, output=output, quiet=False)

Retrieving folder list


Retrieving folder 1MXbdg5aLjcmdjdA9ihujEffWH6zIg2m0 evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z
Processing file 12kuZM086Ru7Z87_oIwpT_Uip09uDF60b AXON_Body_2_Video_2020-05-31_2055.mp4
Processing file 1xvzbD8FetdRXmG_84iaM325gNytfRhvz protest_downtown.mp4
Processing file 1vJgt1MUcA4XxxFzC4whbCYs-jXl0zuAN Protest-2.mp4
Processing file 1ciqBcj7r9Ir4XTeos9aMqzZaOES2bFKC PROTEST-3.mp4
Processing file 1TMPuRS9tgo1tBZInK8qj7KMzHBnFQNvQ protest.mp4
Processing file 1xjD8pnMHq8gLNGslNUcDjpba3_iFtyVy Riot_Tremont_&_TempleSt.mp4
Processing file 1J6nXcT-kjN3yiBdcU6zovqyY1QOKtE83 riot.mp4
Processing file 15SDw-7q1kkxwq31NmPibiQ1ocpob6kNX Table_of_Contents.xlsx
Retrieving folder 1j2kM8fXOqbcfj-DWdmWgbWVe-8qneBh1 evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_14_of_14_created_2023-03-16T17_37_05Z
Processing file 14Zj0f9O7PYfA2VPGQk9G-JWrVSgORglA Civil_Unrest_Tremont_@_Winter.mp4
Processing file 1k1SMMg_LrapJeDMMvdBHaPjfsOMgZ_lL RIOT_-_1_Ring_Rd

Retrieving folder list completed
Building directory structure
Downloading...
From (uriginal): https://drive.google.com/uc?id=12kuZM086Ru7Z87_oIwpT_Uip09uDF60b
From (redirected): https://drive.google.com/uc?id=12kuZM086Ru7Z87_oIwpT_Uip09uDF60b&confirm=t&uuid=ce616557-04e2-4546-9df6-e4caf24477db
To: /projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/AXON_Body_2_Video_2020-05-31_2055.mp4
100%|██████████| 2.82G/2.82G [00:23<00:00, 121MB/s] 
Downloading...
From (uriginal): https://drive.google.com/uc?id=1xvzbD8FetdRXmG_84iaM325gNytfRhvz
From (redirected): https://drive.google.com/uc?id=1xvzbD8FetdRXmG_84iaM325gNytfRhvz&confirm=t&uuid=e40a20d3-12d5-4289-af0d-9b04486e0e89
To: /projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/protest_downtown.mp4
100%|██████████| 2.69G/2.69G [00:22<00:00, 120MB/s] 
Downloading...
From (uriginal): https:/

['/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/AXON_Body_2_Video_2020-05-31_2055.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/protest_downtown.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/Protest-2.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/PROTEST-3.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/protest.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_31_2020_-_6_1_2020_package_9_of_14_created_2023-03-16T17_37_05Z/Riot_Tremont_&_TempleSt.mp4',
 '/projectnb/sparkgrp/datasets/bodycam/evidence.com_case_Protests_5_3

### Windows on Earth

The original dataset is stored in this 
[Google Drive](https://drive.google.com/drive/folders/17k9OTFAdbD2-rZO64MyodMbzHfF6fXbx?usp=sharing)

It is downloaded to `/projectnb/sparkgrp/datasets/windows_on_earth`.