<a href="https://colab.research.google.com/github/DrDavidL/learning-dhds/blob/main/Retrieve_Public_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Dataset Search and Download

This notebook is designed to help medical students and researchers find and download public access medical datasets. You can use this tool to search Data.gov for CSV files on various medical topics and then download them for your own analysis.

### Instructions:

1.  **Enter your search term:** In the form entry below, you will find a variable called `search_term`. Replace the placeholder text with your desired medical topic (e.g., `cancer OR malignancy`, `diabetes OR diabetic`, `cardiovascular disease OR heart disease`). Wildcard searching, e.g., `diab*` isn't available for this API so use:

| Operator | Description |
|----------|-------------|
| `AND`    | Must match both terms |
| `OR`     | Match either term |
| `-` (minus) | Exclude a term |
| `"quoted phrase"` | Exact phrase match |

2.  **Run the search:** Execute the code cell. The notebook will then search for relevant datasets and display the results in a table. The table will include the dataset title, a brief description, and a download link.

3.  **Download a dataset:** To download a specific dataset, copy the download link from the "Download Link" column and paste it into your web browser. This will initiate the download of the CSV file.

In [None]:
import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()
# @title Default title text
search_term = "heart disease OR cardiovascular" # @param {"type":"string"}
import requests

# 1. Define the base URL for the new API
# base_url = f"https://catalog.data.gov/api/action/package_search&api_key={api_key}"
base_url = f"https://catalog.data.gov/api/3/action/package_search"
# 2. Create a dictionary for the query parameters
query_params = {
    "q": search_term,
    "fq": "res_format:CSV",
    "rows": 1000,
}
# # Headers with API key
# api_key=""
# headers = {
#     "Authorization": api_key
# }

# 3. Make a GET request to the API
response = requests.get(base_url, params=query_params)

# 4. Convert the response to a JSON object
if response.status_code == 200:
    datasets_json = response.json()
    print("Successfully fetched datasets.")
else:
    print(f"Error: {response.status_code}")
    datasets_json = None

data = []
if datasets_json and datasets_json.get('result') and datasets_json['result'].get('results'):
    for dataset in datasets_json['result']['results']:
        title = dataset.get('title', 'No Title')
        notes = dataset.get('notes', 'No Description')
        for resource in dataset.get('resources', []):
            if resource.get('format', '').upper() == 'CSV':
                url = resource.get('url')
                if url:
                    data.append({'Title': title, 'Description': notes, 'Download Link': url})
                    break  # Move to the next dataset once a CSV link is found

df = pd.DataFrame(data)
print(f"Found {len(df)} datasets.")
display(df)

Successfully fetched datasets.
Found 55 datasets.


Unnamed: 0,Title,Description,Download Link
0,Air Quality,Dataset contains information on New York City ...,https://data.cityofnewyork.us/api/views/c3uy-2...
1,Diabetes,These datasets provide de-identified insurance...,https://data.wprdc.org/dataset/23fa923f-fc4e-4...
2,Rates and Trends in Heart Disease and Stroke M...,This dataset documents rates and trends in hea...,https://data.cdc.gov/api/views/7b9s-s8ck/rows....
3,Heart Disease Mortality Data Among US Adults (...,"2019 to 2021, 3-year average. Rates are age-st...",https://data.cdc.gov/api/views/55yu-xksw/rows....
4,Hypertension,These datasets provide de-identified insurance...,https://data.wprdc.org/dataset/3f0b8a8c-2239-4...
5,Rates and Trends in Hypertension-related Cardi...,This dataset documents rates and trends in loc...,https://data.cdc.gov/api/views/uc9k-vc2j/rows....
6,Weekly Provisional Counts of Deaths by State a...,"Effective September 27, 2023, this dataset wil...",https://data.cdc.gov/api/views/muzy-jte6/rows....
7,Surgical Site Infections (SSIs) for Operative ...,These datasets show surgical site infections (...,https://data.chhs.ca.gov/dataset/f243090b-4c05...
8,Heart Disease Mortality Data Among US Adults (...,"2018 to 2020, 3-year average. Rates are age-st...",https://data.cdc.gov/api/views/jiwm-ppbh/rows....
9,"Conditions Contributing to COVID-19 Deaths, by...","Effective September 27, 2023, this dataset wil...",https://data.cdc.gov/api/views/hk9y-quqm/rows....


## Select your dataset!

### Run the following cell and follow the directions:
1. Paste the link to the desired dataset. (Double click the cell with the link to select it, then right click to copy.)
2. Enter a filename and then run the cell.
(N.B. Large datasets may sometimes generate an error; be patient and try again.)


In [None]:
# @title Paste the web address (URL) and enter a file name for saving!
download_url = "https://data.cdc.gov/api/views/9dzk-mvmi/rows.csv?accessType=DOWNLOAD" # @param {"type":"string","placeholder":"Paste the download link here"}
file_name = "deaths" # @param {"type":"string","placeholder":"Enter the desired filename"}
import requests

# download_url = input("Paste the download link here: ")
# file_name = input("Enter the desired filename (e.g., my_dataset.csv): ")

try:
    response = requests.get(download_url)
    if response.status_code == 200:
        with open(file_name, 'wb') as f:
            f.write(response.content)
        print(f"Successfully downloaded '{file_name}' to Colab storage")
    else:
        print(f"Error: Failed to download file. Status code: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Successfully downloaded 'deaths' to Colab storage


Now, download your dataset to your computer by clicking the folder icon on the left side of this page. Then, hover over the file you created and click the three vertical dots and then the download option!

You may also run the cell below to see what your dataset looks like before analyzing in another notebook!

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()
df = pd.read_csv(file_name, encoding='latin1')
data_table.DataTable(
    df, include_index=False, num_rows_per_page=10)



Unnamed: 0,Data As Of,Start Date,End Date,Jurisdiction of Occurrence,Year,Month,All Cause,Natural Cause,Septicemia,Malignant Neoplasms,...,Intentional Self-Harm (Suicide),Assault (Homicide),Drug Overdose,COVID-19 (Multiple Cause of Death),COVID-19 (Underlying Cause of Death),flag_accid,flag_mva,flag_suic,flag_homic,flag_drugod
0,09/27/2023,01/01/2020,01/31/2020,United States,2020,1,264681,242914,3687,52635,...,4040.0,1708.0,6547.0,6,4,,,,,
1,09/27/2023,02/01/2020,02/29/2020,United States,2020,2,244966,224343,3324,48764,...,3672.0,1471.0,6435.0,25,20,,,,,
2,09/27/2023,03/01/2020,03/31/2020,United States,2020,3,269806,247634,3669,51640,...,3952.0,1693.0,7268.0,7175,6785,,,,,
3,09/27/2023,04/01/2020,04/30/2020,United States,2020,4,322424,300780,3366,48773,...,3480.0,1756.0,7938.0,65553,62014,,,,,
4,09/27/2023,05/01/2020,05/31/2020,United States,2020,5,280564,255489,3085,49012,...,3769.0,2067.0,9466.0,38330,35279,,,,,
5,09/27/2023,06/01/2020,06/30/2020,United States,2020,6,250456,225455,3036,47962,...,3985.0,2261.0,8212.0,18026,15827,,,,,
6,09/27/2023,07/01/2020,07/31/2020,United States,2020,7,279012,252481,3127,50626,...,4184.0,2426.0,8583.0,31135,28279,,,,,
7,09/27/2023,08/01/2020,08/31/2020,United States,2020,8,277282,251071,3268,51209,...,4055.0,2348.0,8351.0,29913,27031,,,,,
8,09/27/2023,09/01/2020,09/30/2020,United States,2020,9,257190,232827,3136,49671,...,3925.0,2191.0,7589.0,19158,16858,,,,,
9,09/27/2023,10/01/2020,10/31/2020,United States,2020,10,273906,249366,3250,51255,...,3804.0,2368.0,7486.0,24930,22083,,,,,
