<a href="https://colab.research.google.com/github/TongSii/hds5210-2025/blob/main/week08inclass/module25_internet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using the Requests module to get a file

Documentation for Requests is available at https://requests.readthedocs.io/en/latest/

This demonstration simply requests a file from the HHS Open Data portal: https://healthdata.gov/State/EHR-Incentive-Program-Payments-Hospitals/iq9g-z8pq/about_data
In this example, we get the file from HHS inspect some interesting information about it, then write the data to a local file.

In [1]:
import requests

In [2]:
r0= requests.get("https://download.medicaid.gov/data/nadac-national-average-drug-acquisition-cost-2022.csv")

In [3]:
type(r0)

In [4]:
r0.status_code

200

In [5]:
help(open)

Help on built-in function open in module _io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise OSError upon failure.

    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)

    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position).
    

In [6]:
with open('nadac.csv','w') as f:
  f.write(r0.text)

In [None]:
r= requests.get("https://gis.dhcs.ca.gov/api/download/v1/items/e175f359277641348b019da88519fe8d/csv?layers=0")

In [None]:
%%time

CPU times: user 20.5 ms, sys: 2.8 ms, total: 23.3 ms
Wall time: 1.33 s


In [None]:
type(r)

requests.models.Response

In [None]:
r.status_code

200

In [None]:
with open('ehr.csv','w') as f:
    f.write(r.text)

In [None]:
lines = 0
for row in r.text.split('\n'):
    lines += 1

In [None]:
r.text[0:1000]

'ï»¿X,Y,OBJECTID,Provider_Name,NPI,CCN,Business_Street_Address,Business_City,Business_County,Business_ZIP_Code,Business_State_Territory,Payment_Year_Number,Program_Type,Medicaid_EP_Hospital_Type,total_payments,Last_Payment_Criteria,Recent_Disbursement_Amount,Latitude,Longitude,Last_Program_Year,Last_Payment_Year\n-124.142008559,40.783559489,1,ST JOSEPH HEALTH NORTHERN CALIFORNIA LLC,1609858950,50006,2700 Dolbeer St,Eureka,Humboldt,95501,California,4,Medicare/Medicaid,Acute Care Hospitals,1530950.7,MU,153095.07,40.7835594893242,-124.142008559438,2015,2016\n-122.086674,37.632915,2,HAYWARD SISTERS HOSPITAL,1942298153,50002,27200 Calaroga Ave,Hayward,Alameda,94545,California,4,Medicare/Medicaid,Acute Care Hospitals,3245920.28,MU,324592.03,37.632915,-122.086674,2015,2016\n-122.295861,38.3254020000001,3,ST JOSEPH HEALTH NORTHERN CALIFORNIA LLC,1235218785,50009,1000 Trancas St,Napa,Napa,94558,California,4,Medicare/Medicaid,Acute Care Hospitals,1262015.89,MU,126201.59,38.325402,-122.295861,201

In [None]:
r.headers

{'Accept-Ranges': 'bytes', 'Last-Modified': 'Wed, 18 Jan 2023 21:41:45 GMT', 'ETag': '"b74718922d16e79b8519a608ce73a317:1674078105.768109"', 'Content-Length': '113687401', 'Date': 'Wed, 08 Oct 2025 22:04:05 GMT', 'Connection': 'keep-alive', 'Content-Type': 'application/octet-stream', 'content-disposition': 'attachment', 'Strict-Transport-Security': 'max-age=31536000'}

In [None]:
import json
print(json.dumps(dict(r.headers), indent=4))

{
    "Date": "Wed, 08 Oct 2025 22:07:31 GMT",
    "Content-Type": "text/csv",
    "Content-Length": "24657",
    "Connection": "keep-alive",
    "Vary": "Accept-Encoding",
    "x-amz-replication-status": "COMPLETED",
    "Last-Modified": "Tue, 24 Jun 2025 17:33:54 GMT",
    "ETag": "\"fca51ae9619ce9d4176ec26b691ef293\"",
    "x-amz-server-side-encryption": "AES256",
    "Cache-Control": "must-revalidate",
    "x-amz-meta-cachetime": "731",
    "Content-Disposition": "attachment; filename=\"EHR_Incentive_Program_Payments_-_Hospitals.csv\"",
    "Content-Encoding": "gzip",
    "x-amz-meta-contentlastmodified": "2021-05-18T22:12:26.698Z",
    "Strict-Transport-Security": "max-age=31536000",
    "X-Content-Type-Options": "nosniff",
    "Content-Security-Policy": "upgrade-insecure-requests"
}


### Total payments in this file? with CSV module

In [None]:
import csv

In [None]:
total=0

with open('ehr.csv') as f:
  reader = csv.reader(f)
  header = next(reader)
  payment_idx = header.index('total_payments')
  for record in reader:
    total+= float(record[payment_idx])


In [None]:
print("CA hospitals have received ${:,.2f} in payments.".format(total))

CA hospitals have received $792,870,050.99 in payments.


## Reading internet files with Pandas

Pandas is smart enough to know that when you provide an HTTP url it is supposed to go access that data from the internet.

https://pandas.pydata.org/pandas-docs/version/0.23.4/io.html


In [None]:
import pandas as pd

In [None]:
print("CA hospitals have received ${:,.2f} in payments.".format(df['Total_payments'].sum()))