# Notebook Information

This notebook is an initial stab at looking at the aquired datasets, seeing what data we have, and seeing how the data can be merged and aligned for future modeling work.

In [1]:
# Standard package imports

# Data anaysis
import pandas as pd
import numpy as np

# Plotting and Correlation Maths
import seaborn as sns
import scipy as sci

# Simple model development
import sklearn as sk

import os

In [3]:
# You will probably be in the /notebooks directory if you've placed you
#  Jupyter notebook in the right spot.
os.getcwd()

'c:\\Users\\btb51\\Documents\\GitHub\\DAAN881_Team1\\opioid_analytics\\notebooks'

# Reading data and looking at some summary statistics

## data-table

This seems to include the State, the Data Reported, and the Years Available.

This may not actually be useful.

In [4]:
dt = pd.read_csv('../data/raw/data-table.csv')
dt.head(5)

Unnamed: 0,State,Data reported,Years available
0,Alabama,Unfunded State,
1,Alaska,Emergency Department & Inpatient Hospitalization,2018-2022
2,Arizona,Unfunded State,
3,Arkansas,Unfunded State,
4,California,Emergency Department Only,2018-2022


In [5]:
dt.describe()

Unnamed: 0,State,Data reported,Years available
count,50,50,25
unique,50,4,2
top,Alabama,Unfunded State,2018-2022
freq,1,25,24


## County Opioid Dispencing Rates

Data provided by the CDC at:  https://www.cdc.gov/overdose-prevention/data-research/facts-stats/opioid-dispensing-rate-maps.html

In [6]:
count_disp_rates_df = pd.read_csv('../data/raw/County Opioid Dispensing Rates.csv')
count_disp_rates_df.head(5)

Unnamed: 0,FullGeoName,YEAR,STATE_NAME,STATE_ABBREV,COUNTY_NAME,STATE_COUNTY_FIP_U,opioid_dispensing_rate,Opioid Dispensing Rate (per 100 persons)
0,"AL, Autauga",2019,Alabama,AL,Autauga County,1001,102.8,>51.0
1,"AL, Baldwin",2019,Alabama,AL,Baldwin County,1003,67.9,>51.0
2,"AL, Barbour",2019,Alabama,AL,Barbour County,1005,27.6,18.6 - 32.2
3,"AL, Bibb",2019,Alabama,AL,Bibb County,1007,21.0,18.6 - 32.2
4,"AL, Blount",2019,Alabama,AL,Blount County,1009,23.8,18.6 - 32.2


In [27]:
sum(count_disp_rates_df['COUNTY_NAME'] == 'Allegheny')

0

In [7]:
# What years do we have county data for?
count_disp_rates_df['YEAR'].unique()

array([2019, 2020, 2021, 2022], dtype=int64)

In [8]:
count_disp_rates_df.describe()

Unnamed: 0,YEAR,STATE_COUNTY_FIP_U,opioid_dispensing_rate
count,12340.0,12340.0,12331.0
mean,2020.498298,30329.250729,38.306496
std,1.118513,15176.674079,30.690663
min,2019.0,1001.0,0.0
25%,2019.0,18171.0,18.6
50%,2020.0,29147.0,32.2
75%,2021.0,45069.5,51.0
max,2022.0,56045.0,569.1


## State Opioid Dispencing Rates

Data provided by the CDC at:  https://www.cdc.gov/overdose-prevention/data-research/facts-stats/opioid-dispensing-rate-maps.html

In [9]:
state_disp_rates_df = pd.read_csv('../data/raw/State Opioid Dispensing Rates.csv')
state_disp_rates_df.head(5)

Unnamed: 0,YEAR,STATE_NAME,STATE_ABBREV,STATE_FIPS,opioid_dispensing_rate,Opioid Dispensing Rate (per 100 persons)
0,2019,Alabama,AL,1,86.0,>52.8
1,2019,Alaska,AK,2,39.3,36.1 - 42.7
2,2019,Arizona,AZ,4,44.2,42.8 - 52.8
3,2019,Arkansas,AR,5,81.1,>52.8
4,2019,California,CA,6,31.0,<36.1


In [10]:
# What Years do we have state data for?
state_disp_rates_df['YEAR'].unique()

array([2019, 2020, 2021, 2022], dtype=int64)

In [11]:
state_disp_rates_df.describe()

Unnamed: 0,YEAR,STATE_FIPS,opioid_dispensing_rate
count,204.0,204.0,204.0
mean,2020.5,28.960784,45.566176
std,1.120784,15.715401,12.663324
min,2019.0,1.0,24.3
25%,2019.75,16.0,36.175
50%,2020.5,29.0,42.75
75%,2021.25,42.0,52.825
max,2022.0,56.0,86.0


## Dose DX Dashbord information

Data provided by the CDC at: https://www.cdc.gov/overdose-prevention/data-research/facts-stats/dose-dashboard-nonfatal-discharge-data.html

Additional information about DOSE can be found at: https://www.cdc.gov/overdose-prevention/data-research/facts-stats/about-dose-system.html

`TODO`
I need to go through and get this sheet by sheet.  The first sheet I've been reading is 'Data Dictionary' which just talks about what the data is.

In [12]:
dose_df = pd.read_excel('../data/raw/dose_dx_dashboard_v6.xlsx')
dose_df
# dose_data = pd.read_excel('../data/raw/dose_dx_dashboard_v6.xlsx', engine='calamine')
# dose_data
# cols = dose_data[0]
# vals = dose_data[1:]

# dose_df = pd.DataFrame(vals, columns=cols)

# dose_df

Unnamed: 0,Variable,Definition,Variable Values
0,US State Submission Counts,This sheet provides information on the number ...,
1,dataset,Indicator for the data source for that row of ...,Character (ED or HOSP)
2,month,All (annual) or month value ranging from 1-12 ...,Character (1-12 or all) (where 1=January and 1...
3,year,Year of analysis for that row of data.,"Numeric (2018, 2019, 2020, 2021 or 2022)"
4,jurisdiction_count,Count of states reporting ED visits or Inpatie...,Numeric
5,State Counts & Rates,This sheet provides information on the monthly...,
6,state,The two-digit state abbreviation.,"Character (e.g., AK=Alaska, US=overall for par..."
7,month,All (annual) or month value ranging from 1-12 ...,Character (1-12 or all) (where 1=January and 1...
8,year,Year of analysis for that row of data.,"Numeric (2018, 2019, 2020, 2021 or 2022)"
9,time_frame,Time frame (annual or monthly) for that row of...,Character (annual or monthly)


In [13]:
dose_df.describe()

Unnamed: 0,Variable,Definition,Variable Values
count,51,48,44
unique,28,35,19
top,year,Year of analysis for that row of data.,Character (count value or suppressed)*
freq,4,4,9


## Weather Events

`TODO` Where did we get this data?


`Question` This dataset seems to be highly detailed to the Western PA region.  We may need to move our analysis from the state of PA, to the Allegheny County (Pittsburgh) region.

In [100]:
# This is about a 4 minute read on my system (BB) if using just pandas
# Using the python-calamine engine in pandas 2.2.*
#  via the python-calamine package (in environment.yaml) speeds this up
#  to just 30 seconds.  I would recommend installing python-calamine.

weather_events_df = pd.read_excel('../data/raw/WeatherEvents_Jan2016-Dec2022.xlsx', engine='calamine')
weather_events_df.head(5)

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


In [101]:
# This only has 45 of the 50 states
print(f'There are {len(weather_events_df['State'].unique())} out of 50 states.')
weather_events_df['State'].unique()


There are 45 out of 50 states.


array(['CO', 'OK', 'MN', 'LA', 'WI', 'ID', 'MI', 'KS', 'WY', 'MA', 'MO',
       'NM', 'NC', 'SC', 'RI', 'VA', 'CT', 'OR', 'ND', 'CA', 'NY', 'OH',
       'SD', 'AZ', 'NV', 'IA', 'TX', 'GA', 'NE', 'TN', 'AL', 'IL', 'AR',
       'WA', 'IN', 'UT', 'FL', 'WV', 'MS', 'PA', 'ME', 'MD', 'NJ', 'KY',
       'VT'], dtype=object)

In [102]:
weather_events_df.describe()

Unnamed: 0,Precipitation(in),LocationLat,LocationLng,ZipCode
count,999999.0,999999.0,999999.0,991683.0
mean,0.08989,39.453285,-92.545095,52737.512165
std,0.561167,5.132021,13.733891,26410.391978
min,0.0,24.7263,-123.1408,1201.0
25%,0.0,36.0103,-99.3216,31510.0
50%,0.0,39.7945,-90.9938,55008.0
75%,0.05,43.208,-82.1591,73703.0
max,115.6,48.929,-68.0179,99362.0


## Larger Generic Weather Dataset

In [103]:
generic_weather_df = pd.read_excel('../data/raw/Allegheny County 2017-01-01 to 2024-05-01.xlsx', engine='calamine')
generic_weather_df.head(5)

Unnamed: 0,name,datetime,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,solarenergy,uvindex,severerisk,sunrise,sunset,moonphase,conditions,description,icon,stations
0,Allegheny County,2017-01-01,43.9,31.4,37.1,40.1,29.0,33.2,28.3,71.1,...,8.2,4,,2017-01-01T07:43:20,2017-01-01T17:04:23,0.12,"Snow, Rain, Partially cloudy",Clearing in the afternoon with early morning s...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
1,Allegheny County,2017-01-02,46.3,37.0,42.6,43.3,33.2,38.4,38.9,87.1,...,1.3,1,,2017-01-02T07:43:25,2017-01-02T17:05:15,0.16,"Rain, Partially cloudy",Partly cloudy throughout the day with a chance...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
2,Allegheny County,2017-01-03,48.6,42.9,45.1,46.3,37.7,41.6,42.6,91.0,...,0.7,0,,2017-01-03T07:43:27,2017-01-03T17:06:08,0.19,"Rain, Partially cloudy",Partly cloudy throughout the day with rain.,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
3,Allegheny County,2017-01-04,44.3,21.1,33.8,37.4,8.0,23.6,23.4,67.4,...,6.8,4,,2017-01-04T07:43:28,2017-01-04T17:07:02,0.23,"Snow, Rain, Partially cloudy",Partly cloudy throughout the day with rain or ...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
4,Allegheny County,2017-01-05,22.6,19.0,20.6,20.8,7.0,11.2,11.5,68.4,...,6.0,3,,2017-01-05T07:43:26,2017-01-05T17:07:58,0.25,"Snow, Partially cloudy",Partly cloudy throughout the day with snow.,snow,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."


In [104]:
generic_weather_df

Unnamed: 0,name,datetime,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,solarenergy,uvindex,severerisk,sunrise,sunset,moonphase,conditions,description,icon,stations
0,Allegheny County,2017-01-01,43.9,31.4,37.1,40.1,29.0,33.2,28.3,71.1,...,8.2,4,,2017-01-01T07:43:20,2017-01-01T17:04:23,0.12,"Snow, Rain, Partially cloudy",Clearing in the afternoon with early morning s...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
1,Allegheny County,2017-01-02,46.3,37.0,42.6,43.3,33.2,38.4,38.9,87.1,...,1.3,1,,2017-01-02T07:43:25,2017-01-02T17:05:15,0.16,"Rain, Partially cloudy",Partly cloudy throughout the day with a chance...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
2,Allegheny County,2017-01-03,48.6,42.9,45.1,46.3,37.7,41.6,42.6,91.0,...,0.7,0,,2017-01-03T07:43:27,2017-01-03T17:06:08,0.19,"Rain, Partially cloudy",Partly cloudy throughout the day with rain.,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
3,Allegheny County,2017-01-04,44.3,21.1,33.8,37.4,8.0,23.6,23.4,67.4,...,6.8,4,,2017-01-04T07:43:28,2017-01-04T17:07:02,0.23,"Snow, Rain, Partially cloudy",Partly cloudy throughout the day with rain or ...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
4,Allegheny County,2017-01-05,22.6,19.0,20.6,20.8,7.0,11.2,11.5,68.4,...,6.0,3,,2017-01-05T07:43:26,2017-01-05T17:07:58,0.25,"Snow, Partially cloudy",Partly cloudy throughout the day with snow.,snow,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,Allegheny County,2024-04-27,79.1,53.0,65.3,79.1,53.0,65.3,50.3,60.5,...,15.3,7,10.0,2024-04-27T06:23:41,2024-04-27T20:12:13,0.63,"Rain, Partially cloudy",Partly cloudy throughout the day with rain cle...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
2674,Allegheny County,2024-04-28,82.0,63.2,72.6,81.5,63.2,72.6,55.3,55.6,...,20.8,10,10.0,2024-04-28T06:22:22,2024-04-28T20:13:15,0.67,Partially cloudy,Partly cloudy throughout the day.,partly-cloudy-day,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
2675,Allegheny County,2024-04-29,82.9,62.6,74.1,82.1,62.6,73.9,54.6,52.0,...,23.1,9,10.0,2024-04-29T06:21:04,2024-04-29T20:14:17,0.70,Partially cloudy,Becoming cloudy in the afternoon.,partly-cloudy-day,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."
2676,Allegheny County,2024-04-30,72.3,61.1,65.9,72.3,61.1,65.9,56.1,71.0,...,11.9,6,10.0,2024-04-30T06:19:47,2024-04-30T20:15:18,0.74,"Rain, Partially cloudy",Partly cloudy throughout the day with afternoo...,rain,"KBTP,KAGC,72512464705,KPIT,72520514762,7252009..."


In [105]:
generic_weather_df.columns

Index(['name', 'datetime', 'tempmax', 'tempmin', 'temp', 'feelslikemax',
       'feelslikemin', 'feelslike', 'dew', 'humidity', 'precip', 'precipprob',
       'precipcover', 'preciptype', 'snow', 'snowdepth', 'windgust',
       'windspeed', 'winddir', 'sealevelpressure', 'cloudcover', 'visibility',
       'solarradiation', 'solarenergy', 'uvindex', 'severerisk', 'sunrise',
       'sunset', 'moonphase', 'conditions', 'description', 'icon', 'stations'],
      dtype='object')

In [106]:
gen_weather_sel_cols = [
    'name','datetime','tempmax','tempmin','temp','feelslikemax',
    'feelslikemin','feelslike','dew','humidity','precip','precipprob',
    'precipcover','preciptype','snow','snowdepth','windgust','windspeed',
    'winddir','sealevelpressure','cloudcover','visibility','solarradiation','solarenergy',
    'uvindex','severerisk','sunrise','sunset','moonphase'
]

In [107]:
down_sel_generic_weather_df = generic_weather_df[gen_weather_sel_cols]
down_sel_generic_weather_df

Unnamed: 0,name,datetime,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,sunrise,sunset,moonphase
0,Allegheny County,2017-01-01,43.9,31.4,37.1,40.1,29.0,33.2,28.3,71.1,...,1021.5,41.7,9.5,95.6,8.2,4,,2017-01-01T07:43:20,2017-01-01T17:04:23,0.12
1,Allegheny County,2017-01-02,46.3,37.0,42.6,43.3,33.2,38.4,38.9,87.1,...,1021.8,87.9,5.5,15.5,1.3,1,,2017-01-02T07:43:25,2017-01-02T17:05:15,0.16
2,Allegheny County,2017-01-03,48.6,42.9,45.1,46.3,37.7,41.6,42.6,91.0,...,1006.3,78.7,5.3,7.9,0.7,0,,2017-01-03T07:43:27,2017-01-03T17:06:08,0.19
3,Allegheny County,2017-01-04,44.3,21.1,33.8,37.4,8.0,23.6,23.4,67.4,...,1007.1,73.0,9.0,78.9,6.8,4,,2017-01-04T07:43:28,2017-01-04T17:07:02,0.23
4,Allegheny County,2017-01-05,22.6,19.0,20.6,20.8,7.0,11.2,11.5,68.4,...,1015.1,83.5,6.4,68.1,6.0,3,,2017-01-05T07:43:26,2017-01-05T17:07:58,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,Allegheny County,2024-04-27,79.1,53.0,65.3,79.1,53.0,65.3,50.3,60.5,...,1022.5,65.9,9.6,177.0,15.3,7,10.0,2024-04-27T06:23:41,2024-04-27T20:12:13,0.63
2674,Allegheny County,2024-04-28,82.0,63.2,72.6,81.5,63.2,72.6,55.3,55.6,...,1019.8,47.9,9.9,241.0,20.8,10,10.0,2024-04-28T06:22:22,2024-04-28T20:13:15,0.67
2675,Allegheny County,2024-04-29,82.9,62.6,74.1,82.1,62.6,73.9,54.6,52.0,...,1015.8,27.2,9.9,265.8,23.1,9,10.0,2024-04-29T06:21:04,2024-04-29T20:14:17,0.70
2676,Allegheny County,2024-04-30,72.3,61.1,65.9,72.3,61.1,65.9,56.1,71.0,...,1012.7,68.1,9.9,140.0,11.9,6,10.0,2024-04-30T06:19:47,2024-04-30T20:15:18,0.74


In [None]:
gen_weather_sel_cols

In [117]:
down_sel_generic_weather_df[gen_weather_sel_cols[20:30]].describe()

Unnamed: 0,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,moonphase
count,2678.0,2678.0,2678.0,2678.0,2678.0,843.0,2678.0
mean,54.78174,8.912211,125.280358,10.809858,4.79537,14.827995,0.483342
std,29.013607,1.339226,79.305213,6.849332,2.471342,14.298733,0.288079
min,0.0,1.1,5.8,0.5,0.0,10.0,0.0
25%,29.8,8.5,59.05,5.1,3.0,10.0,0.25
50%,55.25,9.5,110.55,9.5,5.0,10.0,0.49
75%,80.475,9.9,184.075,15.9,7.0,10.0,0.7475
max,100.0,9.9,345.4,29.8,10.0,100.0,0.98


In [111]:
down_sel_generic_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2678 entries, 0 to 2677
Data columns (total 29 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              2678 non-null   object 
 1   datetime          2678 non-null   object 
 2   tempmax           2678 non-null   float64
 3   tempmin           2678 non-null   float64
 4   temp              2678 non-null   float64
 5   feelslikemax      2678 non-null   float64
 6   feelslikemin      2678 non-null   float64
 7   feelslike         2678 non-null   float64
 8   dew               2678 non-null   float64
 9   humidity          2678 non-null   float64
 10  precip            2678 non-null   float64
 11  precipprob        2678 non-null   int64  
 12  precipcover       2678 non-null   float64
 13  preciptype        1648 non-null   object 
 14  snow              2678 non-null   float64
 15  snowdepth         2678 non-null   float64
 16  windgust          2678 non-null   float64


In [119]:
for col in gen_weather_sel_cols:
    print(f'{col}: {down_sel_generic_weather_df[col].value_counts()}')

name: name
Allegheny County    2678
Name: count, dtype: int64
datetime: datetime
2017-01-01    1
2021-11-15    1
2021-11-17    1
2021-11-18    1
2021-11-19    1
             ..
2019-06-13    1
2019-06-14    1
2019-06-15    1
2019-06-16    1
2024-05-01    1
Name: count, Length: 2678, dtype: int64
tempmax: tempmax
78.4    17
73.9    14
76.6    14
57.8    13
82.4    12
        ..
26.3     1
19.0     1
19.8     1
59.6     1
72.3     1
Name: count, Length: 666, dtype: int64
tempmin: tempmin
64.9    17
66.8    17
30.8    16
38.9    13
28.1    13
        ..
74.2     1
72.6     1
13.0     1
70.2     1
17.1     1
Name: count, Length: 633, dtype: int64
temp: temp
70.1    13
65.7    13
71.2    13
73.2    12
71.1    11
        ..
7.9      1
22.6     1
22.8     1
5.7      1
61.6     1
Name: count, Length: 642, dtype: int64
feelslikemax: feelslikemax
78.4     17
73.9     14
76.6     14
57.8     13
76.9     12
         ..
100.1     1
98.5      1
82.8      1
89.6      1
72.3      1
Name: count, Length

In [120]:
for col in gen_weather_sel_cols:
    print(f'{col}: {down_sel_generic_weather_df[col].unique()}')

name: ['Allegheny County']
datetime: ['2017-01-01' '2017-01-02' '2017-01-03' ... '2024-04-29' '2024-04-30'
 '2024-05-01']
tempmax: [43.9 46.3 48.6 44.3 22.6 17.6 16.  15.7 27.9 49.  54.7 64.4 36.6 35.6
 34.1 38.7 60.2 51.3 41.7 55.8 63.2 60.8 49.9 39.1 51.9 50.3 31.9 33.
 31.6 37.9 24.9 29.9 44.8 54.4 63.4 55.9 29.6 32.9 54.5 55.7 38.8 48.7
 42.  28.4 52.8 65.9 59.5 65.5 65.1 67.9 75.7 65.6 45.4 56.7 61.1 39.8
 30.3 46.2 56.  57.  33.7 25.9 27.7 39.4 35.3 35.2 41.  43.3 39.5 51.
 59.3 41.9 49.5 73.4 72.7 65.4 68.1 59.7 60.1 55.6 52.9 48.  57.1 70.3
 63.1 70.6 64.2 38.2 57.7 71.9 78.4 70.  65.2 71.4 80.8 74.6 67.5 70.1
 75.4 68.5 53.9 66.5 61.  63.6 76.4 80.9 70.4 73.9 85.2 80.6 61.2 58.1
 64.7 71.1 47.8 63.9 70.8 58.6 66.2 68.7 69.9 82.3 86.1 84.5 70.9 74.7
 65.3 71.3 66.8 66.6 77.3 78.  74.9 73.1 74.8 78.9 79.8 73.7 64.  69.8
 76.1 82.4 84.6 86.3 88.7 83.2 84.3 81.2 85.5 87.2 77.6 83.3 75.5 74.3
 73.  74.  84.1 83.1 81.  83.  80.7 77.4 79.5 81.8 87.  76.7 77.8 80.3
 83.9 86.4 77.1 79.

## OverDoseDataPA1

In [2]:
overdose_PA_df = pd.read_excel('../data/raw/OverdoseDataPA1.xlsx', engine='calamine')
overdose_PA_df.head(5)

Unnamed: 0,dose_count,incident_id,incident_time,day,incident_county_name,incident_state,victim_id,gender_desc,age_range,race,...,geocoded_column,geocoded_column_address,geocoded_column_city,geocoded_column_zip,geocoded_column_1_city,geocoded_column_1_zip,geocoded_column_1_state,geocoded_column_1,dose_unit,geocoded_column_1_address
0,1,184,13:52:00,Wednesday,Allegheny,Pennsylvania,124,Female,20 - 24,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
1,2,39237,4:13:00,Saturday,Allegheny,Pennsylvania,31680,Female,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
2,1,27008,0:15:00,Monday,Lackawanna,Pennsylvania,21028,Female,30 - 39,BLACK,...,POINT (-75.612183 41.439101),,,,,,,POINT (-75.032709 41.332572),0,
3,2,47495,21:00:00,Saturday,Adams,Pennsylvania,38590,Male,40 - 49,WHITE,...,POINT (-77.222243 39.872096),,,,,,,POINT (0 0),2,
4,1,25677,17:30:00,Wednesday,Allegheny,Pennsylvania,20086,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,


In [3]:
overdose_PA_df

Unnamed: 0,dose_count,incident_id,incident_time,day,incident_county_name,incident_state,victim_id,gender_desc,age_range,race,...,geocoded_column,geocoded_column_address,geocoded_column_city,geocoded_column_zip,geocoded_column_1_city,geocoded_column_1_zip,geocoded_column_1_state,geocoded_column_1,dose_unit,geocoded_column_1_address
0,1,184,13:52:00,Wednesday,Allegheny,Pennsylvania,124,Female,20 - 24,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
1,2,39237,4:13:00,Saturday,Allegheny,Pennsylvania,31680,Female,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
2,1,27008,0:15:00,Monday,Lackawanna,Pennsylvania,21028,Female,30 - 39,BLACK,...,POINT (-75.612183 41.439101),,,,,,,POINT (-75.032709 41.332572),0,
3,2,47495,21:00:00,Saturday,Adams,Pennsylvania,38590,Male,40 - 49,WHITE,...,POINT (-77.222243 39.872096),,,,,,,POINT (0 0),2,
4,1,25677,17:30:00,Wednesday,Allegheny,Pennsylvania,20086,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53932,2,39326,14:27:00,Tuesday,Cambria,Pennsylvania,31751,Male,40 - 49,BLACK,...,POINT (-78.718942 40.491275),,,,,,,POINT (-78.718942 40.491275),2,
53933,2,14560,16:36:00,Friday,York,Pennsylvania,11654,Male,25 - 29,WHITE,...,POINT (-76.725761 39.921925),,,,,,,POINT (-76.725761 39.921925),4,
53934,1,12293,21:30:00,Monday,Allegheny,Pennsylvania,9750,Female,30 - 39,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
53935,2,35341,21:58:00,Sunday,Westmoreland,Pennsylvania,28171,Male,40 - 49,WHITE,...,POINT (-79.471341 40.310315),,,,,,,POINT (-79.471341 40.310315),2,


In [90]:
overdose_PA_df.columns

Index(['dose_count', 'incident_id', 'incident_time', 'day',
       'incident_county_name', 'incident_state', 'victim_id', 'gender_desc',
       'age_range', 'race', 'ethnicity_desc', 'victim_state', 'victim_county',
       'accidental_exposure', 'victim_od_drug_id', 'susp_od_drug_desc',
       'naloxone_administered', 'administration_id', 'incident_date',
       'dose_desc', 'response_time_desc', 'survive', 'response_desc',
       'revive_action_desc', 'third_party_admin_desc',
       'incident_county_fips_code', 'incident_county_lat',
       'incident_county_long', 'victim_county_fips_code', 'victim_county_lat',
       'victim_county_long', 'geocoded_column_state', 'geocoded_column',
       'geocoded_column_address', 'geocoded_column_city',
       'geocoded_column_zip', 'geocoded_column_1_city',
       'geocoded_column_1_zip', 'geocoded_column_1_state', 'geocoded_column_1',
       'dose_unit', 'geocoded_column_1_address'],
      dtype='object')

In [94]:
overdose_sel_cols = ['incident_time','day','incident_county_name','gender_desc','age_range',
                     'race','ethnicity_desc','accidental_exposure','susp_od_drug_desc',
                     'naloxone_administered','incident_date','response_time_desc','survive',
                     'incident_county_lat','incident_county_long']

In [95]:
overdose_PA_df[overdose_sel_cols].describe()

Unnamed: 0,incident_county_lat,incident_county_long
count,53937.0,53937.0
mean,40.551382,-77.434264
std,0.566338,1.89211
min,39.854804,-80.351074
25%,40.167598,-79.762866
50%,40.419746,-76.725761
75%,40.815095,-75.71107
max,41.994138,-75.032709


In [91]:
overdose_PA_df['day'].value_counts()

day
Saturday     8724
Friday       8474
Thursday     7670
Wednesday    7453
Tuesday      7402
Sunday       7262
Monday       6952
Name: count, dtype: int64

In [15]:
overdose_PA_df['incident_county_name'].count()

53937

In [20]:
down_sel = overdose_PA_df[overdose_PA_df['incident_county_name'] == 'Allegheny']
down_sel

Unnamed: 0,dose_count,incident_id,incident_time,day,incident_county_name,incident_state,victim_id,gender_desc,age_range,race,...,geocoded_column,geocoded_column_address,geocoded_column_city,geocoded_column_zip,geocoded_column_1_city,geocoded_column_1_zip,geocoded_column_1_state,geocoded_column_1,dose_unit,geocoded_column_1_address
0,1,184,13:52:00,Wednesday,Allegheny,Pennsylvania,124,Female,20 - 24,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
1,2,39237,4:13:00,Saturday,Allegheny,Pennsylvania,31680,Female,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
4,1,25677,17:30:00,Wednesday,Allegheny,Pennsylvania,20086,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
6,1,3984,0:02:00,Friday,Allegheny,Pennsylvania,3252,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
22,0,42561,8:00:00,Monday,Allegheny,Pennsylvania,34747,Male,40 - 49,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53897,2,26867,0:00:00,Saturday,Allegheny,Pennsylvania,20932,Female,20 - 24,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (0 0),4,
53904,2,39762,23:00:00,Monday,Allegheny,Pennsylvania,32155,Male,30 - 39,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),4,
53918,0,23593,1:00:00,Saturday,Allegheny,Pennsylvania,18642,Male,25 - 29,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
53929,1,39064,21:16:00,Sunday,Allegheny,Pennsylvania,31523,Male,40 - 49,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),4,


In [21]:
down_sel['susp_od_drug_desc'].unique()

array(['HEROIN', 'FENTANYL', 'PHARMACEUTICAL OTHER', 'UNKNOWN',
       'MARIJUANA', 'CARFENTANIL',
       'FENTANYL ANALOG/OTHER SYNTHETIC OPIOID', 'METHADONE', 'ALCOHOL',
       'OTHER', 'BENZODIAZEPINES (I.E.VALIUM, XANAX, ATIVAN, ETC)',
       'COCAINE/CRACK', 'PHARMACEUTICAL OPIOID', 'METHAMPHETAMINE',
       'SYNTHETIC MARIJUANA', 'PHARMACEUTICAL STIMULANT',
       'BARBITURATES (I.E. AMYTAL, NEMBUTAL, ETC)',
       'BUPRENORPHINE (I.E. SUBOXONE, SUBLOCADE, SUBUTEX, BUTRANS, BUPRENEX, ETC)',
       'XYLAZINE', 'BATH SALTS'], dtype=object)

In [48]:
susp_drugs = ['HEROIN', 'FENTANYL', 'CARFENTANIL',
       'FENTANYL ANALOG/OTHER SYNTHETIC OPIOID', 'METHADONE',
       'PHARMACEUTICAL OPIOID',
       'BUPRENORPHINE (I.E. SUBOXONE, SUBLOCADE, SUBUTEX, BUTRANS, BUPRENEX, ETC)']



In [49]:
down_sel2 = down_sel[down_sel['susp_od_drug_desc'].isin(susp_drugs)].copy()


In [50]:
down_sel2

Unnamed: 0,dose_count,incident_id,incident_time,day,incident_county_name,incident_state,victim_id,gender_desc,age_range,race,...,geocoded_column,geocoded_column_address,geocoded_column_city,geocoded_column_zip,geocoded_column_1_city,geocoded_column_1_zip,geocoded_column_1_state,geocoded_column_1,dose_unit,geocoded_column_1_address
0,1,184,13:52:00,Wednesday,Allegheny,Pennsylvania,124,Female,20 - 24,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
1,2,39237,4:13:00,Saturday,Allegheny,Pennsylvania,31680,Female,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
4,1,25677,17:30:00,Wednesday,Allegheny,Pennsylvania,20086,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
6,1,3984,0:02:00,Friday,Allegheny,Pennsylvania,3252,Male,50 - 59,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
22,0,42561,8:00:00,Monday,Allegheny,Pennsylvania,34747,Male,40 - 49,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53879,0,40768,17:08:00,Tuesday,Allegheny,Pennsylvania,33077,Male,30 - 39,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.917118 40.910832),0,
53891,2,37405,16:00:00,Friday,Allegheny,Pennsylvania,30037,Male,30 - 39,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),4,
53904,2,39762,23:00:00,Monday,Allegheny,Pennsylvania,32155,Male,30 - 39,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),4,
53929,1,39064,21:16:00,Sunday,Allegheny,Pennsylvania,31523,Male,40 - 49,WHITE,...,POINT (-79.986198 40.467355),,,,,,,POINT (-79.986198 40.467355),4,


In [66]:
# Use this cell for dataset information.  Just change the column name
down_sel2['naloxone_administered'].value_counts()

naloxone_administered
Y    4115
N    1198
Name: count, dtype: int64

In [92]:
print(down_sel2['incident_county_name'].unique())
# print(down_sel2['incident_time'].unique())
print(down_sel2['day'].unique())
print(down_sel2['gender_desc'].unique())
print(down_sel2['age_range'].unique())
print(down_sel2['ethnicity_desc'].unique())
print(down_sel2['accidental_exposure'].unique())
print(down_sel2['susp_od_drug_desc'].unique())
print(down_sel2['incident_date'].unique())
print(down_sel2['response_time_desc'].unique())
print(down_sel2['survive'].unique())
print(down_sel2['incident_county_lat'].unique())
print(down_sel2['incident_county_long'].unique())
print(down_sel2['naloxone_administered'].unique())


['Allegheny']
['Wednesday' 'Saturday' 'Friday' 'Monday' 'Sunday' 'Thursday' 'Tuesday']
['Female', 'Male', 'Unknown']
Categories (3, object): ['Female', 'Male', 'Unknown']
['20 - 24' '50 - 59' '40 - 49' '30 - 39' '25 - 29' '70 - 79' '15 - 19'
 '60 - 69' '0 - 9' '14-Oct' '80 - *']
['Unknown' 'NON-HISPANIC OR NOT LATINO' 'HISPANIC or LATINO']
['N' 'Y']
['HEROIN' 'FENTANYL' 'CARFENTANIL'
 'FENTANYL ANALOG/OTHER SYNTHETIC OPIOID' 'METHADONE'
 'PHARMACEUTICAL OPIOID'
 'BUPRENORPHINE (I.E. SUBOXONE, SUBLOCADE, SUBUTEX, BUTRANS, BUPRENEX, ETC)']
['3/7/2018' '4/22/2023' '5/26/2021' ... '12/20/2019' '11/9/2019'
 '12/23/2022']
[nan '1-3 MINUTES' '<1 MINUTE' 'DID NOT WORK' '>5 MINUTES' '3-5 MINUTES'
 "DON'T KNOW"]
['N' 'U' 'Y']
[40.467355]
[-79.986198]
['Y' 'N']


In [74]:
down_sel2.dtypes

dose_count                      int64
incident_id                     int64
incident_time                  object
day                            object
incident_county_name           object
incident_state                 object
victim_id                       int64
gender_desc                  category
age_range                      object
race                           object
ethnicity_desc                 object
victim_state                   object
victim_county                  object
accidental_exposure            object
victim_od_drug_id               int64
susp_od_drug_desc              object
naloxone_administered          object
administration_id               int64
incident_date                  object
dose_desc                      object
response_time_desc             object
survive                        object
response_desc                  object
revive_action_desc             object
third_party_admin_desc         object
incident_county_fips_code       int64
incident_cou

In [77]:
type(down_sel2['incident_date'][0])

str

In [70]:
down_sel2['incident_time'][0]

'13:52:00'

In [72]:
down_sel2['gender_desc'] = down_sel2['gender_desc'].astype('category')

In [73]:
type(down_sel2['gender_desc'][0])

str

In [110]:
down_sel2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5313 entries, 0 to 53934
Data columns (total 42 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   dose_count                 5313 non-null   int64   
 1   incident_id                5313 non-null   int64   
 2   incident_time              5313 non-null   object  
 3   day                        5313 non-null   object  
 4   incident_county_name       5313 non-null   object  
 5   incident_state             5313 non-null   object  
 6   victim_id                  5313 non-null   int64   
 7   gender_desc                5313 non-null   category
 8   age_range                  5313 non-null   object  
 9   race                       5313 non-null   object  
 10  ethnicity_desc             5313 non-null   object  
 11  victim_state               4921 non-null   object  
 12  victim_county              4921 non-null   object  
 13  accidental_exposure        5313 non-n