### DSCI 511: Data Acquisition and Preprocessing Group Project
### Group members: Allie Schneider and Francis Villamater

The end goal of this open-ended assignment was to build and and make available a complex dataset. My group decided to build a dataset to compare the air quality by population density stratified by Pennsylvania county.

#### Data Sources Used

<b>[OpenWeather:](https://openweathermap.org/api/air-pollution)<br> </b>
We used the "Historical Air Pollution" data, which requires the inputs of latitude, longitude, start date (in unix time),and end date(in unix time). The request returns a dictionary with data regarding overall Air Quality Index, as well as data about polluting gasses:

- Carbon monoxide (CO)
- Nitrogen monoxide (NO)
- Nitrogen dioxide (NO2)
- Ozone (O3)
- Sulphur dioxide (SO2)
- Ammonia (NH3)
- Particulates (PM2.5 and PM10)

Historical data is not availible prior to November 27 2020 for this data source.

<b>[US Census:](https://www.census.gov/data/developers/data-sets/popest-popproj/popest.html)<br> </b>
We used 2019 Population Estimates (as 2020 estimates were unavailible), which requires the inputs of census county and state codes retrieved from [census.gov](https://www2.census.gov/geo/docs/reference/codes/files/st42_pa_cou.txt). The request returns a nested list with county name, population, and density.

In order to retrieve all PA county data from OpenWeather, we looped through a [list of latitudes and longitudes.](https://data.pa.gov/Government-That-Works/County-Latitude-Longitude-Points-For-Each-County-S/dvjn-d63b)

In order to retrieve all PA county data from census.gov, we looped through a [list of census county codes.](https://www2.census.gov/geo/docs/reference/codes/files/st42_pa_cou.txt)

#### API Restrictions

U.S. Census data and opendataPA are both open access. OpenWeather advocates free access to weather data and their APIs.

#### Gather Pennsylvania Census Data.

In [10]:
import pprint
import requests
import csv
import pandas as pd
import datetime

In [4]:
# Pull the county names/codes from the API endpoint URL over a local file directory


countyreq = requests.get('https://www2.census.gov/geo/docs/reference/codes/files/st42_pa_cou.txt')
counties = countyreq.text
names = counties.split('\n')
county_dict = {}
county_codes = []
for line in names:
    line = line.split(",")
    state = line[0]
    state_code = line[1]
    county_code = line[2]
    county_name = line[3]
    county_dict[county_name] = {"state": state,
                               "state_code" : state_code,
                               "county_code" : county_code}
    county_codes.append(county_code)
pprint.pprint(county_codes)
print()
pprint.pprint(county_dict)


['001',
 '003',
 '005',
 '007',
 '009',
 '011',
 '013',
 '015',
 '017',
 '019',
 '021',
 '023',
 '025',
 '027',
 '029',
 '031',
 '033',
 '035',
 '037',
 '039',
 '041',
 '043',
 '045',
 '047',
 '049',
 '051',
 '053',
 '055',
 '057',
 '059',
 '061',
 '063',
 '065',
 '067',
 '069',
 '071',
 '073',
 '075',
 '077',
 '079',
 '081',
 '083',
 '085',
 '087',
 '089',
 '091',
 '093',
 '095',
 '097',
 '099',
 '101',
 '103',
 '105',
 '107',
 '109',
 '111',
 '113',
 '115',
 '117',
 '119',
 '121',
 '123',
 '125',
 '127',
 '129',
 '131',
 '133']

{'Adams County': {'county_code': '001', 'state': 'PA', 'state_code': '42'},
 'Allegheny County': {'county_code': '003', 'state': 'PA', 'state_code': '42'},
 'Armstrong County': {'county_code': '005', 'state': 'PA', 'state_code': '42'},
 'Beaver County': {'county_code': '007', 'state': 'PA', 'state_code': '42'},
 'Bedford County': {'county_code': '009', 'state': 'PA', 'state_code': '42'},
 'Berks County': {'county_code': '011', 'state': 'PA', 'state_code': '42

#### Loop through the county code list in order to access each county's census data.

In [5]:
#Getting the census data for each Pennsylvania county and putting it in a list
api_key ='312a2c4d299453c3f06359d2f2158546e89f2f6c'
#density is expressed as number of peple per square mile of land area
census_data = [['NAME', 'POP', 'DENSITY', 'state', 'county']]
for county_code in county_codes:
    address = 'https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=county:{}&in=state:42&key={}'.format(county_code, api_key)
    
    census_resp = requests.get(address)
    census_result = census_resp.json()
    census_data.append(census_result[1])

pprint.pprint(census_data)

[['NAME', 'POP', 'DENSITY', 'state', 'county'],
 ['Adams County, Pennsylvania', '103009', '198.59524445000000', '42', '001'],
 ['Allegheny County, Pennsylvania',
  '1216045',
  '1665.54791010000000',
  '42',
  '003'],
 ['Armstrong County, Pennsylvania', '64735', '99.10578176300000', '42', '005'],
 ['Beaver County, Pennsylvania', '163929', '377.09710119000000', '42', '007'],
 ['Bedford County, Pennsylvania', '47888', '47.30601317700000', '42', '009'],
 ['Berks County, Pennsylvania', '421164', '491.78051037000000', '42', '011'],
 ['Blair County, Pennsylvania', '121829', '231.70158614000000', '42', '013'],
 ['Bradford County, Pennsylvania', '60323', '52.57303239400000', '42', '015'],
 ['Bucks County, Pennsylvania', '628270', '1039.49065960000000', '42', '017'],
 ['Butler County, Pennsylvania', '187853', '237.92159069000000', '42', '019'],
 ['Cambria County, Pennsylvania', '130192', '189.13291868000000', '42', '021'],
 ['Cameron County, Pennsylvania', '4447', '11.22325306500000', '42', '02

#### Funnel PA census data by county into a nested dictionary.

In [6]:
census_dict = {}
for line in census_data[1:]:
    name = line[0]
    pop = line[1]
    density = line[2]
    state_code = line[3]
    county_code = line[4]
    census_dict[name] = {"pop": pop,
                        "density": density,
                        "state_code":state_code,
                        "county_code":county_code}
pprint.pprint(census_dict)

{'Adams County, Pennsylvania': {'county_code': '001',
                                'density': '198.59524445000000',
                                'pop': '103009',
                                'state_code': '42'},
 'Allegheny County, Pennsylvania': {'county_code': '003',
                                    'density': '1665.54791010000000',
                                    'pop': '1216045',
                                    'state_code': '42'},
 'Armstrong County, Pennsylvania': {'county_code': '005',
                                    'density': '99.10578176300000',
                                    'pop': '64735',
                                    'state_code': '42'},
 'Beaver County, Pennsylvania': {'county_code': '007',
                                 'density': '377.09710119000000',
                                 'pop': '163929',
                                 'state_code': '42'},
 'Bedford County, Pennsylvania': {'county_code': '009',
                        

#### Collect latitude and longitude for each county.

In [7]:
# pull the coordinates from the API end URL over a local file directory
datareq = requests.get("https://data.pa.gov/resource/dvjn-d63b.json")
data = datareq.json()
data[1:2][0]

county_lonlat = {}
for i in range(0, len(data)):
    county_name = str(data[i:i+1][0]['name']) + " County, Pennsylvania"
    latitude = data[i:i+1][0]['latitude']
    longitude = data[i:i+1][0]['longitude']
    county_lonlat[county_name] = {'Longitude' : longitude,
                          'Latitude' : latitude}
pprint.pprint(county_lonlat)

{'Adams County, Pennsylvania': {'Latitude': '39.87209565',
                                'Longitude': '-77.22224271'},
 'Allegheny County, Pennsylvania': {'Latitude': '40.46735543',
                                    'Longitude': '-79.98619843'},
 'Armstrong County, Pennsylvania': {'Latitude': '40.81509526',
                                    'Longitude': '-79.47316899'},
 'Beaver County, Pennsylvania': {'Latitude': '40.68349245',
                                 'Longitude': '-80.35107356'},
 'Bedford County, Pennsylvania': {'Latitude': '40.00737536',
                                  'Longitude': '-78.49116474'},
 'Berks County, Pennsylvania': {'Latitude': '40.41939635',
                                'Longitude': '-75.93077327'},
 'Blair County, Pennsylvania': {'Latitude': '40.48555024',
                                'Longitude': '-78.34907687'},
 'Bradford County, Pennsylvania': {'Latitude': '41.79117814',
                                   'Longitude': '-76.51825624'},
 'Bu

#### Combine county coordinates with census data.

In [8]:
complete_census = {}

for census_key in census_dict:
    pop = census_dict[census_key]["pop"]
    density = census_dict[census_key]["density"]
    county_code = census_dict[census_key]["county_code"]
    for county_key in county_lonlat:
        if census_key == county_key:
            lon = county_lonlat[county_key]["Longitude"]
            lat = county_lonlat[county_key]["Latitude"]
            complete_census[county_key] = {"population": pop,
                                          "density" : density,
                                          "longitude" : lon,
                                          "latitude" : lat,
                                          "county_code" : county_code}
            
pprint.pprint(complete_census)

{'Adams County, Pennsylvania': {'county_code': '001',
                                'density': '198.59524445000000',
                                'latitude': '39.87209565',
                                'longitude': '-77.22224271',
                                'population': '103009'},
 'Allegheny County, Pennsylvania': {'county_code': '003',
                                    'density': '1665.54791010000000',
                                    'latitude': '40.46735543',
                                    'longitude': '-79.98619843',
                                    'population': '1216045'},
 'Armstrong County, Pennsylvania': {'county_code': '005',
                                    'density': '99.10578176300000',
                                    'latitude': '40.81509526',
                                    'longitude': '-79.47316899',
                                    'population': '64735'},
 'Beaver County, Pennsylvania': {'county_code': '007',
                 

#### Gether AQI data from Openweather and average the values for our time range (November 1, 2020 to October 31, 2020). <br> We will need to average each air quality measure over our specified timeframe.

In [11]:
start_date = datetime.datetime(2020,11,1) 
start = int(start_date.timestamp()) #converting to unix epoch timestamp

end_date = datetime.datetime(2021, 10, 31) #COMMENT[FV]: Corrected date to 31st (from 1st) for full 365 days.
end = int(end_date.timestamp()) #converting to unix epoch timestamp
    
air = [['county', 'county_code', 'lat', 'long', 'pop', 'density', 'co', 'nh3', 'no', 'no2', '03', 'pm10', 'pm2_5', 'so2', 'aqi']]

for county in complete_census:
    lon = complete_census[county]["longitude"][:8]
    lat = complete_census[county]["latitude"][:7]
    pop = complete_census[county]['population']
    density = complete_census[county]['density']
    county_code = complete_census[county]['county_code']

    address = 'http://api.openweathermap.org/data/2.5/air_pollution/history?lat={}&lon={}&start={}&end={}&appid=da5dc0e03e30df9819aafabdd4a58004'.format(lat, lon, str(start), str(end))

    resp = requests.get(address)
    result = resp.json()

    #Calculate average CO
    co_avg = 0
    for i in range(len(result['list'])):
        co_avg += (result['list'][i]['components']['co'])
    co_avg = co_avg / len(result['list'])
    #print(co_avg)

    #Calculate average nh3
    nh3_avg = 0
    for i in range(len(result['list'])):
        nh3_avg += (result['list'][i]['components']['nh3'])
    nh3_avg = nh3_avg / len(result['list'])
    #print(nh3_avg)

    #Calculate average no
    no_avg = 0
    for i in range(len(result['list'])):
        no_avg += (result['list'][i]['components']['no'])
    no_avg = no_avg / len(result['list'])
    #print(no_avg)

    #Calculate average no2
    no2_avg = 0
    for i in range(len(result['list'])):
        no2_avg += (result['list'][i]['components']['no2'])
    no2_avg = no2_avg / len(result['list'])
    #print(no2_avg)

    #Calculate average o3
    o3_avg = 0
    for i in range(len(result['list'])):
        o3_avg += (result['list'][i]['components']['o3'])
    o3_avg = o3_avg / len(result['list'])
    #print(o3_avg)

    #Calculate average pm10
    pm10_avg = 0
    for i in range(len(result['list'])):
        pm10_avg += (result['list'][i]['components']['pm10'])
    pm10_avg = pm10_avg / len(result['list'])
    #print(pm10_avg)

    #Calculate average pm2_5
    pm2_5_avg = 0
    for i in range(len(result['list'])):
        pm2_5_avg += (result['list'][i]['components']['pm2_5'])
    pm2_5_avg = pm2_5_avg / len(result['list'])
    #print(pm2_5_avg)

    #Calculate average so2
    so2_avg = 0
    for i in range(len(result['list'])):
        so2_avg += (result['list'][i]['components']['so2'])
    so2_avg = so2_avg / len(result['list'])
    #print(so2_avg)
    
    #Calculate average aqi
    aqi_avg = 0
    for i in range(len(result['list'])):
        aqi_avg += (result['list'][i]['main']['aqi'])
    aqi_avg = aqi_avg / len(result['list'])
    #print(aqi_avg)


    list_air = [] #[county, county_code, lat, long, co, nh3, no, no2, 03, pm10, pm2_5, so2, aqi]
    
    list_air.append(county)
    list_air.append(county_code)
    list_air.append(result['coord']['lat'])
    list_air.append(result['coord']['lon'])
    list_air.append(pop)
    list_air.append(density)
    list_air.append(co_avg)
    list_air.append(nh3_avg)
    list_air.append(no_avg)
    list_air.append(no2_avg)
    list_air.append(o3_avg)
    list_air.append(pm10_avg)
    list_air.append(pm2_5_avg)
    list_air.append(so2_avg)
    list_air.append(aqi_avg)
    
    air.append(list_air)
    
pprint.pprint(air)

[['county',
  'county_code',
  'lat',
  'long',
  'pop',
  'density',
  'co',
  'nh3',
  'no',
  'no2',
  '03',
  'pm10',
  'pm2_5',
  'so2',
  'aqi'],
 ['Adams County, Pennsylvania',
  '001',
  39.872,
  -77.2222,
  '103009',
  '198.59524445000000',
  269.91551474201697,
  1.1425712530712544,
  0.2974336609336562,
  6.327442260442295,
  60.32694840294856,
  7.368173218673216,
  6.540871007370997,
  1.727047911547918,
  1.4705159705159705],
 ['Allegheny County, Pennsylvania',
  '003',
  40.4673,
  -79.9861,
  '1216045',
  '1665.54791010000000',
  308.2815319410333,
  0.7623734643734688,
  2.764500000000007,
  14.487695331695186,
  50.79184275184297,
  9.489630221130273,
  8.456509828009821,
  4.804437346437362,
  1.5732186732186733],
 ['Armstrong County, Pennsylvania',
  '005',
  40.815,
  -79.4731,
  '64735',
  '99.10578176300000',
  252.05870638820693,
  0.4475073710073805,
  0.27839066339066065,
  5.82265110565113,
  54.51617321867347,
  6.952847665847665,
  6.343954545454539,
  2.9

### Finalize variables

In [15]:
final_list = [['Latitude', 'Longitude', 'PA County Name', 'County Code', 'Average AQI Score', 
               'Population', 'Population Density (/mi^2)', 'Avg AQI / Pop Density',
              'co', 'nh3', 'no', 'no2', '_03', 'pm10', 'pm2_5', 'so2']]

for line in air[1:]:
    lat = line[2]
    lon = line[3]
    county_name = line[0].split(',')[0] 
    county_code = line[1]
    avg_aqi = line[-1]
    pop = float(line[4])
    density = float(line[5])
    aqi_per_density = float(avg_aqi) / float(density)
    
    co = line[6]
    nh3 = line[7]
    no = line[8]
    no2 = line[9]
    _03 = line[10]
    pm10 = line[11]
    pm2_5 = line[12]
    so2 = line[13]
    
    #append all to final_list
    final_list.append([lat,lon,county_name,county_code,avg_aqi,pop,density,aqi_per_density,
                      co, nh3, no, no2, _03, pm10, pm2_5, so2])
    
pprint.pprint(final_list)

[['Latitude',
  'Longitude',
  'PA County Name',
  'County Code',
  'Average AQI Score',
  'Population',
  'Population Density (/mi^2)',
  'Avg AQI / Pop Density',
  'co',
  'nh3',
  'no',
  'no2',
  '_03',
  'pm10',
  'pm2_5',
  'so2'],
 [39.872,
  -77.2222,
  'Adams County',
  '001',
  1.4705159705159705,
  103009.0,
  198.59524445,
  0.007404588033255751,
  269.91551474201697,
  1.1425712530712544,
  0.2974336609336562,
  6.327442260442295,
  60.32694840294856,
  7.368173218673216,
  6.540871007370997,
  1.727047911547918],
 [40.4673,
  -79.9861,
  'Allegheny County',
  '003',
  1.5732186732186733,
  1216045.0,
  1665.5479101,
  0.00094456524707489,
  308.2815319410333,
  0.7623734643734688,
  2.764500000000007,
  14.487695331695186,
  50.79184275184297,
  9.489630221130273,
  8.456509828009821,
  4.804437346437362],
 [40.815,
  -79.4731,
  'Armstrong County',
  '005',
  1.3893120393120393,
  64735.0,
  99.105781763,
  0.014018476163524123,
  252.05870638820693,
  0.4475073710073805

### Save as a csv

In [16]:
with open("./data/aqi.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(final_list)
aqi = pd.read_csv("./data/aqi.csv", sep = ",", header = 0, parse_dates = [0])
aqi

Unnamed: 0,Latitude,Longitude,PA County Name,County Code,Average AQI Score,Population,Population Density (/mi^2),Avg AQI / Pop Density,co,nh3,no,no2,_03,pm10,pm2_5,so2
0,39.872,-77.2222,Adams County,1,1.470516,103009.0,198.595244,0.007405,269.915515,1.142571,0.297434,6.327442,60.326948,7.368173,6.540871,1.727048
1,40.4673,-79.9861,Allegheny County,3,1.573219,1216045.0,1665.547910,0.000945,308.281532,0.762373,2.764500,14.487695,50.791843,9.489630,8.456510,4.804437
2,40.815,-79.4731,Armstrong County,5,1.389312,64735.0,99.105782,0.014018,252.058706,0.447507,0.278391,5.822651,54.516173,6.952848,6.343955,2.923330
3,40.6834,-80.3510,Beaver County,7,1.436486,163929.0,377.097101,0.003809,262.141009,0.508416,0.536749,8.504674,55.868408,7.802413,6.994066,5.151950
4,40.0073,-78.4911,Bedford County,9,1.392506,47888.0,47.306013,0.029436,256.226350,0.699826,0.325881,5.745870,57.256576,6.578182,5.988871,2.628241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,40.191,-80.2518,Washington County,125,1.434889,206865.0,241.386316,0.005944,280.024455,0.570855,0.779869,10.277667,53.200193,7.846797,6.968866,4.658197
63,41.6496,-75.3051,Wayne County,127,1.361916,51361.0,70.784131,0.019240,247.146818,0.321459,0.138294,3.747085,62.057559,5.297419,4.897725,1.134521
64,40.3103,-79.4713,Westmoreland County,129,1.405037,348899.0,339.382505,0.004140,261.814791,0.489625,0.359620,6.902709,55.703929,7.249383,6.548383,3.429400
65,41.5189,-76.0181,Wyoming County,131,1.338452,26794.0,67.438935,0.019847,245.168440,0.483333,0.150128,3.656564,58.187187,5.253752,4.845182,1.129765
