# ICE: Sources of Data
## Name: Cabot Steward
## *DATA 3300*

In this in-class exercise we will examine different ways that we can source data, beginning with precompiled datasets, followed by web APIs, and then finally we will produce our own dataset from a class survey.

Whenever I am working on putting together a dataset, I do so with the intention of solving a particular problem or answering a particular question. This will guide what data I need to source!

**So, let's say we're starting a Travel Agency business to help individuals plan optimal trips to National Parks, what sort of data might we want to make data-driven recommendations?**

*  customer level data - primary data collection - survey
*  park level data - secondary data



## **Part A) Primary Data Collection**

Surveys are one common way of collecting data, and it can be a helpful way for gathering information from specific individuals. Let's [create a survey](https://docs.google.com/forms/u/0/) with five or so questions we think will help us plan a National Parks Holiday trip!

These questions should help us narrow down their preferences around:



*   Region of the country to visit
*   Preference around traffic
*   Type of sites preferred
*   Trail activity type preference
*   Weather preferences



In [1]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

In [2]:
df_1 = pd.read_csv ('NPS_-_Trails_-_Geographic_Coordinate_System.csv') # import sheets data
df_1.head()

Unnamed: 0,OBJECTID,TRLNAME,TRLALTNAME,MAPLABEL,TRLSTATUS,TRLSURFACE,TRLTYPE,TRLCLASS,TRLUSE,PUBLICDISPLAY,...,ISEXTANT,OPENTOPUBLIC,ALTLANGNAME,ALTLANG,SEASONAL,SEASDESC,MAINTAINER,NOTES,StagingTable,ShapeSTLength
0,1,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000682
1,2,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000259
2,3,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000264
3,4,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000934
4,5,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000281


<!-- 

## **Part B) Secondary Data Collection**

Let's begin with examining some sources of precompiled datasets. Websites I often use for sourcing datasets include:

*   Government websites, like [Transparent Utah](https://transparent.utah.gov/job_title_search.php)
*   [Kaggle](https://www.kaggle.com/datasets)
*   [Data is Plural](https://www.data-is-plural.com/archive/)

Let's see what precompiled data is out there that might be useful related to finding info on national parks! -->

[National Parks Dataset](https://www.kaggle.com/datasets/thedevastator/the-united-states-national-parks?select=df_2.csv)

[National Park Trails](https://www.data-is-plural.com/archive/2020-08-26-edition/)

In [3]:
parks = pd.read_csv('df_2.csv') # load in data set
parks.head()

Unnamed: 0.1,Unnamed: 0,Name,Image,Location,Date established as park[7][12],Area (2021)[13],Recreation visitors (2021)[11],Description
0,0,Acadia,,"Maine.mw-parser-output .geo-default,.mw-parser...","February 26, 1919","49,071.40 acres (198.6 km2)",4069098,Covering most of Mount Desert Island and other...
1,1,American Samoa,,American Samoa14°15′S 170°41′W﻿ / ﻿14.25°S 170...,"October 31, 1988","8,256.67 acres (33.4 km2)",8495,The southernmost national park is on three Sam...
2,2,Arches,,Utah38°41′N 109°34′W﻿ / ﻿38.68°N 109.57°W,"November 12, 1971","76,678.98 acres (310.3 km2)",1806865,"This site features more than 2,000 natural san..."
3,3,Badlands,,South Dakota43°45′N 102°30′W﻿ / ﻿43.75°N 102.50°W,"November 10, 1978","242,755.94 acres (982.4 km2)",1224226,"The Badlands are a collection of buttes, pinna..."
4,4,Big Bend,,Texas29°15′N 103°15′W﻿ / ﻿29.25°N 103.25°W,"June 12, 1944","801,163.21 acres (3,242.2 km2)",581220,Named for the prominent bend in the Rio Grande...


**1) Narrow Down the Dataset: Let's filter down this dataset some to parks of interest based on their Location -- Assume we want to stick to locations in the Intermountain West. How can we use the Location columns to narrow down our list?**

see the lst variable below

In [4]:
lst = 'Utah', 'Colorado', 'Wyoming', 'Montana', 'Idaho', 'Arizona'
filtered_parks = parks[parks['Location'].str.startswith(lst)] # add in strings to filter down location
filtered_parks.head()

Unnamed: 0.1,Unnamed: 0,Name,Image,Location,Date established as park[7][12],Area (2021)[13],Recreation visitors (2021)[11],Description
2,2,Arches,,Utah38°41′N 109°34′W﻿ / ﻿38.68°N 109.57°W,"November 12, 1971","76,678.98 acres (310.3 km2)",1806865,"This site features more than 2,000 natural san..."
6,6,Black Canyon of the Gunnison,,Colorado38°34′N 107°43′W﻿ / ﻿38.57°N 107.72°W,"October 21, 1999","30,779.83 acres (124.6 km2)",308910,The park protects a quarter of the Gunnison Ri...
7,7,Bryce Canyon,,Utah37°34′N 112°11′W﻿ / ﻿37.57°N 112.18°W,"February 25, 1928","35,835.08 acres (145.0 km2)",2104600,Bryce Canyon is a geological amphitheater on s...
8,8,Canyonlands,,Utah38°12′N 109°56′W﻿ / ﻿38.2°N 109.93°W,"September 12, 1964","337,597.83 acres (1,366.2 km2)",911594,This landscape was eroded into a maze of canyo...
9,9,Capitol Reef,,Utah38°12′N 111°10′W﻿ / ﻿38.20°N 111.17°W,"December 18, 1971","241,904.50 acres (979.0 km2)",1405353,The park's Waterpocket Fold is a 100-mile (160...


In [5]:
# view the parks within those locations
filtered_parks['Name']

2                           Arches
6     Black Canyon of the Gunnison
7                     Bryce Canyon
8                      Canyonlands
9                     Capitol Reef
21                         Glacier
23                  Grand Canyon *
24                     Grand Teton
26                Great Sand Dunes
42                    Mesa Verde *
47                Petrified Forest
50                  Rocky Mountain
51                         Saguaro
60                     Yellowstone
62                            Zion
Name: Name, dtype: object

**2) If we know our customer is interested in specific geographic sites or types of sites, how can we further filter down parks?**

see below

In [6]:
desc_filter = 'mountains'
mountain_parks = filtered_parks[filtered_parks['Description'].str.contains(desc_filter, case=False)] # add in description, why are we using contains here instead of startsiwth?
mountain_parks.head()

Unnamed: 0.1,Unnamed: 0,Name,Image,Location,Date established as park[7][12],Area (2021)[13],Recreation visitors (2021)[11],Description
21,21,Glacier,,Montana48°48′N 114°00′W﻿ / ﻿48.80°N 114.00°W,"May 11, 1910","1,013,126.39 acres (4,100.0 km2)",3081656,The U.S. half of Waterton-Glacier Internationa...
24,24,Grand Teton,,Wyoming43°44′N 110°48′W﻿ / ﻿43.73°N 110.80°W,"February 26, 1929","310,044.36 acres (1,254.7 km2)",3885230,Grand Teton is the tallest mountain in the sce...
26,26,Great Sand Dunes,,Colorado37°44′N 105°31′W﻿ / ﻿37.73°N 105.51°W,"September 24, 2004","107,345.73 acres (434.4 km2)",602613,"The tallest sand dunes in North America, up to..."
50,50,Rocky Mountain,,Colorado40°24′N 105°35′W﻿ / ﻿40.40°N 105.58°W,"January 26, 1915","265,807.24 acres (1,075.7 km2)",4434848,Bisected north to south by the Continental Div...


In [7]:
mountain_parks['Name'].value_counts() # check the parks count

Name
Glacier             1
Grand Teton         1
Great Sand Dunes    1
Rocky Mountain      1
Name: count, dtype: int64

**3) Now let's Incorporate our Trail Data! What pieces of information tie together these two datasets?**

In [45]:
trails = pd.read_csv('NPS_-_Trails_-_Geographic_Coordinate_System.csv') # read in second dataset
trails.head()

Unnamed: 0,OBJECTID,TRLNAME,TRLALTNAME,MAPLABEL,TRLSTATUS,TRLSURFACE,TRLTYPE,TRLCLASS,TRLUSE,PUBLICDISPLAY,...,ISEXTANT,OPENTOPUBLIC,ALTLANGNAME,ALTLANG,SEASONAL,SEASDESC,MAINTAINER,NOTES,StagingTable,ShapeSTLength
0,1,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000682
1,2,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000259
2,3,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000264
3,4,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000934
4,5,Jumbo Mine Trail,,Jumbo Mine Trail,Existing,Unknown,Standard Terra Trail,Class 2: Moderately Developed,Hiker/Pedestrian,Public Map Display,...,True,Unknown,,,Unknown,,National Park Service,Multiple Routes,STAGING_AKR_TRAILS_ALL_20240213004113,0.000281


**4) Join the datasets by mapping Name to Unitcode then joining on the same primary key!**

In [None]:
parks_nps_codes = {
    "Glacier": "",
    "Grand Teton": "",
    "Great Sand Dunes": "",
    "Rocky Mountain": "ROMO",
    "Saguaro": "SAGU",
    "Yellowstone": "YELL"
}

# Rename the 'Name' column to 'UNITCODE' and replace values using parks_nps_codes
mountain_parks = mountain_parks.rename() # fill in rename argument
mountain_parks['UNITCODE'] = mountain_parks['UNITCODE'].replace() # fill in replace argument
mountain_parks.head()

In [None]:
merged_df = # create new dataframe by merging mountain_parks and trails
merged_df.head()

In [None]:
# view all columns in merged_df

**5) What has changed about the structure of the dataset? (Apart from adding new columns from the join)?**

# **Part C) OpenWeatherMaps API - Let's Check the Forecast at our Parks of Interest!**

To access the OpenWeatherMap API, first navigate [here](https://openweathermap.org/api) and sign up for an account!

Then we will navigate to the `My API Keys` page under your profile, to generate an API key. This is all we need (aprt from the request/endpoint) to start pulling data from this website!

**1) Store API Credentials**

Note: This isn't a great way to do this. But since this is a free account we're not too worried about security.

In [10]:
# API Key (replace 'your_api_key' with the actual API key)
api_key = '42e919afac4c1f87a02bbbff6fab2f54' #9c9681d1ee3a680fd15053968fa4c0a5'


In [1]:
import requests

city = input("Enter City:")

url = 'http://api.openweathermap.org/data/2.5/weather?q={}&appid=42e919afac4c1f87a02bbbff6fab2f54&units=metric'.format(city)

res = requests.get(url)
data = res.json()

print(data)

# humidity = data['main']['humidity']
# pressure = data['main']['pressure']
# wind = data['wind']['speed']
# description = data['weather'][0]['description']
# temp = data['main']['temp']

# print('Temperature:',temp,'°C')
# print('Wind:',wind)
# print('Pressure: ',pressure)
# print('Humidity: ',humidity)
# print('Description:',description)


{'coord': {'lon': 14.5048, 'lat': 35.8628}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01n'}], 'base': 'stations', 'main': {'temp': 15.12, 'feels_like': 14.99, 'temp_min': 15.12, 'temp_max': 15.12, 'pressure': 1019, 'humidity': 88, 'sea_level': 1019, 'grnd_level': 1018}, 'visibility': 10000, 'wind': {'speed': 5.66, 'deg': 220}, 'clouds': {'all': 0}, 'dt': 1737659585, 'sys': {'type': 1, 'id': 6861, 'country': 'MT', 'sunrise': 1737612493, 'sunset': 1737649147}, 'timezone': 3600, 'id': 2562529, 'name': 'Santa Luċija', 'cod': 200}


**2) Set up your request!**

Here, the goal is to add the average temperature for each of our six parks into our dataset, using the Openweather maps API.

In [4]:
parks_info = {
    "GLAC": {"lat": 48.6966, "lon": -113.7183},
    "GRTE": {"lat": 43.7904, "lon": -110.6818},
    "GRSA": {"lat": 37.7926, "lon": -105.5943},
    "ROMO": {"lat": 40.3428, "lon": -105.6836},
    "SAGU": {"lat": 32.2967, "lon": -111.1666},
    "YELL": {"lat": 44.4280, "lon": -110.5885}
} # add comment

In [6]:
# Add comments
avg_temp_summary = []

#
for unit_code, coords in parks_info.items():
    lat, lon = coords['lat'], coords['lon']

    #
    url = f"http://api.openweathermap.org/data/2.5/forecast?lat={lat}&lon={lon}&appid={api_key}&units=imperial"

    #
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the JSON response
        forecast_data = response.json()
        forecast_list = forecast_data['list']

        #
        df_forecast = pd.DataFrame([{
            'datetime': item['dt_txt'],
            'temperature': item['main']['temp']
        } for item in forecast_list])

        #
        df_forecast['datetime'] = pd.to_datetime(df_forecast['datetime'])

        #
        df_forecast['date'] = df_forecast['datetime'].dt.date
        daily_avg_temps = df_forecast.groupby('date')['temperature'].mean().reset_index()

        #
        five_day_avg_temp = daily_avg_temps['temperature'].mean()

        #
        avg_temp_summary.append({'UNITCODE': unit_code, 'five_day_avg_temp': five_day_avg_temp})

    else:
        print(f"Failed to retrieve data for {unit_code}: Status code", response.status_code)


df_avg_temp_summary =  # Create a DataFrame from the summary list


merged_df = merged_df.merge() # Merge the summary DataFrame with your existing merged_df on UNITCODE


merged_df.head()

Failed to retrieve data for GLAC: Status code 401
Failed to retrieve data for GRTE: Status code 401
Failed to retrieve data for GRSA: Status code 401
Failed to retrieve data for ROMO: Status code 401
Failed to retrieve data for SAGU: Status code 401
Failed to retrieve data for YELL: Status code 401


NameError: name 'merged_df' is not defined

**3) Why are the values in five_day_avg_temp repeated?**

**4) What is the difference between an inner (what we did the first time) and left merge (what we just did)?**

## **Part D) Build a Park \& Trail Recommender!**

In [None]:
def park_recommender(activity, state, n=5, popular=True):
    # Add comments
    state_parks = merged_df[merged_df['Location'].str.startswith(state, na=False)]

    #
    activity_parks = state_parks[state_parks['TRLUSE'].str.contains(activity, case=False, na=False)]

    #
    activity_parks = activity_parks.sort_values(
        'Recreation visitors (2021)[11]',
        ascending=not popular
    )

    #
    top_parks = activity_parks.head(n)

    return top_parks[['UNITCODE', 'Location', 'add whatever columns you want displayed for the top trails/parks']]

In [None]:
recommendations = park_recommender("activity", "state", n=10)
recommendations

**1) Describe how the recommender works -- why have these particular trails been recommended?**

**2) Does this recommender suit the needs of our client (based on the data we wanted to collect from them)? Explain.**

**3) How can we improve our approach? Should we be asking different questions in our survey? Trying to source different data? Trying to change our recommender function?**