# ICE: Sources of Data
## Name:
## *DATA 3300*

In this in-class exercise we will examine different ways that we can source data, beginning with precompiled datasets, followed by web APIs, and then finally we will produce our own dataset from a class survey.

Whenever I am working on putting together a dataset, I do so with the intention of solving a particular problem or answering a particular question. This will guide what data I need to source!

**So, let's say we're starting a Travel Agency business to help individuals plan optimal trips to National Parks, what sort of data might we want to make data-driven recommendations?**



*   
*   



## **Part A) Primary Data Collection**

Surveys are one common way of collecting data, and it can be a helpful way for gathering information from specific individuals. Let's [create a survey](https://docs.google.com/forms/u/0/) with five or so questions we think will help us plan a National Parks Holiday trip!

These questions should help us narrow down their preferences around:



*   Region of the country to visit
*   Preference around traffic
*   Type of sites preferred
*   Trail activity type preference
*   Weather preferences



In [None]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

In [None]:
df_1 = pd.read_csv ('insert file path') # import sheets data
df_1.head()



## **Part B) Secondary Data Collection**

Let's begin with examining some sources of precompiled datasets. Websites I often use for sourcing datasets include:

*   Government websites, like [Transparent Utah](https://transparent.utah.gov/job_title_search.php)
*   [Kaggle](https://www.kaggle.com/datasets)
*   [Data is Plural](https://www.data-is-plural.com/archive/)

Let's see what precompiled data is out there that might be useful related to finding info on national parks!

[National Parks Dataset](https://www.kaggle.com/datasets/thedevastator/the-united-states-national-parks?select=df_2.csv)

[National Park Trails](https://www.data-is-plural.com/archive/2020-08-26-edition/)

In [None]:
parks = pd.read_csv('') # load in data set
parks.head()

**1) Narrow Down the Dataset: Let's filter down this dataset some to parks of interest based on their Location -- Assume we want to stick to locations in the Intermountain West. How can we use the Location columns to narrow down our list?**

In [None]:
filtered_parks = parks[parks['Location'].str.startswith(())] # add in strings to filter down location
filtered_parks.head()

In [None]:
# view the parks within those locations

**2) If we know our customer is interested in specific geographic sites or types of sites, how can we further filter down parks?**

In [None]:
mountain_parks = filtered_parks[filtered_parks['Description'].str.contains()] # add in description, why are we using contains here instead of startsiwth?
mountain_parks.head()

In [None]:
mountain_parks['Name'].value_counts() # check the parks count

**3) Now let's Incorporate our Trail Data! What pieces of information tie together these two datasets?**

In [None]:
trails = pd.read_csv() # read in second dataset
trails.head()

**4) Join the datasets by mapping Name to Unitcode then joining on the same primary key!**

In [None]:
parks_nps_codes = {
    "Glacier": "",
    "Grand Teton": "",
    "Great Sand Dunes": "",
    "Rocky Mountain": "ROMO",
    "Saguaro": "SAGU",
    "Yellowstone": "YELL"
}

# Rename the 'Name' column to 'UNITCODE' and replace values using parks_nps_codes
mountain_parks = mountain_parks.rename() # fill in rename argument
mountain_parks['UNITCODE'] = mountain_parks['UNITCODE'].replace() # fill in replace argument
mountain_parks.head()

In [None]:
merged_df = # create new dataframe by merging mountain_parks and trails
merged_df.head()

In [None]:
# view all columns in merged_df

**5) What has changed about the structure of the dataset? (Apart from adding new columns from the join)?**

# **Part C) OpenWeatherMaps API - Let's Check the Forecast at our Parks of Interest!**

To access the OpenWeatherMap API, first navigate [here](https://openweathermap.org/api) and sign up for an account!

Then we will navigate to the `My API Keys` page under your profile, to generate an API key. This is all we need (aprt from the request/endpoint) to start pulling data from this website!

**1) Store API Credentials**

Note: This isn't a great way to do this. But since this is a free account we're not too worried about security.

In [None]:
# API Key (replace 'your_api_key' with the actual API key)
api_key = '' #9c9681d1ee3a680fd15053968fa4c0a5'

**2) Set up your request!**

Here, the goal is to add the average temperature for each of our six parks into our dataset, using the Openweather maps API.

In [None]:
parks_info = {
    "GLAC": {"lat": 48.6966, "lon": -113.7183},
    "GRTE": {"lat": 43.7904, "lon": -110.6818},
    "GRSA": {"lat": 37.7926, "lon": -105.5943},
    "ROMO": {"lat": 40.3428, "lon": -105.6836},
    "SAGU": {"lat": 32.2967, "lon": -111.1666},
    "YELL": {"lat": 44.4280, "lon": -110.5885}
} # add comment

In [None]:
# Add comments
avg_temp_summary = []

#
for unit_code, coords in parks_info.items():
    lat, lon = coords['lat'], coords['lon']

    #
    url = f"http://api.openweathermap.org/data/2.5/forecast?lat={lat}&lon={lon}&appid={api_key}&units=imperial"

    #
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the JSON response
        forecast_data = response.json()
        forecast_list = forecast_data['list']

        #
        df_forecast = pd.DataFrame([{
            'datetime': item['dt_txt'],
            'temperature': item['main']['temp']
        } for item in forecast_list])

        #
        df_forecast['datetime'] = pd.to_datetime(df_forecast['datetime'])

        #
        df_forecast['date'] = df_forecast['datetime'].dt.date
        daily_avg_temps = df_forecast.groupby('date')['temperature'].mean().reset_index()

        #
        five_day_avg_temp = daily_avg_temps['temperature'].mean()

        #
        avg_temp_summary.append({'UNITCODE': unit_code, 'five_day_avg_temp': five_day_avg_temp})

    else:
        print(f"Failed to retrieve data for {unit_code}: Status code", response.status_code)


df_avg_temp_summary = # Create a DataFrame from the summary list


merged_df = merged_df.merge() # Merge the summary DataFrame with your existing merged_df on UNITCODE


merged_df.head()

**3) Why are the values in five_day_avg_temp repeated?**

**4) What is the difference between an inner (what we did the first time) and left merge (what we just did)?**

## **Part D) Build a Park \& Trail Recommender!**

In [None]:
def park_recommender(activity, state, n=5, popular=True):
    # Add comments
    state_parks = merged_df[merged_df['Location'].str.startswith(state, na=False)]

    #
    activity_parks = state_parks[state_parks['TRLUSE'].str.contains(activity, case=False, na=False)]

    #
    activity_parks = activity_parks.sort_values(
        'Recreation visitors (2021)[11]',
        ascending=not popular
    )

    #
    top_parks = activity_parks.head(n)

    return top_parks[['UNITCODE', 'Location', 'add whatever columns you want displayed for the top trails/parks']]

In [None]:
recommendations = park_recommender("activity", "state", n=10)
recommendations

**1) Describe how the recommender works -- why have these particular trails been recommended?**

**2) Does this recommender suit the needs of our client (based on the data we wanted to collect from them)? Explain.**

**3) How can we improve our approach? Should we be asking different questions in our survey? Trying to source different data? Trying to change our recommender function?**