## Red Rock Canyon Data Project

Hello. Welcome to my Red Rock Canyon project Notebook.

The purpose of this notebook is to display both data scraping, cleaning and visualization skills by using one of my hobbies as a topic. As a Las Vegas local, hiking is one of my favorite parts about the city I live in. In this notebook I will be using the Red Rock Canyon website to data set of hikes/trails located in the canyon. I hope you enjoy. 

-Richard Henderson

In [44]:
#Importating dependencies...

from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import requests


In [20]:
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}


trails_page = requests.get(url = 'https://www.redrockcanyonlv.org/lasvegas/hikes-trails/', headers = user_agent).text
trails_soup = BeautifulSoup(trails_page, 'html.parser')
print(trails_soup)

#Above code allows us the parse the page from a URL of our choice.

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><meta content="telephone=no" name="format-detection"/><title>Hikes &amp; Trails - Round-Trip Distances &amp; Times | Red Rock Canyon Las Vegas</title>
<script type="text/javascript">var ajaxurl = "https://www.redrockcanyonlv.org/wp-admin/admin-ajax.php";</script>
<!-- All in One SEO 4.1.5.3 -->
<meta content="Hikes are numbered according to their location on the trail map. Georeferenced maps are also included with each trail and can be used with any georeferenced map mobile application." name="description">
<meta content="max-image-preview:large" name="robots"/>
<link href="https://www.redrockcanyonlv.org/lasvegas/hikes-trails/" rel="canonical"/>
<link href="https://www.redrockcanyonlv.org/lasvegas/hikes-trails/page/2/" rel="next"/>
<

In [21]:
#There are actually two pages worth of trails. This does the same thing as above for the second page. 
trails_page2 = requests.get(url = 'https://www.redrockcanyonlv.org/lasvegas/hikes-trails/page/2/',headers = user_agent).text
trails_soup2 = BeautifulSoup(trails_page2, 'html.parser')

trails_soup2


<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><meta content="telephone=no" name="format-detection"/><title>Hikes &amp; Trails - Round-Trip Distances &amp; Times | Red Rock Canyon Las Vegas - Part 2</title>
<script type="text/javascript">var ajaxurl = "https://www.redrockcanyonlv.org/wp-admin/admin-ajax.php";</script>
<!-- All in One SEO 4.1.5.3 -->
<meta content="Hikes are numbered according to their location on the trail map. Georeferenced maps are also included with each trail and can be used with any georeferenced map mobile application. - Part 2" name="description">
<meta content="noindex, nofollow, max-image-preview:large" name="robots"/>
<link href="https://www.redrockcanyonlv.org/lasvegas/hikes-trails/page/2/" rel="canonical"/>
<link href="https://www.redrockcanyonlv.org/l

In [22]:
#Two List comprehension to create a list of "Trail Urls" from our trails_soup objects. Finds all h2 tags with the class 'the-title.
#Then uses the object created from find_all() to find all get 'href' from 'a' tags. 

page1_list = [x.find("a").get('href') for x in trails_soup.find_all('h2',{'class':'the-title'})]
page2_list = [x.find("a").get('href') for x in trails_soup2.find_all('h2',{'class':'the-title'})]

#Concatanate the two lists together to make one list of Urls.

trail_urls = page1_list + page2_list

In [23]:
trail_urls

['https://www.redrockcanyonlv.org/moenkopi-loop/',
 'https://www.redrockcanyonlv.org/calico-hills/',
 'https://www.redrockcanyonlv.org/calico-tanks/',
 'https://www.redrockcanyonlv.org/turtlehead-peak/',
 'https://www.redrockcanyonlv.org/keystone-thrust/',
 'https://www.redrockcanyonlv.org/white-rock-willow-springs/',
 'https://www.redrockcanyonlv.org/grand-circle-loop/',
 'https://www.redrockcanyonlv.org/white-rock-la-madre-spring-loop/',
 'https://www.redrockcanyonlv.org/willow-springs-loop/',
 'https://www.redrockcanyonlv.org/la-madre-spring/',
 'https://www.redrockcanyonlv.org/petroglyph-wall-trail/',
 'https://www.redrockcanyonlv.org/north-peak-trail/',
 'https://www.redrockcanyonlv.org/bridge-mountain-trail/',
 'https://www.redrockcanyonlv.org/lost-creek-childrens-discovery/',
 'https://www.redrockcanyonlv.org/smyc/',
 'https://www.redrockcanyonlv.org/ice-box-canyon/',
 'https://www.redrockcanyonlv.org/dales/',
 'https://www.redrockcanyonlv.org/pine-creek-canyon/',
 'https://www.

In [24]:
#With our list of URLs we will now be scraping each trail url to scrape the neccesary information from them. 
# First we create a list of html pages with list comprehension. 

trail_htmls = [BeautifulSoup(requests.get(url = link ,headers = user_agent).text, 'html.parser') for link in trail_urls]
    
#Then we scrape each list and 

In [25]:
#Creating DataFrame

df = pd.DataFrame({'Name': [],
                   'Distance (miles)': [],
                   'Average Time (hours)':[],
                   'Difficulty':[],
                  })
#For loop to grab needed data from htmls created above.

for x in trail_htmls:
    #This grabs some metrics from the web pages and then turns those metrics into a list.
    scraped_metrics = x.find_all('div',{'class':'mk-single-content clearfix'})[0].find('p').text.split(';')

    
    #This grabs <title> of the webpage in order to get the name of the trail. 
    scraped_name = x.find_all('title')[0].get_text().split('|')[0].strip()

    #This adds the trail name to the beggining of the metrics list.
    scraped_metrics.insert(0, scraped_name) 
    
    
    
    #This adds the list of information to the dataframe as row. 
    df.loc[len(df.index)] = scraped_metrics
     

In [26]:
# Cleaning data.

# Grabbing only the numeric value from distance column. 
for x,y in df.iterrows():
    y[1] = y[1].split(' ')[1]
    y[1]
    


In [27]:
#Cleaning Average Time data. 

#Stripping excess information from strings and turning values into numerics. 

for x,y in df.iterrows():
    y[2] = y[2].strip().split(':')[1].split('h')[0].strip()
    y[2]=y[2].split('-')[0]


#Petroglyph Wall had it's value as '30 minutes'.

df.loc[10,'Average Time (hours)'] = .5

#Ash Spring trail had it's value as '1/2'
df.loc[30,'Average Time (hours)'] = .5


In [28]:
df

Unnamed: 0,Name,Distance (miles),Average Time (hours),Difficulty
0,Moenkopi Loop,2,1.5,Difficulty: EASY
1,Calico Hills,2-6,1.5 – 3.5,Difficulty: MODERATE
2,Calico Tanks,2.2,2,Difficulty: MODERATE – STRENUOUS
3,Turtlehead Peak,4.6,3.5,Difficulty: STRENUOUS
4,Keystone Thrust,2.4,1.5,Difficulty: MODERATE
5,White Rock – Willow Springs,4,2.5,Difficulty: MODERATE
6,Grand Circle Loop,11.4,6,Difficulty: Strenuous
7,White Rock Mountain Loop,6.2,3.5,Difficulty: STRENUOUS
8,Willow Spring Loop,1.1,1.25,Difficulty: EASY
9,La Madre Spring,3.6,2,Difficulty: MODERATE


In [29]:
df['Difficulty'].unique()

array([' Difficulty: EASY', '\xa0Difficulty: MODERATE',
       '\xa0Difficulty: MODERATE – STRENUOUS\xa0',
       '\xa0Difficulty: STRENUOUS', ' Difficulty: MODERATE',
       ' Difficulty: Strenuous', ' Difficulty:\xa0STRENUOUS',
       ' Difficulty: Easy', ' Difficulty: EASY-MODERATE',
       '\xa0Difficulty: EASY-MODERATE', '\xa0Difficulty: EASY',
       ' Difficulty: Moderate', ' Difficulty: STRENUOUS',
       ' Difficulty: Easy to Moderate'], dtype=object)

In [30]:
for x,y in df.iterrows():
    y[3]=y[3].split(':')[1].strip().lower()

In [31]:
df['Difficulty'].unique()

array(['easy', 'moderate', 'moderate – strenuous', 'strenuous',
       'easy-moderate', 'easy to moderate'], dtype=object)

In [32]:
for x,y in df.iterrows():   

    if df.loc[x,'Difficulty'] == 'easy':
        df.loc[x,'Difficulty'] = 1
    elif df.loc[x,'Difficulty'] == 'easy-moderate':
        df.loc[x,'Difficulty'] = 1.5
    elif df.loc[x,'Difficulty'] == 'easy to moderate':
        df.loc[x,'Difficulty'] = 1.5
    elif df.loc[x,'Difficulty'] == 'moderate':
        df.loc[x,'Difficulty'] = 2
    elif df.loc[x,'Difficulty'] == 'moderate - strenuous':
        df.loc[x,'Difficulty'] = 2.5
    elif df.loc[x,'Difficulty'] == 'strenuous':
        df.loc[x,'Difficulty'] = 3
        
df.loc[2,'Difficulty'] = 2.5

In [33]:
#Calico hills had it's value as '1.5 – 3.5'. According to it's description on the website, this is because 
#There are two differen't "vesions" of the trial. It's Distance is also set to '2-6'. However it is only given
#One difficulty rating. Based on the the fact that I'm trying to find the correleation between distance,
# average time and Difficult, it kind of making it a guessing game which of these pairs, average time 1.5 and 
# distance 2, or average time 3.5 and distance 6, are considered to be moderate. After some thought I decided
# to simply remove the row rather than choosing to guess. We already have a small amount of total rows, however
# one small lie means a lot more to a small set of data than it would mean to large set of data. 

df.drop(1, inplace = True)

In [41]:
#Lastly we make sure all of our string values are turned to numerics. 


df['Average Time (hours)'] = pd.to_numeric(df['Average Time (hours)'])
df['Distance (miles)'] = pd.to_numeric(df['Distance (miles)'])
df['Difficulty'] = pd.to_numeric(df['Difficulty'])


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 30
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Name                  30 non-null     object 
 1   Distance (miles)      30 non-null     float64
 2   Average Time (hours)  30 non-null     float64
 3   Difficulty            30 non-null     float64
dtypes: float64(3), object(1)
memory usage: 1.2+ KB


In [60]:
df.describe()

Unnamed: 0,Distance (miles),Average Time (hours),Difficulty
count,30.0,30.0,30.0
mean,3.603333,2.241667,1.933333
std,3.52904,1.4301,0.727932
min,0.2,0.5,1.0
25%,1.625,1.3125,1.5
50%,2.4,2.0,2.0
75%,4.0,2.5,2.375
max,15.8,6.0,3.0


In [62]:
df.corr()

Unnamed: 0,Distance (miles),Average Time (hours),Difficulty
Distance (miles),1.0,0.944939,0.683999
Average Time (hours),0.944939,1.0,0.777865
Difficulty,0.683999,0.777865,1.0


In [64]:
px.histogram(df, x = "Distance (miles)")

In [65]:
px.histogram(df, x = "Average Time (hours)")

In [66]:
px.histogram(df , x = "Difficulty")

In [72]:
px.scatter(df, x = "Distance (miles)", y = "Average Time (hours)", 
           color = "Difficulty", color_continuous_scale = "reds", template = "plotly_dark",symbols)