# Hiking Trail Time predictor

Goal: Use a database of hiking trails with various parameters to predict the average walking time. 

## Table of contents
1. [Introduction](#intro)
2. [Dataset](#dataset)
3. [Loading modules and data](#loading)
4. [Data pre-processing](#pre-process)


# Introduction <a name="intro"></a>

# Dataset <a name="dataset"></a>

# Loading modules and data <a name="loading"></a>

In [1]:
import sys
import warnings
from urllib.parse import urlparse
import numpy as np
import pandas as pd

# plotting
import seaborn as sns
import matplotlib.pyplot as plt

# model
from sklearn.model_selection import train_test_split

# mlflow
import mlflow
import mlflow.sklearn


  from ipykernel import kernelapp as app


In [2]:
df=pd.read_csv("gpx-tracks-from-hikr.org.csv")

In [3]:
df.head(3)

Unnamed: 0,_id,length_3d,user,start_time,max_elevation,bounds,uphill,moving_time,end_time,max_speed,gpx,difficulty,min_elevation,url,downhill,name,length_2d
0,5afb229e8f80884aaad9c6ea,10832.953016,Bergfritz,2018-05-11 07:37:40,1934.47,"{'min': {'type': 'Point', 'coordinates': [13.2...",612.88,12155.0,2018-05-11 11:38:23,1.595493,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<gpx x...",T2 - Mountain hike,1322.96,http://www.hikr.org/tour/post131855.html,609.67,"Remsteinkopf, 1945 m",10832.953016
1,5afb229e8f80884aaad9c6eb,12259.376315,Bergfritz,2018-05-12 07:25:08,2186.21,"{'min': {'type': 'Point', 'coordinates': [13.1...",614.753,13876.0,2018-05-12 12:08:28,1.39432,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<gpx x...",T3 - Difficult Mountain hike,1266.4,http://www.hikr.org/tour/post131856.html,1193.733,"Schuhflicker, 2214 m",12259.376315
2,5afb229e8f80884aaad9c6ec,22980.168081,igor,2018-05-11 06:29:38,2265.0,"{'min': {'type': 'Point', 'coordinates': [8.99...",2255.976,28971.0,2018-05-11 15:32:43,1.503002,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<gpx x...",T3 - Difficult Mountain hike,176.54,http://www.hikr.org/tour/post131839.html,2177.626,Cima d'erbea Est quota 2164m e Gaggio 2267m,22980.168081


In [4]:
df.columns

Index(['_id', 'length_3d', 'user', 'start_time', 'max_elevation', 'bounds',
       'uphill', 'moving_time', 'end_time', 'max_speed', 'gpx', 'difficulty',
       'min_elevation', 'url', 'downhill', 'name', 'length_2d'],
      dtype='object')

In [5]:
df.shape

(12141, 17)

# Data pre-processing <a name="pre-process"></a>

We can immediately drop some columns that won't be useful

In [6]:
df=df.drop(['_id','bounds','gpx','name','start_time','end_time','url'],axis=1)

In [7]:
df.shape

(12141, 10)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12141 entries, 0 to 12140
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   length_3d      12141 non-null  float64
 1   user           12141 non-null  object 
 2   max_elevation  10563 non-null  float64
 3   uphill         12141 non-null  float64
 4   moving_time    12141 non-null  float64
 5   max_speed      12141 non-null  float64
 6   difficulty     12141 non-null  object 
 7   min_elevation  10563 non-null  float64
 8   downhill       12141 non-null  float64
 9   length_2d      12141 non-null  float64
dtypes: float64(8), object(2)
memory usage: 948.6+ KB


Let's convert the difficulty into a number for easier use. First we can investigate the current difficulty ratings.

In [13]:
np.unique(df.difficulty)

array(['T1 - Valley hike', 'T2 - Mountain hike',
       'T3 - Difficult Mountain hike', 'T3+ - Difficult Mountain hike',
       'T4 - High-level Alpine hike', 'T4+ - High-level Alpine hike',
       'T4- - High-level Alpine hike',
       'T5 - Challenging High-level Alpine hike',
       'T5+ - Challenging High-level Alpine hike',
       'T5- - Challenging High-level Alpine hike',
       'T6 - Difficult High-level Alpine hike',
       'T6+ - Difficult High-level Alpine hike',
       'T6- - Difficult High-level Alpine hike'], dtype=object)

We can use the numbers after T for our numeric difficulty but we also need to account for the '+' and '-' in the ratings. We want to convert this into a number but keep some relative information of which is more or less difficult. To do this, we can change the difficult strings into float values. 

In [14]:
def dif_to_num(i):
    #convert difficulty string to a float
    
    if i[2] == "+":          #e.g. for 3+ return 3.8
        return int(i[1])+0.8
    elif i[2]== "-":         #e.g. for 3- return 3.2
        return int(i[1])+0.2 
    else:                    #e.g. for 3 return 3.5
        return int(i[1])+0.5 

df['dif_num'] = [dif_to_num(d) for d in df['difficulty']]
print(f"New difficulty values are: {np.unique(df.dif_num)}")

New difficulty values are: [1.5 2.5 3.5 3.8 4.2 4.5 4.8 5.2 5.5 5.8 6.2 6.5 6.8]


Now let's explore the properties of the numeric parameters:

In [12]:
df.describe()

Unnamed: 0,length_3d,max_elevation,uphill,moving_time,max_speed,min_elevation,downhill,length_2d,dif_num
count,12141.0,10563.0,12141.0,12141.0,12141.0,10563.0,12141.0,12141.0,12141.0
mean,18747.71,1934.281708,942.184362,12848.445268,1.746356,1003.33115,879.145539,18747.71,3.392669
std,409309.8,784.968353,1065.498993,11599.792248,5.394065,813.001041,1028.618856,409309.8,1.162604
min,0.0,-1.0,0.0,0.0,0.0,-32768.0,0.0,0.0,1.5
25%,8254.129,1382.275,420.142,5260.0,1.078841,560.02,256.519,8254.129,2.5
50%,12005.77,1986.7,882.0,12990.0,1.36702,960.09,823.199002,12005.77,3.5
75%,16458.13,2498.455848,1301.005,18514.0,1.604181,1389.485,1266.923,16458.13,3.8
max,31891800.0,5633.462891,35398.006781,189380.0,192.768748,4180.0,52379.2,31891800.0,6.8


There are some suspicious values here. For example, a length of 0 metres or negative max and min elevations. 