# <img src="./resources/GA.png" width="25" height="25" /> <span style="color:Blue">DSI Capstone:  MTB Trail Recommender Engine</span> 
---
## <span style="color:Green">Preprocessing</span>      

#### Ryan McDonald -General Assembly 

---

### Notebook Contents:

- [Reading the User Data](#intro)    
    - [Arizona User Data Cleaning](#cleanaz)
    - [Utah User Data Cleaning](#cleanut) 
- [Reading the Trail Data](#trail)
    - [Arizona Trail Data Cleaning](#trailaz)
        - [Arizona Imputation/OHE](#imputeaz)
    - [Utah Trail Data Cleaning](#trailut) 
        - [Utah Imputation/OHE](#imputeut)
- [Export to CSV- Arizona Trails](#saveaz)
- [Export to CSV- Utah Trails](#saveut)

**Imports**

In [46]:
# basic imports
import numpy as np
import pandas as pd
import sys

# general processing, CSV manipulation
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# Spatial distance module
import geopandas as gpd
from shapely.geometry import Point
from shapely.ops import nearest_points

ModuleNotFoundError: No module named 'geopandas'

<a id='intro'></a>
## 1. Content - Based Recommender
## Read Data- Arizona Trail Data

In [12]:
# reading in the scaled, one_hot_encoded dataset for the recommender system
az_trails = pd.read_csv('./data/recommender_data/az_trail_data.csv')
az_trails = az_trails.set_index('trail_name')
az_trails.head()

Unnamed: 0_level_0,length,longitude,latitude,popularity,rating,tot_climb,tot_descent,ave_grade,max_grade,max_elevation,...,difficulty_intermediate,difficulty_intermediate/difficult,difficulty_very difficult,dog_policy_leashed,dog_policy_no dogs,dog_policy_off-leash,dog_policy_unknown,e_bike_policy_allowed,e_bike_policy_not allowed,e_bike_policy_unknown
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,0.022399,0.619678,0.507727,1.0,0.94,0.022963,0.057739,0.315789,0.357143,0.429345,...,0,0,1,0,0,0,1,0,0,1
Slim Shady Trail,0.018786,0.6171,0.508725,0.998953,0.88,0.018666,0.021932,0.210526,0.112245,0.412061,...,0,1,0,0,0,0,1,0,0,1
Mescal,0.017341,0.637923,0.498336,0.997906,0.92,0.01451,0.013791,0.157895,0.112245,0.435423,...,0,1,0,0,0,0,1,0,0,1
Chuckwagon,0.039017,0.637893,0.498445,0.996859,0.9,0.039375,0.040625,0.210526,0.132653,0.432479,...,1,0,0,0,0,0,1,0,0,1
Tortolita Preserve Loop,0.070087,0.197366,0.626201,0.995812,0.84,0.036627,0.0432,0.105263,0.040816,0.254416,...,1,0,0,0,0,0,1,0,0,1


In [13]:
az_trails.shape, az_trails.isnull().sum().sort_values(ascending = False).head()

((956, 24),
 longitude                    11
 latitude                     11
 e_bike_policy_unknown         0
 e_bike_policy_not allowed     0
 popularity                    0
 dtype: int64)

#### Creating a Sparse Matrix

In [14]:
az_sparse = sparse.csr_matrix(az_trails.fillna(0))


# verifying sparse matrix and az_trails are the same shape!
# sparse matrix saves a ton of space, even though this dataframe isn't missing the majority of points
az_sparse, az_trails.shape, sys.getsizeof(az_sparse), sys.getsizeof(az_trails)

(<956x24 sparse matrix of type '<class 'numpy.float64'>'
 	with 13146 stored elements in Compressed Sparse Row format>,
 (956, 24),
 48,
 254197)

In [15]:
# calculating pairwise distances and building into a dataframe
az_rec = pairwise_distances(az_sparse, metric = 'cosine')
az_rec = pd.DataFrame(az_rec, index = az_trails.index, columns = az_trails.index)
az_rec

trail_name,Hiline Trail,Slim Shady Trail,Mescal,Chuckwagon,Tortolita Preserve Loop,Lone Cactus Loop,Apache Wash Loop,Desperado Loop,North Loop,Bug Springs,...,Monument Trail,Spine Trail,Spine Trail to Ridge Trail Connector,Far West Trail,Alamo Springs Spur Trail,Trail C,Trail G,Trail H,Trail D,Kain Trail
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,0.000000,0.174012,0.173635,0.171861,0.210020,0.209322,0.385569,0.382047,0.373010,0.185404,...,0.619941,0.422191,0.411123,0.432790,0.404184,0.410994,0.400494,0.406358,0.408603,0.411950
Slim Shady Trail,0.174012,0.000000,0.000515,0.170912,0.204185,0.205499,0.382805,0.382360,0.374641,0.199690,...,0.615020,0.416384,0.417889,0.420689,0.421725,0.408406,0.406637,0.185855,0.407863,0.414595
Mescal,0.173635,0.000515,0.000000,0.169433,0.205173,0.206346,0.381109,0.381116,0.373362,0.201811,...,0.616264,0.422664,0.427050,0.424741,0.432608,0.413944,0.413776,0.194056,0.413974,0.422610
Chuckwagon,0.171861,0.170912,0.169433,0.000000,0.026550,0.029780,0.382711,0.382585,0.373995,0.198061,...,0.614656,0.419340,0.204844,0.423613,0.225401,0.189014,0.407759,0.409140,0.189339,0.416799
Tortolita Preserve Loop,0.210020,0.204185,0.205173,0.026550,0.000000,0.001065,0.395093,0.375853,0.370878,0.209881,...,0.645532,0.416363,0.197769,0.409590,0.230118,0.192250,0.430785,0.428032,0.194174,0.414963
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Trail C,0.410994,0.408406,0.413944,0.189014,0.192250,0.197469,0.656729,0.668259,0.660269,0.425583,...,0.576559,0.296292,0.015321,0.311230,0.060976,0.000000,0.288635,0.288285,0.000494,0.303732
Trail G,0.400494,0.406637,0.413776,0.407759,0.430785,0.432210,0.426115,0.440730,0.434386,0.407729,...,0.576701,0.012339,0.282841,0.319346,0.289132,0.288635,0.000000,0.283847,0.285194,0.292341
Trail H,0.406358,0.185855,0.194056,0.409140,0.428032,0.430074,0.657826,0.667859,0.658605,0.418125,...,0.576370,0.295993,0.290961,0.313520,0.306951,0.288285,0.283847,0.000000,0.286580,0.298793
Trail D,0.408603,0.407863,0.413974,0.189339,0.194174,0.199227,0.657807,0.668152,0.659034,0.420332,...,0.575970,0.295465,0.011559,0.312430,0.052484,0.000494,0.285194,0.286580,0.000000,0.299151


Trails with highest similarity between eachother represent lower values (with **'0'** being equal to itself, **'1'** being not similar at all)

In [16]:
# Which 10 trails are most similar to Hangover Trail?

az_rec['Hangover Trail'].sort_values().head(11)[1:]

trail_name
Hiline Trail                       0.000617
Kellog/Incinerator Ridge           0.037165
Tabletop                           0.041521
Western Loop Trail                 0.054197
Green Mountain                     0.066301
Baby Jesus Trail                   0.087726
Hog Heaven                         0.164656
Sunset                             0.165837
Little Yeager Canyon Trail #533    0.165987
Cathedral Rock Connector Trail     0.166559
Name: Hangover Trail, dtype: float64

'Hiline Trail' is most similar to 'Hangover Trail'! Several others share many characteristics!

In [17]:
az_trails.head(1)

Unnamed: 0_level_0,length,longitude,latitude,popularity,rating,tot_climb,tot_descent,ave_grade,max_grade,max_elevation,...,difficulty_intermediate,difficulty_intermediate/difficult,difficulty_very difficult,dog_policy_leashed,dog_policy_no dogs,dog_policy_off-leash,dog_policy_unknown,e_bike_policy_allowed,e_bike_policy_not allowed,e_bike_policy_unknown
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,0.022399,0.619678,0.507727,1.0,0.94,0.022963,0.057739,0.315789,0.357143,0.429345,...,0,0,1,0,0,0,1,0,0,1


In [18]:
# Creating a trail search term:

search = "Hiline"
trails = az_trails[az_trails.index.str.contains(search)].index
for trail in trails:
    print(trail)
    print("Popularity: ", az_trails.loc[trail, 'popularity'])
    print("Number of Ratings: ", az_trails.T[trail].count())
    print("")
    print("10 Closest Users")
    print(az_rec[trail].sort_values()[1:11])
    print("")
    print("*"*35)
    print("")

Hiline Trail
Popularity:  1.0
Number of Ratings:  24

10 Closest Users
trail_name
Hangover Trail                    0.000617
Kellog/Incinerator Ridge          0.042147
Tabletop                          0.042998
Western Loop Trail                0.051682
Green Mountain                    0.070407
Baby Jesus Trail                  0.087600
Hog Heaven                        0.166381
Cathedral Rock Connector Trail    0.167931
High on the Hog                   0.169619
Broken Arrow Trail                0.169760
Name: Hiline Trail, dtype: float64

***********************************



### All trails within a defined radius

18.755876755653013


<a id='intro'></a>
## 2. User - Based (Binary) Recommender
## Read Data- Arizona User Data

In [8]:
# reading in the cleaned, sorted user dataset for the recommender system
az_users = pd.read_csv('./data/all_arizona_users.csv')

# add the binary rating column (users that rated the trail)
# '1' = user rated, '0' = user not rated
az_users['binary_rate'] = 1
az_users.head()

Unnamed: 0,user_name,trail_name,binary_rate
0,Maxx Byerly,Hiline Trail,1
1,Cameron McFarland,Hiline Trail,1
2,Ascanio Pignatelli,Hiline Trail,1
3,Sabrina Katharina,Hiline Trail,1
4,Clayton Burtsfield,Hiline Trail,1


In [9]:
az_users.shape, az_users.isnull().sum().sort_values(ascending = False).head()

((5192, 3),
 binary_rate    0
 trail_name     0
 user_name      0
 dtype: int64)

#### Transform to Pivot Table

In [10]:
# users as the index, trail names on x-axis, ratings as values.
# will show NaN for trails not rated, and '1' where trail was rated
az_pivot = az_users.pivot_table(index='user_name', columns= 'trail_name', values = 'binary_rate')
az_pivot.head()

trail_name,#297 Smith Ravine Trail,#305 Homestead Trail,#307 Groom Creek Loop Trail (Spruce Mtn),#322 Circle Connection Trail,#9415 Wolverton Mountain Trail,104th St. Trail,10K Route (Blue),136th St. Express,136th Street Spur,1918,...,Windmill Trail,Window Rock Loop,Woods Canyon Trail,Wren Arena Red Loop Trail,Yeager Cabin Trail #111,Yeager Canyon Trail #28,Yeti Crossing,Yetman Trail,Yuma Bike Path (Colorado River Levee Multi-use Path),Zygomatic
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A H,,,,,,,,,,,...,,,,,,,,,,
AJ Wanta,,,,,,,,,,,...,,,,,,,,,,
Aaron Cholewa,,,,,,,,,,,...,,,,,,,,,,
Aaron Davies,,,,,,,,,,,...,,,,,,,,,,
Aaron Frank,,,,,,,,,,,...,,,,,,,,,,


#### Creating a Sparse Matrix

In [11]:
sparse_users = sparse.csc_matrix(az_pivot.fillna(0))
# verifying shapes of pivot and sparse are the same
sparse_users, az_pivot.shape

(<1768x873 sparse matrix of type '<class 'numpy.float64'>'
 	with 5180 stored elements in Compressed Sparse Column format>,
 (1768, 873))

In [12]:
# calculating pairwise distances and building into a dataframe
# both axis to be 'user_name'
az_user_rec = pairwise_distances(sparse_users, metric = 'cosine')
az_user_rec  = pd.DataFrame(az_user_rec , index = az_pivot.index, columns = az_pivot.index)
az_user_rec 

user_name,A H,AJ Wanta,Aaron Cholewa,Aaron Davies,Aaron Frank,Aaron Hickson,Aaron Johnson,Aaron Lovato,Abe Ferraro,Abe Gold,...,sal serrano,sam schwann,skelldify,stuart schwartz,theiner Heiner,trevjens,victor thompson,yannick,Þorvarður Hálfdanarson,❤️
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A H,0.0,1.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
AJ Wanta,1.0,0.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Aaron Cholewa,1.0,1.0,0.000000,1.0,0.666667,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Aaron Davies,1.0,1.0,1.000000,0.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Aaron Frank,1.0,1.0,0.666667,1.0,0.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
trevjens,1.0,1.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
victor thompson,1.0,1.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
yannick,1.0,1.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
Þorvarður Hálfdanarson,1.0,1.0,1.000000,1.0,1.000000,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0


Users with highest similarity between eachother represent lower values (with **'0'** being equal to itself, **'1'** being not similar at all)

In [13]:
# Which 10 users are most similar to A H?

az_user_rec['A H'].sort_values().head(11)[1:]

user_name
Soloman Picoult          0.000000
Josh Richart             0.000000
Brian Derrick            0.292893
Brandon Sudeith          0.422650
Bob Spak                 0.500000
Mark Smith               0.552786
Nikki McIntyre           0.666667
Pablo Cortez             0.750000
Happy Cycling            0.781782
Michael Bartholomeusz    1.000000
Name: A H, dtype: float64

'Soloman Picoult' and 'Josh Richart' must be close riding partners to 'A H'.  Two other users are very close (less than 0.5) to 'A H'.  Then, users become quite dissimilar.
'A H' must be a strong rider since he has rated mostly challenging trails.

In [23]:
az_pivot.head(1)

trail_name,#297 Smith Ravine Trail,#305 Homestead Trail,#307 Groom Creek Loop Trail (Spruce Mtn),#322 Circle Connection Trail,#9415 Wolverton Mountain Trail,104th St. Trail,10K Route (Blue),136th St. Express,136th Street Spur,1918,...,Windmill Trail,Window Rock Loop,Woods Canyon Trail,Wren Arena Red Loop Trail,Yeager Cabin Trail #111,Yeager Canyon Trail #28,Yeti Crossing,Yetman Trail,Yuma Bike Path (Colorado River Levee Multi-use Path),Zygomatic
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A H,,,,,,,,,,,...,,,,,,,,,,


In [46]:
# Creating a user search term:

search = "A H"
users = az_pivot[az_pivot.index.str.contains(search)].index
for user in users:
    print(user)
    print("Average Rating: ", az_pivot.loc[user, :].mean())
    print("Number of Ratings: ", az_pivot.T[user].count())
    print("")
    print("10 Closest Users")
    print(az_user_rec[user].sort_values()[1:11])
    print("")
    print("*"*35)
    print("")

A H
Average Rating:  1.0
Number of Ratings:  1

10 Closest Users
user_name
Soloman Picoult          0.000000
Josh Richart             0.000000
Brian Derrick            0.292893
Brandon Sudeith          0.422650
Bob Spak                 0.500000
Mark Smith               0.552786
Nikki McIntyre           0.666667
Pablo Cortez             0.750000
Happy Cycling            0.781782
Michael Bartholomeusz    1.000000
Name: A H, dtype: float64

***********************************

