# <img src="./resources/GA.png" width="25" height="25" /> <span style="color:Blue">DSI Capstone:  MTB Trail Recommender Engine</span> 
---
## <span style="color:Green">Preprocessing</span>      

#### Ryan McDonald -General Assembly 

---

### Notebook Contents:

- [Reading the User Data](#intro)    
    - [Arizona User Data Cleaning](#cleanaz)
    - [Utah User Data Cleaning](#cleanut) 
- [Reading the Trail Data](#trail)
    - [Arizona Trail Data Cleaning](#trailaz)
        - [Arizona Imputation/OHE](#imputeaz)
    - [Utah Trail Data Cleaning](#trailut) 
        - [Utah Imputation/OHE](#imputeut)
- [Export to CSV- Arizona Trails](#saveaz)
- [Export to CSV- Utah Trails](#saveut)

**Imports**

In [1]:
# basic imports
import numpy as np
import pandas as pd
import sys

# general processing, CSV manipulation
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# # Spatial distance module
# import geopandas as gpd
# from shapely.geometry import Point
# from shapely.ops import nearest_points

<a id='intro'></a>
## 1. Content - Based Recommender
## Read Data- 

### Arizona Trail Data

In [2]:
# reading in the scaled, one_hot_encoded dataset for the recommender system
az_trails = pd.read_csv('./data/recommender_data/az_trail_data.csv')
az_trails = az_trails.set_index('trail_name')
az_trails.head()

Unnamed: 0_level_0,length,longitude,latitude,popularity,rating,tot_climb,tot_descent,ave_grade,max_grade,max_elevation,...,difficulty_intermediate,difficulty_intermediate/difficult,difficulty_very difficult,dog_policy_leashed,dog_policy_no dogs,dog_policy_off-leash,dog_policy_unknown,e_bike_policy_allowed,e_bike_policy_not allowed,e_bike_policy_unknown
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,0.022399,0.619678,0.507727,1.0,0.94,0.022963,0.057739,0.315789,0.357143,0.429345,...,0,0,1,0,0,0,1,0,0,1
Slim Shady Trail,0.018786,0.6171,0.508725,0.998953,0.88,0.018666,0.021932,0.210526,0.112245,0.412061,...,0,1,0,0,0,0,1,0,0,1
Mescal,0.017341,0.637923,0.498336,0.997906,0.92,0.01451,0.013791,0.157895,0.112245,0.435423,...,0,1,0,0,0,0,1,0,0,1
Chuckwagon,0.039017,0.637893,0.498445,0.996859,0.9,0.039375,0.040625,0.210526,0.132653,0.432479,...,1,0,0,0,0,0,1,0,0,1
Tortolita Preserve Loop,0.070087,0.197366,0.626201,0.995812,0.84,0.036627,0.0432,0.105263,0.040816,0.254416,...,1,0,0,0,0,0,1,0,0,1


In [3]:
az_trails.shape, az_trails.isnull().sum().sort_values(ascending = False).head()

((956, 24),
 longitude                    11
 latitude                     11
 e_bike_policy_unknown         0
 e_bike_policy_not allowed     0
 popularity                    0
 dtype: int64)

### Utah Trail Data

In [4]:
# reading in the scaled, one_hot_encoded dataset for the recommender system
ut_trails = pd.read_csv('./data/recommender_data/ut_trail_data.csv')
ut_trails = ut_trails.set_index('trail_name')
ut_trails.head()

Unnamed: 0_level_0,length,longitude,latitude,popularity,rating,tot_climb,tot_descent,ave_grade,max_grade,max_elevation,...,difficulty_intermediate,difficulty_intermediate/difficult,difficulty_very difficult,dog_policy_leashed,dog_policy_no dogs,dog_policy_off-leash,dog_policy_unknown,e_bike_policy_allowed,e_bike_policy_not allowed,e_bike_policy_unknown
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Thunder Mountain Trail #33098,0.065165,0.140063,0.312952,1.0,0.94,0.052217,0.14821,0.3,0.409091,0.632152,...,0,1,0,0,0,1,0,0,0,1
Wasatch Crest,0.100563,0.726322,0.462306,0.998922,0.96,0.082152,0.234174,0.3,0.393939,0.817796,...,0,1,0,0,1,0,0,0,1,0
Captain Ahab,0.033789,0.304305,0.87393,0.997845,0.94,0.024706,0.086493,0.3,0.348485,0.246302,...,0,0,0,1,0,0,0,0,1,0
Wire Mesa Loop,0.059533,0.025333,0.145997,0.996767,0.92,0.032437,0.03659,0.1,0.181818,0.200894,...,0,1,0,0,0,0,1,1,0,0
Ramblin',0.026549,0.328357,0.838821,0.99569,0.94,0.014778,0.035091,0.15,0.181818,0.28999,...,0,1,0,1,0,0,0,0,1,0


#### Creating a Content- Based Recommender

In [31]:
def content_recommend(df):
    
    # creating the sparse matrix
    sparse_matrix = sparse.csr_matrix(df.fillna(0))
       
    # calculating pairwise distances and building into a dataframe
    rec = pairwise_distances(sparse_matrix, metric = 'cosine')
    
    # saving pairwise matrix as a dataframe
    rec = pd.DataFrame(1-rec, index = df.index, columns = df.index)
    
    # return the dataframe
    return rec

### Arizona Trail Recommender

In [32]:
content_recommend(az_trails)

trail_name,Hiline Trail,Slim Shady Trail,Mescal,Chuckwagon,Tortolita Preserve Loop,Lone Cactus Loop,Apache Wash Loop,Desperado Loop,North Loop,Bug Springs,...,Monument Trail,Spine Trail,Spine Trail to Ridge Trail Connector,Far West Trail,Alamo Springs Spur Trail,Trail C,Trail G,Trail H,Trail D,Kain Trail
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,1.000000,0.825988,0.826365,0.828139,0.789980,0.790678,0.614431,0.617953,0.626990,0.814596,...,0.380059,0.577809,0.588877,0.567210,0.595816,0.589006,0.599506,0.593642,0.591397,0.588050
Slim Shady Trail,0.825988,1.000000,0.999485,0.829088,0.795815,0.794501,0.617195,0.617640,0.625359,0.800310,...,0.384980,0.583616,0.582111,0.579311,0.578275,0.591594,0.593363,0.814145,0.592137,0.585405
Mescal,0.826365,0.999485,1.000000,0.830567,0.794827,0.793654,0.618891,0.618884,0.626638,0.798189,...,0.383736,0.577336,0.572950,0.575259,0.567392,0.586056,0.586224,0.805944,0.586026,0.577390
Chuckwagon,0.828139,0.829088,0.830567,1.000000,0.973450,0.970220,0.617289,0.617415,0.626005,0.801939,...,0.385344,0.580660,0.795156,0.576387,0.774599,0.810986,0.592241,0.590860,0.810661,0.583201
Tortolita Preserve Loop,0.789980,0.795815,0.794827,0.973450,1.000000,0.998935,0.604907,0.624147,0.629122,0.790119,...,0.354468,0.583637,0.802231,0.590410,0.769882,0.807750,0.569215,0.571968,0.805826,0.585037
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Trail C,0.589006,0.591594,0.586056,0.810986,0.807750,0.802531,0.343271,0.331741,0.339731,0.574417,...,0.423441,0.703708,0.984679,0.688770,0.939024,1.000000,0.711365,0.711715,0.999506,0.696268
Trail G,0.599506,0.593363,0.586224,0.592241,0.569215,0.567790,0.573885,0.559270,0.565614,0.592271,...,0.423299,0.987661,0.717159,0.680654,0.710868,0.711365,1.000000,0.716153,0.714806,0.707659
Trail H,0.593642,0.814145,0.805944,0.590860,0.571968,0.569926,0.342174,0.332141,0.341395,0.581875,...,0.423630,0.704007,0.709039,0.686480,0.693049,0.711715,0.716153,1.000000,0.713420,0.701207
Trail D,0.591397,0.592137,0.586026,0.810661,0.805826,0.800773,0.342193,0.331848,0.340966,0.579668,...,0.424030,0.704535,0.988441,0.687570,0.947516,0.999506,0.714806,0.713420,1.000000,0.700849


### Utah Trail Recommender

In [19]:
content_recommend(ut_trails)

trail_name,Thunder Mountain Trail #33098,Wasatch Crest,Captain Ahab,Wire Mesa Loop,Ramblin',Rush,Bull Run,Big Mesa,Getaway,Dino-Flow,...,Jones Ranch Trail #123 Alternate Access,Sovereign Connect,Whales Connect,Humpback,Flat Iron Mesa 4x4 Jeep Road Spur,BST Access Trail,The Farm - Green Trail,Hi Line,Carin-Age,Lasso
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Thunder Mountain Trail #33098,0.000000,0.337835,0.551475,0.402220,0.385433,0.373467,0.375111,0.548581,0.553409,0.405215,...,0.586680,0.662385,0.650733,0.647736,0.826635,0.653027,0.704335,0.623006,0.618705,0.842940
Wasatch Crest,0.337835,0.000000,0.372677,0.429697,0.213528,0.178302,0.204471,0.364152,0.364672,0.543282,...,0.694661,0.804063,0.842944,0.840320,0.771875,0.747212,0.624618,0.777175,0.773478,0.592971
Captain Ahab,0.551475,0.372677,0.000000,0.603297,0.172981,0.531512,0.170624,0.172176,0.177188,0.350722,...,0.809777,0.793656,0.918735,0.915005,0.768928,0.828743,0.676294,0.853385,0.843765,0.594059
Wire Mesa Loop,0.402220,0.429697,0.603297,0.000000,0.418805,0.237567,0.418482,0.602992,0.610502,0.620838,...,0.701861,0.718164,0.705162,0.704265,0.490050,0.723431,0.368349,0.701449,0.699957,0.712317
Ramblin',0.385433,0.213528,0.172981,0.418805,0.000000,0.369284,0.001984,0.169689,0.172197,0.347940,...,0.801376,0.792799,0.909435,0.908101,0.776620,0.829494,0.678424,0.848264,0.846979,0.585471
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
BST Access Trail,0.653027,0.747212,0.828743,0.723431,0.829494,0.779716,0.822303,0.824173,0.825768,0.620860,...,0.042294,0.313384,0.348712,0.078478,0.555744,0.000000,0.544516,0.038585,0.036273,0.312054
The Farm - Green Trail,0.704335,0.624618,0.676294,0.368349,0.678424,0.456779,0.677014,0.494399,0.504036,0.507482,...,0.567359,0.382348,0.418159,0.659477,0.166359,0.544516,0.000000,0.604320,0.599345,0.605575
Hi Line,0.623006,0.777175,0.853385,0.701449,0.848264,0.835228,0.840080,0.842774,0.839346,0.633630,...,0.016246,0.307568,0.300528,0.013694,0.551729,0.038585,0.604320,0.000000,0.003339,0.292346
Carin-Age,0.618705,0.773478,0.843765,0.699957,0.846979,0.827146,0.836662,0.840916,0.838316,0.633379,...,0.020773,0.309848,0.306956,0.018523,0.547404,0.036273,0.599345,0.003339,0.000000,0.298653


In [25]:
az_sparse = sparse.csr_matrix(az_trails.fillna(0))


# verifying sparse matrix and az_trails are the same shape!
# sparse matrix saves a ton of space, even though this dataframe isn't missing the majority of points
az_sparse, az_trails.shape, sys.getsizeof(az_sparse), sys.getsizeof(az_trails)

(<956x24 sparse matrix of type '<class 'numpy.float64'>'
 	with 13146 stored elements in Compressed Sparse Row format>,
 (956, 24),
 48,
 295157)

In [30]:
# calculating pairwise distances and building into a dataframe
az_rec = pairwise_distances(az_sparse, metric = 'cosine')
az_rec = pd.DataFrame(1-az_rec, index = az_trails.index, columns = az_trails.index)
az_rec

trail_name,Hiline Trail,Slim Shady Trail,Mescal,Chuckwagon,Tortolita Preserve Loop,Lone Cactus Loop,Apache Wash Loop,Desperado Loop,North Loop,Bug Springs,...,Monument Trail,Spine Trail,Spine Trail to Ridge Trail Connector,Far West Trail,Alamo Springs Spur Trail,Trail C,Trail G,Trail H,Trail D,Kain Trail
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,1.000000,0.825988,0.826365,0.828139,0.789980,0.790678,0.614431,0.617953,0.626990,0.814596,...,0.380059,0.577809,0.588877,0.567210,0.595816,0.589006,0.599506,0.593642,0.591397,0.588050
Slim Shady Trail,0.825988,1.000000,0.999485,0.829088,0.795815,0.794501,0.617195,0.617640,0.625359,0.800310,...,0.384980,0.583616,0.582111,0.579311,0.578275,0.591594,0.593363,0.814145,0.592137,0.585405
Mescal,0.826365,0.999485,1.000000,0.830567,0.794827,0.793654,0.618891,0.618884,0.626638,0.798189,...,0.383736,0.577336,0.572950,0.575259,0.567392,0.586056,0.586224,0.805944,0.586026,0.577390
Chuckwagon,0.828139,0.829088,0.830567,1.000000,0.973450,0.970220,0.617289,0.617415,0.626005,0.801939,...,0.385344,0.580660,0.795156,0.576387,0.774599,0.810986,0.592241,0.590860,0.810661,0.583201
Tortolita Preserve Loop,0.789980,0.795815,0.794827,0.973450,1.000000,0.998935,0.604907,0.624147,0.629122,0.790119,...,0.354468,0.583637,0.802231,0.590410,0.769882,0.807750,0.569215,0.571968,0.805826,0.585037
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Trail C,0.589006,0.591594,0.586056,0.810986,0.807750,0.802531,0.343271,0.331741,0.339731,0.574417,...,0.423441,0.703708,0.984679,0.688770,0.939024,1.000000,0.711365,0.711715,0.999506,0.696268
Trail G,0.599506,0.593363,0.586224,0.592241,0.569215,0.567790,0.573885,0.559270,0.565614,0.592271,...,0.423299,0.987661,0.717159,0.680654,0.710868,0.711365,1.000000,0.716153,0.714806,0.707659
Trail H,0.593642,0.814145,0.805944,0.590860,0.571968,0.569926,0.342174,0.332141,0.341395,0.581875,...,0.423630,0.704007,0.709039,0.686480,0.693049,0.711715,0.716153,1.000000,0.713420,0.701207
Trail D,0.591397,0.592137,0.586026,0.810661,0.805826,0.800773,0.342193,0.331848,0.340966,0.579668,...,0.424030,0.704535,0.988441,0.687570,0.947516,0.999506,0.714806,0.713420,1.000000,0.700849


Trails with highest similarity between eachother represent lower values (with **'0'** being equal to itself, **'1'** being not similar at all)

In [6]:
# Which 10 trails are most similar to Hangover Trail?

az_rec['Hangover Trail'].sort_values().head(11)[1:]

trail_name
Hiline Trail                       0.000617
Kellog/Incinerator Ridge           0.037165
Tabletop                           0.041521
Western Loop Trail                 0.054197
Green Mountain                     0.066301
Baby Jesus Trail                   0.087726
Hog Heaven                         0.164656
Sunset                             0.165837
Little Yeager Canyon Trail #533    0.165987
Cathedral Rock Connector Trail     0.166559
Name: Hangover Trail, dtype: float64

'Hiline Trail' is most similar to 'Hangover Trail'! Several others share many characteristics!

In [7]:
az_trails.head(1)

Unnamed: 0_level_0,length,longitude,latitude,popularity,rating,tot_climb,tot_descent,ave_grade,max_grade,max_elevation,...,difficulty_intermediate,difficulty_intermediate/difficult,difficulty_very difficult,dog_policy_leashed,dog_policy_no dogs,dog_policy_off-leash,dog_policy_unknown,e_bike_policy_allowed,e_bike_policy_not allowed,e_bike_policy_unknown
trail_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Hiline Trail,0.022399,0.619678,0.507727,1.0,0.94,0.022963,0.057739,0.315789,0.357143,0.429345,...,0,0,1,0,0,0,1,0,0,1


In [8]:
# Creating a trail search term:

search = "Hiline"
trails = az_trails[az_trails.index.str.contains(search)].index
for trail in trails:
    print(trail)
    print("Popularity: ", az_trails.loc[trail, 'popularity'])
    print("Number of Ratings: ", az_trails.T[trail].count())
    print("")
    print("10 Closest Users")
    print(az_rec[trail].sort_values()[1:11])
    print("")
    print("*"*35)
    print("")

Hiline Trail
Popularity:  1.0
Number of Ratings:  24

10 Closest Users
trail_name
Hangover Trail                    0.000617
Kellog/Incinerator Ridge          0.042147
Tabletop                          0.042998
Western Loop Trail                0.051682
Green Mountain                    0.070407
Baby Jesus Trail                  0.087600
Hog Heaven                        0.166381
Cathedral Rock Connector Trail    0.167931
High on the Hog                   0.169619
Broken Arrow Trail                0.169760
Name: Hiline Trail, dtype: float64

***********************************



### All trails within a defined radius

<a id='intro'></a>
## 2. User - Based (Binary) Recommender
## Read Data- Arizona User Data

In [None]:
# reading in the cleaned, sorted user dataset for the recommender system
az_users = pd.read_csv('./data/all_arizona_users.csv')

# add the binary rating column (users that rated the trail)
# '1' = user rated, '0' = user not rated
az_users['binary_rate'] = 1
az_users.head()

In [None]:
az_users.shape, az_users.isnull().sum().sort_values(ascending = False).head()

#### Transform to Pivot Table

In [None]:
# users as the index, trail names on x-axis, ratings as values.
# will show NaN for trails not rated, and '1' where trail was rated
az_pivot = az_users.pivot_table(index='user_name', columns= 'trail_name', values = 'binary_rate')
az_pivot.head()

#### Creating a Sparse Matrix

In [None]:
sparse_users = sparse.csc_matrix(az_pivot.fillna(0))
# verifying shapes of pivot and sparse are the same
sparse_users, az_pivot.shape

In [None]:
# calculating pairwise distances and building into a dataframe
# both axis to be 'user_name'
az_user_rec = pairwise_distances(sparse_users, metric = 'cosine')
az_user_rec  = pd.DataFrame(az_user_rec , index = az_pivot.index, columns = az_pivot.index)
az_user_rec 

Users with highest similarity between eachother represent lower values (with **'0'** being equal to itself, **'1'** being not similar at all)

In [None]:
# Which 10 users are most similar to A H?

az_user_rec['A H'].sort_values().head(11)[1:]

'Soloman Picoult' and 'Josh Richart' must be close riding partners to 'A H'.  Two other users are very close (less than 0.5) to 'A H'.  Then, users become quite dissimilar.
'A H' must be a strong rider since he has rated mostly challenging trails.

In [None]:
az_pivot.head(1)

In [None]:
# Creating a user search term:

search = "A H"
users = az_pivot[az_pivot.index.str.contains(search)].index
for user in users:
    print(user)
    print("Average Rating: ", az_pivot.loc[user, :].mean())
    print("Number of Ratings: ", az_pivot.T[user].count())
    print("")
    print("10 Closest Users")
    print(az_user_rec[user].sort_values()[1:11])
    print("")
    print("*"*35)
    print("")