# Notebook \#4

**Part 1: [3 points]** You must run at least 6 variations of the algorithms and display their results using an __appropriate regression metric__ (again, use the scikit-learn modules). I will be looking for the following to be included in your comparison:

* **weighted k-Nearest-Neighbor** with a **small value of k** (the same one you used for the unweighted version)
* **weighted k-Nearest-Neighbor** with a **large value of k **(the same one you used for the unweighted version)
* a **decision tree** with default parameter values
* a **decision tree**, setting some kind of parameter that results in a smaller tree 
* a **Random Forest**, with default parameter values
* a **Random Forest**, with a **change to the number of trees** used.

You will need to use the documentation for sklearn for this Notebook. Here are some helpful links:
* [K Neighbors Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
* [Decision Tree Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) 


In [None]:
# load in the data and necessary libraries
import sklearn
import pandas
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import neighbors
from sklearn.preprocessing import StandardScaler

from google.colab import drive
drive.mount('/content/drive')
songs = pandas.read_csv('/content/drive/MyDrive/CS167-Machine_Learning/datasets/spotify.csv') # change this to match your dataset directory
songs.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0.1,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,id,uri,track_href,analysis_url,duration_ms,time_signature,genre,song_name,Unnamed: 0,title
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,...,2Vc6NJ9PW9gD9q343XFRKx,spotify:track:2Vc6NJ9PW9gD9q343XFRKx,https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD...,https://api.spotify.com/v1/audio-analysis/2Vc6...,124539,4,Dark Trap,Mercury: Retrograde,,
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,...,7pgJBLVz5VmnL7uGHmRj6p,spotify:track:7pgJBLVz5VmnL7uGHmRj6p,https://api.spotify.com/v1/tracks/7pgJBLVz5Vmn...,https://api.spotify.com/v1/audio-analysis/7pgJ...,224427,4,Dark Trap,Pathology,,
2,0.85,0.893,5,-4.783,1,0.0623,0.0138,4e-06,0.372,0.0391,...,0vSWgAlfpye0WCGeNmuNhy,spotify:track:0vSWgAlfpye0WCGeNmuNhy,https://api.spotify.com/v1/tracks/0vSWgAlfpye0...,https://api.spotify.com/v1/audio-analysis/0vSW...,98821,4,Dark Trap,Symbiote,,
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,...,0VSXnJqQkwuH2ei1nOQ1nu,spotify:track:0VSXnJqQkwuH2ei1nOQ1nu,https://api.spotify.com/v1/tracks/0VSXnJqQkwuH...,https://api.spotify.com/v1/audio-analysis/0VSX...,123661,3,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote),,
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,...,4jCeguq9rMTlbMmPHuO7S3,spotify:track:4jCeguq9rMTlbMmPHuO7S3,https://api.spotify.com/v1/tracks/4jCeguq9rMTl...,https://api.spotify.com/v1/audio-analysis/4jCe...,123298,4,Dark Trap,Venom,,


In [None]:
# Split the data into the training data and testing data
# we're only going to use a subset of the columns that are numeric
target= 'danceability'
predictors = ['energy', 'key', 'loudness','speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','duration_ms']
train_data, test_data, train_sln, test_sln = train_test_split(songs[predictors], songs[target], test_size = 0.2, random_state=41)
train_data.head()
train_data.shape

(33844, 10)

In [None]:
# w-kNN with small k
dt = neighbors.KNeighborsRegressor(n_neighbors = 5, weights = 'distance')

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

R2 score 0.16320665977971316


In [None]:
# w-kNN with large k
dt = neighbors.KNeighborsRegressor(n_neighbors = 500, weights = 'distance')

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

R2 score 0.32561799134055114


In [None]:
# decision tree with default paramters
dt = tree.DecisionTreeRegressor()

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print(dt.get_depth())
print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

44
R2 score 0.30908438249957193


In [None]:
# decision tree with some kind of parameter that (hopefully) results in a smaller tree
dt = tree.DecisionTreeRegressor(min_samples_leaf=100)

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print(dt.get_depth())
print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

14
R2 score 0.4637541810828988


In [None]:
# Random Forest with default parameters
from sklearn import ensemble

dt = ensemble.RandomForestRegressor()

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

R2 score 0.6471319230715651


In [None]:
# Random Forest with change to the number of trees
dt = ensemble.RandomForestRegressor(n_estimators=150)

dt.fit(train_data,train_sln)

danceability_predictions = dt.predict(test_data)

print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

R2 score 0.6492496597631483


## Part 2: w-kNN on Normalized data
**2.)** Normalize the data and run a weighted k-Nearest Neighbors algorithm on it (from sklearn,  not the one we wrote from scratch). You can choose the k value. To Normalize, use the StandardScalar from sklearn.

In [None]:
# normalize the data
# copy the data
# songs_copy = songs.copy()
# target= 'danceability'
# predictors = ['energy', 'key', 'loudness','speechiness',
#        'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','duration_ms']
# train_data_copy, test_data_copy, train_sln_copy, test_sln_copy = train_test_split(songs_copy[predictors], songs_copy[target], test_size = 0.2, random_state=41)


train_data_copy = train_data.copy()
test_data_copy = test_data.copy()
# print(test_data_copy)
scaler = StandardScaler()

# scaler.fit(train_data_copy)
# scaler.fit(test_data_copy)
# # print(scaler.mean_)

# scaler.transform(train_data_copy)
# scaler.transform(test_data_copy)

# print(scaler.transform(test_data_copy))

train_scale = scaler.fit_transform(train_data_copy)
test_scale = scaler.fit_transform(test_data_copy)

# run a w-knn
dt = neighbors.KNeighborsRegressor(n_neighbors = 500, weights = 'distance')

dt.fit(train_scale,train_sln)

danceability_predictions = dt.predict(test_scale)

print("R2 score", metrics.r2_score(test_sln,danceability_predictions))

R2 score 0.46029844095566164


## Part 3:
**3.)**Use a Markup cell to answer the following questions:
* What algorithm performed better? w-kNN, Decision Trees, or Random Forests? Why do you think this was the case?
* What effect did normalizing the data have on your results? Explain. 

* Random forest with a higher tree count. A random forest is made up of many decision tree models where all their decisions are pooled and averaged which likely out performs an individal decision tree. As for doing better than kNeighestNeighbors, I'd guess that it's due to some of the predictor variables having a larger impact on the target variable.
* Normalizing the data had a significant impact on improving the results of a weighted kNN algorithm. Some of the variables have different weights, so some of the variables end up having too much(loudness) or little to no influence(instrumentalness) on the prediction.