# P4B - Learning

This project give you experience with SQL and Learning topics. 

In [3]:
import pandas as pd 
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.datasets import make_classification


from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection

from sklearn.feature_selection import SelectPercentile

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import svm

from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

from sklearn import metrics

import numpy as np
np.random.seed(5550)

import otter
grader = otter.Notebook()

# Problem: Classification - Music Hits 

For this problem, you will work to classify a song’s popularity. Specifically, you will develop methods to predict whether a song will make the Top10 of Billboard’s Hot 100 Chart. The data set consists of song from the Top10 of Billboard’s Hot 100 Chart from 1990-2010 along with a sampling of other songs that did not make the list.  

The data source is adapted from one used in a MIT 15.071 course. The data set was created by scraping Billboard’s Hot 100, other songs on Billboard, and using the EchoNest API, now a part of Spotify, to get song information.

The variables included in the data set include several description of the song and artist (including song title and id numbers), the year the song was released. Additionally, several variables describe the song attributes: time signature, loudness, tempo, key, energy pitch, and timbre (measured of different sections of the song). The last variable is binary indicated whether the song was in the Top10 or not.

You will use the variables of the song attributes to predict whether the song will be popular or not.

## Q1 - Load and understand the data 

Load in the `music` data. 

You should not use the `year`, `artistname`, `artistID`, `songtitle` or `songID` in the prediction.  
Additionally, remove any variables that are the confidence of another variable, e.g., `timesignature_confidence`, `temp_confidence`. 


Create a input feature matrix, `Xm` and label vector `ym` that you will use to create your classifiers. 


In [1]:
music = pd.read_csv('music-f23.csv', encoding = "ISO-8859-1")
drop_columns = ['year', 'artistname', 'artistID', 'songtitle', 'songID','timesignature_confidence','tempo_confidence','key_confidence']
music = music.drop(columns = drop_columns)

Xm = music.drop(columns = ['Top10'])
ym = music['Top10']

Xm.head()

NameError: name 'pd' is not defined

In [2]:
grader.check("q1")

NameError: name 'grader' is not defined

## Q2. Classify Top 10 Hits 

We want to report out the results of predicting the top-10 hits using either KNN, Decision Trees, or SVMS.  

For each model, you will tune the hyper-parameters:    
* KNN, number of neighbors = [3, 5, 7, 9, 11] and weights = ['uniform', 'distance']
* Decision Trees, maximum depth of the tree = [2, 5, 10, 15] and criterion of ['gini', 'entropy'], set the random_state = 5
* SVM, use a rbf kernel with C = [0.01, 0.1, 1, 10] 

In addition, you will want to see which scaling methods seems to work best for this dataset and method: `StandardScaler` or `MinMaxScaler`. 

Overall, you will construct **three pipelines** to perform this analysis one for each model: KNN, DT, SVM.  You will do an initial stratified split of your data into training+validation set with 85% of the data and a test set with 15% of the data (random_state=5).  Use 10-fold stratified cross-validation with a random_state = 5. 

Additionally, when selecting the best hyper-parameters, instead of using accuracy you will use the `f1_measure`.  
 
The steps in your pipeline should be called `scaler` for the scaling step, `knn` for the KNN classifier, `dt` for the Decision Tree, and `svm` for the Support Vector Machine. 

One note, we are not using the results here to select a certain model (that would be using the test set for more than just estimating the generalized performance), rather just to report out the results. 

In [4]:

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(Xm, ym, test_size = 0.15, random_state = 5, stratify = ym)



# ** KNN **
# Create pipeline, with steps 'scaler' and 'knn'
knn_pipe = Pipeline( [('scaler', StandardScaler()),  ('knn', KNeighborsClassifier())])

# specify pipeline steps hyperparameters
knn_param = { 'knn__n_neighbors': [3, 5, 7, 9, 11], 'knn__weights': ['uniform', 'distance']}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
knn_grid = GridSearchCV(knn_pipe, knn_param, cv=cvStrat, scoring='f1', n_jobs=-1)
knn_grid.fit(X_trainval, y_trainval)


# preditions on final test set 
knn_ytest = knn_grid.predict(X_test)


print(knn_grid.best_params_)

{'knn__n_neighbors': 3, 'knn__weights': 'distance'}


In [5]:

np.random.seed(5550)
from sklearn.tree import DecisionTreeClassifier

# ** DT ** 
# Create pipeline, with steps 'scaler' and 'dt'
dt_pipe = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeClassifier(random_state=5))])

dt_param = { 'dt__max_depth': [2, 5, 10, 15],'dt__criterion': ['gini', 'entropy']}  

# Setup cross-validation for repeatability 
cvStrat =  StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
dt_grid =GridSearchCV(dt_pipe, dt_param, cv=cvStrat, scoring='f1', n_jobs=-1)
dt_grid.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest = dt_grid.predict(X_test)


print(dt_grid.best_params_)

{'dt__criterion': 'gini', 'dt__max_depth': 10}


In [6]:
from sklearn.svm import SVC
# ** SVM ** 
# Create pipeline, with steps 'scaler' and 'svm'
svm_pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC(kernel='rbf'))])

svm_param =  {'svm__C': [0.01, 0.1, 1, 10]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
svm_grid = GridSearchCV(svm_pipe, svm_param, cv=cvStrat, scoring='f1', n_jobs=-1)
svm_grid.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest = svm_grid.predict(X_test)


print(svm_grid.best_params_)

{'svm__C': 10}


In [7]:
grader.check("q2")

## Q3  Table of Results 

Report in a DataFrame the following information for each model:
* `Model` type (KNN, DT, SVM), 
* best `Hyper-parameters` for the model, e.g., [(n_neighbors, 7), (weights, 'uniform')], (max_depth, 10), ('C', 0.1), etc.
* `Accuracy`, 
* `Precision`,
* `Recall`, 
* `F1-measure` and 
* `Balanced Acc` - balanced accuracy

The last 5 values should all be calculated on the test set. 

In [8]:

# Build data frame of requested results

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, balanced_accuracy_score

knn_accuracy = accuracy_score(y_test, knn_ytest)
knn_precision = precision_score(y_test, knn_ytest)
knn_recall = recall_score(y_test, knn_ytest)
knn_f1 = f1_score(y_test, knn_ytest)
knn_balanced_acc = balanced_accuracy_score(y_test, knn_ytest)

dt_accuracy = accuracy_score(y_test, dt_ytest)
dt_precision = precision_score(y_test, dt_ytest)
dt_recall = recall_score(y_test, dt_ytest)
dt_f1 = f1_score(y_test, dt_ytest)
dt_balanced_acc = balanced_accuracy_score(y_test, dt_ytest)

svm_accuracy = accuracy_score(y_test, svm_ytest)
svm_precision = precision_score(y_test, svm_ytest)
svm_recall = recall_score(y_test, svm_ytest)
svm_f1 = f1_score(y_test, svm_ytest)
svm_balanced_acc = balanced_accuracy_score(y_test, svm_ytest)

# Create DataFrame
results_data = {
    'Model': ['KNN', 'Decision Tree', 'SVM'],
    'Hyper-parameters': [knn_grid.best_params_, dt_grid.best_params_, svm_grid.best_params_],
    'Accuracy': [knn_accuracy, dt_accuracy, svm_accuracy],
    'Precision': [knn_precision, dt_precision, svm_precision],
    'Recall': [knn_recall, dt_recall, svm_recall],
    'F1-measure': [knn_f1, dt_f1, svm_f1],
    'Balanced Acc.': [knn_balanced_acc, dt_balanced_acc, svm_balanced_acc]
}

results = pd.DataFrame(results_data)



results

Unnamed: 0,Model,Hyper-parameters,Accuracy,Precision,Recall,F1-measure,Balanced Acc.
0,KNN,"{'knn__n_neighbors': 3, 'knn__weights': 'dista...",0.741158,0.527559,0.39881,0.454237,0.633325
1,Decision Tree,"{'dt__criterion': 'gini', 'dt__max_depth': 10}",0.681672,0.405063,0.380952,0.392638,0.586952
2,SVM,{'svm__C': 10},0.747588,0.539568,0.446429,0.488599,0.65273


In [9]:
grader.check("q3")

<!-- BEGIN QUESTION -->

## Q4  Results summary 

Describe the results.  Write 5-7 sentences about the results observed and the overall performance on the problem.  Include a description of which method does best of predicting the positive examples. 


**ANSWER** 
The results indicate that the Support Vector Machine (SVM) model performs the best among the three models (KNN, Decision Tree, SVM) in predicting whether a song will make it to the Top 10 of Billboard's Hot 100 Chart. The SVM model has the highest accuracy, precision, recall, and F1-measure on the test set. The SVM model with a C value of 10 outperformed other hyperparameter combinations.

The KNN model showed slightly lower performance compared to SVM. The distance-weighted approach with three neighbors worked well for this classification task.

The Decision Tree model, falls slightly behind the SVM and KNN models. The chosen hyperparameters for the Decision Tree, with a criterion of 'gini' and a max depth of 10, represented a good balance between complexity and performance.

<!-- END QUESTION -->

## Bonus.  Improve Performance of Models

The problem we are working with deals with an imbalanced data set.  Meaning there are many more of one class than the other.  For this dataset, ~77% of the data are negative samples (not Top10 hits). 

The imbalanced data is one explanation for the poor performance of our classifiers above (among other reasons).  

Let's try to improve this performance.  Classification with imbalanced data can be improved using a number of different techniques.  Two approaches are: 

* Cost-sensitive or weighted learning approach
* Data or sampling approach 

Here we will examing the class weighting approach. Some of our traditional classification models are adapted to include a penalty of cost for the different classes.  In our problem, we have a minority class "Top10 Hits" and the majority class "Non-Top10 Hits".  We can use class weighting to penalize the model for misclassifying the minority class more than the majority class.  

We will make use of the `scikit-learn` parameter `class_weight`.  Setting `class_weight ='balanced'` will have a weight applied inversely proportional to the class frequency.  

Note, not all classification models have this parameter to set, e.g., KNN. 

Rerun your DT and SVM pipelines from above (Q2) now with the DT and SVM using the parameter `class_weight ='balanced'`

Add the resulting models `DT bal class weights` and `SVM bal class weights` and their performance to the results table from Q3.  



In [10]:

# ** DT ** 
# include class_weight in DT parameters
dt_pipe2 = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeClassifier(random_state=5, class_weight='balanced'))])

# Specify DT hyperparameters
dt_param2 = {'dt__max_depth': [2, 5, 10, 15], 'dt__criterion': ['gini', 'entropy']}   

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
dt_grid2 = GridSearchCV(dt_pipe2, param_grid=dt_param2, cv=cvStrat, scoring='f1', n_jobs=-1)
dt_grid2.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest2 = dt_grid2.predict(X_test)


print(dt_grid2.best_params_)

{'dt__criterion': 'entropy', 'dt__max_depth': 5}


In [11]:

# ** SVM ** 
# include class_weight in SVM parameters
svm_pipe2 =  Pipeline([('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', class_weight='balanced'))])


# Specify SVM hyperparameters
svm_param2 = {'svm__C': [0.01, 0.1, 1, 10]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
svm_grid2 = GridSearchCV(svm_pipe2, param_grid=svm_param2, cv=cvStrat, scoring='f1', n_jobs=-1)
svm_grid2.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest2 = svm_grid2.predict(X_test)


print(svm_grid2.best_params_)

{'svm__C': 1}


In [12]:

# Add "DT balanced class weights" and "SVM balanced class weights" rows 
#  to the results table. 


dt2_accuracy = accuracy_score(y_test, dt_ytest2)
dt2_precision = precision_score(y_test, dt_ytest2)
dt2_recall = recall_score(y_test, dt_ytest2)
dt2_f1 = f1_score(y_test, dt_ytest2)
dt2_balanced_acc = balanced_accuracy_score(y_test, dt_ytest2)

svm2_accuracy = accuracy_score(y_test, svm_ytest2)
svm2_precision = precision_score(y_test, svm_ytest2)
svm2_recall = recall_score(y_test, svm_ytest2)
svm2_f1 = f1_score(y_test, svm_ytest2)
svm2_balanced_acc = balanced_accuracy_score(y_test, svm_ytest2)

# Create DataFrame
results2_data = {
    'Model': [ 'DT bal class weights', 'SVM bal class weights'],
    'Hyper-parameters': [dt_grid2.best_params_, svm_grid2.best_params_],
    'Accuracy': [dt2_accuracy, svm2_accuracy],
    'Precision': [dt2_precision, svm2_precision],
    'Recall': [dt2_recall, svm2_recall],
    'F1-measure': [dt2_f1, svm2_f1],
    'Balanced Acc.': [dt2_balanced_acc, svm2_balanced_acc]
}

results2 = pd.DataFrame(results2_data)




results = pd.concat([results, results2], ignore_index= True)
results

Unnamed: 0,Model,Hyper-parameters,Accuracy,Precision,Recall,F1-measure,Balanced Acc.
0,KNN,"{'knn__n_neighbors': 3, 'knn__weights': 'dista...",0.741158,0.527559,0.39881,0.454237,0.633325
1,Decision Tree,"{'dt__criterion': 'gini', 'dt__max_depth': 10}",0.681672,0.405063,0.380952,0.392638,0.586952
2,SVM,{'svm__C': 10},0.747588,0.539568,0.446429,0.488599,0.65273
3,DT bal class weights,"{'dt__criterion': 'entropy', 'dt__max_depth': 5}",0.602894,0.377709,0.72619,0.496945,0.64173
4,SVM bal class weights,{'svm__C': 1},0.731511,0.502203,0.678571,0.577215,0.714836


In [134]:
grader.check("bonus")

In [13]:
results.shape

(5, 7)

<!-- BEGIN QUESTION -->

## Bonus 2 

Given the update results.  State which classifier you would want to use to ensure that you correctly identified the most Top10 Hits.  Explain why (briefly 1-2 sentences). 

**ANSWER** 

Based on the above results, the SVM model with balanced class weights is the best classifier for identifying the most Top10 Hits. It has the highest balanced accuracy (0.714836) among the other models, indicating a good trade-off between precision and recall. This means it is better at correctly identifying positive examples while also minimizing false positives and false negatives.

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [135]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)