# K-Fold Cross Validation

Let's revisit the song_df data set:

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import svm
import io

# https://www.kaggle.com/datasets/nitishraj/song-popularity-k-folds             link to dataset - 15,000 rows are deleted as the SVM runtime was over 3.5 minutes for the full data set

from google.colab import files
uploaded = files.upload()
song_df = pd.read_csv(io.BytesIO(uploaded['song_popularity.csv']))


Saving song_popularity.csv to song_popularity.csv


In [2]:
song_df.head()                # get a visual of the dataset

Unnamed: 0,id,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence,song_popularity
0,25525,160207.0,0.468952,0.558192,0.542314,0.004122,1.0,,,1,0.048953,118.507699,4,0.736457,0
1,29945,175575.0,0.562466,0.837785,0.684599,0.001808,,0.124795,,1,0.053656,102.400007,3,0.65118,1
2,22765,66156.0,1.025163,,,0.007247,7.0,0.127724,-21.378036,0,0.033405,81.077515,2,0.132257,0
3,9738,194331.0,0.101652,,0.811663,0.00225,5.0,0.11615,-4.430667,1,0.106921,79.33407,3,0.41887,0
4,25087,250925.0,0.676626,0.822191,0.413637,,10.0,0.110757,-8.159729,1,0.106729,101.913642,3,0.406016,0


In [3]:
song_df.shape                           # 20,0000 rows and 15 columns in the dataset

(2498, 15)

In [4]:
song_df.drop('id', axis=1, inplace=True)

In [5]:
song_df.describe()                      # as the count row is not all 20,000 this tells us that there are null values that need to be dropped in some of the columns

Unnamed: 0,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence,song_popularity
count,2247.0,2254.0,2261.0,2253.0,2255.0,2261.0,2232.0,2238.0,2498.0,2498.0,2498.0,2498.0,2498.0,2498.0
mean,192646.306186,0.278503,0.568458,0.681626,0.039851,4.896948,0.198706,-7.343748,0.314251,0.09474,115.963598,3.400721,0.580343,0.370697
std,44436.269819,0.297922,0.191869,0.213564,0.159469,3.415153,0.153489,3.901285,0.46431,0.084072,25.589144,0.514849,0.239146,0.483088
min,42109.0,-0.010766,0.077057,0.037646,-0.002523,0.0,0.032782,-29.043994,0.0,0.021477,63.977202,2.0,0.041047,0.0
25%,166435.5,0.03928,0.424706,0.5431,0.000994,2.0,0.112093,-9.306378,0.0,0.038926,96.423734,3.0,0.397502,0.0
50%,186798.0,0.141846,0.604645,0.699832,0.002025,5.0,0.135795,-6.281367,0.0,0.055847,112.99411,3.0,0.600443,0.0
75%,213463.0,0.478014,0.716935,0.867654,0.003196,8.0,0.210705,-4.567081,1.0,0.121436,127.570636,4.0,0.759454,1.0
max,414414.0,1.065284,0.92517,1.024361,1.007672,11.0,0.980411,-1.099458,1.0,0.560748,212.54355,5.0,1.001118,1.0


In [6]:
song_df.dropna(inplace=True)          # function used to drop null values from dataset

In [7]:
song_df.isnull().sum()                # confirmation that the above worked successfully. There are now no null values left in the dataset

song_duration_ms    0
acousticness        0
danceability        0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
audio_mode          0
speechiness         0
tempo               0
time_signature      0
audio_valence       0
song_popularity     0
dtype: int64

In [8]:
features = list(song_df.columns[0:13])    # independant variables now put into a list 

X = song_df[features]                     # X variable now assigned the above list from column "song_duration_ms" through "audio_valence"
y = song_df['song_popularity']            # y is the target variable as we are trying to predict "song popularity"


A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [13]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)        # Training and test data split out into different data sets - 70/30 split


song_clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)                                  # Support Vector Machine is now fit with the training data in order to predict song popularity


print(song_clf.score(X_train, y_train))                                                               # Print the accuracy score for the training data
print(song_clf.score(X_test, y_test))                                                                 # Now we test the unseen training data to see how the SVM prediction performs

0.6151832460732984
0.5975609756097561


K-Fold cross validation is just as easy; let's use a K of 5:

In [14]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(song_clf, X, y, cv=5)

# Print the accuracy for each fold:
print(scores)                                       # The model scores slightly better when using cross validation. The score for the SVM was 59% and the Cross Validation score is 62% 

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.61818182 0.63636364 0.6146789  0.62385321 0.62385321 0.62385321
 0.6146789  0.62385321 0.62385321 0.62385321]
0.6227022518765637


Our model is even better than we thought! Can we do better? Let's try a different kernel (poly):

In [11]:
song_clf = svm.SVC(kernel='poly', C=1)
scores = cross_val_score(song_clf, X, y, cv=5)
print(scores)
print(scores.mean())

[0.62557078 0.62557078 0.62844037 0.62385321 0.62385321]
0.6254576683004481


No! The more complex polynomial kernel produced the same accuracy than a simple linear kernel. The polynomial kernel may be overfitting. But we couldn't have told that with a single train/test split:

In [15]:
# Build an SVC model for predicting song_df classifications using training data
song_clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

# Now measure its performance with the test data
song_clf.score(X_test, y_test)   

0.625

We got a similar score with a single train/test split on the linear kernel.

## Activity