# Execute the code below

In [1]:
import pandas as pd
import seaborn as sns


url = "https://raw.githubusercontent.com/murpi/wilddata/master/quests/spotify.zip"
df_music = pd.read_csv(url)
df_zoom = df_music.loc[df_music.genre.isin(['Country', 'Classical']), ['genre', 'duration_ms', 'speechiness']].reset_index(drop = True)
df_zoom

Unnamed: 0,genre,duration_ms,speechiness
0,Country,200013,0.0444
1,Country,208187,0.0569
2,Country,123360,0.0960
3,Country,238600,0.0368
4,Country,243000,0.0330
...,...,...,...
17915,Country,179147,0.0322
17916,Country,230400,0.0832
17917,Country,216093,0.0268
17918,Country,179947,0.0909


# Standardization and classification

You now have a dataset with Country and Classical musics, and 2 numerical features : duration and speechiness.
Our goal will be to predict genre from numerical features.

## Draw a scatterplot from df_zoom with
- 'duration_ms' on X axis
- 'speechiness' on Y axis
- 'genre' in hue

In [2]:
# Your code here :
#import matplotlib as plt

#plt.scatter(x = df_zoom["duration_ms"], y = df_zoom["speechiness"], c= 'red')
#plt.show()

## Classification
From df_zoom :
- define X (`duration_ms` and `speechiness`)
- define y (`genre`)
- split your datas into train and test datasets, and `random_state = 2`
- perform 3 classification algorithms (Logistic Regression, KNN and Decision Tree)
- score your 3 models with accuracy score on the train dataset and on the test dataset


In [3]:
# Your code here :
X = df_zoom[["duration_ms", "speechiness"]]
y = df_zoom["genre"]

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2, train_size = 0.75)
print("The length of the initial dataset is :", len(X))
print("The length of the train dataset is   :", len(X_train))
print("The length of the test dataset is    :", len(X_test))

The length of the initial dataset is : 17920
The length of the train dataset is   : 13440
The length of the test dataset is    : 4480


##Regression Logistique

In [5]:
#Train a logistic regression
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

In [6]:
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

In [7]:
from sklearn.metrics import accuracy_score

accuracy1 = accuracy_score(y_train, y_pred_train)
accuracy2 = accuracy_score(y_test, y_pred_test)
print(f"Accuracy on train: {round(accuracy1, 2)}")
print(f"Accuracy on test: {round(accuracy2, 2)}")

Accuracy on train: 0.52
Accuracy on test: 0.51


##KNN Classifier

In [8]:
from sklearn.neighbors import KNeighborsClassifier

modelKNN = KNeighborsClassifier(n_neighbors=5)

modelKNN.fit(X_train, y_train)

print("\nScore for the Train dataset :", modelKNN.score(X_train, y_train))
print("Score for the Test dataset :", modelKNN.score(X_test, y_test))

print("Scikit-Learn : ", modelKNN.predict([[10, 10]]))


Score for the Train dataset : 0.7979910714285714
Score for the Test dataset : 0.7129464285714285
Scikit-Learn :  ['Classical']




##Decision Tree

In [9]:
X = df_zoom[["duration_ms", "speechiness"]]
y = df_zoom["genre"]

In [10]:
from sklearn.tree import DecisionTreeClassifier

modelDTC = DecisionTreeClassifier()
modelDTC.fit(X_train, y_train)

In [11]:
from sklearn.metrics import accuracy_score

x_pred = modelDTC.predict(X_train)
y_pred = modelDTC.predict(X_test)

accuracy1 = accuracy_score(y_train, x_pred)
accuracy2 = accuracy_score(y_test, y_pred)
print(f"Accuracy on train: {round(accuracy1, 2)}")
print(f"Accuracy on test: {round(accuracy2, 2)}")

Accuracy on train: 1.0
Accuracy on test: 0.74



You have to find these accuracy scores for test set :
- Logistic regression : 0.50982
- KNN : 0.71295
- Decision tree : 0.73728

It seems Decision tree is better, but did you look at the overfitting ?

#Standardization

- Fit your scaler model on X_train
- Transform X_train and X_test  with your scaler model into  X_train_scaled and X_test_scaled
- perform and score the same 3 classification algorithms, but with X_train_scaled and X_test_scaled

In [13]:
# Your code here :
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

##Regression logistique

In [14]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

In [15]:
y_pred_test = model.predict(X_test_scaled)
y_pred_train = model.predict(X_train_scaled)



In [16]:
from sklearn.metrics import accuracy_score

accuracy1 = accuracy_score(y_train, y_pred_train)
accuracy2 = accuracy_score(y_test, y_pred_test)
print(f"Accuracy on train: {round(accuracy1, 2)}")
print(f"Accuracy on test: {round(accuracy2, 2)}")

Accuracy on train: 0.67
Accuracy on test: 0.69


##KNN Classifier

In [24]:
from sklearn.neighbors import KNeighborsClassifier

modelKNN = KNeighborsClassifier(n_neighbors=5)

modelKNN.fit(X_train_scaled, y_train)

print("\nScore for the Train dataset :", modelKNN.score(X_train_scaled, y_train))
print("Score for the Test dataset :", modelKNN.score(X_test_scaled, y_test))

print("Scikit-Learn : ", modelKNN.predict([[5, 5]]))


Score for the Train dataset : 0.8364583333333333
Score for the Test dataset : 0.7743303571428571
Scikit-Learn :  ['Classical']


##Decision Tree

In [25]:
from sklearn.tree import DecisionTreeClassifier

modelDTC = DecisionTreeClassifier()
modelDTC.fit(X_train_scaled, y_train)

In [26]:
from sklearn.metrics import accuracy_score

x_pred = modelDTC.predict(X_train_scaled)
y_pred = modelDTC.predict(X_test_scaled)

accuracy1 = accuracy_score(y_train, x_pred)
accuracy2 = accuracy_score(y_test, y_pred)
print(f"Accuracy on train: {round(accuracy1, 2)}")
print(f"Accuracy on test: {round(accuracy2, 2)}")

Accuracy on train: 1.0
Accuracy on test: 0.74


# Conclusion
- Decision tree is insensitive to Standardization.
- Logistic regression and KNN have better result after Standardization.


We can remember that standardization is always good in preprocessing, before machine learning classifications and regressions. At worst, it does not change anything. At best, it improves results.