<div style="text-align: right"> Projet Machine Learning <br/> Julia Locatelli <br/> Jonathan Yuste</div>

<h1><center>Projet Machine Learning: <br/> Classification d'un genre de musique</center></h1>
<br/>
<div class='text-danger'><center> Objectif : Déterminer le genre d'une musique</center></div>

## L'objectif du projet
Nous devons utiliser nos connaissances acquises en cours afin de :
* Utiliser le Machine Learning,
* Prédire/Résoudre un problème choisi.

Dans notre cas, nous avons sélectionné le dataset "Spotify Genre Audio Features". Le but est de déterminer le genre d'une musique grâce à plusieurs caractéristiques.

Il faut tout d'abord importer plusieurs bibliothèques pour notre projet.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## Initialisation du dataset

Étant donné que nous avons téléchargé un dataset, il faut que nous puissions l'importer et le lire.
Pour cela nous utilisons "pandas" afin de lire notre document excel.

In [2]:
tracks = pd.read_excel("spotify_genre_final.xlsx")

print(tracks.head())
print(tracks.columns)

  Genre                          Title  \
0  rock                   Baba O'Riley   
1  rock            More Than a Feeling   
2  rock        (Don't Fear) The Reaper   
3  rock           Jump - 2015 Remaster   
4  rock  Stairway to Heaven - Remaster   

                                    Album_cover_link            Artist  \
0  https://i.scdn.co/image/ab67616d0000b273fe24dc...           The Who   
1  https://i.scdn.co/image/ab67616d0000b27390ef97...            Boston   
2  https://i.scdn.co/image/ab67616d0000b2733ac318...  Blue Öyster Cult   
3  https://i.scdn.co/image/ab67616d0000b273b414c6...         Van Halen   
4  https://i.scdn.co/image/ab67616d0000b273c8a11e...      Led Zeppelin   

   duration_ms  explicit                      id  popularity release_date  \
0       300400     False  3qiyyUfYe7CRYLucrPmulD          75   1971-08-14   
1       285133     False  1QEEqeFIZktqIpPI4jSVSF          78         1976   
2       308120     False  5QTxFnGygVM4jFQiBovmRo          76         19

Dans ce document, nous n'avons pas besoin de toutes les colonnes pour définir le genre d'une musique.
C'est pour cela que nous allons en ignorer certaines.

In [3]:
ColPasUtile = ['Genre','Title', 'Album_cover_link', 'Artist', 'duration_ms',
       'explicit', 'id', 'popularity', 'release_date',
       'release_date_precision', 'total_tracks','time_signature']

y = tracks['Genre'].values
X = tracks.drop(ColPasUtile, axis=1).values

print(y)
print(X)

['rock' 'rock' 'rock' ... 'r&b' 'r&b' 'r&b']
[[4.89000e-01 7.24000e-01 5.00000e+00 ... 2.87000e-01 1.50000e-01
  1.17292e+02]
 [3.77000e-01 6.81000e-01 7.00000e+00 ... 5.04000e-02 2.85000e-01
  1.08789e+02]
 [3.33000e-01 9.27000e-01 9.00000e+00 ... 2.97000e-01 3.85000e-01
  1.41466e+02]
 ...
 [5.07000e-01 8.32000e-01 9.00000e+00 ... 1.64000e-01 5.78000e-01
  8.49170e+01]
 [8.50000e-01 7.00000e-01 8.00000e+00 ... 6.16000e-02 7.40000e-01
  1.39980e+02]
 [5.28000e-01 3.49000e-01 0.00000e+00 ... 1.75000e-01 1.95000e-01
  8.93470e+01]]


## Supervised Learning : Classification

Dans notre projet nous devons faire une classification linéaire car nous avons des caractéristiques fixes.

Nous définissons dans un premier temps ce qui va nous permettre de train et de tester.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2)

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (5533, 11)
y_train shape: (5533,)
X_test shape: (1384, 11)
y_test shape: (1384,)


In [5]:
#multinomial car nous avons plusieurs classes dans notre dataset
clf_lr = LogisticRegression(multi_class= 'multinomial', solver='lbfgs')

clf_lr.fit(X_train, y_train) 

print("intercept: {}".format(clf_lr.intercept_))
print("weights:   {}".format(clf_lr.coef_))

intercept: [ 0.32425753  0.35464957 -0.02141149  0.40778784 -0.12847997  0.02321762
 -0.96002109]
weights:   [[-0.08381727  0.96746954  0.01476265  0.27551843 -0.78581669 -0.43357393
  -0.90301375  1.47106488  0.19760608 -1.87546241  0.01666389]
 [ 1.37139446  0.59643098 -0.02250773 -0.09823317 -0.42625518  1.48831838
  -0.70038149 -0.21279324  0.42210989  1.12519393 -0.02195792]
 [ 0.65855322  0.06239343 -0.02287271  0.15805975  0.82996375 -0.47716506
   1.63585924 -0.48810183 -0.16948999  2.07510806 -0.00788554]
 [-0.01949103 -0.89620735  0.03776111  0.07515409  1.08068249 -0.50159685
   1.31950911 -0.38872861 -0.3316329  -0.86632227  0.00422126]
 [-0.01140463 -0.93342888 -0.0182703  -0.13263812 -1.16582516 -0.15633831
   0.36302216 -0.32098405 -0.25815751 -0.2055438   0.0040473 ]
 [ 1.23009354 -0.31036204 -0.0292423  -0.05519834 -0.47129862  1.17275519
  -0.68535159 -0.40050044  0.22407421 -0.04592352 -0.00328179]
 [-3.14532829  0.51370431  0.04036928 -0.22266264  0.9385494  -1.0923

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
t = X_test[:1,:] # build array only containing the first example from test using slicing
pred = clf_lr.predict(t) # predict() requires n-dimensional array
pred_pr = clf_lr.predict_proba(t) 

t_pred = pred[0]
t_pred_pr = pred_pr[0]
#print("Prediction: {}, probability: {}".format(t_pred,t_pred_pr[t_pred]))

train_score = clf_lr.score(X_train,y_train)
test_score = clf_lr.score(X_test,y_test)

print("Training set score: {:.2f} ".format(train_score))
print("Test set score: {:.2f} ".format(test_score))

Training set score: 0.39 
Test set score: 0.35 


# Sources :

https://www.kaggle.com/naoh1092/spotify-genre-audio-features