# **Análisis de canciones: Predicción del género musical**

# Proyecto de Machine Learning

## Preparación BBDD barajada

### by Marta Buesa

#### Febrero 2022 

![MartaBuesaProyectoML](portada_ML.png)


Detectar el género de una canción podría parecer aparentemente fácil si conocemos el autor ya que suelen ser ubicados dentro de un género musical determinado. Sin embargo, podría haber géneros que tengan mezcla de varios y existen muchos subgéneros para clasificar las canciones.

Por ello, en este estudio he puesto el foco en 10 géneros musicales, donde tomando una BBDD con una muestra amplia de 50000 canciones pertenecientes a dichos 10 géneros musicales, he analizado sus características y preparado modelos para la predicción del genero musical en el que se clasificarían.

CSV fuente:
https://www.kaggle.com/vicsuperman/prediction-of-music-genre 

## 1. Importo librerias

In [1]:
import pandas as pd
import numpy as np

## 2. Importo dataset

In [2]:
music_df = pd.read_csv('csvs/music_genre.csv')
music_df.head(15)

Unnamed: 0,instance_id,artist_name,track_name,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,32894.0,Röyksopp,Röyksopp's Night Out,27.0,0.00468,0.652,-1.0,0.941,0.792,A#,0.115,-5.201,Minor,0.0748,100.889,4-Apr,0.759,Electronic
1,46652.0,Thievery Corporation,The Shining Path,31.0,0.0127,0.622,218293.0,0.89,0.95,D,0.124,-7.043,Minor,0.03,115.00200000000001,4-Apr,0.531,Electronic
2,30097.0,Dillon Francis,Hurricane,28.0,0.00306,0.62,215613.0,0.755,0.0118,G#,0.534,-4.617,Major,0.0345,127.994,4-Apr,0.333,Electronic
3,62177.0,Dubloadz,Nitro,34.0,0.0254,0.774,166875.0,0.7,0.00253,C#,0.157,-4.498,Major,0.239,128.014,4-Apr,0.27,Electronic
4,24907.0,What So Not,Divide & Conquer,32.0,0.00465,0.638,222369.0,0.587,0.909,F#,0.157,-6.266,Major,0.0413,145.036,4-Apr,0.323,Electronic
5,89064.0,Axel Boman,Hello,47.0,0.00523,0.755,519468.0,0.731,0.854,D,0.216,-10.517,Minor,0.0412,?,4-Apr,0.614,Electronic
6,43760.0,Jordan Comolli,Clash,46.0,0.0289,0.572,214408.0,0.803,8e-06,B,0.106,-4.294,Major,0.351,149.995,4-Apr,0.23,Electronic
7,30738.0,Hraach,Delirio,43.0,0.0297,0.809,416132.0,0.706,0.903,G,0.0635,-9.339,Minor,0.0484,120.008,4-Apr,0.761,Electronic
8,84950.0,Kayzo,NEVER ALONE,39.0,0.00299,0.509,292800.0,0.921,0.000276,F,0.178,-3.175,Minor,0.268,149.94799999999998,4-Apr,0.273,Electronic
9,56950.0,Shlump,Lazer Beam,22.0,0.00934,0.578,204800.0,0.731,0.0112,A,0.111,-7.091,Minor,0.173,139.933,4-Apr,0.203,Electronic


Las columnas del conjunto de datos son 18 variables descriptivas de una canción, y son:

- **instance_id**: número de serie de la canción en el conjunto de datos.
- **artist_name**: Nombre del artista de la canción.
- **track_name**: Título de la canción.

- **popularity**: una puntuación arbitraria asignada a la canción en el rango de 0 a 100, siendo 100 la más popular y 0 la menos.
- **acousticness**: este valor describe qué tan acústica es una canción. Una puntuación de 1,0 significa que lo más probable es que la canción sea acústica.
- **danceability**: la bailabilidad describe qué tan adecuada es una pista para bailar en función de una combinación de elementos musicales. Un valor de 0,0 es menos bailable y 1,0 es más bailable

- **duration_ms** : Es la duración en milisegundos de la canción.
- **energy**: Representa la energía de la canción. El rango de este campo está entre [0-1], siendo 1 la canción con la energía más alta y 0 con la más baja.
- **instrumentalness**: este valor representa la cantidad de voces en la canción. Cuanto más cerca está de 1.0, más instrumental es la canción.

- **key**: La clave de una pieza es el grupo de tonos, o escala, que forma la base de una composición musical.
- **liveness**: este valor describe la probabilidad de que la canción se haya grabado con una audiencia en vivo. [0-1]
- **loudness**: Columna que representa el volumen de la canción.

- **mode**: escalas mayores y menores en las que se basa la canción.
- **speechiness**: Speechiness detecta la presencia de palabras habladas en una pista.
- **tempo**: Velocidad a la que se reproduce la canción.

- **obtained_date**: la fecha en la que se recuperaron los metadatos de la canción.
- **valence**: Una medida de 0.0 a 1.0 que describe la positividad musical transmitida por una pista. Las pistas con alta valencia suenan más positivas.
- **music_genre**: la categoría real a la que pertenece la canción. Esta es nuestra variable objetivo.

## 3. Visualizo el dataset

In [3]:
# Observo la forma del dataset, filas x columnas
music_df.shape

(50005, 18)

In [4]:
# Observo su resumen descriptivo
music_df.describe(include=None).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
instance_id,50000.0,55888.39636,20725.256253,20002.0,37973.5,55913.5,73863.25,91759.0
popularity,50000.0,44.22042,15.542008,0.0,34.0,45.0,56.0,99.0
acousticness,50000.0,0.306383,0.34134,0.0,0.02,0.144,0.552,0.996
danceability,50000.0,0.558241,0.178632,0.0596,0.442,0.568,0.687,0.986
duration_ms,50000.0,221252.60286,128671.957157,-1.0,174800.0,219281.0,268612.25,4830606.0
energy,50000.0,0.599755,0.264559,0.000792,0.433,0.643,0.815,0.999
instrumentalness,50000.0,0.181601,0.325409,0.0,0.0,0.000158,0.155,0.996
liveness,50000.0,0.193896,0.161637,0.00967,0.0969,0.126,0.244,1.0
loudness,50000.0,-9.133761,6.16299,-47.046,-10.86,-7.2765,-5.173,3.744
speechiness,50000.0,0.093586,0.101373,0.0223,0.0361,0.0489,0.098525,0.942


In [5]:
# Observo si hay valores nulos y el tipo de datos que tengo en cada columna
music_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50005 entries, 0 to 50004
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       50000 non-null  float64
 1   artist_name       50000 non-null  object 
 2   track_name        50000 non-null  object 
 3   popularity        50000 non-null  float64
 4   acousticness      50000 non-null  float64
 5   danceability      50000 non-null  float64
 6   duration_ms       50000 non-null  float64
 7   energy            50000 non-null  float64
 8   instrumentalness  50000 non-null  float64
 9   key               50000 non-null  object 
 10  liveness          50000 non-null  float64
 11  loudness          50000 non-null  float64
 12  mode              50000 non-null  object 
 13  speechiness       50000 non-null  float64
 14  tempo             50000 non-null  object 
 15  obtained_date     50000 non-null  object 
 16  valence           50000 non-null  float6

In [6]:
# Elimino los 5 registros con valores nulos
music_df.dropna(inplace=True)

### Quito columnas innecesarias

In [7]:
# Quito columnas no utiles para el análisis
music_df.drop(columns=['instance_id', 'obtained_date'], inplace=True) 

In [8]:
# La forma que tiene mi dataset ahora es así
print('Número de filas: ', music_df.shape[0])
print('Número de columnas: ', music_df.shape[1])

Número de filas:  50000
Número de columnas:  16


In [9]:
columnas_dataset = music_df.columns[1:]

## 4. Barajo la BBDD para poder dividirla en TRAIN / TEST

Dado que en el dataset la columna target aparecen las filas ordenadas por genero, voy a  barajar y guardarlo y así con ello hacer un TRAIN/ TEST con posibilidad de estar balanceado.

In [10]:
from sklearn.utils import shuffle

music_df = shuffle(music_df)

music_df.head(30)

Unnamed: 0,artist_name,track_name,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,music_genre
27249,Lupe Fiasco,XO (feat. Troi Irons),51.0,0.0224,0.662,271871.0,0.922,0.0,A,0.499,-4.862,Minor,0.23,128.137,0.174,Rap
3541,Grandtheft,Easy Go,49.0,0.107,0.681,190912.0,0.922,0.00167,F,0.0938,-3.592,Major,0.0779,156.071,0.566,Electronic
25603,Hollywood Undead,Comin’ In Hot,61.0,0.0146,0.708,-1.0,0.749,0.0,D,0.0913,-5.64,Major,0.0897,96.016,0.382,Rap
31957,Santana,Nothing At All (feat. Musiq),40.0,0.103,0.532,268827.0,0.643,0.0,G,0.123,-7.039,Minor,0.0638,?,0.208,Blues
14138,Pink Martini,Hey Eugene,34.0,0.442,0.723,-1.0,0.573,0.0001,D,0.102,-7.303,Major,0.0459,95.802,0.509,Jazz
3854,DJ Shadow,Organ Donor,41.0,0.837,0.61,117240.0,0.571,0.743,C#,0.15,-7.136,Minor,0.0315,106.185,0.909,Electronic
49935,Lil Pump,Youngest Flexer (feat. Gucci Mane),55.0,0.246,0.854,199111.0,0.648,0.0,A,0.125,-5.251,Major,0.0808,134.916,0.718,Hip-Hop
49482,Atmosphere,Modern Man's Hustle,46.0,0.154,0.859,225227.0,0.609,0.0,B,0.348,-3.633,Major,0.286,85.976,0.649,Hip-Hop
23442,Brantley Gilbert,Just As I Am,44.0,0.869,0.384,245946.0,0.38,0.0,G#,0.322,-9.56,Major,0.046,134.885,0.333,Country
18664,The Black Keys,You're the One,44.0,0.822,0.594,208227.0,0.462,0.00186,G,0.123,-10.218,Major,0.0403,?,0.529,Alternative


In [11]:
music_df.reset_index(inplace=True)

In [12]:
music_df.drop(columns=['index'], inplace=True)
music_df.head()

Unnamed: 0,artist_name,track_name,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,music_genre
0,Lupe Fiasco,XO (feat. Troi Irons),51.0,0.0224,0.662,271871.0,0.922,0.0,A,0.499,-4.862,Minor,0.23,128.137,0.174,Rap
1,Grandtheft,Easy Go,49.0,0.107,0.681,190912.0,0.922,0.00167,F,0.0938,-3.592,Major,0.0779,156.071,0.566,Electronic
2,Hollywood Undead,Comin’ In Hot,61.0,0.0146,0.708,-1.0,0.749,0.0,D,0.0913,-5.64,Major,0.0897,96.016,0.382,Rap
3,Santana,Nothing At All (feat. Musiq),40.0,0.103,0.532,268827.0,0.643,0.0,G,0.123,-7.039,Minor,0.0638,?,0.208,Blues
4,Pink Martini,Hey Eugene,34.0,0.442,0.723,-1.0,0.573,0.0001,D,0.102,-7.303,Major,0.0459,95.802,0.509,Jazz


## 4.1. > Guardo el CSV barajado

In [13]:
music_df.to_csv('csvs/music_genre_barajado.csv')