### 1. Introdução e Coleta de Dados

#### Fonte dos Dados
Este projeto utiliza o **Spotify Tracks Dataset**, disponibilizado publicamente na plataforma Kaggle por Maharshi Pandya. O conjunto de dados contém características de áudio de faixas do Spotify em 125 gêneros diferentes.

* **Fonte Original:** Kaggle
* **Autor:** Maharshi Pandya
* **Link:** [Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

#### Licença
O dataset está licenciado sob a **Community Data License Agreement - Sharing - Version 1.0 (CDLA-Sharing-1.0)** ou similar (Kaggle Open Data), permitindo uso, modificação e compartilhamento para fins acadêmicos e de análise.

#### Variável-Alvo (Target)
Para as análises de regressão linear (simples, múltipla e polinomial), a variável dependente escolhida será:
* **`popularity`**: Um valor inteiro de 0 a 100 que representa a popularidade da faixa no Spotify.

#### Hipóteses de Negócio
Buscamos entender quais características musicais influenciam o sucesso de uma música.
1.  **H1:** Músicas mais "dançantes" (`danceability`) tendem a ter maior popularidade.
2.  **H2:** Músicas com maior energia (`energy`) e volume (`loudness`) são preferidas pelo público atual.
3.  **H3:** A duração da música (`duration_ms`) tem uma correlação não-linear com a popularidade (muito curtas ou muito longas podem ser menos populares).

##### Import das Bibliotecas

In [43]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

##### Instanciando o dataset

In [44]:
df = pd.read_csv("..\\data\\raw\\dataset.csv")

In [45]:
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


##### Renomeando as colunas

In [46]:
df = df.rename(columns={
    "Unnamed: 0": "id",
    "track_id": "id_musica",
    "artists": "artistas",
    "album_name": "album",
    "track_name": "musica",
    "popularity": "popularidade",
    "duration_ms": "duracao_ms",
    "explicit": "explicita",
    "danceability": "dancabilidade",
    "energy": "energia",
    "key": "tom",
    "loudness": "volume",
    "mode": "modo",
    "speechiness": "falada",
    "acousticness": "acustica",
    "instrumentalness": "instrumental",
    "liveness": "ao_vivo",
    "valence": "valencia",
    "tempo": "tempo_bpm",
    "time_signature": "compasso",
    "track_genre": "genero"
})


##### Mudamos essa coluna booleana pra int, para ficar com os valores 0 e 1


In [47]:
df["explicita"] = df["explicita"].astype(int)


##### Drop na coluna track_id

In [48]:
df.drop(columns="id_musica", inplace=True)

In [49]:
df.head()

Unnamed: 0,id,artistas,album,musica,popularidade,duracao_ms,explicita,dancabilidade,energia,tom,volume,modo,falada,acustica,instrumental,ao_vivo,valencia,tempo_bpm,compasso,genero
0,0,Gen Hoshino,Comedy,Comedy,73,230666,0,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,0,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,0,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,0,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,Chord Overstreet,Hold On,Hold On,82,198853,0,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


##### Dropando valores nulos

<p> haviam apenas três linhas com valores nulos, então dropá-las não vai afetar os resultados do nosso modelo
</p>

In [50]:
df.dropna(subset=["artistas", "album", "musica"], inplace=True)



In [51]:
df.isna().sum()

id               0
artistas         0
album            0
musica           0
popularidade     0
duracao_ms       0
explicita        0
dancabilidade    0
energia          0
tom              0
volume           0
modo             0
falada           0
acustica         0
instrumental     0
ao_vivo          0
valencia         0
tempo_bpm        0
compasso         0
genero           0
dtype: int64

In [52]:
df.to_csv("../data/processed/dataset_limpo.csv", index=False)
