# Modelo preditivo

* train.csv - o conjunto de treinamento, com 20 colunas + coluna-alvo de "popularity"
* test.csv - o conjunto de teste, com 20 colunas
* sample_submission.csv - um exemplo de arquivo de submissão, para você estruturar suas respostas.

* **Importings e bibliotecas**

In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error

* **Carregando o dataframe**

In [3]:
df_test = pd.read_csv("test.csv")
df_train = pd.read_csv("train.csv")

* **Informações iniciais**

In [4]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34200 entries, 0 to 34199
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_unique_id   34200 non-null  int64  
 1   track_id          34200 non-null  object 
 2   artists           34199 non-null  object 
 3   album_name        34199 non-null  object 
 4   track_name        34199 non-null  object 
 5   duration_ms       34200 non-null  int64  
 6   explicit          34200 non-null  bool   
 7   danceability      34200 non-null  float64
 8   energy            34200 non-null  float64
 9   key               34200 non-null  int64  
 10  loudness          34200 non-null  float64
 11  mode              34200 non-null  int64  
 12  speechiness       34200 non-null  float64
 13  acousticness      34200 non-null  float64
 14  instrumentalness  34200 non-null  float64
 15  liveness          34200 non-null  float64
 16  valence           34200 non-null  float6

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79800 entries, 0 to 79799
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   track_unique_id    79800 non-null  int64  
 1   track_id           79800 non-null  object 
 2   artists            79800 non-null  object 
 3   album_name         79800 non-null  object 
 4   track_name         79800 non-null  object 
 5   duration_ms        79800 non-null  int64  
 6   explicit           79800 non-null  bool   
 7   danceability       79800 non-null  float64
 8   energy             79800 non-null  float64
 9   key                79800 non-null  int64  
 10  loudness           79800 non-null  float64
 11  mode               79800 non-null  int64  
 12  speechiness        79800 non-null  float64
 13  acousticness       79800 non-null  float64
 14  instrumentalness   79800 non-null  float64
 15  liveness           79800 non-null  float64
 16  valence            798

* **Missing values**

In [6]:
missing_values = df_train.isnull().sum()
print(missing_values)

track_unique_id      0
track_id             0
artists              0
album_name           0
track_name           0
duration_ms          0
explicit             0
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
time_signature       0
track_genre          0
popularity_target    0
dtype: int64


&emsp;Criando uma cópia do dataframe para fazer o pré-processamento: 

In [7]:
df_normalized = df_train.copy()
df_test_normalized = df_test.copy()

&emsp;Definição das colunas numéricas, categóricas e coluna-alvo:

In [8]:
categorical_cols = ['track_id', 'artists', 'album_name', 'track_name', 'track_genre']
numerical_cols = ['duration_ms', 'danceability', 'energy', 'key', 'loudness', 'speechiness',
                  'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']
target_column = ['popularity_target']


&emsp;Codificando variáveis categóricas:

In [9]:
label_encoder = LabelEncoder()

for col in categorical_cols:
    df_normalized[col] = label_encoder.fit_transform(df_normalized[col])

for col in categorical_cols:
    df_normalized[col] = label_encoder.fit_transform(df_normalized[col])


&emsp;Identificação e remoção de _outliers_:

In [10]:
iso_forest = IsolationForest(contamination=0.1, random_state=42)    
outliers = iso_forest.fit_predict(df_normalized)
df_normalized['Outlier'] = outliers

df_normalized = df_normalized[df_normalized['Outlier'] != -1].drop(columns='Outlier')


&emsp;Padronizando variáveis numéricas:

In [11]:
df_normalized = df_normalized[categorical_cols + numerical_cols + target_column]
scaler = StandardScaler()
df_normalized[numerical_cols] = scaler.fit_transform(df_normalized[numerical_cols])
df_normalized.describe()

Unnamed: 0,track_id,artists,album_name,track_name,track_genre,duration_ms,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,popularity_target
count,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0,71820.0
mean,33420.583083,12867.553801,18070.648134,27639.288847,57.287413,2.9284410000000005e-17,4.847757e-16,-3.640765e-17,-3.264816e-17,-7.697052e-17,7.835559000000001e-17,-1.482029e-16,8.458842e-18,9.428394e-17,-1.144664e-16,1.83819e-16,5.817309e-17,0.497536
std,19148.199272,7372.022352,10703.533596,15737.991025,32.32204,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,1.000007,0.499997
min,0.0,0.0,4.0,0.0,0.0,-2.27574,-3.256822,-2.872179,-1.506974,-8.303974,-0.7490449,-0.9145838,-0.4480075,-1.109077,-1.950865,-3.182094,-12.53332,0.0
25%,16940.0,6603.0,8928.75,14376.0,30.0,-0.5576185,-0.6622711,-0.6950852,-0.9404033,-0.4762677,-0.5576194,-0.8666782,-0.4480075,-0.6090437,-0.8116754,-0.8056051,0.1940021,0.0
50%,33418.5,12775.0,17919.0,27378.5,57.0,-0.1479181,0.05704748,0.1642254,-0.09054719,0.207116,-0.3748297,-0.4516039,-0.4479142,-0.4176626,-0.03366513,-0.006472183,0.1940021,0.0
75%,49930.25,19235.0,27109.0,41024.0,85.0,0.3702953,0.7388364,0.8455978,0.7593089,0.7003795,0.08286423,0.760503,-0.3929915,0.33764,0.8038868,0.5904729,0.1940021,1.0
max,66719.0,25774.0,37314.0,55766.0,113.0,55.88559,2.52775,1.461871,1.609165,2.655602,11.52665,2.287693,3.191794,4.517359,1.994719,3.466172,3.375832,1.0


<center> <h3>Hipóteses<hr /><h3/> </center>

1. **Duração das Músicas**

**Insight**: A coluna duration_ms possui um valor médio de aproximadamente 1,29 segundos, com um intervalo entre 0 e 53 segundos. </br>
**Hipótese**: Músicas com duração média (em torno de 3-4 minutos) têm maior popularidade em comparação com músicas muito curtas ou longas.


2. **Danceabilidade**

**Insight**: O valor médio de danceability é 57,69, e a variabilidade é alta (mínimo de 31 e máximo de 113). </br>
**Hipótese**: Músicas com níveis mais altos de danceabilidade tendem a ser mais populares, especialmente em gêneros como pop e dance.


6. **Gênero da Música**

**Insight**: A coluna track_genre não possui uma análise descritiva direta, mas é um fator importante. </br>
**Hipótese**: Certos gêneros, como pop e hip-hop, têm uma maior popularidade em comparação com gêneros menos mainstream, como jazz ou música clássica.

9. **Instrumentalidade**

**Insight**: A média de instrumentalness e os extremos indicam que a maioria das músicas contém vocais. </br>
**Hipótese**: Músicas com menor instrumentalidade (mais vocais) tendem a ser mais populares do que as músicas puramente instrumentais.


&emsp;Criação da matriz de correlação:

In [12]:
correlation_matrix = df_normalized.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Matriz de Correlação')

plt.savefig('matriz_de_correlacao.png', dpi=300) 

plt.close()

&emsp;Ao analisar a matriz de correlação, é notório a pouca correlação entre as variáveis, que dificulta a abordagem supervisionada. As maiores relações são entre ```track_name``` e ```album_name``` (0.30), ```valence``` e ```danceability``` (0.40) e, por fim, ```loudness``` e ```energy``` (0.70), apresentando a maior correlação.

&emsp;Criação do modelo:

In [33]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

features = df_normalized.drop(columns='popularity_target').values 
target = df_normalized[target_column].values

X_train, X_val, y_train, y_val, track_train, track_val = train_test_split(
    features, target, df_normalized['track_id'].values, test_size=0.2, random_state=42
)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=50, min_samples_split=2, min_samples_leaf=1, bootstrap=False)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average='macro') 
recall = recall_score(y_val, y_pred, average='macro')

print(f"Acurácia: {accuracy}")
print(f"Precisão: {precision}")
print(f"Recall: {recall}")

  return fit_method(estimator, *args, **kwargs)


Acurácia: 0.8192703982177666
Precisão: 0.819287281518922
Recall: 0.8192793830262424


&emsp;Otimização dos parâmetros:

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf_model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Melhores parâmetros encontrados:", best_params)

best_rf_model = grid_search.best_estimator_
y_pred = best_rf_model.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average='macro')
recall = recall_score(y_val, y_pred, average='macro')

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
