# Data Minning

Este notebook tem como função limpar e deixar pronto o dataframe que serão utilizados no desenvolvimento do projeto, alguns dos pontos a serem avaliados:
* Quantidade de valores nulos e tratamento dos mesmos
* correlação para trazer outras informações (feature engineering)


### Requirements

In [30]:
import pandas as pd
import numpy as np

In [31]:

df_rec = pd.read_parquet('data/raw_recommendation_data.parquet')

In [32]:
df_rec.isnull().sum()

id_date        0
user_id        0
id_tracks      0
plays          0
holiday        0
id_artist    511
id_genre     511
Feature1     511
Feature2     511
Feature3     511
Feature4     511
Feature5     511
Feature6     511
Feature7     511
Feature8     511
Feature9     511
Feature10    511
dtype: int64

A base tem 511 valores nulos,representando 13,47% da base, as linhas faltantes são originadas de informações sobre o artista daquea música escutada pelo usuário. 
Entendendo serem informações importantes, iremos utilizar técnicas de completar.

## ID_GENDER = Preencher com outro valor
Para os valores nulos de genero optamos por criar um novo genero "desconhecido" onde contemplará este genero que não foi identificado


In [34]:
df_rec.id_genre.value_counts()

id_genre
3.0    820
1.0    804
2.0    792
4.0    792
Name: count, dtype: int64

In [35]:
df_rec.id_genre = np.where(df_rec.id_genre.isnull(), 5.0, df_rec.id_genre)

## ID_ARTIST = Preencher com outro valor
O ID do artista apresenta um ponto de atenção importante no projeto pois a depender da metodologia escolhida influenciará diretamente na modelagem e resultados finais.
Tentando mitigar a premissa de que o artista seria o que tem o maior número de aparições ou mais audições criamos um id unico para estes artistas. Vale lembrar que utilizamos um "novo genero" para os não identificados também.

In [36]:
df_rec.id_artist.value_counts().sort_values(ascending=False)

id_artist
463.0    18
237.0    17
468.0    16
151.0    16
178.0    15
         ..
314.0     1
494.0     1
409.0     1
150.0     1
449.0     1
Name: count, Length: 493, dtype: int64

In [37]:
df_rec.id_artist.describe()

count    3208.000000
mean      250.372506
std       142.287096
min         1.000000
25%       126.000000
50%       250.000000
75%       371.250000
max       500.000000
Name: id_artist, dtype: float64

In [38]:

df_rec.id_artist = np.where(df_rec.id_artist.isnull(), 501, df_rec.id_artist)

## Técnicas para os faltantes

In [39]:
import numpy as np
from sklearn.impute import SimpleImputer
SimpleImputer(strategy="median")
imp = SimpleImputer(strategy="median")
df_rec[['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9', 'Feature10']] = imp.fit_transform(df_rec[['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9', 'Feature10']])

In [40]:
df_rec

Unnamed: 0,id_date,user_id,id_tracks,plays,holiday,id_artist,id_genre,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,2023-01-01,41,611,3,1,501.0,5.0,421.0,1148.0,744.0,448.0,649.0,470.0,1238.0,539.0,414.0,901.0
1,2023-01-01,41,128,10,1,66.0,3.0,387.0,1460.0,362.0,1026.0,1217.0,454.0,1362.0,503.0,323.0,79.0
2,2023-01-01,41,478,12,1,501.0,5.0,421.0,1148.0,744.0,448.0,649.0,470.0,1238.0,539.0,414.0,901.0
3,2023-01-01,1,3003,17,1,501.0,5.0,421.0,1148.0,744.0,448.0,649.0,470.0,1238.0,539.0,414.0,901.0
4,2023-01-01,1,1778,13,1,347.0,2.0,419.0,1226.0,860.0,497.0,945.0,385.0,1094.0,515.0,412.0,1359.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3714,2023-07-30,30,1624,5,1,293.0,1.0,452.0,1077.0,367.0,129.0,1045.0,381.0,1289.0,574.0,459.0,901.0
3715,2023-07-30,30,1981,2,1,130.0,1.0,421.0,1617.0,1412.0,334.0,1323.0,490.0,1250.0,823.0,666.0,-867.0
3716,2023-07-30,30,1774,3,1,362.0,4.0,396.0,1448.0,1313.0,1132.0,513.0,672.0,1261.0,630.0,414.0,981.0
3717,2023-07-30,30,1973,17,1,456.0,3.0,370.0,853.0,-107.0,194.0,642.0,354.0,1318.0,335.0,15.0,173.0


## Feature Engineering

In [41]:
df_rec.corr()

Unnamed: 0,id_date,user_id,id_tracks,plays,holiday,id_artist,id_genre,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
id_date,1.0,-0.044736,0.003031,-0.007352,,-0.003408,-0.004578,-0.030925,-0.008527,-0.000259,0.017292,0.01008,-0.01217,0.027819,-0.00066,-0.0203,0.014416
user_id,-0.044736,1.0,-0.01529,-0.007363,,0.00692,0.025261,-0.010483,-0.005253,0.018985,-0.017119,-0.004294,0.002947,0.0066,0.005726,-0.022627,0.011918
id_tracks,0.003031,-0.01529,1.0,-0.008675,,0.024689,-0.016587,0.036243,-0.01659,0.007248,-0.016574,-0.01405,0.026182,-0.00646,-0.029121,-0.035325,0.029778
plays,-0.007352,-0.007363,-0.008675,1.0,,0.00791,-0.01254,-0.021328,0.003067,-0.011929,0.020447,-0.001088,0.004385,0.004535,-0.016286,0.003406,0.01263
holiday,,,,,,,,,,,,,,,,,
id_artist,-0.003408,0.00692,0.024689,0.00791,,1.0,0.363274,-0.000528,0.013893,-0.037329,-0.017251,0.058213,0.015147,-0.049456,-0.068686,-0.014493,0.069638
id_genre,-0.004578,0.025261,-0.016587,-0.01254,,0.363274,1.0,0.002292,-0.01986,-0.010033,-0.021396,0.019926,-0.006536,0.010973,0.015981,-0.000798,0.010378
Feature1,-0.030925,-0.010483,0.036243,-0.021328,,-0.000528,0.002292,1.0,0.046217,0.025678,-0.076015,0.085121,-0.087508,-0.09642,0.019207,0.005944,-0.051948
Feature2,-0.008527,-0.005253,-0.01659,0.003067,,0.013893,-0.01986,0.046217,1.0,-0.000688,0.06154,0.065659,-0.017137,-0.061341,0.04452,-0.018375,0.007827
Feature3,-0.000259,0.018985,0.007248,-0.011929,,-0.037329,-0.010033,0.025678,-0.000688,1.0,0.001853,0.021277,0.024297,-0.031969,0.049956,0.066205,-0.061463


Neste primeiro momento não encontramos uma correlação que viabilizasse a relação entre duas features, então neste momento inicial seguiremos desta forma.

### Quantidade de vezes ouvidas "outlier"

In [42]:
df_rec['acima_media_track'] = np.where(df_rec.plays > df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)

  df_rec['acima_media_track'] = np.where(df_rec.plays > df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)


In [43]:
df_rec['abaixo_media_track'] = np.where(df_rec.plays < df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)

  df_rec['abaixo_media_track'] = np.where(df_rec.plays < df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)


### Artistas premium por genero (dos artistas tem mais vezes ouvidas)

In [44]:
df_rec['acima_media_artist'] = np.where(df_rec.plays > df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)

  df_rec['acima_media_artist'] = np.where(df_rec.plays > df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)


In [45]:
df_rec['abaixo_media_artist'] = np.where(df_rec.plays < df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)

  df_rec['abaixo_media_artist'] = np.where(df_rec.plays < df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)


## Juntando base final

In [46]:
df_rec.to_parquet(r'data/processed_recommendation_data.parquet')

### Conclusões

Variáveis criadas
correlações entre as variaveis
forma de tratamento dos nulos