# Data Minning

Este notebook tem como função limpar e deixar pronto o dataframe que serão utilizados no desenvolvimento do projeto, alguns dos pontos a serem avaliados:
* Quantidade de valores nulos e tratamento dos mesmos
* correlação para trazer outras informações (feature engineering)


### Requirements

In [24]:
import pandas as pd
import numpy as np

In [25]:

df_rec = pd.read_parquet('data/raw_recommendation_data.parquet')

In [26]:
df_rec.isnull().sum()

id_date         0
user_id         0
id_tracks       0
plays           0
holiday         0
id_artist    1453
id_genre     1453
Feature1     1453
Feature2     1453
Feature3     1453
Feature4     1453
Feature5     1453
Feature6     1453
Feature7     1453
Feature8     1453
Feature9     1453
Feature10    1453
dtype: int64

A base tem 511 valores nulos,representando 13,47% da base, as linhas faltantes são originadas de informações sobre o artista daquea música escutada pelo usuário. 
Entendendo serem informações importantes, iremos utilizar técnicas de completar.

## ID_GENDER = Preencher com outro valor
Para os valores nulos de genero optamos por criar um novo genero "desconhecido" onde contemplará este genero que não foi identificado


In [27]:
df_rec.id_genre.value_counts()

id_genre
3.0    2915
4.0    2872
1.0    2812
2.0    2724
Name: count, dtype: int64

In [28]:
df_rec.id_genre = np.where(df_rec.id_genre.isnull(), 5.0, df_rec.id_genre)

## ID_ARTIST = Preencher com outro valor
O ID do artista apresenta um ponto de atenção importante no projeto pois a depender da metodologia escolhida influenciará diretamente na modelagem e resultados finais.
Tentando mitigar a premissa de que o artista seria o que tem o maior número de aparições ou mais audições criamos um id unico para estes artistas. Vale lembrar que utilizamos um "novo genero" para os não identificados também.

In [29]:
df_rec.id_artist.value_counts().sort_values(ascending=False)

id_artist
466.0    51
454.0    48
196.0    48
408.0    47
26.0     47
         ..
281.0     5
465.0     5
96.0      5
63.0      4
117.0     3
Name: count, Length: 500, dtype: int64

In [30]:
df_rec.id_artist.describe()

count    11323.000000
mean       251.359445
std        146.297191
min          1.000000
25%        126.000000
50%        251.000000
75%        380.000000
max        500.000000
Name: id_artist, dtype: float64

In [31]:

df_rec.id_artist = np.where(df_rec.id_artist.isnull(), 501, df_rec.id_artist)

## Técnicas para os faltantes

In [32]:
import numpy as np
from sklearn.impute import SimpleImputer
SimpleImputer(strategy="median")
imp = SimpleImputer(strategy="median")
df_rec[['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9', 'Feature10']] = imp.fit_transform(df_rec[['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9', 'Feature10']])

In [33]:
df_rec

Unnamed: 0,id_date,user_id,id_tracks,plays,holiday,id_artist,id_genre,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,2021-07-30,9,182,18,1,93.0,3.0,0.0,734.0,380.0,596.0,1227.0,-158.0,228.0,2294.0,534.0,2169.0
1,2021-07-30,9,3663,14,1,76.0,4.0,141.0,540.0,849.0,573.0,1242.0,-251.0,230.0,981.0,934.0,394.0
2,2021-07-30,9,3268,9,1,501.0,5.0,121.0,627.0,754.0,580.0,1208.0,1489.0,232.0,1403.0,1573.0,1098.0
3,2021-07-30,4,178,3,1,415.0,1.0,99.0,589.0,930.0,585.0,1368.0,2313.0,229.0,1863.0,-313.0,1033.0
4,2021-07-30,4,379,1,1,366.0,4.0,118.0,488.0,947.0,528.0,782.0,2289.0,233.0,1339.0,2218.0,2011.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12771,2023-07-30,23,3193,5,1,443.0,4.0,101.0,413.0,564.0,529.0,1885.0,1510.0,223.0,670.0,3441.0,1096.0
12772,2023-07-30,23,115,15,1,408.0,4.0,226.0,605.0,811.0,619.0,1200.0,2053.0,246.0,2527.0,2623.0,2046.0
12773,2023-07-30,18,1661,1,1,371.0,1.0,86.0,1075.0,1036.0,602.0,1009.0,1746.0,212.0,1851.0,1160.0,1392.0
12774,2023-07-30,18,1965,19,1,501.0,5.0,121.0,627.0,754.0,580.0,1208.0,1489.0,232.0,1403.0,1573.0,1098.0


## Feature Engineering

In [34]:
df_rec.corr()

Unnamed: 0,id_date,user_id,id_tracks,plays,holiday,id_artist,id_genre,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
id_date,1.0,0.003089,-0.002724,0.008199,-0.005541,0.01567,0.007709,-0.0032,-0.000717,0.004023,-0.010739,-0.001663,0.012991,0.005332,0.015231,-0.007628,-0.017287
user_id,0.003089,1.0,-0.014795,0.011555,0.030982,0.022001,0.002877,-0.009174,0.004732,0.002481,0.02629,-0.009265,-0.019317,-0.007331,0.004088,0.006761,-0.003104
id_tracks,-0.002724,-0.014795,1.0,0.006058,0.007294,-0.008165,0.017849,-0.028343,0.005711,0.023167,-0.013635,0.012029,-0.006971,0.012702,0.044653,-0.018425,0.010407
plays,0.008199,0.011555,0.006058,1.0,0.005657,-0.003603,-0.004443,0.008819,0.013403,-0.002998,-0.000615,-0.009854,-0.012776,-0.00957,-0.00303,0.005667,-0.003675
holiday,-0.005541,0.030982,0.007294,0.005657,1.0,0.003469,-0.007458,-0.016034,0.00654,0.003531,0.010443,-0.012436,0.008067,-0.00819,0.004305,0.016083,-0.001926
id_artist,0.01567,0.022001,-0.008165,-0.003603,0.003469,1.0,0.312,-0.019356,-0.028848,-0.007228,0.005215,0.008128,0.04219,0.028921,0.06535,-0.01178,-0.010838
id_genre,0.007709,0.002877,0.017849,-0.004443,-0.007458,0.312,1.0,-0.000245,-0.008741,0.016423,-0.054634,-0.007618,0.009715,0.009226,-0.029958,-0.023257,0.015684
Feature1,-0.0032,-0.009174,-0.028343,0.008819,-0.016034,-0.019356,-0.000245,1.0,-0.021633,-0.013588,0.010626,-0.059901,0.022516,-0.019012,-0.025861,-0.035818,0.011899
Feature2,-0.000717,0.004732,0.005711,0.013403,0.00654,-0.028848,-0.008741,-0.021633,1.0,-0.033846,0.046326,0.026742,-0.005318,-0.019116,0.07205,-0.062417,-0.043491
Feature3,0.004023,0.002481,0.023167,-0.002998,0.003531,-0.007228,0.016423,-0.013588,-0.033846,1.0,0.008911,-0.026061,-0.013514,-0.043414,0.033533,0.052753,-0.062844


Neste primeiro momento não encontramos uma correlação que viabilizasse a relação entre duas features, então neste momento inicial seguiremos desta forma.

### Quantidade de vezes ouvidas "outlier"

In [35]:
df_rec['acima_media_track'] = np.where(df_rec.plays > df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)

  df_rec['acima_media_track'] = np.where(df_rec.plays > df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)


In [36]:
df_rec['abaixo_media_track'] = np.where(df_rec.plays < df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)

  df_rec['abaixo_media_track'] = np.where(df_rec.plays < df_rec.groupby('id_tracks')[['plays']].mean().mean()[0], 1,0)


### Artistas premium por genero (dos artistas tem mais vezes ouvidas)

In [37]:
df_rec['acima_media_artist'] = np.where(df_rec.plays > df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)

  df_rec['acima_media_artist'] = np.where(df_rec.plays > df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)


In [38]:
df_rec['abaixo_media_artist'] = np.where(df_rec.plays < df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)

  df_rec['abaixo_media_artist'] = np.where(df_rec.plays < df_rec.groupby('id_artist')[['plays']].mean().mean()[0], 1,0)


## Juntando base final

In [39]:
df_rec.to_parquet(r'data/processed_recommendation_data.parquet')

### Conclusões

As correlações entre as variaveis não mostraram muitos insights, criamos novas variáveis a partir da "Importancia" do artista e da musica.
Além disso, para cada variável estipulamos uma forma de tratamento dos nulos