# Questão 1 - Classificação
Elabore uma solução utilizando dois algoritmos de aprendizagem de máquina do seu conhecimento para classificar a emoção da música (escolha como você representará a saída, uma vez que algumas músicas podem pertencer a duas categorias). 
Justifique a escolha dos dois algoritmos de aprendizagem de maquina utilizados e discuta os resultados obtidos em ambos. Divida (aleatoriamente) o conjunto em treinamento e teste: 80% e 20%. Os resultados dessa questão deverão ser descritos detalhadamente no relatório através de três pontos principais:
- Análise da base de dados: identificar instancias com atributos incompletos, gerar matriz de correlação, identificar a presença de outliers e verificar se as classes estão balanceadas. O balanceamento devera ser ilustrado por meio de gráficos (e.g. histograma);
- Análise dos resultados considerando matriz de confusão, especificidade, sensibilidade, medida f1 e acurácia. Descreva detalhadamente os resultados obtidos por cada métrica,justificando a diferença entre eles.

Bônus: Altere dois parâmetros de cada algoritmo de aprendizagem de maquina utilizados na questão e discuta os resultados obtidos. Exemplo: alterar a quantidade de k vizinhos e a função de distancia utilizada, alterar a função kernel do SVM, alterar a arquitetura da rede neural (e.x. camadas e função de ativação, alterar o otimizador e a taxa de aprendizado).

## Preliminares
Importação de dependências, carregamento da base de dados e observação inicial dos dados.

In [85]:
from scipy.io import arff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn import neighbors
from sklearn import tree
from sklearn import model_selection
from sklearn import metrics

In [86]:
#emotions.arff contem todas as instancias de test e train
data = arff.loadarff('multilabel-classification-emotions/emotions.arff')
df = pd.DataFrame(data[0])

df.head()

Unnamed: 0,Mean_Acc1298_Mean_Mem40_Centroid,Mean_Acc1298_Mean_Mem40_Rolloff,Mean_Acc1298_Mean_Mem40_Flux,Mean_Acc1298_Mean_Mem40_MFCC_0,Mean_Acc1298_Mean_Mem40_MFCC_1,Mean_Acc1298_Mean_Mem40_MFCC_2,Mean_Acc1298_Mean_Mem40_MFCC_3,Mean_Acc1298_Mean_Mem40_MFCC_4,Mean_Acc1298_Mean_Mem40_MFCC_5,Mean_Acc1298_Mean_Mem40_MFCC_6,...,BH_HighLowRatio,BHSUM1,BHSUM2,BHSUM3,amazed-suprised,happy-pleased,relaxing-calm,quiet-still,sad-lonely,angry-aggresive
0,0.034741,0.089665,0.091225,-73.302422,6.215179,0.615074,2.03716,0.804065,1.301409,0.558576,...,2.0,0.245457,0.105065,0.405399,b'0',b'1',b'1',b'0',b'0',b'0'
1,0.081374,0.272747,0.085733,-62.584437,3.183163,-0.218145,0.163038,0.620251,0.458514,0.041426,...,2.0,0.343547,0.276366,0.710924,b'1',b'0',b'0',b'0',b'0',b'1'
2,0.110545,0.273567,0.08441,-65.235325,2.794964,0.639047,1.281297,0.757896,0.489412,0.627636,...,3.0,0.188693,0.045941,0.457372,b'0',b'1',b'0',b'0',b'0',b'1'
3,0.042481,0.199281,0.093447,-80.305152,5.824409,0.648848,1.75487,1.495532,0.739909,0.809644,...,2.0,0.102839,0.241934,0.351009,b'0',b'0',b'1',b'0',b'0',b'0'
4,0.07455,0.14088,0.079789,-93.697749,5.543229,1.064262,0.899152,0.890336,0.702328,0.490685,...,2.0,0.195196,0.310801,0.683817,b'0',b'0',b'0',b'1',b'0',b'0'


In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593 entries, 0 to 592
Data columns (total 78 columns):
Mean_Acc1298_Mean_Mem40_Centroid    593 non-null float64
Mean_Acc1298_Mean_Mem40_Rolloff     593 non-null float64
Mean_Acc1298_Mean_Mem40_Flux        593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_0      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_1      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_2      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_3      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_4      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_5      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_6      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_7      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_8      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_9      593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_10     593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_11     593 non-null float64
Mean_Acc1298_Mean_Mem40_MFCC_12     593 

Os únicos atributos não numéricos são os atributos alvos, que deverão ser convertidos posteriormente. Por fim, a última preliminar é checar se existe algum dado ausente:

In [88]:
#df.isnull().sum()

for (name, data) in df.iteritems():
    print("%s %d" % (name, df[name].isnull().sum()))


Mean_Acc1298_Mean_Mem40_Centroid 0
Mean_Acc1298_Mean_Mem40_Rolloff 0
Mean_Acc1298_Mean_Mem40_Flux 0
Mean_Acc1298_Mean_Mem40_MFCC_0 0
Mean_Acc1298_Mean_Mem40_MFCC_1 0
Mean_Acc1298_Mean_Mem40_MFCC_2 0
Mean_Acc1298_Mean_Mem40_MFCC_3 0
Mean_Acc1298_Mean_Mem40_MFCC_4 0
Mean_Acc1298_Mean_Mem40_MFCC_5 0
Mean_Acc1298_Mean_Mem40_MFCC_6 0
Mean_Acc1298_Mean_Mem40_MFCC_7 0
Mean_Acc1298_Mean_Mem40_MFCC_8 0
Mean_Acc1298_Mean_Mem40_MFCC_9 0
Mean_Acc1298_Mean_Mem40_MFCC_10 0
Mean_Acc1298_Mean_Mem40_MFCC_11 0
Mean_Acc1298_Mean_Mem40_MFCC_12 0
Mean_Acc1298_Std_Mem40_Centroid 0
Mean_Acc1298_Std_Mem40_Rolloff 0
Mean_Acc1298_Std_Mem40_Flux 0
Mean_Acc1298_Std_Mem40_MFCC_0 0
Mean_Acc1298_Std_Mem40_MFCC_1 0
Mean_Acc1298_Std_Mem40_MFCC_2 0
Mean_Acc1298_Std_Mem40_MFCC_3 0
Mean_Acc1298_Std_Mem40_MFCC_4 0
Mean_Acc1298_Std_Mem40_MFCC_5 0
Mean_Acc1298_Std_Mem40_MFCC_6 0
Mean_Acc1298_Std_Mem40_MFCC_7 0
Mean_Acc1298_Std_Mem40_MFCC_8 0
Mean_Acc1298_Std_Mem40_MFCC_9 0
Mean_Acc1298_Std_Mem40_MFCC_10 0
Mean_Acc1298_Std_M

~Milagrosamente~ Todos os atributos não possuem valores faltando.

## Tratando atributos não numéricos
Nesta base de dados, os únicos atributos apresentados de forma não numérica são os atributos alvos, primeiramente, verifica-se os valores existentes para cada um.

In [89]:
print("Valores únicos de amazed-suprised:", df['amazed-suprised'].unique(), "\n")
print("Valores únicos de happy-pleased:", df['happy-pleased'].unique(), "\n")
print("Valores únicos de relaxing-calm:", df['relaxing-calm'].unique(), "\n")
print("Valores únicos de quiet-still:", df['quiet-still'].unique(), "\n")
print("Valores únicos de sad-lonely:", df['sad-lonely'].unique(), "\n")
print("Valores únicos de angry-aggresive:", df['angry-aggresive'].unique(), "\n")

Valores únicos de amazed-suprised: [b'0' b'1'] 

Valores únicos de happy-pleased: [b'1' b'0'] 

Valores únicos de relaxing-calm: [b'1' b'0'] 

Valores únicos de quiet-still: [b'0' b'1'] 

Valores únicos de sad-lonely: [b'0' b'1'] 

Valores únicos de angry-aggresive: [b'0' b'1'] 



Para todos estes atributos, os valores existentes são b'0' b'1', logo é bastante simples mapear os valores existentes para valores numéricos usando um dicionário.

In [90]:
to_num = {
   "b'1'" : 1, "b'0'": 0
}

df['amazed-suprised'] = df['amazed-suprised'].apply(lambda x : to_num[str(x)])
df['happy-pleased'] = df['happy-pleased'].apply(lambda x : to_num[str(x)])
df['relaxing-calm'] = df['relaxing-calm'].apply(lambda x : to_num[str(x)])
df['quiet-still'] = df['quiet-still'].apply(lambda x : to_num[str(x)])
df['sad-lonely'] = df['sad-lonely'].apply(lambda x : to_num[str(x)])
df['angry-aggresive'] = df['angry-aggresive'].apply(lambda x : to_num[str(x)])

## Analisando correlação

In [148]:
correl = df.corr().iloc[ : -6, -6: ]
#correl = df.corr()
correl.style.background_gradient(cmap = 'coolwarm')

Unnamed: 0,amazed-suprised,happy-pleased,relaxing-calm,quiet-still,sad-lonely,angry-aggresive
Mean_Acc1298_Mean_Mem40_Centroid,0.345763,0.108973,-0.305083,-0.424649,-0.335437,0.267874
Mean_Acc1298_Mean_Mem40_Rolloff,0.339501,0.0312514,-0.348233,-0.491562,-0.380406,0.375731
Mean_Acc1298_Mean_Mem40_Flux,0.0995083,0.0381931,-0.171167,-0.393341,-0.286426,0.278627
Mean_Acc1298_Mean_Mem40_MFCC_0,0.383036,0.0456017,-0.424995,-0.492264,-0.163028,0.413489
Mean_Acc1298_Mean_Mem40_MFCC_1,-0.347632,-0.0706246,0.415806,0.597033,0.401008,-0.455261
Mean_Acc1298_Mean_Mem40_MFCC_2,-0.0918324,-0.160563,-0.0769074,0.189363,0.129448,0.0652834
Mean_Acc1298_Mean_Mem40_MFCC_3,-0.220566,0.0712478,0.190066,0.16152,0.17482,-0.190563
Mean_Acc1298_Mean_Mem40_MFCC_4,-0.0809502,-0.211555,-0.00334488,0.150739,0.147596,0.0176273
Mean_Acc1298_Mean_Mem40_MFCC_5,-0.106724,-0.0658283,0.0862056,0.211173,0.144831,-0.147237
Mean_Acc1298_Mean_Mem40_MFCC_6,-0.012209,-0.11129,-0.0393416,0.0145314,-0.0428849,0.115362


Pela matriz acima, é possível perceber que muitos dos atributos possuem uma correlação fraca com todos os atributos alvos, o que é o caso dos atributos ..., que serão descartados: