# Enunciado
1. Implemente os algoritmos Naive Bayes (NB) e SVM usando uma interface semelhante ao Scikit-Learn. O algoritmo deve ser uma classe Python em uma biblioteca externa.
2. Treine e avalie (de acordo com a métrica F1-Score), usando NB e SVM implementados por você, o problema de classificação binária a seguir: [Heart Disease Dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)
3. Compare o resultado de sua implementação com as implementações de NB e SVM do scikit learn em um grid search. Varie os hiper-parâmetros da implementação do scikit learn, usando aqueles apresentados na disciplina.



# Importações

In [1]:
import pandas as pd
import numpy as np

#libs auxiliares
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#libs dos modelos
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [2]:
df = pd.read_csv('/content/drive/MyDrive/machine-learning/heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
categoricos = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
for i in categoricos:
  print(df[i].value_counts())

1    713
0    312
Name: sex, dtype: int64
0    497
2    284
1    167
3     77
Name: cp, dtype: int64
0    872
1    153
Name: fbs, dtype: int64
1    513
0    497
2     15
Name: restecg, dtype: int64
0    680
1    345
Name: exang, dtype: int64
1    482
2    469
0     74
Name: slope, dtype: int64
0    578
1    226
2    134
3     69
4     18
Name: ca, dtype: int64
2    544
3    410
1     64
0      7
Name: thal, dtype: int64


In [4]:
categoricos

['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

In [5]:
lista_de_valores_unicos = list()
for i in categoricos:
  valores_unicos = df[i].unique()
  lista_de_valores_unicos.append(valores_unicos)

lista_de_valores_unicos

[array([1, 0]),
 array([0, 1, 2, 3]),
 array([0, 1]),
 array([1, 0, 2]),
 array([0, 1]),
 array([2, 0, 1]),
 array([2, 0, 1, 3, 4]),
 array([3, 2, 1, 0])]

In [6]:
for classe in range(len(lista_de_valores_unicos)):
  categorias = sorted(lista_de_valores_unicos[classe], reverse=True)
  for categoria in categorias:
    df[categoricos[classe]] = df[categoricos[classe]].replace(categoria, categoria+1)

In [7]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,2,1,125,212,1,2,168,1,1.0,3,3,4,0
1,53,2,1,140,203,2,1,155,2,3.1,1,1,4,0
2,70,2,1,145,174,1,2,125,2,2.6,1,1,4,0
3,61,2,1,148,203,1,2,161,1,0.0,3,2,4,0
4,62,1,1,138,294,2,2,106,1,1.9,2,4,3,0


In [8]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [9]:
for i in df.columns:
  print(i, len(df[df[i] == 0]))

age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 329
slope 0
ca 0
thal 0
target 499


**Testando estratégias para lidar com valor zero na coluna oldpeak**
1. Excluir a coluna que ainda possui zero
2. Somar +1 em nos valores da coluna oldpeak

In [10]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [11]:
x1 = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
   'exang', 'slope', 'ca', 'thal']].values

In [12]:
df2 = df.copy()

In [13]:
df2.oldpeak

0       1.0
1       3.1
2       2.6
3       0.0
4       1.9
       ... 
1020    0.0
1021    2.8
1022    1.0
1023    0.0
1024    1.4
Name: oldpeak, Length: 1025, dtype: float64

In [14]:
df2['oldpeak'] = df2['oldpeak'] +1

In [15]:
df2.oldpeak

0       2.0
1       4.1
2       3.6
3       1.0
4       2.9
       ... 
1020    1.0
1021    3.8
1022    2.0
1023    1.0
1024    2.4
Name: oldpeak, Length: 1025, dtype: float64

A nossa implementação do SVM precisa que os valores sejam -1 e 1, para que não obtenhamos um rank score errôneo, vamos definir previamente 0 para -1 e 1 permanece. 

In [16]:
y = df.iloc[:,-1].values

In [17]:
y = [-1 if value==0 else 1 for value in y]

In [18]:
x2 = df2.iloc[:,:-1].values

# 1.Implementações

In [19]:
from libs import naive
from libs import svm
from libs import ranking_gs #Fiz essa lib baseado nas minhas implementações dos exercícios anteriores
#para gerar o ranking, como fiz de uma forma padrão que não dependa de decorar as posições dos 
#classificadores como foi sugerido pelo professor em aula. 
#Funciona tanto pro grid search quanto pro random search

# 2.Treino e avaliação

**Treino e avaliação da 1ª estratégia**
* Excluir a coluna que ainda possui zero

In [20]:
X_train, X_test, y_train, y_test = train_test_split(x1, y, train_size=.8, random_state=42)

In [21]:
s = StandardScaler()
X_train = s.fit_transform(X_train)

In [22]:
nb1 = naive.NaiveBayes()
svm1 = svm.SVM()

In [23]:
nb1.fit(X_train, y_train)

In [24]:
svm1.fit(X_train, y_train)

In [25]:
pred_nb = nb1.predict(X_test)

  posterior = np.sum(np.log(self._pdf(idx, x)))


In [26]:
pred_nb

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1])

In [27]:
pred_svm = svm1.predict(X_test)

In [28]:
print("Test F1-Score:", f1_score(y_test, pred_nb, average=None))
print("Test F1-Score:", f1_score(y_test, pred_nb, average='micro'))
print("Test F1-Score:", f1_score(y_test, pred_nb, average='macro'))
print("Test F1-Score:", f1_score(y_test, pred_nb, average='weighted'))

Test F1-Score: [0.66449511 0.        ]
Test F1-Score: 0.4975609756097561
Test F1-Score: 0.3322475570032573
Test F1-Score: 0.3306268372129975


In [29]:
print("Test F1-Score:", f1_score(y_test, pred_svm, average=None))
print("Test F1-Score:", f1_score(y_test, pred_svm, average='micro'))
print("Test F1-Score:", f1_score(y_test, pred_svm, average='macro'))
print("Test F1-Score:", f1_score(y_test, pred_svm, average='weighted'))

Test F1-Score: [0.67808219 0.20338983]
Test F1-Score: 0.5414634146341464
Test F1-Score: 0.4407360111446482
Test F1-Score: 0.43957822489764253


**Treino e avaliação da 2ª estratégia**
* Somar +1 em nos valores da coluna oldpeak

In [30]:
X_train, X_test, y_train, y_test = train_test_split(x2, y, train_size=.8, random_state=42)

In [31]:
s = StandardScaler()
X_train = s.fit_transform(X_train)

In [32]:
nb2 = naive.NaiveBayes()
svm2 = svm.SVM()

In [33]:
nb2.fit(X_train, y_train)

In [34]:
svm2.fit(X_train, y_train)

In [35]:
pred_nb2 = nb2.predict(X_test)

  posterior = np.sum(np.log(self._pdf(idx, x)))


In [36]:
pred_nb2

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1])

In [37]:
pred_svm2 = svm2.predict(X_test)

In [38]:
pred_svm2 = [0 if value==-1 else 1 for value in pred_svm2]

In [39]:
print("Test F1-Score:", f1_score(y_test, pred_nb2, average=None))
print("Test F1-Score:", f1_score(y_test, pred_nb2, average='micro'))
print("Test F1-Score:", f1_score(y_test, pred_nb2, average='macro'))
print("Test F1-Score:", f1_score(y_test, pred_nb2, average='weighted'))

Test F1-Score: [0.66449511 0.        ]
Test F1-Score: 0.4975609756097561
Test F1-Score: 0.3322475570032573
Test F1-Score: 0.3306268372129975


In [40]:
print("Test F1-Score:", f1_score(y_test, pred_svm2, average=None))
print("Test F1-Score:", f1_score(y_test, pred_svm2, average='micro'))
print("Test F1-Score:", f1_score(y_test, pred_svm2, average='macro'))
print("Test F1-Score:", f1_score(y_test, pred_svm2, average='weighted'))

Test F1-Score: [0.         0.         0.07476636]
Test F1-Score: 0.01951219512195122
Test F1-Score: 0.024922118380062305
Test F1-Score: 0.037565534533850004


Vamos considerar que a primeira estrátegia é a melhor, já que para o naive bayes a predição foi a mesma em ambas estratégias e no svm teve um percentual maior de 0.54 em relação aos f1-scores do svm da segunda estratégia.

# 3.Comparação de resultados
Compare o resultado de sua implementação com as implementações de NB e SVM do scikit learn em um grid search. Varie os hiper-parâmetros da implementação do scikit learn, usando aqueles apresentados na disciplina.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(x1, y, train_size=.8, random_state=42)

In [42]:
clf1 = naive.NaiveBayes()
clf2 = GaussianNB()
clf3 = svm.SVM()
clf4 = SVC()

In [43]:
classificadores = [clf1, clf2, clf3, clf4]

In [44]:
param1 = {}
param1['classifier'] = [clf1]

param2 = {}
param2['classifier'] = [clf2]

param3 = {}
param3['classifier'] = [clf3]

param4 = {}
param4['classifier__max_iter'] = [500, 1000, 2000]
param4['classifier__random_state'] = [42]
param4['classifier'] = [clf4]

In [45]:
params = [param1, param2, param3, param4]

In [46]:
pipeline = Pipeline([('standard', StandardScaler()), ('classifier', clf1)])

In [47]:
gs = GridSearchCV(pipeline, params, cv=3, n_jobs=-1, scoring='f1_micro', return_train_score=True).fit(X_train, y_train)



In [48]:
df_results_gs = pd.DataFrame(gs.cv_results_)

In [49]:
ranking_train, ranking_test = ranking_gs.rankings(df_results_gs, classificadores)

In [50]:
ranking_train

Unnamed: 0,model,mean_train_score,mean_test_score
3,"SVC(max_iter=500, random_state=42)",0.938418,0.90363
2,<libs.svm.SVM object at 0x7f0f3a2d5ad0>,0.861583,0.853627
0,<libs.naive.NaiveBayes object at 0x7f0f3a2d5a90>,0.843292,0.826796
1,GaussianNB(),0.843292,0.826796


In [51]:
ranking_test

Unnamed: 0,model,mean_train_score,mean_test_score
3,"SVC(max_iter=500, random_state=42)",0.938418,0.90363
2,<libs.svm.SVM object at 0x7f0f3a2d5ad0>,0.861583,0.853627
0,<libs.naive.NaiveBayes object at 0x7f0f3a2d5a90>,0.843292,0.826796
1,GaussianNB(),0.843292,0.826796


Considerando os valores da seção 2 e 3, podemos desconfiar que estamos fazendo algo errado nos treinos da seção 2 kkkk

# Fim