<h1>Atividade 02 - melhorar o desempenho de RP em conjunto de dados existentes</h1>
<p>A atividade 02 visa trabalhar com um conjunto de dados pr√©-constru√≠do, onde as op√ß√µes que o desenvolvedor tem, s√£o de aplicar as t√©cnicas de pr√©-processamento abaixo relacionadas:</p>
<ul><li>Sele√ß√£o</li>
<li>Limpeza</li>
<li>Codifica√ß√£o</li>
<li>Enriquecimento</li>
<li>Normaliza√ß√£o</li>
<li>Constru√ß√£o de Atributos</li>
<li>Corre√ß√£o de Preval√™ncia</li>
<li>Parti√ß√£o do Conjunto de Dados</li>
</ul>
<p>Busque uma base de dados na UCI Machine Learning que seja indicada para problemas de classifica√ß√£o. (<a target="_blank" href="https://archive.ics.uci.edu/datasets">https://archive.ics.uci.edu/datasets</a>)</p>

<h2>Integrantes</h2>
<ul>
  <li>Lugi Alves</li>
  <li>Renan Santos</li>
  <li>Umberto Ferreira</li>
  <li>William Cardoso</li>
</ul>

Op√ß√£o 01 - carregue o arquivo de dados da pasta local para o colab.


In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
import requests
from io import StringIO

# Dataset dispon√≠vel como CSV
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
response = requests.get(url)

# Nomes das colunas segundo a documenta√ß√£o UCI
colunas = [
    'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'
]

# Leitura do conte√∫do como CSV (valores faltantes s√£o marcados como '?')
df = pd.read_csv(StringIO(response.text), header=None, names=colunas, na_values='?')

# Visualiza√ß√£o inicial
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [13]:
# Verificar valores ausentes
print("Valores faltantes por coluna:\n", df.isna().sum())

# Remover linhas com valores ausentes
df = df.dropna()

# Converter colunas para tipos corretos
df = df.astype({
    'ca': int, 'thal': int
})

# A vari√°vel target tem valores 0 (sem doen√ßa) a 4 (com n√≠veis de doen√ßa)
# Vamos binarizar: 0 = sem doen√ßa, 1 = com doen√ßa
df['target'] = df['target'].apply(lambda x: 1 if x > 0 else 0)

# Separar atributos e r√≥tulo
X_orig = df.drop(columns=['target'])
Y_orig = df['target']


Valores faltantes por coluna:
 age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64


In [14]:
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X_orig), columns=X_orig.columns)
Y = Y_orig.copy()


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, stratify=Y, random_state=42)
X_orig_train, X_orig_test, y_orig_train, y_orig_test = train_test_split(X_orig, Y_orig, test_size=0.25, stratify=Y_orig, random_state=42)


In [16]:
modelo_orig = SVC()
modelo_orig.fit(X_orig_train, y_orig_train)

print("üîé Avalia√ß√£o com dados ORIGINAIS (Treino):")
print(confusion_matrix(y_orig_train, modelo_orig.predict(X_orig_train)))
print(classification_report(y_orig_train, modelo_orig.predict(X_orig_train)))

print("üîé Avalia√ß√£o com dados ORIGINAIS (Teste):")
print(confusion_matrix(y_orig_test, modelo_orig.predict(X_orig_test)))
print(classification_report(y_orig_test, modelo_orig.predict(X_orig_test)))


üîé Avalia√ß√£o com dados ORIGINAIS (Treino):
[[104  16]
 [ 59  43]]
              precision    recall  f1-score   support

           0       0.64      0.87      0.73       120
           1       0.73      0.42      0.53       102

    accuracy                           0.66       222
   macro avg       0.68      0.64      0.63       222
weighted avg       0.68      0.66      0.64       222

üîé Avalia√ß√£o com dados ORIGINAIS (Teste):
[[33  7]
 [17 18]]
              precision    recall  f1-score   support

           0       0.66      0.82      0.73        40
           1       0.72      0.51      0.60        35

    accuracy                           0.68        75
   macro avg       0.69      0.67      0.67        75
weighted avg       0.69      0.68      0.67        75



In [None]:
modelo = SVC()
modelo.fit(X_train, y_train)

print(" Avalia√ß√£o com dados NORMALIZADOS (Treino):")
print(confusion_matrix(y_train, modelo.predict(X_train)))
print(classification_report(y_train, modelo.predict(X_train)))

print("Avalia√ß√£o com dados NORMALIZADOS (Teste):")
print(confusion_matrix(y_test, modelo.predict(X_test)))
print(classification_report(y_test, modelo.predict(X_test)))


üîé Avalia√ß√£o com dados NORMALIZADOS (Treino):
[[109  11]
 [ 18  84]]
              precision    recall  f1-score   support

           0       0.86      0.91      0.88       120
           1       0.88      0.82      0.85       102

    accuracy                           0.87       222
   macro avg       0.87      0.87      0.87       222
weighted avg       0.87      0.87      0.87       222

üîé Avalia√ß√£o com dados NORMALIZADOS (Teste):
[[36  4]
 [ 7 28]]
              precision    recall  f1-score   support

           0       0.84      0.90      0.87        40
           1       0.88      0.80      0.84        35

    accuracy                           0.85        75
   macro avg       0.86      0.85      0.85        75
weighted avg       0.85      0.85      0.85        75

