
# üìä Reproduzindo o Experimento do Cap√≠tulo 2.8 - Minera√ß√£o de Dados

**Base:** Mammographic Mass Dataset  
**Fonte:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/161/mammographic+mass)

Neste notebook, vamos reproduzir **todos os passos** do Cap√≠tulo 2.8 do livro "Introdu√ß√£o √† Minera√ß√£o de Dados", incluindo:  
‚úÖ Carregamento dos dados  
‚úÖ An√°lise de valores ausentes  
‚úÖ Tratamento de inconsist√™ncias  
‚úÖ Discretiza√ß√£o  
‚úÖ Transforma√ß√£o  
‚úÖ Redu√ß√£o  
‚úÖ Dataset final pronto para minera√ß√£o


In [None]:

# Instalar bibliotecas se necess√°rio
# !pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer, MinMaxScaler


In [None]:

# üì• Carregar o dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data"
columns = ["BI-RADS", "Age", "Shape", "Margin", "Density", "Severity"]

df = pd.read_csv(url, names=columns, na_values='?')

# Visualizar primeiras linhas
df.head()


In [None]:

# üîç Analisar valores ausentes

df.info()
df.isnull().sum()


In [None]:

# üõ†Ô∏è Imputar valores ausentes

imputer = SimpleImputer(strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=columns)

# Converter colunas num√©ricas de volta
for col in ["BI-RADS", "Age", "Shape", "Margin", "Density", "Severity"]:
    df_imputed[col] = pd.to_numeric(df_imputed[col])

df_imputed.isnull().sum()


In [None]:

# üìä Estat√≠sticas descritivas

df_imputed.describe()


In [None]:

# üß© Discretizar a idade em 3 faixas

kbd = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df_imputed['Age_binned'] = kbd.fit_transform(df_imputed[['Age']])

df_imputed[['Age', 'Age_binned']].head()


In [None]:

# üîß Normalizar os atributos num√©ricos

scaler = MinMaxScaler()
df_normalized = df_imputed.copy()
df_normalized[["Age", "BI-RADS"]] = scaler.fit_transform(df_normalized[["Age", "BI-RADS"]])

df_normalized.head()


In [None]:

# ‚úÇÔ∏è Redu√ß√£o: Remover atributo n√£o preditivo

df_reduced = df_normalized.drop(columns=['BI-RADS'])

df_reduced.head()


In [None]:

# ‚úÖ Dataset final pronto para minera√ß√£o

df_reduced.to_csv("mammographic_mass_preprocessed.csv", index=False)
print("Arquivo salvo como mammographic_mass_preprocessed.csv")

df_reduced.head()



## ‚úÖ Conclus√£o

Reproduzimos com sucesso todas as etapas do Cap√≠tulo 2.8:  
- Carregamento e an√°lise dos dados  
- Tratamento de valores ausentes  
- Discretiza√ß√£o e transforma√ß√£o  
- Redu√ß√£o de atributos

O dataset final est√° pronto para ser utilizado em t√©cnicas de **Minera√ß√£o de Dados** como: classifica√ß√£o, clustering ou regras de associa√ß√£o.
