<hr>
<hr>

**Exercício 1** Utilizando a base de dados `breast_cancer`, utilize a clusterização hierárquica para descrever fatores que podem ajudar a indicar se um tumor é maligno ou não. Dica: transforme a feature alvo (`diagnosis`) em número. Use 3 e 5 agrupamentos para a comparação e compare os resultados. Utilize o método de Ward para ligar os clusters. Apresente o dendograma obtido.

<hr>

**Exercício 2** Repita a análise feita anteriormente, porém agora utilizando o DBSCAN. Realize a tunagem dos hiperparâmetros usando o método da silhueta. 

<hr>

### 1 - Importações

In [1]:
import warnings
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import DBSCAN
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score

# ignorar warnings
warnings.filterwarnings('ignore')

### 2 - Tratamento Inicial dos Dados

In [2]:
# importar os dados
data = pd.read_csv('../data/breast_cancer.csv')
data.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
# eliminar colunas que não serão utilizadas
data.drop(['id'], axis=1, inplace=True)

In [4]:
# separar entre preditores (x) e targets (y)
x = data.drop(['diagnosis'], axis=1)
y = data[['diagnosis']]

# transformar o diagnosis em números
y.diagnosis = y.diagnosis.map({'B': 0, 'M': 1})


In [5]:
# verificando presença dos nulos
data.isna().sum()

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

In [6]:
# verificando se todos os atributos são numéricos
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

Agrupamento 3 clusters

### 3 - Modelagem

### 3.1 - Funções auxiliares

In [7]:
# função para avaliar influência dos parâmetros
def plot_dbscan(x, eps, min_pts):

    # pipeline de modelo
    cluster_pipe = Pipeline([
        ('scaler', RobustScaler()),
        ('dbscan', DBSCAN(eps=eps, min_samples=min_pts))
    ])

    # ajuste
    cluster_pipe.fit(x)

    # atribuindo clusters
    x['cluster'] = cluster_pipe['dbscan'].labels_

    # construir gráfico
    print(f'DBSCAN - eps = {eps} - min_pts = {min_pts}')
    
    # Number of clusters in labels, ignoring noise if present.
    labels = x['cluster']
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)

    print("Estimated number of clusters: %d" % n_clusters_)
    print("Estimated number of noise points: %d" % n_noise_)

    # desempenho do modelo
    if len(x.cluster.unique()) > 1:
        print(f'Score de Silhueta: {silhouette_score(x.iloc[:, :-1], x["cluster"])}')
    else:
        print(f'Score de Silhueta: Não foi possível calcular')
    
    # análise da distribuição de pontos em cada cluster
    print('Distribuição de pontos')
    print(x.cluster.value_counts(normalize=True))
    print(x.groupby(['cluster'])['cluster'].count())
    


### 3.2 - Processamento variando o eps

In [8]:
s = []
# avaliando influência do eps
eps_list = np.arange(0.5, 10, 0.5)

for eps in eps_list:
    plot_dbscan(x.copy(), eps, 2)


DBSCAN - eps = 0.5 - min_pts = 2
Estimated number of clusters: 1
Estimated number of noise points: 569
Score de Silhueta: Não foi possível calcular
Distribuição de pontos
-1    1.0
Name: cluster, dtype: float64
cluster
-1    569
Name: cluster, dtype: int64
DBSCAN - eps = 1.0 - min_pts = 2
Estimated number of clusters: 8
Estimated number of noise points: 555
Score de Silhueta: -0.7597111221532754
Distribuição de pontos
-1    0.975395
 0    0.003515
 1    0.003515
 2    0.003515
 3    0.003515
 4    0.003515
 5    0.003515
 6    0.003515
Name: cluster, dtype: float64
cluster
-1    555
 0      2
 1      2
 2      2
 3      2
 4      2
 5      2
 6      2
Name: cluster, dtype: int64
DBSCAN - eps = 1.5 - min_pts = 2
Estimated number of clusters: 25
Estimated number of noise points: 405
Score de Silhueta: -0.7302419626088638
Distribuição de pontos
-1     0.711775
 2     0.177504
 13    0.010545
 0     0.008787
 18    0.008787
 11    0.008787
 7     0.007030
 10    0.005272
 3     0.005272
 5

### 3.2 - Processamento variando o min_pts

In [9]:
# avaliando influência do eps
min_pt_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for n in min_pt_list:
    plot_dbscan(x.copy(), 7, n)

DBSCAN - eps = 7 - min_pts = 1
Estimated number of clusters: 7
Estimated number of noise points: 0
Score de Silhueta: -0.5048985277824654
Distribuição de pontos
0    0.989455
1    0.001757
2    0.001757
3    0.001757
4    0.001757
5    0.001757
6    0.001757
Name: cluster, dtype: float64
cluster
0    563
1      1
2      1
3      1
4      1
5      1
6      1
Name: cluster, dtype: int64
DBSCAN - eps = 7 - min_pts = 2
Estimated number of clusters: 2
Estimated number of noise points: 6
Score de Silhueta: 0.593336524796007
Distribuição de pontos
 0    0.989455
-1    0.010545
Name: cluster, dtype: float64
cluster
-1      6
 0    563
Name: cluster, dtype: int64
DBSCAN - eps = 7 - min_pts = 3
Estimated number of clusters: 2
Estimated number of noise points: 6
Score de Silhueta: 0.593336524796007
Distribuição de pontos
 0    0.989455
-1    0.010545
Name: cluster, dtype: float64
cluster
-1      6
 0    563
Name: cluster, dtype: int64
DBSCAN - eps = 7 - min_pts = 4
Estimated number of clusters: 2

3.3 Conclusões

Ao processar o DBSCAN oscilando-se apenas o valor de eps, não se encontrou nenhum valor satisfatório. 
<br>Na maioria das vezes, os pontos acabavam sendo classifacos em grande parte em outliers, ou então, ficavam muito concentrados em um único cluster.

Em seguida, ao processar o DBSCAN oscilando-se o min_points, a situação continou a mesma