# Atividade 01 - Mineração de Dados Não Estruturados (SCC0287)

---

Heitor Carvalho Pinheiro

---


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import chi2, SelectKBest, RFECV, RFE
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


import statsmodels.api as sm
import statsmodels.formula.api as smf

pd.set_option('display.max_columns', 500)

# Sobre os Dados

**Classificação Binária com um Conjunto de Dados de Defeitos de Software
Série Playground - Temporada 3, Episódio 23**

**Descrição do Conjunto de Dados**

O conjunto de dados para esta competição (tanto de treino quanto de teste) foi gerado a partir de um modelo de deep learning treinado no [Conjunto de Dados de Defeitos de Software](https://www.kaggle.com/datasets/semustafacevik/software-defect-prediction).

As distribuições das características são próximas, mas não exatamente iguais às do original. Fique à vontade para usar o conjunto de dados original como parte desta competição, tanto para explorar as diferenças quanto para ver se a incorporação do original no treinamento melhora o desempenho do modelo.

**Arquivos**

* train.csv - o conjunto de dados de treino; defects é a variável alvo binária, que é tratada como um booleano (False=0, True=1)

* test.csv - o conjunto de dados de teste; seu objetivo é prever a probabilidade de defeitos positivos (ou seja, defects=True)

* sample_submission.csv - um arquivo de submissão de exemplo no formato correto`

### Dicionário de Dados

**Sobre as métricas de McCabe**

`As métricas de McCabe são uma coleção de quatro métricas de software: complexidade essencial, complexidade ciclomática, complexidade de design e LOC, Linhas de Código.`

`-- **Complexidade Ciclomática**, ou "v(G)", mede o número de "caminhos linearmente independentes". Um conjunto de caminhos é considerado linearmente independente se nenhum caminho no conjunto for uma combinação linear de quaisquer outros caminhos no conjunto através do "fluxograma" de um programa. Um fluxograma é um grafo direcionado onde cada nó corresponde a uma instrução do programa, e cada arco indica o fluxo de controle de uma instrução para outra. "v(G)" é calculado por "v(G) = e - n + 2", onde "G" é o fluxograma de um programa, "e" é o número de arcos no fluxograma, e "n" é o número de nós no fluxograma. As regras padrão de McCabe ("v(G)" > 10) são usadas para identificar módulos propensos a falhas.

-- **Complexidade Essencial**, ou "ev(G)", é a extensão pela qual um fluxograma pode ser "reduzido" decompondo todos os subfluxogramas de "G" que são "primos estruturados em D". Esses "primos estruturados em D" são às vezes referidos como "subfluxogramas adequados de entrada única e saída única" (para uma discussão mais aprofundada sobre D-primes, veja o texto de Fenton referido acima). "ev(G)" é calculado usando "ev(G) = v(G) - m", onde "m" é o número de subfluxogramas de "G" que são primos estruturados em D.

-- **Complexidade de Design**: qualquer complexidade que não influencie a inter-relação entre os módulos de design. Segundo McCabe, essa medida de complexidade reflete os padrões de chamada dos módulos para seus módulos subordinados imediatos.

-- **Linhas de código (LOC)** são medidas de acordo com as convenções de contagem de linhas de McCabe.

**Sobre as métricas de Halstead**

`As métricas de Halstead se dividem em três grupos: as medidas básicas, as medidas derivadas e as medidas de linhas de código.`

-- **Medidas básicas**:
   -- mu1             = número de operadores únicos
   
   -- mu2             = número de operandos únicos
   
   -- N1              = ocorrências totais de operadores
   
   -- N2              = ocorrências totais de operandos
   
   -- comprimento     = N  = N1 + N2
   
   -- vocabulário = mu = mu1 + mu2
   
   -- Constantes definidas para cada função:
      
      -- mu1' = 2 = contagem potencial de operadores (apenas o nome da função e o operador "return")
      
      -- mu2'      = contagem potencial de operandos (o número de argumentos do módulo)

   Por exemplo, a expressão "return max(w+x,x+y)" tem "N1=4" operadores ("return, max, +,+"), "N2=4" operandos (w, x, x, y), "mu1=3" operadores únicos (return, max, +) e "mu2=3" operandos únicos (w, x, y).

-- **Medidas derivadas**:

   -- P = volume = V = N * log2(mu) (o número de comparações mentais necessárias para escrever um programa de comprimento N)

   -- V* = volume na implementação mínima
         = (2 + mu2')*log2(2 + mu2')
   
   -- L  = comprimento do programa = V*/N
   
   -- D  = dificuldade = 1/L
   
   -- L' = 1/D
   
   -- I  = inteligência = L'*V'
   
   -- E  = esforço para escrever o programa = V/L
   
   -- T  = tempo para escrever o programa = E/18 segundos

### Sobre os dados

**Número de atributos:** 22 (5 diferentes medidas de linhas de código, 3 métricas de McCabe, 4 medidas básicas de Halstead, 8 medidas derivadas de Halstead, uma contagem de ramos e 1 campo objetivo)



1. loc             : numeric % McCabe's line count of code
2. v(g)            : numeric % McCabe "cyclomatic complexity"
3. ev(g)           : numeric % McCabe "essential complexity"
4. iv(g)           : numeric % McCabe "design complexity"
5. n               : numeric % Halstead total operators + operands
6. v               : numeric % Halstead "volume"
7. l               : numeric % Halstead "program length"
8. d               : numeric % Halstead "difficulty"
9. i               : numeric % Halstead "intelligence"
0. e               : numeric % Halstead "effort"
1. b               : numeric % Halstead
2. t               : numeric % Halstead's time estimator
3. lOCode          : numeric % Halstead's line count
4. lOComment       : numeric % Halstead's count of lines of comments
5. lOBlank         : numeric % Halstead's count of blank lines
6. lOCodeAndComment: numeric
7. uniq_Op         : numeric % unique operators
8. uniq_Opnd       : numeric % unique operands
9. total_Op        : numeric % total operators
10. total_Opnd      : numeric % total operands
11. branchCount     : numeric % of the flow graph
12. defects         : {false,true} % module has/has not one or more


## Leitura dos dados

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/Heitorcp/scc0287-mineracao-de-dados-nao-estruturados/main/train.csv", usecols=[*range(1,23)])
test = pd.read_csv("https://raw.githubusercontent.com/Heitorcp/scc0287-mineracao-de-dados-nao-estruturados/main/test.csv")
sample_submission = pd.read_csv("https://raw.githubusercontent.com/Heitorcp/scc0287-mineracao-de-dados-nao-estruturados/main/sample_submission.csv")

In [None]:
def rename_cols(df):
  """
  Rename columns in the dataframe that contain the strings (g)
  to replace the (g) with _g.

  Args:
    df: Pandas DataFrame.

  Returns:
    Pandas DataFrame with renamed columns.
  """
  df.columns = df.columns.str.replace(r"\(g\)", "_g", regex=True)
  return df

In [None]:
train = rename_cols(train)
test = rename_cols(test)

# Pré-Processamento

### Separação do dados em Treino e Validação

In [None]:
X = train.drop('defects', axis=1)
y = train['defects']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

#Encoding variables
y_train = y_train.astype(int)
y_val = y_val.astype(int)

### Verificando valores Nulos

## Padronização das Features

**Aplicando um MinMaxScaler nos dados**

In [None]:
features = train.columns.drop('defects')
scaler = MinMaxScaler()
scaler.fit(X_train[features])
train_scaled = scaler.transform(X_train[features])
val_scaled = scaler.transform(X_val[features])

In [None]:
train_scaled_df = pd.DataFrame(train_scaled, columns=features)
val_scaled_df = pd.DataFrame(val_scaled, columns=features)

In [None]:
train_scaled_df

Unnamed: 0,loc,v_g,ev_g,iv_g,n,v,l,d,i,e,b,t,lOCode,lOComment,lOBlank,locCodeAndComment,uniq_Op,uniq_Opnd,total_Op,total_Opnd,branchCount
0,0.009590,0.017370,0.000000,0.012469,0.016823,0.015136,0.059701,0.055476,0.055530,0.001011,0.008905,0.001405,0.008144,0.002907,0.050228,0.000000,0.043902,0.019729,0.015498,0.019199,0.027888
1,0.005812,0.004963,0.000000,0.002494,0.004739,0.003731,0.104478,0.033309,0.026098,0.000141,0.002226,0.000196,0.005666,0.000000,0.009132,0.000000,0.031707,0.008631,0.004613,0.004965,0.007968
2,0.001744,0.000000,0.000000,0.000000,0.001777,0.001027,0.432836,0.008369,0.024992,0.000010,0.000742,0.000014,0.001416,0.000000,0.004566,0.000000,0.012195,0.006165,0.001476,0.002317,0.000000
3,0.002616,0.002481,0.000000,0.002494,0.002962,0.001962,0.283582,0.012745,0.031328,0.000036,0.001113,0.000050,0.002479,0.000000,0.004566,0.000000,0.019512,0.007398,0.002952,0.002979,0.003984
4,0.004650,0.004963,0.000000,0.002494,0.004265,0.003138,0.179104,0.020182,0.035435,0.000087,0.001855,0.000121,0.004249,0.002907,0.009132,0.000000,0.021951,0.011097,0.003875,0.004965,0.007968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81405,0.016565,0.017370,0.012195,0.014963,0.019311,0.018885,0.059701,0.062673,0.057233,0.001172,0.011503,0.001630,0.014518,0.008721,0.018265,0.184211,0.043902,0.025894,0.019373,0.021185,0.027888
81406,0.017437,0.017370,0.024390,0.012469,0.018600,0.016629,0.074627,0.043950,0.077837,0.000801,0.010019,0.001113,0.014518,0.023256,0.054795,0.000000,0.041463,0.029593,0.016236,0.022840,0.027888
81407,0.004359,0.007444,0.012195,0.000000,0.002962,0.002407,0.194030,0.021927,0.023009,0.000053,0.001484,0.000074,0.003895,0.000000,0.013699,0.000000,0.026829,0.007398,0.002768,0.003310,0.011952
81408,0.004650,0.002481,0.000000,0.002494,0.004265,0.003403,0.253731,0.013989,0.050054,0.000054,0.002226,0.000075,0.004958,0.000000,0.004566,0.000000,0.021951,0.016030,0.004059,0.004634,0.003984


## Seleção de Features

Como estamos lidando com um problema supervisionado de classificação binária, utilizaremos métodos supervisionados para a seleção de features.

Os métodos utilizados serão:
1. Método Qui-quadrado
2. Recursive Feature Elimination with Cross-Validation

### Método Qui-quadrado

In [None]:
def selectBestChi2(X, y) -> list[str]:
  chi2_selector = SelectKBest(chi2, k=10) # Selecionando as 10 melhores features
  X_chi2 = chi2_selector.fit_transform(X, y)
  # Obtendo as features selecionadas
  selected_features_chi2 = X.columns[chi2_selector.get_support()]
  print("Features selecionadas pelo teste qui-quadrado:", selected_features_chi2)

  return selected_features_chi2

In [None]:
X_train = train_scaled_df
selected_features_chi2 = selectBestChi2(X_train, y_train)

Features selecionadas pelo teste qui-quadrado: Index(['loc', 'v_g', 'ev_g', 'v', 'l', 'lOComment', 'lOBlank',
       'locCodeAndComment', 'total_Opnd', 'branchCount'],
      dtype='object')


### Método RFE

In [None]:
clf_rf = RandomForestClassifier()
rfe = RFE(estimator=clf_rf, step=1)
rfe = rfe.fit(X_train, y_train)

print('Número ótimo de features :', rfe.n_features_)
print('Melhores features:', X_train.columns[rfe.support_])

Número ótimo de features : 10
Melhores features: Index(['loc', 'n', 'v', 'd', 'i', 'e', 't', 'lOCode', 'total_Op',
       'branchCount'],
      dtype='object')


In [None]:
rfe_selected_features = X_train.columns[rfe.support_]

# Extração de Padrões e Pós-processamento

## Classificação

A fim de classificar o conjunto de dados, utilizaremos uma regressão Logística como nosso modelo baseline.

Em seguida, utilizaremos um Random Forest para a comparação com o modelo baseline.

In [None]:
def evaluate_model(y_true, y_pred):
  """
  Evaluate the model using classification_report.

  Args:
    y_true: True labels.
    y_pred: Predicted labels.

  Returns:
    Classification report string.
  """
  report = classification_report(y_true, y_pred)
  return report

### Regressão Logística com features selecionadas por Qui-quadrado

Utilizamos aqui a biblioteca `statsmodels` para o ajuste do modelo de Regressão Logística pois ela permite que ajustemos o modelo a partir da definição de um modelo MLG para variáveis respostas da família Binomial, o que garante propriedades estatísticas melhores comparado ao método implementado pelo Sklearn.

A seleção do limiar de decisão para a classificação dos resultados obtidos pela regressão será feita utilizando o ponto máximo da Curva ROC.

In [None]:
#dados de treino
df_train = train_scaled_df.copy()
df_train_chi2 = df_train[selected_features_chi2]
df_train['y'] = y_train.values

#dados de validação
df_val = val_scaled_df.copy()
df_val_chi2 = df_val[selected_features_chi2]
df_val['y'] = y_val.values

In [None]:
formula = 'y ~ ' + ' + '.join(df_train_chi2.columns)
lr_model = smf.glm(formula=formula, data=df_train, family=sm.families.Binomial())
result_lr = lr_model.fit()
print(result_lr.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                81410
Model:                            GLM   Df Residuals:                    81399
Model Family:                Binomial   Df Model:                           10
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -36956.
Date:                Sun, 15 Sep 2024   Deviance:                       73912.
Time:                        22:15:49   Pearson chi2:                 1.63e+15
No. Iterations:                     7   Pseudo R-squ. (CS):             0.1521
Covariance Type:            nonrobust                                         
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -1.7655      0.02

In [None]:
from sklearn.metrics import roc_curve, auc

# Predict probabilities
y_pred_prob = result_lr.predict(exog=df_val_chi2)

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(df_val['y'], y_pred_prob)
roc_auc = auc(fpr, tpr)

# Find the threshold that maximizes the area under the ROC Curve
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

# Classify based on the optimal threshold
y_pred = (y_pred_prob >= optimal_threshold).astype(int)

# Evaluate the model
report = evaluate_model(df_val['y'], y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.89      0.77      0.82     15825
           1       0.45      0.66      0.53      4528

    accuracy                           0.74     20353
   macro avg       0.67      0.71      0.68     20353
weighted avg       0.79      0.74      0.76     20353



### Regressão Logística com RFE Features

In [None]:
df_train_rfe = df_train[rfe_selected_features]
df_val_rfe = df_val[rfe_selected_features]

formula = 'y ~ ' + ' + '.join(df_train_rfe.columns)
lr_model_rfe = smf.glm(formula=formula, data=df_train, family=sm.families.Binomial())
result_lr_rfe = lr_model_rfe.fit()
print(result_lr_rfe.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                81410
Model:                            GLM   Df Residuals:                    81399
Model Family:                Binomial   Df Model:                           10
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -36920.
Date:                Sun, 15 Sep 2024   Deviance:                       73841.
Time:                        22:15:55   Pearson chi2:                 9.48e+12
No. Iterations:                     7   Pseudo R-squ. (CS):             0.1529
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -2.6843      0.025   -108.652      

In [None]:
# Predict probabilities
y_pred_prob = result_lr_rfe.predict(exog=df_val_rfe)

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(df_val['y'], y_pred_prob)
roc_auc = auc(fpr, tpr)

# Find the threshold that maximizes the area under the ROC Curve
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

# Classify based on the optimal threshold
y_pred = (y_pred_prob >= optimal_threshold).astype(int)

# Evaluate the model
report = evaluate_model(df_val['y'], y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.90      0.73      0.81     15825
           1       0.43      0.71      0.53      4528

    accuracy                           0.73     20353
   macro avg       0.66      0.72      0.67     20353
weighted avg       0.79      0.73      0.75     20353



### Random Forest com Cross-Validation



In [None]:
from sklearn.model_selection import cross_val_score, KFold

rf_model = RandomForestClassifier()
cv = KFold(n_splits=3, shuffle=True, random_state=42)
scores = cross_val_score(rf_model, df_train_chi2, y_train, cv=cv, scoring='accuracy')
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())

rf_model.fit(df_train_chi2, y_train)
y_pred = rf_model.predict(df_val_chi2)

report = evaluate_model(df_val['y'], y_pred)
print(report)

Cross-validation scores: [0.79913034 0.80067804 0.79746462]
Mean accuracy: 0.7990910007814994
              precision    recall  f1-score   support

           0       0.84      0.92      0.88     15825
           1       0.58      0.38      0.46      4528

    accuracy                           0.80     20353
   macro avg       0.71      0.65      0.67     20353
weighted avg       0.78      0.80      0.78     20353



Um Random Forest, com as features selecionadas pelo método do Qui-quadrado apresentou uma precisão de de 58% para a detecção de softwares com problema, comparado aos 45% da Regressão Logística.

### Classificador KNN

### Usando as features selecionadas pelo método do Qui-quadrado

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_neighbors': np.arange(1, 11),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Create the KNN classifier
knn = KNeighborsClassifier()

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(df_train_chi2, y_train)

# Print the best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Get the best estimator
best_knn = grid_search.best_estimator_

# Make predictions on the validation set
y_pred = best_knn.predict(df_val_chi2)

# Evaluate the model
report = evaluate_model(df_val['y'], y_pred)
print(report)

Best parameters: {'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'uniform'}
Best score: 0.8039061233460351
              precision    recall  f1-score   support

           0       0.83      0.94      0.88     15825
           1       0.62      0.33      0.43      4528

    accuracy                           0.80     20353
   macro avg       0.72      0.63      0.65     20353
weighted avg       0.78      0.80      0.78     20353



O KMeans apresentou um desempenho superior ao RandomForest, com uma precisão de 62% para a classe positiva.

## Clustering

### K-Means

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

n_clusters_range = range(2, 11)

silhouette_scores = []

for n_clusters in n_clusters_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(df_train_chi2)
    cluster_labels = kmeans.labels_
    silhouette_avg = silhouette_score(df_train_chi2, cluster_labels)
    silhouette_scores.append(silhouette_avg)

best_n_clusters = n_clusters_range[silhouette_scores.index(max(silhouette_scores))]
print("Melhor # de Clusters:", best_n_clusters)

kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
kmeans.fit(df_train_chi2)

  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)


Melhor # de Clusters: 2


  super()._check_params_vs_input(X, default_n_init=10)


### Avaliação de Qualidade

In [None]:
df_train_chi2.loc[:, 'cluster'] = kmeans.labels_
df_val_chi2.loc[:,'cluster'] = kmeans.predict(df_val_chi2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_chi2.loc[:, 'cluster'] = kmeans.labels_
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_val_chi2.loc[:,'cluster'] = kmeans.predict(df_val_chi2)


In [None]:
df_train_chi2

Unnamed: 0,loc,v_g,ev_g,v,l,lOComment,lOBlank,locCodeAndComment,total_Opnd,branchCount,cluster
0,0.009590,0.017370,0.000000,0.015136,0.059701,0.002907,0.050228,0.000000,0.019199,0.027888,0
1,0.005812,0.004963,0.000000,0.003731,0.104478,0.000000,0.009132,0.000000,0.004965,0.007968,0
2,0.001744,0.000000,0.000000,0.001027,0.432836,0.000000,0.004566,0.000000,0.002317,0.000000,1
3,0.002616,0.002481,0.000000,0.001962,0.283582,0.000000,0.004566,0.000000,0.002979,0.003984,1
4,0.004650,0.004963,0.000000,0.003138,0.179104,0.002907,0.009132,0.000000,0.004965,0.007968,0
...,...,...,...,...,...,...,...,...,...,...,...
81405,0.016565,0.017370,0.012195,0.018885,0.059701,0.008721,0.018265,0.184211,0.021185,0.027888,0
81406,0.017437,0.017370,0.024390,0.016629,0.074627,0.023256,0.054795,0.000000,0.022840,0.027888,0
81407,0.004359,0.007444,0.012195,0.002407,0.194030,0.000000,0.013699,0.000000,0.003310,0.011952,0
81408,0.004650,0.002481,0.000000,0.003403,0.253731,0.000000,0.004566,0.000000,0.004634,0.003984,0


Vamos comparar os clusteres criados com as verdadeiras *labels*.

In [None]:
from sklearn.metrics.cluster import rand_score, adjusted_rand_score

print(rand_score(df_train_chi2['cluster'], y_train))
print(adjusted_rand_score(df_train_chi2['cluster'], y_train))

0.5297261483413154
-0.06888205288771096


A medida Rand Score ajustada nos fornece um valor bem próximo de 0. O que isso significa é que o KMeans clusterizou as instâncias de um modo praticamente aleatório. Ou seja, com essas features não foi possível identificar os grupos que supostamente pertenceriam a classe 0 ou 1.