# **Projeto ponta a ponta : Classificando a saúde de fetos**

----

O objetivo desse projeto é, com base em caracteristicas relacionadas à saúde fetal, prever se um feto está com algum problema. O dataset foi retirado do [kaggle](https://www.kaggle.com/andrewmvd/fetal-health-classification) .

## 1. Aquisição dos dados

In [32]:
import pandas as pd
import plotly.express as px


In [33]:
df = pd.read_csv('https://raw.githubusercontent.com/MSR1805200/Portifolio/main/projetos_pessoais/fetal_health/data/fetal_health.csv')

In [34]:
df.head() 

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


Observe abaixo a explicação dos atributos :

* **baseline value(Frequência cardíaca fetal basal)** - mostra a quantidade de batidas do coração por minuto

* **accelerations** - número de acelerações por segundo 

* **fetal_movement** - número de movimento fetal por segundo

* **uterine_contractions** - número de contrações uterinas por segundo

* **light_decelerations** - número de desacelerações leves por segundo

* **severe_decelerations** - número de desacelerações severas por segundo

* **prolongued_decelerations** - número de desacelerações prolongadas por segundo

* **abnormal_short_term_variability** - Porcentagem de tempo com variabilidade anormal de curto prazo

* **mean_value_of_short_term_variability** - Valor médio da variabilidade de curto prazo

* **percentage_of_time_with_abnormal_long_term_variability** - Porcentagem de tempo com variabilidade anormal de longo prazo

* **mean_value_of_long_term_variability** - Valor médio da variabilidade de curto prazo

* **histogram_width** - Largura do histograma feito usando todos os valores de um registro

* **histogram_min** - Valor mínimo do histograma

* **histogram_max** - Valor máximo do histograma

* **histogram_number_of_peaks** - Número de picos no histograma do exame

* **histogram_number_of_zeroes** - Número de zeros no histograma do exame
 
* **histogram_mode** - moda do histograma

* **histogram_mean** - média do histograma

* **histogram_median** - mediana do histograma

* **histogram_variance** - variancia do histograma

* **histogram_tendency** - Tendência do histograma

* **fetal_health(Vida fetal)** - 1: Normal, 2: Suspeita, 3: Patológica

## 2 . Divisão dos dados

Vamos observar como as classes estão distribuidas.

In [35]:
df['fetal_health'].value_counts()

1.0    1655
2.0     295
3.0     176
Name: fetal_health, dtype: int64

Como as classes estão mal distribuidas, se faz necessário estratificar os dados. Vamos realizar a divisão dos dados no início para evitar viéses.

In [36]:
from sklearn.model_selection import  train_test_split
train_set,test_set = train_test_split(df, random_state = 42, shuffle = True, stratify = df['fetal_health'])


## 3 . pré-processamento dos dados

Vamos ver se há algum valor nulo no conjunto de dados do treino.

In [37]:
train_set.isnull().sum()

baseline value                                            0
accelerations                                             0
fetal_movement                                            0
uterine_contractions                                      0
light_decelerations                                       0
severe_decelerations                                      0
prolongued_decelerations                                  0
abnormal_short_term_variability                           0
mean_value_of_short_term_variability                      0
percentage_of_time_with_abnormal_long_term_variability    0
mean_value_of_long_term_variability                       0
histogram_width                                           0
histogram_min                                             0
histogram_max                                             0
histogram_number_of_peaks                                 0
histogram_number_of_zeroes                                0
histogram_mode                          

Parece não ter nenhum valor nulo, mas não sabemos no conjunto de teste. Vamos observar se o conjunto de dados de treino precisa de alguma técnica de pré-processamento.

In [38]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1594 entries, 883 to 229
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          1594 non-null   float64
 1   accelerations                                           1594 non-null   float64
 2   fetal_movement                                          1594 non-null   float64
 3   uterine_contractions                                    1594 non-null   float64
 4   light_decelerations                                     1594 non-null   float64
 5   severe_decelerations                                    1594 non-null   float64
 6   prolongued_decelerations                                1594 non-null   float64
 7   abnormal_short_term_variability                         1594 non-null   float64
 8   mean_value_of_short_term_variability 

A tipagem está a mesma para todas as colunas.

In [39]:
train_set.sample(15,random_state= 42)

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
1068,133.0,0.01,0.0,0.006,0.0,0.0,0.0,33.0,1.2,0.0,...,97.0,179.0,7.0,0.0,163.0,156.0,161.0,19.0,1.0,1.0
2107,136.0,0.0,0.001,0.006,0.0,0.0,0.0,74.0,1.0,21.0,...,107.0,149.0,2.0,0.0,137.0,135.0,138.0,1.0,1.0,1.0
2019,125.0,0.0,0.0,0.008,0.007,0.0,0.001,64.0,1.4,0.0,...,78.0,155.0,4.0,0.0,107.0,111.0,113.0,11.0,0.0,1.0
1285,112.0,0.0,0.0,0.005,0.0,0.0,0.0,26.0,1.3,0.0,...,90.0,127.0,2.0,1.0,114.0,114.0,116.0,2.0,1.0,1.0
1936,133.0,0.0,0.003,0.009,0.003,0.0,0.0,63.0,2.4,0.0,...,103.0,142.0,2.0,0.0,133.0,125.0,131.0,8.0,1.0,1.0
1889,141.0,0.005,0.0,0.007,0.0,0.0,0.0,58.0,0.6,3.0,...,108.0,171.0,3.0,1.0,156.0,154.0,157.0,5.0,1.0,1.0
1338,128.0,0.0,0.016,0.01,0.008,0.0,0.002,16.0,2.9,0.0,...,53.0,178.0,9.0,0.0,133.0,114.0,121.0,74.0,0.0,1.0
1221,135.0,0.0,0.0,0.007,0.0,0.0,0.0,50.0,0.6,0.0,...,118.0,152.0,0.0,0.0,137.0,136.0,138.0,1.0,0.0,1.0
463,120.0,0.006,0.001,0.001,0.0,0.0,0.0,51.0,1.3,3.0,...,59.0,172.0,16.0,1.0,117.0,127.0,129.0,23.0,0.0,1.0
1344,128.0,0.011,0.011,0.005,0.003,0.0,0.0,36.0,1.6,0.0,...,59.0,183.0,5.0,0.0,137.0,135.0,139.0,18.0,0.0,1.0


Ao que parece nenhuma coluna precisa ser pré-processada. Vamos observar como os dados estão se comportando.

In [40]:
train_set.describe()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
count,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,...,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0
mean,133.09473,0.003206,0.009964,0.004354,0.001848,3e-06,0.000164,46.77478,1.336324,9.900251,...,93.475533,163.81995,4.060853,0.314304,137.27478,134.442911,137.951694,18.959849,0.320577,1.304266
std,9.743306,0.003872,0.048223,0.002917,0.002911,5e-05,0.000602,17.312934,0.880302,18.526768,...,29.482176,17.935914,2.946808,0.665505,16.402763,15.52976,14.337718,29.671918,0.611249,0.614448
min,106.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.2,0.0,...,50.0,122.0,0.0,0.0,60.0,73.0,77.0,0.0,-1.0,1.0
25%,126.0,0.0,0.0,0.002,0.0,0.0,0.0,32.0,0.7,0.0,...,67.0,152.0,2.0,0.0,128.0,125.0,128.0,2.0,0.0,1.0
50%,133.0,0.002,0.0,0.004,0.0,0.0,0.0,48.0,1.2,0.0,...,94.0,161.0,3.0,0.0,139.0,136.0,139.0,7.0,0.0,1.0
75%,140.0,0.006,0.003,0.007,0.003,0.0,0.0,61.0,1.7,11.0,...,120.0,174.0,6.0,0.0,148.0,145.0,148.0,23.0,1.0,1.0
max,160.0,0.018,0.477,0.014,0.015,0.001,0.005,86.0,7.0,91.0,...,159.0,238.0,18.0,8.0,187.0,182.0,186.0,254.0,1.0,3.0


Ao que parece, algumas colunas estão com alguns valores discrepantes, como por exemplo a coluna **percentage_of_time_with_abnormal_long_term_variabilit**	que possui uma média de 9.900251	, mas sua mediana possui um valor de 0.000000. Entretanto de início não há necessidade de realizar um tratamento de outliers. A frente observaremos melhor esses dados.

## 4. Análise Exploratória

Vamos criar alguns histogramas para observar o comportamento de algumas colunas.

In [41]:
px.histogram(x =train_set['baseline value'], title  = 'Histograma da coluna : baseline value')


Os dados da **coluna baseline value** parecem ter uma forma parecida com uma curva normal.

In [42]:
px.box(y = train_set['baseline value'], color = train_set['fetal_health'], 
       title = 'boxplot da coluna: baseline value', labels = {'color' : 'Classe', 'y' : 'valores'}
       )

O histograma e o box plot confirmaram a hipótese de que os dados possuem outliers, porém são bem poucos. Vamos observar as outras colunas.

In [43]:
px.histogram(x =train_set['accelerations'], title  = 'Histograma da coluna : accelerations')


O gráfico acima mantém uma forma assimétrica positiva. Os valores mais frequentes estão bem mais perto do 0.

In [44]:
px.histogram(x = train_set['fetal_movement'], title  = 'Histograma da coluna : fetal_movement')


O padrão de assimetria positiva permanece para a  coluna acima. Observe que ela possui alguns outliers.

In [45]:
px.histogram(x =train_set['uterine_contractions'], title  = 'Histograma da coluna : uterine contractions')

In [46]:
px.box(y =train_set['uterine_contractions'], color = train_set['fetal_health'],
          title = 'boxplot da coluna: uterine contractions', labels = {'color' : 'Classe', 'y' : 'valores'}
       )

A coluna acima não possui muitos outliers.

In [47]:
px.histogram(x = train_set['light_decelerations'], title  = 'Histograma da coluna: light decelerations')


In [48]:
px.bar(x = train_set['severe_decelerations'].value_counts().sort_index().index,
       y = train_set['severe_decelerations'].value_counts().sort_index().values,
      color = train_set['severe_decelerations'].value_counts().sort_index().values,
       color_continuous_scale = 'RdBu',
       title = 'Frequência dos valores da coluna: severe decelerations',
       labels = {'color' : 'Quantidade', 'x' : 'Valor', 'y':'Quantidade'}
      )

A coluna acima tem apenas dois valores, tornando-a categórica.

In [49]:
px.histogram(x =train_set['prolongued_decelerations'],title  = 'Histograma da coluna: prolongued decelerations')

In [50]:
px.histogram( x =train_set['abnormal_short_term_variability'],title  = 'Histograma da coluna: abnormal short term variability')

In [51]:
px.box(y = train_set['abnormal_short_term_variability'], color = train_set['fetal_health'],
          title = 'boxplot da coluna: abnormal short term variability', labels = {'color' : 'Classe', 'y' : 'valores'}
       )

In [52]:
px.histogram(x =train_set['mean_value_of_short_term_variability'],title  = 'Histograma da coluna: mean value of short term variability')

In [53]:
px.box(y = train_set['mean_value_of_short_term_variability'], color = train_set['fetal_health'],
        title = 'boxplot da coluna : mean value of short term variability', labels = {'color' : 'Classe', 'y' : 'valores'}
       )

A coluna acima possui diversos  outliers na classe 1 e 2.

In [54]:
px.histogram( x = train_set['percentage_of_time_with_abnormal_long_term_variability'],title  = 'Histograma da coluna: percentage of time with abnormal long term variability')

In [55]:
px.box( y = train_set['percentage_of_time_with_abnormal_long_term_variability'], color = train_set['fetal_health'],
      title = 'boxplot da coluna : percentage of time with abnormal long term variability', labels = {'color' : 'Classe', 'y' : 'valores'}
       )

A coluna acima possui muitos outliers apenas na classe 1.

In [56]:
px.histogram( x =train_set['mean_value_of_long_term_variability'],title  = 'Histograma da coluna: mean value of long term variability')

In [57]:
px.box(y = train_set['mean_value_of_long_term_variability'], color = train_set['fetal_health'],
       title = 'boxplot da coluna : mean value of long term variability', labels = {'color' : 'Classe', 'y' : 'valores'}
      )

Em resumo, algumas colunas tem comportamentos de formam um comportamento assimétrico positivo, todavia algumas colunas tendem a se comportar como uma curva normal. Observamos também a presença de outliers que podem ser tratados, mas vamos deixar passar para observar como os algoritmos de aprendizado de máquina se comportam quando alimentados por esses dados.

## 5. Transformação e Treinamento

Primeiro, como nós não sabemos se há ou não valores nulos nos dados de teste, vamos realizar o processo de imputação de dados pela mediana.

In [58]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'median')

x_train = train_set.drop(['fetal_health'],axis = 1).copy()
y_train = train_set['fetal_health'].copy()

imputer.fit(x_train)


SimpleImputer(strategy='median')

Agora vamos escalonar as features, para depois utilizarmos para a escolha do melhor modelo.

In [59]:
from sklearn.preprocessing import StandardScaler

sts = StandardScaler()

x_tr = sts.fit_transform(x_train)


In [60]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report


Para realização da avaliação priorizaremos a revocação acima de outras métricas, pois não queremos que o classificador classifique erroneamente que o feto está saudável sem de fato está, isso acabaria com as ações clínicas para melhorar sua saúde.

In [61]:
from sklearn.linear_model import LogisticRegression

pred = cross_val_predict(LogisticRegression(max_iter=1000),x_tr,y_train,cv =10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.94      0.95      0.95      1241
         2.0       0.65      0.65      0.65       221
         3.0       0.80      0.77      0.78       132

    accuracy                           0.89      1594
   macro avg       0.80      0.79      0.79      1594
weighted avg       0.89      0.89      0.89      1594



In [62]:
from sklearn.ensemble import RandomForestClassifier
pred = cross_val_predict(RandomForestClassifier(),x_tr,y_train,cv =10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.96      0.98      0.97      1241
         2.0       0.85      0.77      0.81       221
         3.0       0.93      0.86      0.90       132

    accuracy                           0.94      1594
   macro avg       0.92      0.87      0.89      1594
weighted avg       0.94      0.94      0.94      1594



In [63]:
from sklearn.neighbors import KNeighborsClassifier

pred = cross_val_predict(KNeighborsClassifier(),x_tr,y_train,cv = 10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.92      0.97      0.94      1241
         2.0       0.73      0.59      0.65       221
         3.0       0.95      0.76      0.84       132

    accuracy                           0.90      1594
   macro avg       0.87      0.77      0.81      1594
weighted avg       0.90      0.90      0.90      1594



In [64]:
from sklearn.svm import SVC

pred = cross_val_predict(SVC(),x_tr,y_train,cv = 10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.94      0.97      0.95      1241
         2.0       0.74      0.68      0.71       221
         3.0       0.95      0.80      0.87       132

    accuracy                           0.92      1594
   macro avg       0.88      0.81      0.84      1594
weighted avg       0.91      0.92      0.91      1594



In [65]:
from sklearn.svm import LinearSVC

pred = cross_val_predict(LinearSVC(max_iter = 5000),x_tr,y_train,cv = 10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.94      0.95      0.95      1241
         2.0       0.66      0.65      0.65       221
         3.0       0.83      0.77      0.80       132

    accuracy                           0.89      1594
   macro avg       0.81      0.79      0.80      1594
weighted avg       0.89      0.89      0.89      1594



In [66]:
from sklearn.linear_model import SGDClassifier

pred = cross_val_predict(SGDClassifier(),x_tr,y_train,cv = 10)

print(classification_report(y_train,pred))

              precision    recall  f1-score   support

         1.0       0.94      0.95      0.94      1241
         2.0       0.66      0.57      0.61       221
         3.0       0.77      0.80      0.78       132

    accuracy                           0.89      1594
   macro avg       0.79      0.77      0.78      1594
weighted avg       0.88      0.89      0.88      1594



Diante dos classificadores acima, podemos observar que o random forest consegue classificar melhor, logo ele será escolhido para a próxima etapa.

## 6. Aprimoramento e Teste

Vamos melhorar nosso classificador com a técnica de grid search.

In [67]:
from sklearn.model_selection import GridSearchCV

param = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

grid = GridSearchCV( RandomForestClassifier(),param, cv = 10)

grid.fit(x_tr,y_train)



GridSearchCV(cv=10, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 5, 6, 7, 8],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'n_estimators': [200, 500]})

In [68]:
modelo = grid.best_estimator_

Para aprimorar mais ainda o modelo, vamos utilizar a técnica de classificação multilabel one vs one.

In [69]:
from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(modelo)

Agora vamos por em prática o nosso modelo.

In [70]:

x_test = test_set.drop(['fetal_health'], axis= 1).copy()
y_test = test_set['fetal_health'].copy()

x_test_imp = imputer.transform(x_test)
x_test_tr = sts.transform(x_test_imp)


In [73]:
ovo_clf.fit(x_tr,y_train)

pred = ovo_clf .predict(x_test_tr)

print(classification_report(y_test,pred))

              precision    recall  f1-score   support

         1.0       0.94      0.98      0.96       414
         2.0       0.87      0.65      0.74        74
         3.0       0.93      0.91      0.92        44

    accuracy                           0.93       532
   macro avg       0.91      0.85      0.87       532
weighted avg       0.93      0.93      0.93       532



O modelo diminuiu sua performance comparado com seu treino, porém ainda manteve uma pontuação alta, logo ele está pronto para deploy.