<h1> Predicting dermatological diseases </h1>

In this application we are going to use the [Dermatology Data Set](https://archive.ics.uci.edu/ml/datasets/dermatology) from <b> UCI Machine Learning Repository</b>. It provides us 12 clinical attributes and 22 histopathological attributes related to 6 different diseases: <br>

1. Psoriasis
2. Seboreic dermatitis
3. Lichen planus
4. Pityriasis rosea
5. Cronic dermatitis
6. Pityriasis rubra pilaris
<br>

Our _goal_ is to <b> classify which disease the patient has based on these attributes</b>.

<h1> Preprocessing </h1>

For preprocessing our data set, we could change it for the one-hot-enconding approach. Since there is a lot of examples of that on the internet, we are going to try the multiclassification approach with integer values as our targets (diseases 1 to 6). <br>

Another thing here is since some algorithms cannot deal with missing values (NaN), we decided to delete these rows.
<br>

Ps.: we manually put the names of the attributes in the .csv file, but it can be done by code as well.

In [3]:
import pandas as pd

In [4]:
raw_dataset = pd.read_csv("Dermatology.csv", na_values="?")

In [5]:
dataset = raw_dataset.dropna()
dataset.reset_index(drop=True, inplace=True) #reorder rows

In [6]:
dataset.tail() #if you want to print everything, just type 'dataset'

Unnamed: 0,erythema,scaling,definiteBorders,itching,koebnerPhenomenon,polygonalPapules,follicularPapules,oralMucosal,kneeElbow,scalp,...,disappearance,vacuolisation,spongiosis,sawTooth,follicularPlug,perifollicular,mononuclear,bandLike,age,disease
353,2,1,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,2,0,25.0,4
354,3,2,1,0,1,0,0,0,0,0,...,1,0,1,0,0,0,2,0,36.0,4
355,3,2,2,2,3,2,0,2,0,0,...,0,3,0,3,0,0,2,3,28.0,3
356,2,1,3,1,2,3,0,2,0,0,...,0,2,0,1,0,0,2,3,50.0,3
357,3,2,2,0,0,0,0,0,3,3,...,2,0,0,0,0,0,3,0,35.0,1


<h1> Separating the data set </h1>

In [7]:
import __future__ #for future features in newer versions
import tensorflow
from tensorflow import keras
from tensorflow.keras import layers

In [8]:
train_dataset = dataset.sample(frac=0.75,random_state=0) #the random_state gives the seed for the randomization 
test_dataset = dataset.drop(train_dataset.index)

<b> The label is the value we want to predict (in this case, we want to predict the disease): </b>

In [9]:
train_labels = train_dataset.pop('disease')
test_labels = test_dataset.pop('disease')

<h1> Artificial Neural Network </h1>

For building our ANN we used some _rules-of-thumb_ presented in [this](https://towardsdatascience.com/17-rules-of-thumb-for-building-a-neural-network-93356f9930a) article. 

In [10]:
#remember: 34 features + 1 (disease)
def build_model():
    model = keras.Sequential([
        layers.Input(len(train_dataset.keys())), #input_shape = 34
        layers.Dense(16, activation = 'relu'),
        layers.Dense(8, activation = 'relu'),
        layers.Dense(7, activation = 'softmax'), #6 possible diseases
    ])

    model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['acc']) 
    #sparse is used here because our target values are not one-hot-encoded, but integers

    return model

In [11]:
model = build_model()
print(model.summary())

#params = current *(previous +1)

W0917 14:37:06.242386  2156 deprecation.py:506] From C:\Users\amand\Anaconda3\lib\site-packages\tensorflow\python\ops\init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                560       
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_2 (Dense)              (None, 7)                 63        
Total params: 759
Trainable params: 759
Non-trainable params: 0
_________________________________________________________________
None


<b> Now we have to see if the model training shows decreasing loss and any improvement in accuracy (acc): </b>

In [12]:
predicted_labels_ANN = model.fit(train_dataset, train_labels, epochs=85)

Epoch 1/85
Epoch 2/85
Epoch 3/85
Epoch 4/85
Epoch 5/85
Epoch 6/85
Epoch 7/85
Epoch 8/85
Epoch 9/85
Epoch 10/85
Epoch 11/85
Epoch 12/85
Epoch 13/85
Epoch 14/85
Epoch 15/85
Epoch 16/85
Epoch 17/85
Epoch 18/85
Epoch 19/85
Epoch 20/85
Epoch 21/85
Epoch 22/85
Epoch 23/85
Epoch 24/85
Epoch 25/85
Epoch 26/85
Epoch 27/85
Epoch 28/85
Epoch 29/85
Epoch 30/85
Epoch 31/85
Epoch 32/85
Epoch 33/85
Epoch 34/85
Epoch 35/85
Epoch 36/85
Epoch 37/85
Epoch 38/85
Epoch 39/85
Epoch 40/85
Epoch 41/85
Epoch 42/85
Epoch 43/85
Epoch 44/85
Epoch 45/85
Epoch 46/85
Epoch 47/85
Epoch 48/85
Epoch 49/85
Epoch 50/85
Epoch 51/85
Epoch 52/85
Epoch 53/85
Epoch 54/85
Epoch 55/85
Epoch 56/85
Epoch 57/85
Epoch 58/85
Epoch 59/85
Epoch 60/85
Epoch 61/85
Epoch 62/85
Epoch 63/85
Epoch 64/85
Epoch 65/85
Epoch 66/85
Epoch 67/85
Epoch 68/85
Epoch 69/85
Epoch 70/85
Epoch 71/85
Epoch 72/85
Epoch 73/85
Epoch 74/85
Epoch 75/85
Epoch 76/85
Epoch 77/85
Epoch 78/85
Epoch 79/85
Epoch 80/85
Epoch 81/85
Epoch 82/85
Epoch 83/85


Epoch 84/85
Epoch 85/85


In [13]:
from matplotlib import pyplot

pyplot.plot(predicted_labels_ANN.history['acc'])
pyplot.show()

<Figure size 640x480 with 1 Axes>

In [14]:
test_loss, test_acc = model.evaluate(train_dataset, train_labels)



In [15]:
final_results = model.evaluate(test_dataset, test_labels)



<h1> Naive Bayes </h1>

<b> Let's test the Gaussian Naive Bayes approach: </b>

In [16]:
from sklearn.naive_bayes import GaussianNB

modelB = GaussianNB().fit(train_dataset, train_labels) 

In [17]:
predicted_label = modelB.predict(test_dataset)

In [18]:
from sklearn.metrics import accuracy_score

accuracy_score = accuracy_score(test_labels, predicted_label) 
print (accuracy_score)

0.9111111111111111


<h1> Metrics </h1>

In [19]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np

This is a function to display dataframes for a better analysis. For more details see [here](https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side).

In [20]:
from IPython.core.display import display, HTML

def display_side_by_side(dfs:list, captions:list):
    """Display tables side by side to save vertical space
    Input:
        dfs: list of pandas.DataFrame
        captions: list of table captions
    """
    output = ""
    combined = dict(zip(captions, dfs))
    for caption, df in combined.items():
        output += df.style.set_table_attributes("style='display:inline'").set_caption(caption)._repr_html_()
        output += "\xa0\xa0\xa0"
    display(HTML(output))

In [21]:
prediction_ANN = model.predict(test_dataset) #shows the probabilities for each disease given one row
predictions_ANN = np.argmax(prediction_ANN, axis = 1) #the highest probability value is taken by the disease for that row

labels = list(set(test_labels))
df_ANN = pd.DataFrame(
    data = confusion_matrix(test_labels, predictions_ANN, labels=labels),
    columns=labels,
    index=labels
)


df_GNB = pd.DataFrame(
    data  = confusion_matrix(test_labels, predicted_label, labels=labels),
    columns=labels,
    index=labels
)

display_side_by_side([df_ANN, df_GNB], ['ANN', 'GNB'])

Unnamed: 0,1,2,3,4,5,6
1,35,1,0,0,0,0
2,0,11,0,1,0,0
3,0,0,14,0,0,0
4,0,1,0,13,0,0
5,0,0,0,0,12,0
6,0,0,0,0,0,2

Unnamed: 0,1,2,3,4,5,6
1,36,0,0,0,0,0
2,0,5,0,6,0,1
3,0,0,14,0,0,0
4,1,0,0,13,0,0
5,0,0,0,0,12,0
6,0,0,0,0,0,2


Before we present the metrics associated to the confusion matrix, let's do some recap about them. <br>

 - **Accuracy**: How often is the classifier correct?
 - **Precision**: Of those it classified correctly, how many were they?
 - **Recall**: When it belongs to some X class, how often it classifies as X?
 - **F1-score**: General quality (combines precision and recall). The higher the score, the better the model.
 <br>

The more rigorous we are to predict correctly (improve precision) the less we are willing to make mistakes (increase recall).
<br>
 
Formulas: <br>
 - **Accuracy**: $\frac{TP + TN}{total}$
 - **Precision**: $\frac{TP}{TP+FP}$
 - **Recall**: $\frac{TP}{TP+FN}$
 - **F1-score**:  $\frac{2 * precision * recall}{precision + recall}$

In [22]:
print(classification_report(test_labels, predictions_ANN, digits=4))
print(classification_report(test_labels, predicted_label, digits=4))

              precision    recall  f1-score   support

           1     1.0000    0.9722    0.9859        36
           2     0.8462    0.9167    0.8800        12
           3     1.0000    1.0000    1.0000        14
           4     0.9286    0.9286    0.9286        14
           5     1.0000    1.0000    1.0000        12
           6     1.0000    1.0000    1.0000         2

    accuracy                         0.9667        90
   macro avg     0.9625    0.9696    0.9657        90
weighted avg     0.9684    0.9667    0.9673        90

              precision    recall  f1-score   support

           1     0.9730    1.0000    0.9863        36
           2     1.0000    0.4167    0.5882        12
           3     1.0000    1.0000    1.0000        14
           4     0.6842    0.9286    0.7879        14
           5     1.0000    1.0000    1.0000        12
           6     0.6667    1.0000    0.8000         2

    accuracy                         0.9111        90
   macro avg     0.8873