## Dermatology Using Naive Bayes Algorithm 
Download the Dermatology data set from the UCI depositary, apply Naïve Bayes classification and analyze the performance. Display the confusion matrix and discuss it. Vary the train/test split [(60%, 40%), (70%, 30%), (80%, 20%)] and discuss its effect if any. (source: https://archive.ics.uci.edu/ml/datasets/Dermatology)


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [12]:
df = pd.read_csv('data_sets/dermatology.data')
df.columns = [
    "erythema","scaling","definite_borders","itching","koebner_phenomenon",
    "polygonal_papules","follicular_papules","oral_mucosal_involvement",
    "knee_and_elbow_involvement","scalp_involvement","family_history",
    "melanin_incontinence","eosinophils_in_the_infiltrate","PNL_infiltrate",
    "fibrosis_of_the_papillary_dermis","exocytosis","acanthosis",
    "hyperkeratosis","parakeratosis","clubbing_of_the_rete_ridges",
    "elongation_of_the_rete_ridges","thinning_of_the_suprapapillary_epidermis",
    "spongiform_pustule","munro_microabcess","focal_hypergranulosis",
    "disappearance_of_the_granular_layer","vacuolisation_and_damage_of_basal_layer",
    "spongiosis","saw_tooth_appearance_of_retes","follicular_horn_plug",
    "perifollicular_parakeratosis","inflammatory_monoluclear_inflitrate",
    "band_like_infiltrate","age","class"
]
df.head()

Unnamed: 0,erythema,scaling,definite_borders,itching,koebner_phenomenon,polygonal_papules,follicular_papules,oral_mucosal_involvement,knee_and_elbow_involvement,scalp_involvement,...,disappearance_of_the_granular_layer,vacuolisation_and_damage_of_basal_layer,spongiosis,saw_tooth_appearance_of_retes,follicular_horn_plug,perifollicular_parakeratosis,inflammatory_monoluclear_inflitrate,band_like_infiltrate,age,class
0,3,3,3,2,1,0,0,0,1,1,...,0,0,0,0,0,0,1,0,8,1
1,2,1,2,3,1,3,0,3,0,0,...,0,2,3,2,0,0,2,3,26,3
2,2,2,2,0,0,0,0,0,3,2,...,3,0,0,0,0,0,3,0,40,1
3,2,3,2,2,2,2,0,2,0,0,...,2,3,2,3,0,0,2,3,45,3
4,2,3,2,0,0,0,0,0,0,0,...,0,0,2,0,0,0,1,0,41,2


In [13]:
df.isnull().sum()

erythema                                    0
scaling                                     0
definite_borders                            0
itching                                     0
koebner_phenomenon                          0
polygonal_papules                           0
follicular_papules                          0
oral_mucosal_involvement                    0
knee_and_elbow_involvement                  0
scalp_involvement                           0
family_history                              0
melanin_incontinence                        0
eosinophils_in_the_infiltrate               0
PNL_infiltrate                              0
fibrosis_of_the_papillary_dermis            0
exocytosis                                  0
acanthosis                                  0
hyperkeratosis                              0
parakeratosis                               0
clubbing_of_the_rete_ridges                 0
elongation_of_the_rete_ridges               0
thinning_of_the_suprapapillary_epi

In [14]:
# Replace '?' with NaN and convert to float
df['age'] = df['age'].replace('?', np.nan).astype(float)

# Fill missing values with median
df['age'].fillna(df['age'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace=True)


In [15]:
print(df['class'].value_counts())

print(" Class Number\tDisease Name")
print("1                Psoriasis")
print("2                Seborrheic Dermatitis")
print("3                Lichen Planus")
print("4                Pityriasis Rosea")
print("5                Chronic Dermatitis")
print("6                Pityriasis Rubra Pilaris")


class
1    112
3     72
2     60
5     52
4     49
6     20
Name: count, dtype: int64
 Class Number	Disease Name
1                Psoriasis
2                Seborrheic Dermatitis
3                Lichen Planus
4                Pityriasis Rosea
5                Chronic Dermatitis
6                Pityriasis Rubra Pilaris


In [16]:
X = df.drop('class', axis=1)
y = df['class']

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




In [18]:
# Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

0,1,2
,priors,
,var_smoothing,1e-09


In [19]:
y_pred = model.predict(X_test)

In [20]:
# Results
print(" Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

 Accuracy: 0.8493150684931506

Classification Report:
               precision    recall  f1-score   support

           1       1.00      1.00      1.00        20
           2       1.00      0.21      0.35        14
           3       1.00      1.00      1.00        13
           4       0.47      1.00      0.64         9
           5       0.91      1.00      0.95        10
           6       1.00      1.00      1.00         7

    accuracy                           0.85        73
   macro avg       0.90      0.87      0.82        73
weighted avg       0.92      0.85      0.83        73


Confusion Matrix:
 [[20  0  0  0  0  0]
 [ 0  3  0 10  1  0]
 [ 0  0 13  0  0  0]
 [ 0  0  0  9  0  0]
 [ 0  0  0  0 10  0]
 [ 0  0  0  0  0  7]]
