The Dermatology dataset in scikit-learn is a dataset used for classification tasks, specifically for diagnosing skin diseases based on various dermatological attributes. This dataset can be useful for building and evaluating machine learning models for multi-class classification problems.

Here's an in-depth explanation of the Dermatology dataset:

**Origin:** The Dermatology dataset is often referred to as the "Dermatology dataset" because it is commonly used in the field of dermatology for diagnosing skin diseases. It is available as part of the openml datasets and can be fetched using scikit-learn's `fetch_openml` function.

**Data Description:**
The Dermatology dataset contains dermatological data collected from patients, including both clinical and histopathological attributes. It is a multi-class classification dataset, meaning it is used to classify data into multiple classes or categories.

Here are the key attributes of the dataset:

1. **Features:** The dataset contains a total of 34 features, which are dermatological attributes used for diagnosis. These attributes include clinical and histopathological information, such as the presence or absence of various symptoms, colors, and patterns on the skin.

2. **Target Variable:** The target variable is the class label, which represents the diagnosis of the skin disease. There are six possible classes (class labels) representing different skin diseases:
   - Class 1: Psoriasis
   - Class 2: Seboreic Dermatitis
   - Class 3: Lichen Planus
   - Class 4: Pityriasis Rosea
   - Class 5: Cronic Dermatitis
   - Class 6: Pityriasis Rubra Pilaris

**Use Cases:**
The Dermatology dataset is primarily used for training and evaluating machine learning models for multi-class classification tasks related to dermatology. Some potential use cases for this dataset include:

1. **Disease Diagnosis:** It can be used to build models for automated dermatological disease diagnosis based on patient data.

2. **Research:** Researchers in the field of dermatology can use this dataset to study patterns and relationships between various dermatological attributes and skin diseases.

**Preprocessing and Handling Missing Values:**
In practical applications, it's common to preprocess the dataset by handling missing values and possibly performing feature scaling or engineering. Missing values can be imputed using techniques such as mean imputation or more sophisticated methods.

**Evaluation Metrics:**
When working with the Dermatology dataset, common evaluation metrics for classification tasks can be used, including accuracy, precision, recall, F1-score, and confusion matrices.

**Availability:**
You can access the Dermatology dataset in scikit-learn using the `fetch_openml` function. It's one of the many datasets available in scikit-learn for educational and research purposes.

In [4]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, classification_report



In [6]:
# Load the Dermatology dataset
data = fetch_openml(name="dermatology", version=1)
X, y = data.data, data.target.astype(int)



  warn(


In [8]:
X.shape

(366, 34)

In [9]:
y.shape

(366,)

In [10]:
X.isna().sum()

erythema                                    0
scaling                                     0
definite_borders                            0
itching                                     0
koebner_phenomenon                          0
polygonal_papules                           0
follicular_papules                          0
oral_mucosal_involvement                    0
knee_and_elbow_involvement                  0
scalp_involvement                           0
family_history                              0
melanin_incontinence                        0
eosinophils_in_the_infiltrate               0
PNL_infiltrate                              0
fibrosis_of_the_papillary_dermis            0
exocytosis                                  0
acanthosis                                  0
hyperkeratosis                              0
parakeratosis                               0
clubbing_of_the_rete_ridges                 0
elongation_of_the_rete_ridges               0
thinning_of_the_suprapapillary_epi

In [11]:
# Replace missing values (NaN) with the mean value for each feature
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

In [19]:
X[233]

array([ 2.,  2.,  2.,  1.,  1.,  0.,  0.,  0.,  2.,  0.,  1.,  0.,  0.,
        2.,  0.,  1.,  2.,  1.,  2.,  2.,  2.,  2.,  1.,  1.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  2.,  0., 60.])

In [14]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [15]:
# Create a dictionary of classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=10000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'MLP Neural Network': MLPClassifier(max_iter=10000),
    'Linear Discriminant Analysis': LinearDiscriminantAnalysis()
}


In [16]:
# Create a pipeline that includes preprocessing (scaling) and classification
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('classifier', None)  # The classifier will be set dynamically
])


In [17]:
# Train and evaluate each classifier using the pipeline
for clf_name, clf in classifiers.items():
    pipeline.set_params(classifier=clf)  # Set the classifier in the pipeline
    pipeline.fit(X_train, y_train)  # Train the model
    y_pred = pipeline.predict(X_test)  # Make predictions
    accuracy = accuracy_score(y_test, y_pred)  # Evaluate accuracy
    report = classification_report(y_test, y_pred)  # Generate classification report
    print(f'{clf_name} Accuracy: {accuracy:.2f}')
    print(f'Classification Report for {clf_name}:\n{report}\n')


Logistic Regression Accuracy: 0.99
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        31
           2       0.90      1.00      0.95         9
           3       1.00      1.00      1.00        13
           4       1.00      0.88      0.93         8
           5       1.00      1.00      1.00        10
           6       1.00      1.00      1.00         3

    accuracy                           0.99        74
   macro avg       0.98      0.98      0.98        74
weighted avg       0.99      0.99      0.99        74


Decision Tree Accuracy: 0.99
Classification Report for Decision Tree:
              precision    recall  f1-score   support

           1       0.97      1.00      0.98        31
           2       1.00      1.00      1.00         9
           3       1.00      1.00      1.00        13
           4       1.00      1.00      1.00         8
           5       1.00      0.90

In [28]:
import pickle

# Initialize variables to track the best model and its accuracy
best_model = None
best_accuracy = 0.0
best_classifier_name = ""

# Train and evaluate each classifier using the pipeline
for clf_name, clf in classifiers.items():
    pipeline.set_params(classifier=clf)  # Set the classifier in the pipeline
    pipeline.fit(X_train, y_train)  # Train the model
    y_pred = pipeline.predict(X_test)  # Make predictions
    accuracy = accuracy_score(y_test, y_pred)  # Evaluate accuracy
    report = classification_report(y_test, y_pred)  # Generate classification report
    print(f'{clf_name} Accuracy: {accuracy:.2f}')
    print(f'Classification Report for {clf_name}:\n{report}\n')

    # Save the best model based on accuracy
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = pipeline
        best_classifier_name = clf_name

# Save the best model to a pickle file
with open("best_model.pkl", "wb") as model_file:
    pickle.dump(best_model, model_file)

# Display the best algorithm and its evaluation
print(f'Best Algorithm: {best_classifier_name}')
print(f'Best Accuracy: {best_accuracy:.2f}')

Logistic Regression Accuracy: 0.99
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        31
           2       0.90      1.00      0.95         9
           3       1.00      1.00      1.00        13
           4       1.00      0.88      0.93         8
           5       1.00      1.00      1.00        10
           6       1.00      1.00      1.00         3

    accuracy                           0.99        74
   macro avg       0.98      0.98      0.98        74
weighted avg       0.99      0.99      0.99        74


Decision Tree Accuracy: 0.97
Classification Report for Decision Tree:
              precision    recall  f1-score   support

           1       0.94      1.00      0.97        31
           2       1.00      1.00      1.00         9
           3       1.00      1.00      1.00        13
           4       1.00      0.88      0.93         8
           5       1.00      0.90