Importing needed libraries:

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

Importing Training and Validation Data:

In [2]:
df_training = pd.read_excel('training_data.xlsx')
df_validation = pd.read_excel('validation_data.xlsx')

Splitting datasets into features and targets:

In [3]:
X_train = df_training[['Aire','Peremeter','Circularity','Moyenne_H','Moyenne_S','Moyenne_V']]
y_train = df_training[['melanoma','seborrheic_keratosis']]

X_val = df_validation[['Aire','Peremeter','Circularity','Moyenne_H','Moyenne_S','Moyenne_V']]
y_val = df_validation[['melanoma','seborrheic_keratosis']]

Creating Random Forest Classification Model:

In [4]:
# define the model
model = RandomForestClassifier(random_state=1)

# fit the model on the training dataset
model.fit(X_train, y_train)

# performing classification on validation features
y_pred = model.predict(X_val)

# calculating accuracy score
accuracy = accuracy_score(y_val, y_pred)
accuracy

0.6133333333333333

Hyperparameter Optimization via Grid Search: Improving the model accuracy

Possible Features to improve:
- Explore Number of Features
- Explore Number of Trees (10:1000)
- Explore Tree Depth (1:7)

In [5]:
tree_list = np.arange(10,510,10)
depth_list = np.arange(1,8)
tree_value_list = []
depth_value_list = []
accuracy_value_list = []
counter = 0

for nbr_trees in tree_list:
    for nbr_depth in depth_list:
        model_optimized = RandomForestClassifier(n_estimators = nbr_trees, max_depth = nbr_depth)
        model_optimized.fit(X_train, y_train)
        y_pred = model_optimized.predict(X_val)
        accuracy = accuracy_score(y_val, y_pred)
        accuracy_value_list.append(accuracy)
        tree_value_list.append(nbr_trees)
        depth_value_list.append(nbr_depth)
        counter += 1
        print(f'Models Executed: {counter}/{len(tree_list)*len(depth_list)}', end = '\r')
df_results = pd.DataFrame({'nbr_tree': tree_value_list, 'nbr_depth': depth_value_list, 'accuracy': accuracy_value_list})

Models Executed: 350/350

In [67]:
df_results[df_results['accuracy'] == df_results['accuracy'].max()]

Unnamed: 0,nbr_tree,nbr_depth,accuracy
139,200,7,0.626667
286,410,7,0.626667


Using Optimized Model Hyperparamters:

In [72]:
# define the model
model_optimized = RandomForestClassifier(n_estimators = 200, max_depth = 7, random_state = 1)

# fit the model on the training dataset
model_optimized.fit(X_train, y_train)

# performing classification on validation features
y_pred = model_optimized.predict(X_val)

# calculating accuracy score
accuracy = accuracy_score(y_val, y_pred)
accuracy

0.62

Due to a tight project deadline, a baseline Random Forest model was implemented, achieving an accuracy of 62%. With more time, further image preprocessing and feature engineering (e.g., normalization, segmentation, and augmentation) could have been applied to improve performance