## ZOIDBERG

In [93]:
import os
import time
import cv2
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm, metrics
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from tqdm import tqdm

## Models used
- KNN (K-Nearest Neighbors)
- SVM (Support Vector Machine)
- MLP_classifier (Multi-Layer Perceptron classifier)
- Naive Bayes
- Extremely Randomized Trees

## Get Data

load_data(): This function loads image data from a directory and converts it to grayscale. It returns four lists: train_images, train_labels, test_images, and test_labels. The train_images and train_labels lists contain the training data, and the test_images and test_labels lists contain the testing data.

In [94]:
def load_data():
    # Start time
    start_time = time.time()
    
    # Définir le chemin du dossier contenant les images d'entraînement et de test
    train_dir = "/jup/Epitech/Data/chest_Xray/train/"
    test_dir = "/jup/Epitech/Data/chest_Xray/test/"

    # Définir le nombre de voisins à utiliser pour la classification K-NN
    # n_neighbors = 5

    # Charger les images d'entraînement et de test, et leurs étiquettes
    train_images = []
    train_labels = []
    for foldername in tqdm(["NORMAL", "PNEUMONIA"]): # os.listdir(train_dir)
        label = 0 if foldername == "NORMAL" else 1
        folderpath = os.path.join(train_dir, foldername)
        for filename in os.listdir(folderpath):
            if filename.endswith(".jpeg"):
                imgpath = os.path.join(folderpath, filename)
                img = cv2.imread(imgpath)
                if img is None:
                    print('Wrong path:', imgpath)
                else:
                    img = cv2.resize(img, (64, 64))
                    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                    train_images.append(img.flatten())
                    train_labels.append(label)

    test_images = []
    test_labels = []
    for foldername in tqdm(["NORMAL", "PNEUMONIA"]): # os.listdir(test_dir)
        label = 0 if foldername == "NORMAL" else 1
        folderpath = os.path.join(test_dir, foldername)
        for filename in os.listdir(folderpath):
            if filename.endswith(".jpeg"):
                imgpath = os.path.join(folderpath, filename)
                img = cv2.imread(imgpath)
                if img is None:
                    print('Wrong path:', imgpath)
                else:
                    img = cv2.resize(img, (64, 64))
                    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                    test_images.append(img.flatten())
                    test_labels.append(label)
                    
    print("Finished in", round((time.time() - start_time), 1), "s")
                    
    return train_images, train_labels, test_images, test_labels

In [95]:
train_images, train_labels, test_images, test_labels = load_data()

100%|██████████| 2/2 [01:15<00:00, 37.88s/it]
100%|██████████| 2/2 [00:08<00:00,  4.30s/it]

Finished in 84.4 s





## KNN

**The KNN (k-nearest neighbors) model is a supervised learning algorithm used for classification and regression. The basic idea is to find the k closest training samples (based on a distance measure) to the test sample, and then assign a label (for classification) or a value (for regression) to the test sample based on the majority of the labels or values of the k closest neighbors. The larger k is, the smoother the decision boundary will be, but the higher the variance of the estimate will be.**

KNN(): This function implements the K-Nearest Neighbors algorithm for classification. It takes as input the training and testing data and the number of neighbors to consider (default value is 5). It trains the model on the training data and makes predictions on the testing data. It then calculates the accuracy of the model and prints it.

In [96]:
def KNN(train_images, train_labels, test_images, test_labels, n_neighbors = 5):
    start_time = time.time()
    
    # Créer un objet KNeighborsClassifier et entraîner le modèle sur les données d'entraînement
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(train_images, train_labels)

    # Prédire les classes des images de test
    test_preds = knn.predict(test_images)

    # Calculer l'exactitude du modèle sur les données de test
    accuracy = np.mean(test_preds == test_labels)
    print("Exactitude du modèle : {:.2f} %".format(accuracy*100))
    print("Finished in", round((time.time() - start_time), 1), "s")

In [97]:
KNN(train_images, train_labels, test_images, test_labels)

Exactitude du modèle : 74.20 %
Finished in 0.8 s


## SVC

**The SVM (Support Vector Machine) is a supervised learning algorithm used for classification and regression. It consists in finding the best possible separation between classes by looking for a hyperplane that maximizes the margin between the data of the different classes.**

**The separation hyperplane is constructed using a subset of the training data called support vectors. These support vectors are the data closest to the decision frontier.**

**The SVM model can also use a kernel function to transform the data into a higher dimensional space, where linear separation is easier to achieve. This allows capturing more complex relationships between the data.**

**The SVM model is known for its ability to handle data with a large number of variables, as well as its ability to generalize to new data.**

SVC(): This function implements the Support Vector Machine (SVM) algorithm for classification. It takes as input the training and testing data and a boolean value indicating whether to use a linear kernel or not (default value is False, indicating to use the default kernel). It trains the model on the training data and makes predictions on the testing data. It then calculates the accuracy of the model and prints it.

In [98]:
def SVC(train_images, train_labels, test_images, test_labels, linear: bool = False):
    start_time = time.time()
    
    if not linear:
        
        clf = svm.SVC(verbose=True)
        clf.fit(train_images, train_labels)
        predicted = clf.predict(test_images)
    
    else:
    
        clf = svm.LinearSVC(verbose=True)
        clf.fit(train_images, train_labels)
        predicted = clf.predict(test_images)
    
    print("Accuracy:", round(metrics.accuracy_score(test_labels, predicted)*100, 2), "%")
    print("Finished in", round((time.time() - start_time), 1), "s")

In [99]:
SVC(train_images, train_labels, test_images, test_labels)

[LibSVM]Accuracy: 76.12 %
Finished in 24.4 s


## MLP Classifier

**The MLP (Multi-Layer Perceptron) model is a type of artificial neural network used for classification. It is composed of several layers of connected neurons that transform the input into a predicted output. The input data is passed through hidden layers that apply activation functions to generate intermediate outputs. These outputs are then passed to the output layer, where the output class is predicted based on the inputs and connection weights. The model is trained to adjust the connection weights to minimize a cost function that measures the difference between the predicted output and the actual output. The model can be used for multi-class classification and regression problems.**

MLP_classifier(): This function implements a multilayer perceptron (MLP) algorithm for classification. It takes as input the training and testing data. It creates an MLP classifier with a hidden layer size of (784, 3) and trains the model on the training data. It then makes predictions on the testing data, calculates the accuracy of the model, and prints it.

In [100]:
def MLP_classifier(train_images, train_labels, test_images, test_labels):
    start_time = time.time()
    
    clf = MLPClassifier(verbose=True, solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(784, 3), random_state=1)
    clf.fit(train_images, train_labels)
    predicted = clf.predict(test_images)
    
    print("Accuracy:", round(metrics.accuracy_score(test_labels, predicted)*100, 2), "%")
    print("Finished in", round((time.time() - start_time), 1), "s")

In [101]:
MLP_classifier(train_images, train_labels, test_images, test_labels)

Accuracy: 62.18 %
Finished in 26.6 s


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


## NAIVE Bayes

**The Naive Bayes model is a supervised learning algorithm that is used for text classification and prediction of the probability of an observation belonging to a certain class. The model calculates the probability of each class for a given observation using Bayes' theorem. It assumes that each feature is independent of the other, hence the term "naive". The training data is used to calculate the probabilities of the different characteristics for each class. During prediction, the model calculates the probabilities of each class for the given observation using the previously calculated probabilities and chooses the class with the highest probability.**

NAIVE_bayes(): This function implements the Naive Bayes algorithm for classification. It takes as input the training and testing data, fits the model to the training data, makes predictions on the testing data, calculates the accuracy of the model, and prints it.

In [102]:
def NAIVE_bayes(train_images, train_labels, test_images, test_labels):
    start_time = time.time()
    
    model = GaussianNB()
    # fit the model with the training data
    model.fit(train_images, train_labels)

    predicted = model.predict(test_images)
    
    print("Accuracy:", round(metrics.accuracy_score(test_labels, predicted)*100, 2), "%")
    print("Finished in", round((time.time() - start_time), 1), "s")

In [103]:
NAIVE_bayes(train_images, train_labels, test_images, test_labels)

Accuracy: 72.76 %
Finished in 0.8 s


## EXTREMELY Randomized Trees

**The Extremely Randomized Trees (ERT) model is an extension of the Random Forest algorithm that uses a set of many decision trees. However, unlike Random Forest, the decision trees in ERT are constructed using random cutoffs for the data features, rather than the optimal cutoffs. In addition, the tree splitting is done using a randomly selected sample subset of features for each node.**

**By using randomly chosen cutoffs and features, ERT seeks to increase diversity between decision trees, which can lead to reduced variance and improved model generalization. In addition, the use of the randomly selected feature subset set can help reduce the correlation between trees, which can also improve model performance.**

**ERT is particularly useful for high-dimensional and noisy datasets, as well as for datasets with features that have complex and nonlinear interactions.**

EXTREMELY_randomized_trees(): This function implements the Extremely Randomized Trees algorithm for classification. It takes as input the training and testing data, the number of estimators to use (default value is 100), and the maximum depth of the trees (default value is 10). It trains the model on the training data and makes predictions on the testing data. It then calculates the accuracy of the model and prints it.

In [104]:
def EXTREMELY_randomized_trees(train_images, train_labels, test_images, test_labels, estimators: int = 100, max_depth: int = 10):
    start_time = time.time()
    
    # ExtraTrees classifier always tests random splits over fraction of features
    # (in contrast to RandomForest, which tests all possible splits over fraction of features)

    clf = ExtraTreesClassifier(n_estimators=estimators, max_depth=max_depth, min_samples_split=2, random_state=0)
    clf = clf.fit(train_images, train_labels)

    # Predict the response for test dataset
    predicted = clf.predict(test_images)
    
    print("Accuracy:", round(metrics.accuracy_score(test_labels, predicted)*100, 2), "%")
    print("Finished in", round((time.time() - start_time), 1), "s")

In [105]:
EXTREMELY_randomized_trees(train_images, train_labels, test_images, test_labels)

Accuracy: 76.6 %
Finished in 8.2 s


In [1]:
from Zoidberg_Object import ZOIDBERG
z = ZOIDBERG()
z.compare()


=== Loading Data ===


100%|██████████| 2/2 [01:12<00:00, 36.07s/it]
100%|██████████| 2/2 [00:08<00:00,  4.28s/it]


Finished in 80.7 s

=== KNN Model ===
Exactitude du modèle : 74.20 %
Finished in 0.5 s

=== SVC Model ===
[LibSVM]Accuracy: 76.12 %
Finished in 24.0 s

=== MLP Classifier Model ===


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy: 62.18 %
Finished in 27.9 s

=== NAIVE bayes Model ===
Accuracy: 72.76 %
Finished in 0.6 s

=== EXTREMELY randomized trees Model ===
Accuracy: 76.6 %
Finished in 8.0 s


In [110]:
from Tool import tool
tool.convert2html("Zoidberg_Draft.ipynb", "Draft_14-03-23")

Convert Zoiberg_Draft.ipynb to html
0
