# How to study the Malaria dataset

In this notebook, we illustrate how FrImCla can be employed to classify images. We have to train the framework to make these classifications. In particular, we use the datset provided for the Ntional Library of Medicine (NLB) - from now on we will call this dataset, the Malaria dataset.

The Malaria dataset consist in 27558 mammographic images with equal instances of parasitized and uninfected cells. You can find more information about the dataset in the following link: 

https://ceb.nlm.nih.gov/repositories/malaria-datasets/

This dataset is used in the paper "Pre-trained convolutional neural networks as feature extractors toward improved malaria parasite detection in thin blood smear images". You can read the paper in the following link:

https://peerj.com/articles/4568/

We only will use a subset of this dataset to know the performance of FrImCla. The dataset contains 500 images of parasitized cells and 500 of uninfected cells. This subset can be downloaded by executing the following command.

In [None]:
!wget "https://drive.google.com/uc?id=1ElTc5K3CNVxYiMfdRc2HX1HL-UVud4UZ&export=download&authuser=0" -O malaria.zip
!unzip malaria.zip

In case that FrImCla is not installed in your system, the first task consist in installing using pip.

In [1]:
!pip install frimcla

Collecting frimcla
[31m  Could not find a version that satisfies the requirement frimcla (from versions: )[0m
[31mNo matching distribution found for frimcla[0m


This time we have to go through the process of study the best model of the melanoma dataset (with a technique called data augmentation).

We need some libraries to execute this framework and obtain the results. 

In [None]:
from frimcla.index_features import generateFeatures
from frimcla.StatisticalComparison import statisticalComparison
from frimcla.train import train

### Configuring the variables of the program

First of all, we have to indicate the variables that the program need such as the path of the dataset, the models you want to use,...

In [None]:
datasetPath = "mias"
outputPath = "output"
featureExtractors = [["inception", "False"]]
batchSize = 32 
verbose = False
modelClassifiers = [ "MLP","SVM","KNN"] #You can use MLP, SVM, KNN , LogisticRegression or RandomForest
measure = "accuracy" #You can use accuracy, f1, auroc, precision or recall 
trainingSize = 1
nSteps=10


This variables are used to configure the envoirement of the program. We have to know where the user wants to store de results or where is the dataset that user wants to study.

### Generating the features

At this step we stored the features of each image of the dataset. These features depend on the model used at this moment because each model stores different features of the image. 

In [None]:
generateFeatures(outputPath, batchSize, datasetPath, featureExtractors, nSteps, verbose)

### Statistical analysis

Now with the features of all the images of each model we can perform a statistical analysis to know which of this models has the best performace.

In [None]:
statisticalComparison(outputPath, datasetPath, featureExtractors, modelClassifiers, measure, verbose)

### Train the model

The study gives us as result the best model and indicates if there are significant differences between this and the rest of the models. With this information, we can train the best model and return as a result of the framework to the user.

In [None]:
train(outputPath, datasetPath, trainingSize)

### Predict the class of the images

Finally, we have the best model and we can use it to predict the class of our images. To do this we have to use the following command and we have to define the feature extractor and the classifier.

In [None]:
image = "./example.jpg"
featExt = ["inception", "False"]
classi = "MLP"
prediction(featExt, classi, image, outputPath, datasetPath)