Protein classification by Mounir Messaoudi

This Repo contains :

a pdf resuming all the work done
data analysis (data_analysis.ipynb)
code for preprocessing (preprocessign.py)
code for training (train.py)
code for evaluation (test.py)
evaluation analysis (evaluation.ipynb)

Achieve 99% test accuracy trained by ResNet with Adam optimizer, for classifying 250 classes.

Hardware

Windows 10
Intel(R) Core(TM) i5
Memory: 16G
NVIDIA GeForce GTX 1060 * 1 (4617 MB memory)

Pipeline

To reproduce the result, there are three steps:

Dataset Preparation, run preprocessing.py
Training, for each backcbone, run train.py
Evaluation, for each backbone, run test.py

Environment

Requirements

#Install requirements
pip install -r requirements.txt

#In case of problems
intall tensorflow, Bio and propythia using pip and it should work.

Project directory structured

+-- assets/
|   +-- family_distribution.png
|   +-- ...
+-- backbone/
|   +-- layers/
|   |   +--  ResidualBlock.py
|   |   +--  ...
|   +-- resnet.py
|   +-- ...
+-- logs/
+-- output/
|   +-- exp1
|   +-- ...
+-- preprocessed_data
|   +-- full/
|   |   +-- ...
|   +-- sample/
|   |   +-- ...
+-- random_split/
|   +-- train
|   +-- dev
|   +-- test
+-- utils/
|   +-- dataloader.py
|   +-- model.py
|   +-- tools.py
+-- preprocessing.py
+-- README.md
+-- requirements.txt
+-- train.py
+-- test.py
+-- pfam Classification - Mounir Messaoudi.pdf
+-- evaluation.ipynb
+--data_analysis.ipynb

Dataset preparation

Download Dataset

If no folder random_split, download and put the folder random_plit inside this repo. "Kaggle: Pfam seed random split"

Extract random_split.zip to dataset directory.

After downloading and pre-process, the data directory is structured as:

+-- dataset/
|   +-- train
|   +-- |   +-- data-00000-of-00080
|   +-- |   +-- ...
|   +-- dev
|   +-- |   +-- data-00000-of-00010
|   +-- |   +-- ...
|   +-- test
|   +-- |   +-- data-00000-of-00010
|   +-- |   +-- ...

Dataset Analysis

Look at the notebook data_analysis.ipynb

Training

Usage

usage : preprocessing.py

It will preprocess the data from random_split/, and save the preprocessed data in preprocessed_data/.

## Training
---------------
### Usage

usage: train.py [--output_version OUTPUT_VERSION] [--data_dir DATA_DIR] [--backbone BACKBONE] (REQUIRE) [--num_epochs NUM_EPOCHS] [--batch_size BATCH_SIZE] [--seed SEED]

value for backbone : resnet, mobilnetv2, light_mobilnetv2, LSTM
The rest of the parameters are optional
### Example Usage

Train resnet

```python
python train.py --output_version resnet1 --num_epochs 5 --batch_size 256 --backbone resnet

Train LSTM

python train.py --output_version LSTM1 --num_epochs 5 --batch_size 256 --backbone LSTM

Visualize

tensorboard --logdir "./logs"

Training Random Forest

Usage

open RF_Training.ipynb and run it all

Evaluate (Test)

Usage

usage: test.py [--model_dir DATA_DIR] (REQUIRE)
               [--OUTPUT_dir TRAIN_DIR] [--output_version OUTPUT_VERSION]

Example Usage

test trained model

python test.py --model_dir output/20240107-094325

Result (with adam optimizer)

	Test Acc
resnet	0.9999
LSTM	0.9947
mobilenetV2 light	0.9998
mobilenetV2	0.9997
Random Forest	0.02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein classification by Mounir Messaoudi

Hardware

Pipeline

Environment

Requirements

Dataset preparation

Download Dataset

Extract random_split.zip to dataset directory.

Dataset Analysis

Training

Usage

Training Random Forest

Usage

Evaluate (Test)

Usage

Example Usage

Result (with adam optimizer)

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
RF_Training.ipynb		RF_Training.ipynb
data_analysis.ipynb		data_analysis.ipynb
evaluation.ipynb		evaluation.ipynb
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

MoonMess/PFAM

Folders and files

Latest commit

History

Repository files navigation

Protein classification by Mounir Messaoudi

Hardware

Pipeline

Environment

Requirements

Dataset preparation

Download Dataset

Extract random_split.zip to dataset directory.

Dataset Analysis

Training

Usage

Training Random Forest

Usage

Evaluate (Test)

Usage

Example Usage

Result (with adam optimizer)

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages