MediChestAI

MediChestAI is a comprehensive MLOps project focused on classifying chest CT images as normal or adenocarcinoma using deep learning model. The project integrates various modern tools and practices to streamline the process of development, testing, deployment, and maintenance.

Project Overview

MLOps Workflow: Implements an end-to-end CI/CD pipeline using Jenkins for continuous integration and deployment, triggered by GitHub Actions. GitHub Actions orchestrates the process by triggering the Jenkins job that builds, tests, and deploys the project.
Deployment: The application is designed to be deployed on AWS EC2 instances, leveraging Docker containers stored in Amazon ECR for consistent and scalable deployment.

Docker: A Dockerfile is included to build the project's Docker image, which is based on the python:3.8-slim-buster image. The Dockerfile installs the required dependencies and starts the Flask application.
Docker Compose: The project includes a docker-compose.yaml file for running the application using Docker Compose. It maps port 8080 on the host to port 8080 on the container.

Pipeline Tracking (DVC): DVC is used to manage and track the project's machine learning pipeline, including stages like data ingestion, model preparation, training, and evaluation.
MLflow Integration: For experiment tracking and serving, MLflow is integrated into the project, with DAGsHub providing a centralized URI for managing and tracking experiments.
Data Storage: The project's data is stored on Google Drive, ensuring easy access and version control through DVC.
Flask Application: A Flask web application with a user-friendly interface is developed for interacting with the model, allowing users to easily upload CT images and get predictions.
Python Modular Approach: The project emphasizes a modular Python approach, following industry best practices to ensure maintainability and scalability.
Model Framework: TensorFlow and Keras frameworks are used for developing the deep learning model, specifically using the VGG16 architecture.

Project Structure

MediChestAI/
│
├── .dvc/                    # DVC cache and temporary files
│   ├── cache/
│   └── tmp/
│
├── .github/
│   └── workflows/
│       └── main.yaml        # GitHub Actions workflow for triggering Jenkins
│
├── jenkins/
│   └── Jenkinsfile          # Jenkins pipeline configuration
│
├── artifacts/               # Folder to store pipeline outputs
│
├── config/                  # Configuration files for the project
│   └── config.yaml          # Project configuration
│
├── logs/
│   └── running_logs.log     # Application logs
│
├── model/
│   └── model.h5             # Saved model
│
├── research/                # Jupyter notebooks
│   ├── 01_data_ingestion.ipynb
│   ├── 02_prepare_base_model.ipynb
│   ├── 03_model_trainer.ipynb
│   └── 04_model_evaluation_with_mlflow.ipynb
│
├── scripts/
│   ├── ec2_setup.sh         # Shell script to setup EC2 instance
│   └── jenkins.sh           # Jenkins setup script
│
├── src/                     # Source code for the project
│   └── CNNClassifier/
│       ├── __init__.py      # Init file for the package
│       ├── components/      # Pipeline components
│       │   ├── data_ingestion.py
│       │   ├── model_evaluation.py
│       │   ├── model_trainer.py
│       │   ├── prepare_base_model.py
│       │   └── __init__.py
│       │
│       ├── config/          # Configuration-related source code
│       │   ├── configuration.py
│       │   └── __init__.py
│       │
│       ├── constants/       # Project constants
│       │   ├── __init__.py
│       │
│       ├── entity/          # Entity classes
│       │   ├── config_entity.py
│       │   └── __init__.py
│       │
│       ├── pipeline/        # Pipeline scripts
│       │   ├── stage_01_data_ingestion.py
│       │   ├── stage_02_prepare_base_model.py
│       │   ├── stage_03_model_trainer.py
│       │   ├── stage_04_model_evaluation.py
│       │   └── __init__.py
│       │
│       ├── utils/           # Utility functions
│       │   ├── common.py
│       │   └── __init__.py
│
├── templates/
│   └── index.html           # HTML template for the Flask web app
│
├── .dockerignore            # Files to exclude from Docker builds
├── .dvcignore               # Files to exclude from DVC
├── .gitignore               # Files to exclude from Git
├── app.py                   # Flask application entry point
├── docker-compose.yaml      # Docker Compose configuration file
├── Dockerfile               # Dockerfile for building the Docker image
├── dvc.lock                 # DVC lock file
├── dvc.yaml                 # DVC pipeline configuration
├── inputImage.jpg           # Sample input image
├── LICENSE                  # Project license
├── main.py                  # Main Python script with the model pipeline stages
├── params.yaml              # Parameters for training VGG16 model
├── README.md                # Project README
├── requirements.txt         # Python dependencies
├── scores.json              # Model evaluation scores
├── setup.py                 # Setup script for the Python package
└── template.py              # Template script for setting up project structure

Dataset

The dataset used for training and evaluation can be found on Google Drive. You can download it using the link below:

CT Image Dataset

Data Pipeline

The data pipeline of the MediChestAI project includes several key stages, each of which has a specific role in the machine learning workflow:

Data Ingestion:
- The stage_01_data_ingestion.py file is responsible for downloading and extracting data from a given URL. The dataset is downloaded as a zip file, then extracted to the specified location for further processing.
Prepare Base Model:
- The stage_02_prepare_base_model.py file prepares a base model using VGG16 architecture.
- It initializes the model with specific configurations such as input image size, weights ("imagenet"), and whether to include the top layer.
- This stage saves the initialized model for further use in training and evaluation.
Training:
- The stage_03_model_trainer.py file handles the training of the initialized model using the training dataset.
- The model is fine-tuned with key hyperparameters like learning rate, batch size, and epochs, as specified in the params.yaml file.
- After training, the model is saved for evaluation.
Evaluation:
- The stage_04_model_evaluation.py file evaluates the trained model's performance using a validation dataset.
- It computes evaluation metrics like accuracy and loss, which are saved in a JSON file for analysis.

The pipeline is managed with DVC to ensure reproducibility and version control, making it easy to track changes and improve the pipeline iteratively.

MLflow Experiments

The project leverages MLflow to track and visualize experiments, allowing you to monitor different metrics such as accuracy and loss over multiple training runs. This helps in understanding the model's performance across different hyperparameter settings.

Key Features:

Tracking Experiments: Track experiments and compare their metrics and parameters side by side.
UI Visualization: The MLflow UI provides a clear visualization of the experiments to evaluate model performance.
Integration with DagsHub: Easily integrates with DagsHub for centralized experiment tracking.

To see the experiment tracking in action, here's a screenshot from the MLflow UI:

CI/CD Deployment

The CI/CD deployment architecture for MediChestAI includes the following components:

GitHub Actions:
- Used to trigger Jenkins jobs upon pushing changes to the main repository on GitHub.
- Acts as the starting point in the CI/CD pipeline.
Jenkins (EC2-1):
- Jenkins is installed on an EC2 instance (EC2-1) and is responsible for managing the CI/CD pipeline.
- It orchestrates the build, testing, and deployment processes.
- Jenkins fetches the code from GitHub, builds the Docker image, and pushes it to Amazon ECR.
Amazon ECR (Elastic Container Registry):
- Stores the Docker images built by Jenkins.
- Acts as a centralized repository for the project's Docker images, ensuring that they are versioned and securely stored.
EC2-2 (Flask Application):
- A second EC2 instance (EC2-2) is used to host the Flask application.
- The Flask application is deployed using the Docker image pulled from ECR.
- This instance acts as an endpoint that serves predictions based on the input data.

Deployment Workflow

Source Code Commit: Changes to the code are pushed to the GitHub repository.
Triggering Jenkins: GitHub Actions triggers a Jenkins job to handle the CI/CD pipeline.
Building the Docker Image: Jenkins builds the Docker image based on the Dockerfile in the repository.
Pushing to ECR: The Docker image is pushed to Amazon ECR.
Deployment on EC2-2: The Flask application running on EC2-2 pulls the latest Docker image from ECR and deploys it, making the new version of the application available for use.

This setup ensures automated, consistent, and reliable deployment of the project, enabling continuous integration and delivery.

Installation

Clone the repository:

https://github.com/Omar-Karimov/MediChestAI.git

Create and activate a virtual environment:

conda create -n medichest python=3.8 -y
conda activate medichest

Install the required Python dependencies:

pip install -r requirements.txt

Set up the package for development:

pip install -e .

Usage

To run the application:

python app.py

Access the application in your web browser at http://localhost:8080.

MLflow DagsHub Connection

To set up MLflow tracking using DagsHub:

Create a DagsHub Account:
- Visit DagsHub and create a free account.
Connect Your GitHub Repository to DagsHub:
- In DagsHub, create a new project and link it to your existing GitHub repository.
- Enable the Experiments feature for your project.
Get MLflow Tracking Credentials:
- After connecting your GitHub repo, click on the Remote tab in DagsHub.
- Copy the following environment variables provided by DagsHub:
```
MLFLOW_TRACKING_URI=https://dagshub.com/<your-username>/<your-repo>.mlflow
MLFLOW_TRACKING_USERNAME=<your-username>
MLFLOW_TRACKING_PASSWORD=<your-password>
```

Set the Environment Variables:

You can either export these variables using Git Bash:

export MLFLOW_TRACKING_URI=https://dagshub.com/<your-username>/<your-repo>.mlflow
export MLFLOW_TRACKING_USERNAME=<your-username>
export MLFLOW_TRACKING_PASSWORD=<your-password>

Alternatively, you can add these to your system environment variables.

Access the MLflow UI:
- Go to the MLflow UI tab in DagsHub to track all experiments linked to your project.

DVC Commands

Initialize DVC:

dvc init

Reproduce the pipeline:

dvc repro

DVC will automatically skip stages that haven't changed and run only the necessary parts of the pipeline.

Visualize the pipeline:

dvc dag

This will generate a graph illustrating the flow of the data pipeline.

+----------------+            +--------------------+ 
| data_ingestion |            | prepare_base_model | 
+----------------+*****       +--------------------+ 
         *             *****             *
         *                  ******       *
         *                        ***    *
         **                        +----------+      
           **                      | training |      
             ***                   +----------+      
                ***             ***
                   **         **
                     **     **
                  +------------+
                  | evaluation |
                  +------------+

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.dvc		.dvc
.github/workflows		.github/workflows
.jenkins		.jenkins
config		config
model		model
research		research
scripts		scripts
src/CNNClassifier		src/CNNClassifier
templates		templates
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yaml		docker-compose.yaml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
inputImage.jpg		inputImage.jpg
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
scores.json		scores.json
setup.py		setup.py
template.py		template.py

License

Omar-Karimov/MediChestAI

Folders and files

Latest commit

History

Repository files navigation

MediChestAI

Table of Contents

Project Overview

Project Structure

Dataset

Data Pipeline

MLflow Experiments

Key Features:

CI/CD Deployment

Deployment Workflow

Installation

Usage

MLflow DagsHub Connection

DVC Commands

About

Topics

Resources

License

Stars

Watchers

Forks

Languages