Cancer Diagnosis System

Genes in our body control how your cells work by making proteins. All types of Cancer begin when one or more genes in a cell mutate(changes). An abnormal protein can cause cells to multiply uncontrollably and become cancerous. A mutation may be beneficial, harmful, or neutral. This depends where in the gene the change occurs.

In this project, I have made a model which takes the features 'Gene', 'Variation' and with the help of clinical evidence (text data) it predicts the Cancer class. The dataset contains 9 different types of classes and since in medical domain it is important to reduce false negatives, I have used the "Recall" metric to evaluate the model performance. The dataset is an imbalanced one and hence I have used the "class_weights" as the additional parametre to give weightage to target class and hence have used the models which support the parametre (class_weights)

I have used various methods to vectorize both categorical and text data to get high Recall and also used multiple models to see the best one!

Acknowledgements

Demo Screenshots

Demo

https://bairagisaurabh-project-ii-cancer-prediction-app1-wa9krj.streamlit.app/

Data Overview

The dataset has been obtained from: [https://www.kaggle.com/competitions/msk-redefining-cancer-treatment/data]

⚪Clinical Evidence: This is a text data which the human specialists rigorously go through to classify the genetic mutations. This is an important feature and our model heavily depends on this for its classification task.

⚪Gene: This tells us about the gene where the mutation is located. This is a categorical feature.

⚪Variation: This gives us an idea about the aminoacid changes for the mutation.

Technical approach for classification

👉Data Cleaning: The categorical features have no missing values.

For the text data, stopwords are removed and lemmatization has been applied as it returns the dictionary form of the word. I have avoided using porter/snowball stemmer as it returns the root form of the word which sometimes has different spelling or doesn't have any proper meaning attached to it and in medical domain even one word could dictate or change the whole meaning of the text.

👉Data visualisation:

⚪ Bar plots for distribution of classes.

⚪ Bar plots for categorical features.

⚪ Confusion matrix, matrices for precision & recall. The precision and recall matrices help us to identify the classes where our model is poorly predicting.

⚪ Distribution of extracted features for each class.

⚪ Wordcloud for TEXT, corresponding to each class.

👉Data Preprocessing:

I have employed 2 methods for vectorizing categorical data:

CatBoost Encoder.
Response Coding.

⚪ TFIDF vectorization on 'text'.

⚪ Sentence vector through BERT on 'text'.

👉Model:

Decision Trees
Random Forest
XGBoost

🛠 Skills

Python, Feature Engineering, Hyperparameter Tuning (Optuna), Streamlit, Heroku.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gitignore		.gitignore
Cancer_Prediction_Final.ipynb		Cancer_Prediction_Final.ipynb
Cancer_Prediction_TFIDF_RC.ipynb		Cancer_Prediction_TFIDF_RC.ipynb
Cancer_Prediction_modeling.ipynb		Cancer_Prediction_modeling.ipynb
Procfile		Procfile
README.md		README.md
app1.py		app1.py
array.pkl		array.pkl
c1.PNG		c1.PNG
c3.PNG		c3.PNG
cancer diagnosis-catboost-bert.ipynb		cancer diagnosis-catboost-bert.ipynb
cancer diagnosis-catboost.ipynb		cancer diagnosis-catboost.ipynb
cancer diagnosis-response coding.ipynb		cancer diagnosis-response coding.ipynb
cancer diagnosis-response coding_bert.ipynb		cancer diagnosis-response coding_bert.ipynb
con_cancer.PNG		con_cancer.PNG
dis.PNG		dis.PNG
gitattributes		gitattributes
model.pkl		model.pkl
pre_cancer.PNG		pre_cancer.PNG
pre_rec.PNG		pre_rec.PNG
rec_cancer.PNG		rec_cancer.PNG
requirements.txt		requirements.txt
s1.PNG		s1.PNG
s2.PNG		s2.PNG
top.PNG		top.PNG
wc.PNG		wc.PNG
x_test.pkl		x_test.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cancer Diagnosis System

Acknowledgements

Demo Screenshots

Demo

Data Overview

Technical approach for classification

🛠 Skills

About

Releases

Packages

Languages

BairagiSaurabh/Project-II-Cancer-Prediction

Folders and files

Latest commit

History

Repository files navigation

Cancer Diagnosis System

Acknowledgements

Demo Screenshots

Demo

Data Overview

Technical approach for classification

🛠 Skills

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages