Skip to content

Predicting the Cancerous Gene Variations via Machine Learning Techniques

License

Notifications You must be signed in to change notification settings

PrasunDutta007/Cancer-Diagnosis-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Personalized Cancer Diagnosis

Software

Jupyter Notebook  

Packages

1) Pandas                            5) NLTK
2) Scipy                               6) Numpy
3) Sci-Kit Learn                    7) Plotly
4) Seaborn                          8) Matplotlib

Installation of Packages

  • Open cmd and type the following commands:
  pip3 install pandas
  pip3 install matplotlib
  pip3 install nltk
  pip3 install numpy
  pip3 install scipy
  pip3 install scikit-learn
  pip3 install seaborn
  pip3 install plotly

Concepts Used

  • Hyperparameter Tuning
  • K-Nearest Neighbours
  • Logistic Regression
  • Exploratory Data Analysis
  • Support Vector Machine
  • Random Forest Classifier
  • One Hot Encoding
  • Response Encoding / Mean Value Replacement
  • Naive Bayes
  • Laplace Smooting

Problem Overview

  • Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
    Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

  • We have two data files:

    1) Containing the information about the Genetic Mutations.

    2) Containing the clinical evidence (text) that human experts/pathologists use to classify the Genetic Mutations.

  • Both these data files have a common column called ID

  • Data file's information:

    1) training_variants (ID , Gene, Variations, Class)

    2) training_text (ID, Text)

  • There are 9 different classes a genetic mutation can be classified into => Multi class classification problem

  • Performance Metric(s) to be used:

    1) Multi class log-loss

    2) Confusion matrix

Objective & Constraints

  • Objective: Predict the probability of each data-point belonging to each of the nine classes.
  • Constraints:
    1. Interpretability
    2. Class probabilities are needed.
    3. Penalize the errors in class probabilites => Metric is Log-loss.
    4. No Latency constraints.

Workflow Analysis

Step 1

  1. Reading Gene and Variation Data

  1. Reading Text Data

  1. Preprocessing of Text

  1. Splitting the Data into Train, Test and Cross Validation (64:20:16)


Clearly Classes 1, 2, 4 and 7 have more number of data points. This distribution of points will determine how our models will likely work.

Step 2

  1. Prediction using a Random Model


This Random Model acts as a Threshold for other Models i.e. we should try to keep the other Model's Log-Loss value below that of Random Model!!

Step 3

  1. Univariate Analysis

    Uni means one and variate means variable, so in univariate analysis, there is only one dependable variable. The objective of univariate analysis is to derive the data, define and summarize it, and analyze the pattern present in it. In a dataset, it explores each variable separately. It is possible for two kinds of variables- Categorical and Numerical.

We do the Univariate Analysis on 3 Features:

i) Gene (Categorical Variable)


ii) Variation (Categorical Variable)


iii) Text Feature (Words)



  1. Stacking The Three Types of Features

Step 4

  1. Machine Learning Models

I) Naive Bayes

My Output:

II) K Nearest Neighbour

My Output:

III) Logistic Regression with Class Balancing

My Output:

IV) Logistic Regression without Class Balancing

My Output:

V) Linear Support Vector Machine

My Output:

VI) Random Forest Classifier

My Output:

Summarized Log-Losses and Misclassified Points


Logistic Regression with Class Balancing using One Hot Encoding gave the least Misclassified Points Percentage out of all the other models!!


Even though Voting Classifier which uses a Combination of Naive Bayes, Linear Regression and Support Vector Machine gave the least Misclassified Points Percentage out of all the Models. However, it's Interpretability is almost Negligible, and therefore not recommended.

References & Resources

About

Predicting the Cancerous Gene Variations via Machine Learning Techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published