- Open cmd and type the following commands:
pip3 install pandas
pip3 install matplotlib
pip3 install nltk
pip3 install numpy
pip3 install scipy
pip3 install scikit-learn
pip3 install seaborn
pip3 install plotly
- Hyperparameter Tuning
- K-Nearest Neighbours
- Logistic Regression
- Exploratory Data Analysis
- Support Vector Machine
- Random Forest Classifier
- One Hot Encoding
- Response Encoding / Mean Value Replacement
- Naive Bayes
- Laplace Smooting
-
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/ -
We have two data files:
-
Both these data files have a common column called ID
-
Data file's information:
-
There are 9 different classes a genetic mutation can be classified into => Multi class classification problem
-
Performance Metric(s) to be used:
- Objective: Predict the probability of each data-point belonging to each of the nine classes.
- Constraints:
- Interpretability
- Class probabilities are needed.
- Penalize the errors in class probabilites => Metric is Log-loss.
- No Latency constraints.
- Reading Gene and Variation Data
- Reading Text Data
- Preprocessing of Text
- Splitting the Data into Train, Test and Cross Validation (64:20:16)
Clearly Classes 1, 2, 4 and 7 have more number of data points. This distribution of points will determine how our models will likely work.
- Prediction using a Random Model
This Random Model acts as a Threshold for other Models i.e. we should try to keep the other Model's Log-Loss value below that of Random Model!!
- Univariate Analysis
Uni means one and variate means variable, so in univariate analysis, there is only one dependable variable. The objective of univariate analysis is to derive the data, define and summarize it, and analyze the pattern present in it. In a dataset, it explores each variable separately. It is possible for two kinds of variables- Categorical and Numerical.
We do the Univariate Analysis on 3 Features:
- Stacking The Three Types of Features
- Machine Learning Models
Logistic Regression with Class Balancing using One Hot Encoding gave the least Misclassified Points Percentage out of all the other models!!
Even though Voting Classifier which uses a Combination of Naive Bayes, Linear Regression and Support Vector Machine gave the least Misclassified Points Percentage out of all
the Models. However, it's Interpretability is almost Negligible, and therefore not recommended.
- https://machinelearningknowledge.ai/k-nearest-neighbor-classification-simple-explanation-beginners/
- https://towardsdatascience.com/choosing-the-right-encoding-method-label-vs-onehot-encoder-a4434493149b
- https://www.analyticsvidhya.com/blog/2021/04/exploratory-analysis-using-univariate-bivariate-and-multivariate-analysis-techniques/
- https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
- https://www.datacamp.com/community/tutorials/categorical-data
- https://www.ibm.com/cloud/learn/exploratory-data-analysis
- https://www.geeksforgeeks.org/introduction-to-support-vector-machines-svm/
- https://towardsdatascience.com/introduction-to-data-analysis-basic-concepts-involved-in-multivariate-analysis-4295cc125052
- https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
- https://www.kaggle.com/dansbecker/what-is-log-loss
- https://www.cs.princeton.edu/courses/archive/spring16/cos495/slides/ML_basics_lecture7_multiclass.pdf
- https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a
- https://www.javatpoint.com/machine-learning-random-forest-algorithm
- https://www.forbes.com/sites/matthewherper/2017/06/03/a-new-cancer-drug-helped-almost-everyone-who-took-it-almost-heres-what-it-teaches-us/#2a44ee2f6b25
- https://www.youtube.com/watch?v=qxXRKVompI8