Skip to content

S-M-J-I/cancer-classification-uci

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Breast Cancer Classification

A short task of predicting breast cancer and finding out the perfect model for it using the UCI Wisconsin Breast Cancer dataset

Kaggle Link

Technologies used:

  • Jupyter Notebook
  • Numpy
  • Pandas
  • Scikit Learn
  • Matplotlib

Dataset:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension ("coastline approximation" - 1)

There can be two results: WDBC-Malignant(1) and WDBC-Benign(0)

The features are all columns.

Process:

At first we check the amount of missing values we have in our dataset. We then drop the columns that will have no significant affect in our predictions. If there are any missing labels, we drop that entire row.

We then split the data into features(x) and labels(y).

There are no missing data.

There are also no categorical data.

Next, we use feature scaling. We normalize our data. Later, we looked at what the scores would've been if we standardized it.

we spilt the data into training and test sets. I've considered the test size to be 20% of the total data.

We then use 6 different models on the data to find which one gives us the best case:

  • Linear Model (Accuracy score: 96%)
  • Support Vector Machine (Accuracy score: 97%)
  • Stohastic Gradient Descent (Accuracy score: 96%)
  • Nearest Neighbours (Accuracy score: 96%)
  • Guassian Processes (Accuracy score: 95%)
  • Naive Bayes (Accuracy score: 90%) Worst Performing
  • Decision Trees (Accuracy score: 91%)
  • Random Forest (Accuracy score: 98%) Best performing
  • Majority Voting (Accuracy score: 97%)

image

Conclusion:

The best model for this case, a binaryclassification problem, is the Random Forest (RandomForestClassifier), having an accuracy score of 98%.

Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. It also provides very high accuracy.

However, the main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published