We are going to create a model which can predict if a person has breast-cancer.The prediction and target will be in numerical format (0 and 1), 0 means doesn't have breast cancer and 1 means have breast cancer.
- Python 3.10
- Anaconda / conda
-
These are some steps we are going to follow as we code:
- Run
conda create --prefix ./env numpy pandas scikit-learn matplotlib notebook
in Anaconda/conda prompt. (creates a virtual environment with all necessary packages installed) - run
conda activate path/to/env
to activate the virtual environment. - run
jupyter notebook
to open jupyter notebook. - Create a python notebook.
- Import all the packages.
- Import the Breast Cancer Data from
sklearn.datasets
. - Create a pandas Dataframe using the Breast Cancer Data.
- Check if there are any datafields empty/null.
- Split the Data into features and target sub sets (X & Y respectively).
- Import and initiliaze the classification model from
sklearn.ensemble
. - Split the feature and target datasets into train and test dataset.
- Fit/Train the model using the train dataset.
- Evaluate the model.
- Run
-
To Evaluate our model we use different methods. Scikit-learn provides us many methods for Evaluation. Some which we are going to use are :
- Default model score method
model.score()
. (Returns Coefficient of Determination) - Cross Validation.
- sklearn.metrics.accuracy_score
- ROC (Receiver operating characteristic) curve
- Confusion matrix
- Default model score method
ROC Curve
Perfect ROC Curve