# Breast Cancer Diagnosis using Machine Learning

## 1. Background

### 1.1 Dataset

The breast cancer dataset is a widely recognized and frequently used dataset in the field of machine learning and medical research. It is included in the scikit-learn library, a popular Python library for machine learning tasks. The dataset provides valuable information extracted from digitized images of fine needle aspirates (FNA) of breast masses, aiding in the diagnosis and classification of breast cancer tumors.

Comprising a total of 569 instances, this dataset offers a rich collection of features that describe various characteristics of cell nuclei present in the breast images. These features have been carefully computed from the images and encompass essential attributes such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

For each instance, 30 numeric features are available, providing insights into the nature and composition of the breast mass. These features are further categorized into three groups: mean, standard error, and worst, representing the mean value, standard error value, and the worst (mean of the three largest values) value of each attribute, respectively.

Accompanying the comprehensive feature set, the breast cancer dataset also includes a binary target variable. This target variable serves as the ground truth for classification tasks, with a value of 0 indicating a benign tumor and a value of 1 indicating a malignant tumor.

The breast cancer dataset serves as a valuable resource for researchers, data scientists, and medical professionals aiming to develop robust machine learning models for breast cancer diagnosis. It enables the exploration of various classification algorithms, feature selection techniques, and model evaluation strategies, fostering advancements in the field of medical diagnostics and contributing to the early detection and treatment of breast cancer.

### 1.2 Introduction

Objective: The objective of this project is to develop a machine learning model that can accurately classify breast tumors as benign or malignant based on the provided dataset. By leveraging the breast cancer dataset from the sklearn.datasets module, we aim to create a predictive model that can assist in the early detection and diagnosis of breast cancer.

## 2. Preparation

In this section, you will be doing:
1. Load the breast cancer dataset using the sklearn.datasets module.
2. Understand the structure and characteristics of the dataset.
3. Explore the features and their distributions.

### 1.1 Import library

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import seaborn as sns

plt.style.use('ggplot')

### 1.2 Load Dataset

In [None]:
# Load the breast cancer dataset
data = 

# Access the features (X) and target variable (y)
X = 
y = 

In [None]:
# Print the shape of the dataset
print("Shape of X:",)
print("Shape of y:",)

In [None]:
print(data.feature_names)
print()
print(data.target_names)

### 1.3 Dataframe Creation

In [None]:
# Create a dataframe
df =

# Add the target variable to the dataframe
df['target'] = 

# Print the first few rows of the dataframe
df.head()

In [None]:
df.info()

In [None]:
df.describe()

The breast cancer dataset from scikit-learn contains information about various features computed from digitized images of breast mass aspirates. Here's some information about the dataset:

1. Number of Instances: 569
2. Number of Features: 30 (including the target variable)
The features in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features are computed from the images and describe various characteristics of the cell nuclei present in the image. The features include:

1. Mean radius
2. Mean texture
3. Mean perimeter
4. Mean area
5. Mean smoothness
6. Mean compactness
7. Mean concavity
8. Mean concave points
9. Mean symmetry
10. Mean fractal dimension
For each of these ten features, three measures are available: mean, standard error, and "worst" or largest (mean of the three largest values). This results in a total of 30 features.

The target variable in the dataset represents the class label and indicates whether the breast mass is classified as malignant (cancerous) or benign (non-cancerous). The target variable has two classes:

0: Benign
1: Malignant

## 2. Data Visualization

### 2.1 Distribution of Benign and Malignant Breast Cancer Cases.

In [None]:
# Count the occurrences of each target value
target_counts = 

# Define the colors for the bars
colors = 

# Create a bar plot with custom colors

plt.show()

### 2.2 Relationship Between Selected Features and Target Variable (Benign vs. Malignant).

In [None]:
# Select a subset of features for visualization
subset_features = 

# Create a pairwise scatter plot

plt.show()

## 3. Data Cleaning

### 3.1 Checking for Missing Values

### 3.2 Checking for outliers in our dataset

In [None]:
# Select a subset of features for visualization
subset_features = 

# Create box plots for each feature

plt.show()

The dataset does not contain any missing values, eliminating the need for imputation or handling of missing data. As outliers can provide valuable insights into the data, we have decided not to remove them. Retaining outliers allows us to capture extreme or unique observations that may contribute to a comprehensive analysis, providing a more accurate representation of the breast cancer dataset.

## 4. Feature Correlation

### 4.1 Finding the correlation between features

In [None]:
featureMeans = 

In [None]:
import seaborn as sns


### 4.2 Heatmap

## 5. Data Encoding

No data encoding is required for the breast cancer dataset as it already contains numeric features and the target variable is represented by integers.

## 6. Model Performance Comparison - Preparation

### 6.1 Division of Dataset into Target and Training Data

In [None]:
# Split the data into features (X) and target variable (y)
X = 
y = 

### 6.2 Standardisation of the Data

In [None]:
# Standardize the feature variables
scaler = 
X_scaled = 

### 6.3 Train-Test-Split

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split


## 7. Model Performance Comparison

### 7.1 Decision Tree

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier

# Fit the model to the training data

# Make predictions on the test set

# Calculate the accuracy of the model
print("Accuracy:", dt_accuracy)

### 7.2 Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier

# Fit the model to the standardized training data

# Make predictions on the standardized test set

# Calculate the accuracy of the model


### 7.3 Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

# Create an SVM classifier

# Fit the model to the standardized training data

# Make predictions on the standardized test set

# Calculate the accuracy of the model


### 7.4 K-Nearest Neighbours (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier

# Fit the model to the standardized training data

# Make predictions on the standardized test set

# Calculate the accuracy of the model


### 7.5 Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

# Create a Naive Bayes classifier

# Fit the model to the standardized training data

# Make predictions on the standardized test set

# Calculate the accuracy of the model


## 8. Result Dataframe

### 8.1 Accuracy Comparison


In [None]:
finalleaderboard = {
    'Decision Tree': dt_accuracy,
    'Random Forest': rf_accuracy,
    'Support Vector Machine': svm_accuracy,
    'K-Nearest Neighbors': knn_accuracy,
    'Naive Bayes': nb_accuracy,
}

finalleaderboard = pd.DataFrame.from_dict(finalleaderboard, orient='index', columns=['Accuracy'])
finalleaderboard = finalleaderboard.sort_values('Accuracy', ascending=False)
finalleaderboard