# Dataset: Breast Cancer

Breakdown on the data found in this dataset:

- Classes (2): This indicates that the dataset contains two distinct categories or classes. In the context of the Breast Cancer dataset, these two classes typically represent two types of tumor: malignant (M) and benign (B).

- Samples per class (212 M, 357 B): These numbers show how many samples there are in each class. There are 212 samples of malignant tumors ('M') and 357 samples of benign tumors ('B'). This gives you an idea of the distribution of the classes within the dataset.

- Samples total (569): This is the total number of samples or data points in the dataset. It combines the samples of both malignant and benign tumors.

- Dimensionality (30): This refers to the number of features or dimensions in the dataset. Each sample in the dataset has 30 different features. These features are measurements or characteristics related to the breast cancer cells, such as the texture, size, smoothness, etc.

- Features (real, positive): This describes the nature of the features in the dataset.

Real: Indicates that the feature values are real numbers (as opposed to being integers, categories, or binary values).
Positive: Suggests that all the feature values are positive numbers. This makes sense in a medical dataset where measurements like size, area, and smoothness would naturally be positive values.

In [8]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = load_breast_cancer()
X = data.data # features
y = data.target # labels

print(X)
print(y)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 

In [9]:
''' 
Split the data into a training set and a test set. The training set is used to train the model, and the test set is used to evaluate its performance.
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
'''
Here, we'll use Logistic Regression, a good starting point for binary classification tasks.
'''
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

In [11]:
'''
Use the trained model to make predictions on the test set.
'''
y_pred = model.predict(X_test)

In [12]:
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{conf_matrix}")

# Classification Report
class_report = classification_report(y_test, y_pred)
print(f"Classification Report:\n{class_report}")


Accuracy: 0.956140350877193
Confusion Matrix:
[[39  4]
 [ 1 70]]
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.91      0.94        43
           1       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

