These are all the models we need, along with specific parts of the scikit-learn library, since it is very big and importing all of it is inefficent.

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import ConfusionMatrixDisplay

Read the csv, along with the head to get an overview of the dataset.

simple data preparation to deal with any possible nulls.

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

This is the distribtution of diagnosis, as you can see, this dataset is somewhat imbalanced.

In [None]:
sns.countplot(data=df, x='diagnosis')
plt.title('Diagnosis Distribution')
plt.xlabel('Diagnosis (M = Malignant, B = Benign)')
plt.ylabel('Count')
plt.show()

# Model Building

The features for the model are set here

In [None]:
X = df.drop(['id', 'diagnosis'], axis=1)
y = df['diagnosis']


Here the diagnosis were one hot encoded

In [None]:
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print("\nMapping of original diagnosis to encoded values:")
for original, encoded in zip(le.classes_, le.transform(le.classes_)):
    print(f"'{original}': {encoded}")

the dataset was split here with a 80/20 split here

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

Simple snippet for the regression model, along with the accuracy, which is 97%

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(random_state=42)

model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=le.classes_)

print("\nModel Accuracy:", accuracy)
print("\nClassification Report:\n", report)

To better visualize, here is a confusion matrix

In [None]:
class_names = le.classes_
plt.figure(figsize=(8, 6))
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=class_names, cmap=plt.cm.Blues)
plt.title('Confusion Matrix')

Notebook Author:
# Hadi Faheem Farooqi 
Dataset link:
https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset/data