<a href="https://colab.research.google.com/github/SallyAlsfadi/MLmodels/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breast Cancer Dataset

The dataset contains 30 numerical attributes (features) that describe the tumor characteristics, and the goal is to develop machine learning models to predict the diagnosis.

he task is to classify the tumors as either malignant (M) or benign (B).

 **Dataset Loading**
We will load the dataset using the ucimlrepo library

In [11]:
!pip install ucimlrepo

from ucimlrepo import fetch_ucirepo


breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)


X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

print(breast_cancer_wisconsin_diagnostic.metadata)
print(breast_cancer_wisconsin_diagnostic.variables)


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper':

**Features**: 30 numerical attributes such as radius, texture, perimeter, area, etc.

**Target Variable**: The Diagnosis column, where M represents malignant tumors and B represents benign tumors

In [12]:
print(X.head())
print(y.head())

   radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
0    17.99     10.38      122.80  1001.0      0.11840       0.27760   
1    20.57     17.77      132.90  1326.0      0.08474       0.07864   
2    19.69     21.25      130.00  1203.0      0.10960       0.15990   
3    11.42     20.38       77.58   386.1      0.14250       0.28390   
4    20.29     14.34      135.10  1297.0      0.10030       0.13280   

   concavity1  concave_points1  symmetry1  fractal_dimension1  ...  radius3  \
0      0.3001          0.14710     0.2419             0.07871  ...    25.38   
1      0.0869          0.07017     0.1812             0.05667  ...    24.99   
2      0.1974          0.12790     0.2069             0.05999  ...    23.57   
3      0.2414          0.10520     0.2597             0.09744  ...    14.91   
4      0.1980          0.10430     0.1809             0.05883  ...    22.54   

   texture3  perimeter3   area3  smoothness3  compactness3  concavity3  \
0     17.33      184.60 

Features should be numerical, and the target variable should be categorical.

In [14]:
import pandas as pd
columns = list(X.columns) + ['Diagnosis']
df = pd.concat([X, y], axis=1)
df = df[columns]

print(df.head())


   radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
0    17.99     10.38      122.80  1001.0      0.11840       0.27760   
1    20.57     17.77      132.90  1326.0      0.08474       0.07864   
2    19.69     21.25      130.00  1203.0      0.10960       0.15990   
3    11.42     20.38       77.58   386.1      0.14250       0.28390   
4    20.29     14.34      135.10  1297.0      0.10030       0.13280   

   concavity1  concave_points1  symmetry1  fractal_dimension1  ...  texture3  \
0      0.3001          0.14710     0.2419             0.07871  ...     17.33   
1      0.0869          0.07017     0.1812             0.05667  ...     23.41   
2      0.1974          0.12790     0.2069             0.05999  ...     25.53   
3      0.2414          0.10520     0.2597             0.09744  ...     26.50   
4      0.1980          0.10430     0.1809             0.05883  ...     16.67   

   perimeter3   area3  smoothness3  compactness3  concavity3  concave_points3  \
0      184.

Now, the last column is Diagnosis (target variable), and the other 30 columns are the features.

Splitting the Dataset into Training and Testing Sets

In [15]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

y_train_numeric = np.where(y_train == 'M', 1, 0)
y_test_numeric = np.where(y_test == 'M', 1, 0)

X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]

Training set size: 398 samples
Test set size: 171 samples


Setting random_state=42 ensures that the data split is reproducible each time the code is run.

***Logistic Regression*** is used for classification problems, where the target variable is categorical (like "Malignant" or "Benign")

Breast Cancer dataset is a binary classification problem, Logistic Regression is the correct choice for modeling.

In [None]:
def sigmoid(z):

    return 1 / (1 + np.exp(-z))

def sigmoid_clipped(z):
    return np.clip(1 / (1 + np.exp(-z)), 1e-10, 1 - 1e-10)


    #we can clip the output of the sigmoid function to ensure it never reaches exactly 0 or 1. This helps avoid taking the log of zero


we use cross-entropy loss (log loss) as the cost function

 we need to handle situations where the sigmoid function might produce values exactly equal to 0 or 1. This can be done by clipping h values before calculating the log loss.

In [None]:
def compute_cost(X, y, theta):
    m = len(y)
    h = sigmoid(np.dot(X, theta))
    h = np.clip(h, 1e-10, 1 - 1e-10)
    cost = (1/m) * np.sum(-y * np.log(h) - (1 - y) * np.log(1 - h))
    return cost


We will use gradient descent to minimize the cost function and optimize the weights (theta)

add L2 regularization to the cost function.

 Regularization adds a penalty term to the cost function to prevent weights from growing too large, which can help stabilize the training process

In [None]:
def gradient_descent(X, y, theta, learning_rate, num_iterations):
    m = len(y)
    cost_history = []
    for _ in range(num_iterations):
        gradients = (1/m) * np.dot(X.T, (sigmoid(np.dot(X, theta)) - y))
        theta -= learning_rate * gradients
        cost_history.append(compute_cost(X, y, theta))
    return theta, cost_history


To train the logistic regression model, we will add a bias column (a column of ones) to the feature matrix X and then apply gradient descent to learn the optimal weights

In [None]:
def train_logistic_regression(X, y, learning_rate=0.01, num_iterations=1000):
    theta = np.zeros(X.shape[1])
    theta_optimal, cost_history = gradient_descent(X, y, theta, learning_rate, num_iterations)
    return theta_optimal, cost_history



In [16]:
print(f"Shape of X_train_bias: {X_train_bias.shape}")
print(f"Shape of y_train_numeric: {y_train_numeric.shape}")

Shape of X_train_bias: (398, 31)
Shape of y_train_numeric: (398, 1)


train the model

In [17]:
y_train_numeric = np.ravel(y_train_numeric)
y_test_numeric = np.ravel(y_test_numeric)

In [None]:
theta_optimal, cost_history = train_logistic_regression(X_train_bias, y_train_numeric, learning_rate=0.001, num_iterations=1000)


  return 1 / (1 + np.exp(-z))


We’ll classify the output as Malignant (M) if the probability is greater than 0.5, and Benign (B) otherwise

In [None]:
def predict(X, theta):
    predictions = sigmoid(np.dot(X, theta))
    return [1 if prob >= 0.5 else 0 for prob in predictions]


In [None]:
predictions = predict(X_test_bias, theta_optimal)
print(predictions)

[1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0]


A confusion matrix to check how the model is classifying the Malignant (M) and Benign (B) classes

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test_numeric, predictions)
print(f"Confusion Matrix:\n{cm}")


Confusion Matrix:
[[98 10]
 [ 2 61]]


Benign (0) Malignant (1)
98 (TN) 10 (FP)  2 (FN) 61 (TP)
Since the first row in the confusion matrix corresponds to actual class 0 (Benign), the first value (98) represents True Negatives
TN = 98 (Correctly predicted as Benign)
FP = 10 (Incorrectly predicted as Malignant)
FN = 2 (Incorrectly predicted as Benign)
TP = 61 (Correctly predicted as Malignant)

In [None]:
accuracy = np.mean(predictions == y_test_numeric)
print(f"Accuracy on test set: {accuracy * 100:.2f}%")

Accuracy on test set: 92.98%


***The Decision Tree*** will be used to classify Breast Cancer data into two classes: Malignant (M) and Benign (B). The decision tree will split the dataset based on different features like radius, texture, smoothness. to predict whether a tumor is Malignant or Benign** **bold text**

In [18]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report


def entropy(y):
    classes = np.unique(y)
    entropy_value = 0
    for c in classes:
        p = np.sum(y == c) / len(y)
        if p > 0:
            entropy_value -= p * np.log2(p)
    return entropy_value

**Entropy** is a measure of uncertainty in the dataset. A pure node (where all data points belong to the same class) has zero entropy, while a node with an equal distribution of classes has maximum entropy


In [39]:
def split_dataset(X, y, feature_index, threshold):
    left_mask = X[:, feature_index] <= threshold
    right_mask = ~left_mask
    X_left, y_left = X[left_mask], y[left_mask]
    X_right, y_right = X[right_mask], y[right_mask]
    return X_left, y_left, X_right, y_right




**Split Dataset** Function: This function splits the dataset into two subsets based on a specific feature and threshold. It returns the left and right subsets ,,,,
**Best Split Function**: This function iterates through each feature and possible threshold, calculating the information gain (which is the reduction in entropy) for each potential split. The split that results in the highest information gain is chosen as the best.

In [53]:
import numpy as np






def best_split(X, y):
    best_entropy = float('inf')  # Start with a very high entropy to minimize it
    best_feature_index, best_threshold = None, None
    best_left_y, best_right_y = None, None

    # Loop through each feature to find the best split
    for feature_index in range(X.shape[1]):
        thresholds = np.unique(X[:, feature_index])  # Possible split values for this feature

        # Debug: Print the thresholds for the feature
        print(f"Feature {feature_index}, Thresholds: {thresholds}")

        # Check if the feature has more than one unique value
        if len(thresholds) == 1:
            continue  # Skip this feature if it doesn't have variance (only one unique value)

        for threshold in thresholds:
            # Split the data based on the current threshold
            X_left, y_left, X_right, y_right = split_dataset(X, y, feature_index, threshold)

            # Skip if split is invalid (empty set or no variance in the split)
            if len(y_left) == 0 or len(y_right) == 0:
                continue

            # Calculate Entropy for the split
            total_len = len(X)
            left_entropy = entropy(y_left)
            right_entropy = entropy(y_right)
            weighted_entropy = (len(X_left) / total_len) * left_entropy + (len(X_right) / total_len) * right_entropy

            # Information Gain: original entropy - weighted entropy after split
            info_gain = entropy(y) - weighted_entropy

            # If the current split gives us a better information gain, save it
            if info_gain > best_entropy:
                best_entropy = info_gain
                best_feature_index = feature_index
                best_threshold = threshold
                best_left_y = y_left
                best_right_y = y_right

    # Debug: Print the best feature and threshold
    print(f"Best Split: Feature index = {best_feature_index}, Threshold = {best_threshold}")

    # Return best split info
    return best_feature_index, best_threshold, best_left_y, best_right_y





In [54]:
class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y, depth=0):

        print(f"Depth: {depth}, Unique Classes: {len(np.unique(y))}")


        if len(np.unique(y)) == 1:
            print(f"Leaf node: {np.unique(y)[0]}")
            return np.unique(y)[0]


        if self.max_depth and depth >= self.max_depth:
            majority_class = np.argmax(np.bincount(y))
            print(f"Majority class at depth {depth}: {majority_class}")
            return majority_class

        feature_index, threshold, left_y, right_y = best_split(X, y)

        print(f"Best split at depth {depth}: Feature index = {feature_index}, Threshold = {threshold}")


        left_tree = self.fit(X[left_y == left_y], left_y, depth + 1)
        right_tree = self.fit(X[right_y == right_y], right_y, depth + 1)


        return {'feature_index': feature_index, 'threshold': threshold, 'left': left_tree, 'right': right_tree}

    def predict_one(self, x, tree):

        if tree is None:
            return None

        if isinstance(tree, dict):
            feature_index = tree['feature_index']
            threshold = tree['threshold']


            if feature_index is None or threshold is None:
                return None  # or handle the error

            if x[feature_index] <= threshold:
                return self.predict_one(x, tree['left'])
            else:
                return self.predict_one(x, tree['right'])
        else:
            return tree

    def predict(self, X):
        return [self.predict_one(x, self.tree) for x in X]

The **fit function** builds the decision tree recursively. It stops when the dataset is pure (all labels are the same) or when a predefined maximum depth is reached.

In [55]:

dt_model_entropy = DecisionTree(max_depth=10)
dt_model_entropy.tree = dt_model_entropy.fit(X_train.values, y_train_numeric)


Depth: 0, Unique Classes: 2
Feature 0, Thresholds: [ 7.691  8.219  8.571  8.597  8.598  8.618  8.671  8.726  8.734  8.878
  8.888  8.95   9.     9.042  9.268  9.333  9.397  9.405  9.436  9.465
  9.504  9.567  9.668  9.676  9.683  9.731  9.742  9.755  9.787  9.847
  9.876 10.05  10.2   10.26  10.32  10.44  10.49  10.51  10.57  10.6
 10.66  10.71  10.8   10.86  10.88  10.9   10.91  10.94  10.95  10.96
 10.97  11.04  11.06  11.08  11.14  11.16  11.22  11.26  11.27  11.28
 11.3   11.31  11.33  11.34  11.36  11.37  11.41  11.42  11.43  11.45
 11.46  11.47  11.49  11.5   11.51  11.54  11.57  11.6   11.62  11.63
 11.64  11.66  11.68  11.69  11.71  11.74  11.75  11.76  11.8   11.84
 11.85  11.87  11.89  11.93  11.94  11.95  11.99  12.    12.03  12.04
 12.05  12.06  12.07  12.16  12.18  12.19  12.2   12.21  12.22  12.23
 12.25  12.27  12.3   12.31  12.32  12.34  12.36  12.39  12.42  12.43
 12.45  12.46  12.47  12.49  12.54  12.58  12.62  12.63  12.65  12.67
 12.68  12.7   12.72  12.75  12.76  1

In [56]:

dt_predictions_entropy = [1 if prob is not None and prob >= 0.5 else 0 for prob in dt_predictions_entropy]



In [57]:

from sklearn.metrics import confusion_matrix, accuracy_score


accuracy = accuracy_score(y_test_numeric, dt_predictions_entropy)
print(f"Decision Tree (Entropy) Accuracy: {accuracy * 100:.2f}%")


cm = confusion_matrix(y_test_numeric, dt_predictions_entropy)
print(f"Confusion Matrix:\n{cm}")




Decision Tree (Entropy) Accuracy: 63.16%
Confusion Matrix:
[[108   0]
 [ 63   0]]
