<a href="https://colab.research.google.com/github/SallyAlsfadi/MLmodles/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breast Cancer Dataset

The dataset contains 30 numerical attributes (features) that describe the tumor characteristics, and the goal is to develop machine learning models to predict the diagnosis.

he task is to classify the tumors as either malignant (M) or benign (B).

 **Dataset Loading**
We will load the dataset using the ucimlrepo library

In [None]:
from ucimlrepo import fetch_ucirepo


breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)


X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

print(breast_cancer_wisconsin_diagnostic.metadata)
print(breast_cancer_wisconsin_diagnostic.variables)


{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'ID': 230, 'type': 'NATIVE', 'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'venue': 'Electronic imaging', 'year': 1993, 'journal': None, 'DOI': '1

**Features**: 30 numerical attributes such as radius, texture, perimeter, area, etc.

**Target Variable**: The Diagnosis column, where M represents malignant tumors and B represents benign tumors

In [None]:
print(X.head())
print(y.head())

   radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
0    17.99     10.38      122.80  1001.0      0.11840       0.27760   
1    20.57     17.77      132.90  1326.0      0.08474       0.07864   
2    19.69     21.25      130.00  1203.0      0.10960       0.15990   
3    11.42     20.38       77.58   386.1      0.14250       0.28390   
4    20.29     14.34      135.10  1297.0      0.10030       0.13280   

   concavity1  concave_points1  symmetry1  fractal_dimension1  ...  radius3  \
0      0.3001          0.14710     0.2419             0.07871  ...    25.38   
1      0.0869          0.07017     0.1812             0.05667  ...    24.99   
2      0.1974          0.12790     0.2069             0.05999  ...    23.57   
3      0.2414          0.10520     0.2597             0.09744  ...    14.91   
4      0.1980          0.10430     0.1809             0.05883  ...    22.54   

   texture3  perimeter3   area3  smoothness3  compactness3  concavity3  \
0     17.33      184.60 

Features should be numerical, and the target variable should be categorical.

In [None]:

columns = list(X.columns) + ['Diagnosis']
df = pd.concat([X, y], axis=1)
df = df[columns]

print(df.head())


   radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
0    17.99     10.38      122.80  1001.0      0.11840       0.27760   
1    20.57     17.77      132.90  1326.0      0.08474       0.07864   
2    19.69     21.25      130.00  1203.0      0.10960       0.15990   
3    11.42     20.38       77.58   386.1      0.14250       0.28390   
4    20.29     14.34      135.10  1297.0      0.10030       0.13280   

   concavity1  concave_points1  symmetry1  fractal_dimension1  ...  texture3  \
0      0.3001          0.14710     0.2419             0.07871  ...     17.33   
1      0.0869          0.07017     0.1812             0.05667  ...     23.41   
2      0.1974          0.12790     0.2069             0.05999  ...     25.53   
3      0.2414          0.10520     0.2597             0.09744  ...     26.50   
4      0.1980          0.10430     0.1809             0.05883  ...     16.67   

   perimeter3   area3  smoothness3  compactness3  concavity3  concave_points3  \
0      184.

Now, the last column is Diagnosis (target variable), and the other 30 columns are the features.

Splitting the Dataset into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

y_train_numeric = np.where(y_train == 'M', 1, 0)
y_test_numeric = np.where(y_test == 'M', 1, 0)

X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]

Training set size: 398 samples
Test set size: 171 samples


Setting random_state=42 ensures that the data split is reproducible each time the code is run.

Logistic Regression is used for classification problems, where the target variable is categorical (like "Malignant" or "Benign")

Breast Cancer dataset is a binary classification problem, Logistic Regression is the correct choice for modeling.

In [None]:
def sigmoid(z):

    return 1 / (1 + np.exp(-z))

def sigmoid_clipped(z):
    return np.clip(1 / (1 + np.exp(-z)), 1e-10, 1 - 1e-10)


    #we can clip the output of the sigmoid function to ensure it never reaches exactly 0 or 1. This helps avoid taking the log of zero


we use cross-entropy loss (log loss) as the cost function

 we need to handle situations where the sigmoid function might produce values exactly equal to 0 or 1. This can be done by clipping h values before calculating the log loss.

In [None]:
def compute_cost(X, y, theta):
    m = len(y)
    h = sigmoid(np.dot(X, theta))
    h = np.clip(h, 1e-10, 1 - 1e-10)
    cost = (1/m) * np.sum(-y * np.log(h) - (1 - y) * np.log(1 - h))
    return cost


We will use gradient descent to minimize the cost function and optimize the weights (theta)

add L2 regularization to the cost function.

 Regularization adds a penalty term to the cost function to prevent weights from growing too large, which can help stabilize the training process

In [None]:
def gradient_descent(X, y, theta, learning_rate, num_iterations):
    m = len(y)
    cost_history = []
    for _ in range(num_iterations):
        gradients = (1/m) * np.dot(X.T, (sigmoid(np.dot(X, theta)) - y))
        theta -= learning_rate * gradients
        cost_history.append(compute_cost(X, y, theta))
    return theta, cost_history


To train the logistic regression model, we will add a bias column (a column of ones) to the feature matrix X and then apply gradient descent to learn the optimal weights

In [None]:
def train_logistic_regression(X, y, learning_rate=0.01, num_iterations=1000):
    theta = np.zeros(X.shape[1])
    theta_optimal, cost_history = gradient_descent(X, y, theta, learning_rate, num_iterations)
    return theta_optimal, cost_history



In [None]:
print(f"Shape of X_train_bias: {X_train_bias.shape}")
print(f"Shape of y_train_numeric: {y_train_numeric.shape}")

Shape of X_train_bias: (398, 31)
Shape of y_train_numeric: (398, 1)


train the model

In [None]:
y_train_numeric = np.ravel(y_train_numeric)
y_test_numeric = np.ravel(y_test_numeric)

In [None]:
theta_optimal, cost_history = train_logistic_regression(X_train_bias, y_train_numeric, learning_rate=0.001, num_iterations=1000)


  return 1 / (1 + np.exp(-z))


We’ll classify the output as Malignant (M) if the probability is greater than 0.5, and Benign (B) otherwise

In [None]:
def predict(X, theta):
    predictions = sigmoid(np.dot(X, theta))
    return [1 if prob >= 0.5 else 0 for prob in predictions]


In [None]:
predictions = predict(X_test_bias, theta_optimal)
print(predictions)

[1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0]


A confusion matrix to check how the model is classifying the Malignant (M) and Benign (B) classes

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test_numeric, predictions)
print(f"Confusion Matrix:\n{cm}")


Confusion Matrix:
[[98 10]
 [ 2 61]]


Benign (0) Malignant (1)
98 (TN) 10 (FP)  2 (FN) 61 (TP)
Since the first row in the confusion matrix corresponds to actual class 0 (Benign), the first value (98) represents True Negatives
TN = 98 (Correctly predicted as Benign)
FP = 10 (Incorrectly predicted as Malignant)
FN = 2 (Incorrectly predicted as Benign)
TP = 61 (Correctly predicted as Malignant)

In [None]:
accuracy = np.mean(predictions == y_test_numeric)
print(f"Accuracy on test set: {accuracy * 100:.2f}%")

Accuracy on test set: 92.98%
