# Breast Cancer Diagnosis with Logistic Regression

**1 Loading Libraries**

In [1]:
import numpy as np  # For linear algebra
import pandas as pd  # For data processing
import matplotlib.pyplot as plt  # For visualization

**2. Loading the Dataset**

In [4]:
data = pd.read_csv("data.csv")
data.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

### Key columns: Diagnosis (M or B), and 30 numerical features.

### Columns id and Unnamed: 32 are irrelevant and dropped.

## Preprocessing
**3. Dropping Unnecessary Columns**

In [6]:
data.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
data.diagnosis = [1 if each == "M" else 0 for each in data.diagnosis]

**4. Splitting Input and Output**

In [7]:
y = data.diagnosis.values
x_data = data.drop(['diagnosis'], axis=1)

**5. Normalization**

In [8]:
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data))

**6. Splitting Data for Training and Testing**

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=42)

# Transpose data for compatibility
x_train, x_test = x_train.T, x_test.T
y_train, y_test = y_train.T, y_test.T

# Logistic Regression Implementation

**7. Initializing Weights and Bias**

In [10]:
def initialize_weights_and_bias(dimension):
    w = np.full((dimension, 1), 0.01)
    b = 0.0
    return w, b

**8. Sigmoid Function**

In [11]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

**9.Forward and Backward Propagation**

In [12]:
def forward_backward_propagation(w, b, x_train, y_train):
    z = np.dot(w.T, x_train) + b
    y_head = sigmoid(z)
    loss = -y_train * np.log(y_head) - (1 - y_train) * np.log(1 - y_head)
    cost = np.sum(loss) / x_train.shape[1]

    # Gradients
    derivative_weight = np.dot(x_train, (y_head - y_train).T) / x_train.shape[1]
    derivative_bias = np.sum(y_head - y_train) / x_train.shape[1]

    gradients = {"derivative_weight": derivative_weight, "derivative_bias": derivative_bias}
    return cost, gradients

**10. Updating Parameters**

In [13]:
def update(w, b, x_train, y_train, learning_rate, num_iterations):
    cost_list = []

    for i in range(num_iterations):
        cost, gradients = forward_backward_propagation(w, b, x_train, y_train)
        w -= learning_rate * gradients["derivative_weight"]
        b -= learning_rate * gradients["derivative_bias"]

        if i % 10 == 0:
            cost_list.append(cost)
            print(f"Cost after iteration {i}: {cost}")

    parameters = {"weight": w, "bias": b}
    return parameters, cost_list

**11. Prediction Function**

In [14]:
def predict(w, b, x):
    z = sigmoid(np.dot(w.T, x) + b)
    Y_prediction = np.zeros((1, x.shape[1]))
    Y_prediction[z > 0.5] = 1
    return Y_prediction

**12. Training Logistic Regression**

In [15]:
def logistic_regression(x_train, y_train, x_test, y_test, learning_rate, num_iterations):
    dimension = x_train.shape[0]
    w, b = initialize_weights_and_bias(dimension)

    parameters, _ = update(w, b, x_train, y_train, learning_rate, num_iterations)

    y_prediction_train = predict(parameters["weight"], parameters["bias"], x_train)
    y_prediction_test = predict(parameters["weight"], parameters["bias"], x_test)

    print(f"Train accuracy: {100 - np.mean(np.abs(y_prediction_train - y_train)) * 100}%")
    print(f"Test accuracy: {100 - np.mean(np.abs(y_prediction_test - y_test)) * 100}%")

logistic_regression(x_train, y_train, x_test, y_test, learning_rate=1, num_iterations=100)

Cost after iteration 0: 0.6928602487831985
Cost after iteration 10: 0.6386915336255256
Cost after iteration 20: 0.6138714743826124
Cost after iteration 30: 0.5917763428204554
Cost after iteration 40: 0.5720243853319734
Cost after iteration 50: 0.5543143594255145
Cost after iteration 60: 0.5383746690036969
Cost after iteration 70: 0.5239681021838539
Cost after iteration 80: 0.5108912623752068
Cost after iteration 90: 0.4989713675892603
Train accuracy: 80.74534161490683%
Test accuracy: 81.3953488372093%


# Verifying with Scikit-Learn
**13. Using Sklearn's Logistic Regression**

In [16]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

imputer = SimpleImputer(strategy='mean')
x_train = imputer.fit_transform(x_train.T).T
x_test = imputer.transform(x_test.T).T

logreg = LogisticRegression(random_state=42, max_iter=150)
logreg.fit(x_train.T, y_train.T)

print(f"Train accuracy: {logreg.score(x_train.T, y_train.T)}")
print(f"Test accuracy: {logreg.score(x_test.T, y_test.T)}")

Train accuracy: 0.8633540372670807
Test accuracy: 0.8953488372093024


## Results

Custom Logistic Regression:

Train accuracy: 80.74534161490683%
Test accuracy: 81.3953488372093%

Scikit-Learn Logistic Regression:

Train accuracy: 0.8633540372670807
Test accuracy: 0.8953488372093024

## Recommendations for Improving the Model
1. Feature Engineering: Making Your Inputs Smarter
What does it mean?
Your dataset has a lot of features (like "radius_mean" or "texture_mean"). But not all of them are equally important for predicting whether the tumor is malignant (bad) or benign (not so bad).
Feature Engineering is about identifying which features are the most useful or creating new ones.

How to improve it?

Analyze feature importance:
Use tools to check which features actually affect the result.
Example: Does "radius_mean" matter more than "symmetry_worst"? Focus on the ones that matter.
Dimensionality reduction:
Imagine your dataset is like a very crowded classroom with lots of students (features). PCA (Principal Component Analysis) helps "summarize" the students into fewer groups (principal components) so the room is easier to manage.
2. Hyperparameter Tuning: Fine-Tuning the Recipe
What does it mean?
The model’s performance depends on settings like the learning rate (how quickly it learns) and the number of iterations