# Breast Cancer Prediction Project
This project uses logistic regression , random forest and knn to predict whether a person is likely to have a breast cancer based on health data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

## 1. Load Breast Cancer Dataset

We use `load_breast_cancer()` from `sklearn.datasets`.  
This dataset contains medical features used to classify tumors as malignant or benign.

In [2]:
data=load_breast_cancer()

In [3]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## 2. Select Features and Target Variable

We define `X` as the input features by dropping the `'target'` column from the DataFrame.  
The target variable `y` is set to the `'target'` column itself.

In [4]:
X = df.drop('target', axis=1)
y = df['target']

## 3. Train/Test Split and Model Training

We split the dataset into training and testing sets using an 80/20 ratio with a fixed random state for reproducibility.  
Then, we trained a Logistic Regression model with `class_weight='balanced'` to handle class imbalance, ensuring the minority class (cancer cases) received appropriate weight during training.

This adjustment led to an improvement in model accuracy from **95% to 97%**, making the model more reliable for detecting positive cases.


In [23]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
model=LogisticRegression(class_weight='balanced',max_iter=10000)
model.fit(x_train,y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,10000


## 4. Model Evaluation – Accuracy

We use the test set to evaluate the model’s performance by calculating the accuracy score.  
This tells us how many predictions were correct out of the total test samples.

In [24]:
y_pred=model.predict(x_test)
acurrecy=accuracy_score(y_test,y_pred)
acurrecy

0.9736842105263158

## 5. Confusion Matrix

The confusion matrix helps us evaluate the model's ability to detect cancer cases  
especially in an imbalanced dataset.

In [25]:
cm=confusion_matrix(y_test,y_pred)
cm

array([[41,  2],
       [ 1, 70]])

## 6. Classification Report

We use the classification report to evaluate the model's precision, recall, f1-score, and support for each class.  
This gives a more detailed understanding of the model’s performance


In [26]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



we trained a Logistic Regression model with class_weight='balanced' to handle class imbalance, ensuring the minority class (cancer cases) received appropriate weight during training.

we trained a random forest model 

## 7. Random Forest Model

We trained a Random Forest classifier as an alternative model to compare its performance with Logistic Regression.  
By evaluating accuracy, confusion matrix, and classification report, we assess whether Random Forest provides better results in terms of precision, recall, and overall balance.


In [27]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(x_train, y_train)

yrf_pred = rf_model.predict(x_test)

print("Random Forest Accuracy:", accuracy_score(y_test, yrf_pred))
print('Random forest confusion matrix',confusion_matrix(y_test,yrf_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, yrf_pred))

Random Forest Accuracy: 0.9649122807017544
Random forest confusion matrix [[40  3]
 [ 1 70]]
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



## 8. K-Nearest Neighbors (KNN) Model

We trained a K-Nearest Neighbors (KNN) model as a third approach to compare its performance with Logistic Regression and Random Forest.

We selected `k=5` because it resulted in **100% recall for malignant (cancer) cases**, which is critical in medical applications.  
Although the overall accuracy was slightly lower than `k=6`, prioritizing recall helps minimize the risk of missing a cancer diagnosis.


In [28]:
knn_model = KNeighborsClassifier(n_neighbors=5)  # You can try different k values
knn_model.fit(x_train, y_train)

knn_pred = knn_model.predict(x_test)
print("KNN Accuracy:", accuracy_score(y_test, knn_pred))
print('KNN confusion matrix',confusion_matrix(y_test,knn_pred))
print("KNN Classification Report:\n", classification_report(y_test, knn_pred))

KNN Accuracy: 0.956140350877193
KNN confusion matrix [[38  5]
 [ 0 71]]
KNN Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.88      0.94        43
           1       0.93      1.00      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114



## 9. Handling Class Imbalance and Cross-Validation

We experimented with `class_weight='balanced'` in the Logistic Regression model to address the imbalance between cancer and non-cancer cases.

Although recall remained the same (the model still correctly identified 70 out of 71 cancer cases), using class weights reduced false positives from 4 to 2, improving precision.

To further evaluate the model’s generalization performance, we applied **5-fold cross-validation** on the scaled dataset.  
This helps ensure the model's accuracy is consistent across different data splits, not just the original train-test split.

While the cross-validation accuracy dropped slightly from ~0.951 to ~0.949 with class weighting, we decided to keep it for its better precision and medical safety.

In [29]:
scores = cross_val_score(model, X, y, cv=5)  # 5-fold CV
print("Cross-Validation Scores:", scores)
print("Average Accuracy:", scores.mean())


Cross-Validation Scores: [0.93859649 0.94736842 0.98245614 0.92982456 0.94690265]
Average Accuracy: 0.9490296537804689


## 10. KNN Cross-Validation

We applied 5-fold cross-validation to the KNN model to evaluate its generalization performance.   
This helps ensure the model's accuracy is consistent across different data splits, not just the original train-test split.

By comparing the average accuracy with other models, we can better assess how well KNN performs overall.

In [19]:
scores = cross_val_score(knn_model, X, y, cv=5)
print("Cross-Validation Scores:", scores)
print("Average Accuracy:", scores.mean())


Cross-Validation Scores: [0.88596491 0.93859649 0.93859649 0.94736842 0.92920354]
Average Accuracy: 0.9279459711224964


## 11. Random Forest Cross-Validation

We performed 5-fold cross-validation on the Random Forest model to evaluate its stability and generalization performance.

Cross-validation helps confirm whether the model's accuracy is reliable across different subsets of the data, providing a more robust comparison against other models.


In [20]:
scores = cross_val_score(rf_model, X, y, cv=5)
print("Cross-Validation Scores:", scores)
print("Average Accuracy:", scores.mean())


Cross-Validation Scores: [0.92105263 0.93859649 0.98245614 0.96491228 0.97345133]
Average Accuracy: 0.9560937742586555
