Problem Statement:
Predict whether a tumor is malignant or benign based on cell features.
---- [Add a Markdown Cell Above] ----
Features & Target Variable:
- Features: Mean Radius, Mean Texture, Mean Perimeter, Mean Area, Mean Smoothness, etc.
- Target: 0 → Benign (Non-Cancerous), 1 → Malignant (Cancerous)

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Dataset load karo
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Target variable add karo
df['Target'] = data.target

print("Dataset Loaded Successfully! ✅")


Dataset Loaded Successfully! ✅


In [None]:
print(df.head())  # Pehle 5 rows dekho


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Check for Missing Values & Basic Info

In [None]:
print(df.info())  # Columns, missing values, data types ka overview


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [None]:
print(df.describe())  # Min, max, mean, std deviation, etc.


       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000             0.000000   
25%      

Summary Statistics

In [None]:
print(df.isnull().sum())  # Har column me missing values check karo


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
Target                     0
dtype: int64


In [None]:
df.fillna(df.median(), inplace=True)  # Missing values ko median se fill kar diya



In [None]:
print(df['Target'].value_counts())  # Kitne malignant (1) aur benign (0) cases hain


Target
1    357
0    212
Name: count, dtype: int64


Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1])  # Sirf features ko scale karo, target variable ko nahi

# Scaled data ko wapas DataFrame me convert karo
df_scaled = pd.DataFrame(df_scaled, columns=df.columns[:-1])
df_scaled['Target'] = df['Target']  # Target wapas add kar do

print(df_scaled.head())  # Scaled data ka preview dekho


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0     1.097064     -2.073335        1.269934   0.984375         1.568466   
1     1.829821     -0.353632        1.685955   1.908708        -0.826962   
2     1.579888      0.456187        1.566503   1.558884         0.942210   
3    -0.768909      0.253732       -0.592687  -0.764464         3.283553   
4     1.750297     -1.151816        1.776573   1.826229         0.280372   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0          3.283515        2.652874             2.532475       2.217515   
1         -0.487072       -0.023846             0.548144       0.001392   
2          1.052926        1.363478             2.037231       0.939685   
3          3.402909        1.915897             1.451707       2.867383   
4          0.539340        1.371011             1.428493      -0.009560   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Split Data into Train & Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X = df_scaled.drop(columns=['Target'])  # Features
y = df_scaled['Target']  # Target variable

# 80% Training, 20% Testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training Set:", X_train.shape, y_train.shape)
print("Testing Set:", X_test.shape, y_test.shape)


Training Set: (455, 30) (455,)
Testing Set: (114, 30) (114,)


Train a Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)  # Model train kar do


Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)  # Testing set par prediction

# Accuracy check karo
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Detailed Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))


Model Accuracy: 0.98
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



Hyperparameter Tuning using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter Grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'solver': ['liblinear', 'lbfgs']  # Different solvers for optimization
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)


Best Parameters: {'C': 0.1, 'solver': 'lbfgs'}


In [None]:
{'C': 1, 'solver': 'liblinear'}


{'C': 1, 'solver': 'liblinear'}

Train Final Model with Best Parameters

In [None]:
best_params = {'C': 1, 'solver': 'liblinear'}  # Best parameters yahan daal do

final_model = LogisticRegression(**best_params)
final_model.fit(X_train, y_train)  # Model ko train karo

print("Final Model Trained Successfully! ✅")


Final Model Trained Successfully! ✅


Final Model Evaluation

In [None]:
from sklearn.metrics import confusion_matrix

y_pred_final = final_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred_final)
print(f"Final Model Accuracy: {accuracy:.2f}")

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_final))

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred_final))


Final Model Accuracy: 0.98
Confusion Matrix:
 [[41  1]
 [ 1 71]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



In [None]:
import joblib

# Model ko save karo
joblib.dump(final_model, "breast_cancer_classifier.pkl")
print("Model Saved Successfully! ✅")


Model Saved Successfully! ✅


Final Data Analysis Observations:
1. Feature scaling helped stabilize the model as the features had different magnitudes before applying StandardScaler().
2. Logistic Regression performed well, achieving high accuracy.
3. Hyperparameter tuning improved model performance slightly.
4. The dataset was already balanced, so no resampling techniques were needed.
5. The final model was saved (breast_cancer_model.pkl) for future use.