## **Dataset Overview: Breast Cancer Dataset (from sklearn)**

The **Breast Cancer** dataset is a well-known dataset available in the sklearn library, primarily used for classification tasks. The dataset contains information about various features of cell nuclei from breast cancer biopsies, with the goal of predicting whether the tumor is **malignant** (cancerous) or **benign** (non-cancerous).

The dataset contains 30 features, derived from digital images of fine needle aspirate (FNA) of breast cancer tumors. These features represent various measurements related to the shape and texture of the cell nuclei.

## **Data Acquisition**

In [16]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset
cancer_data = load_breast_cancer()

# Create the DataFrame from the data and feature names
data_frame = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])

# Add the target column to the DataFrame
data_frame.loc[:, 'target'] = cancer_data.target


In [2]:
# Display the first 5 rows of the dataset
df.head(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## **Data Cleaning**

In [3]:
# Display basic information
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [4]:
# Check for missing values
print(f"Missing value count in each column:\n {df.isnull().sum()}")

Missing value count in each column:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


In [5]:
# Check for duplicates
print(f"Duplicate rows : {df.duplicated().sum()}")

Duplicate rows : 0


In [6]:
# Basic statistics of the dataset
print("Descriptive Statistics:\n")
print(df.describe())

Descriptive Statistics:

       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000         

**Observation :** This dataset's datatypes are correct, doesn't have missing values and no duplicate rows, hence dataset is cleaned and we can move to data preprocessing steps

In [7]:
# Separate features (X) from the target variable (y)
X = df.drop(columns=['target'])
y = df['target']

## **Feature Selection**

In [8]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Perform feature selection using RFE with a RandomForest classifier as the estimator
estimator = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=estimator, n_features_to_select=10)
rfe.fit(X, y)

# Get and display the names of the top 10 selected features
selected_features = X.columns[rfe.support_]
print("Best 10 Features:\n")
print(selected_features)

# Filter the dataset to keep only the selected features
X_selected = X[selected_features]

Best 10 Features:

Index(['mean perimeter', 'mean area', 'mean concavity', 'mean concave points',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst concavity', 'worst concave points'],
      dtype='object')


## **Data Preprocessing**

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Use train_test_split to separate features and target into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Initialize the StandardScaler for feature scaling
scaler = StandardScaler()

# Fit the scaler on the training data and apply the transformation to both training and test data
X_train_scaled, X_test_scaled = scaler.fit(X_train), scaler.transform(X_test)


## **ANN Model Building**

In [12]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create and configure the Artificial Neural Network (ANN)
ann = MLPClassifier(random_state=42, max_iter=1000)

# Fit the model on the training data
ann.fit(X_train_scaled, y_train)

# Predict on the test data
predictions = ann.predict(X_test_scaled)

# Evaluate the model's performance using accuracy and classification report
accuracy = accuracy_score(y_test, predictions)
class_report = classification_report(y_test, predictions)

# Output the results
print("ANN Model Performance Evaluation:\n")
print(f"Accuracy: {accuracy}\n")
print("Classification Report:")
print(class_report)


ANN Model Performance Evaluation:

Accuracy: 0.9736842105263158

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



## **Hyperparameter Tuning**

In [13]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the MLPClassifier with required parameters
model = MLPClassifier(random_state=42, max_iter=1000)

# Train the model on the training data
model.fit(X_train_scaled, y_train)

# Generate predictions on the test data
y_pred = model.predict(X_test_scaled)

# Calculate accuracy and generate classification report
results = {
    'accuracy': accuracy_score(y_test, y_pred),
    'classification_report': classification_report(y_test, y_pred)
}

# Print out the evaluation results
print("ANN Model Performance Summary:")
print(f"Accuracy: {results['accuracy']}\n")
print("Classification Report:")
print(results['classification_report'])


ANN Model Performance Summary:
Accuracy: 0.9736842105263158

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

# Define hyperparameter search space for the ANN model
hyperparameters = {
    'hidden_layer_sizes': [(50,), (100,), (150,), (200,)],  # Neurons in hidden layer
    'activation': ['tanh', 'relu'],  # Activation functions
    'solver': ['adam', 'sgd'],  # Optimization algorithms
    'alpha': [0.0001, 0.001, 0.01, 0.1],  # Regularization
    'learning_rate': ['constant', 'invscaling', 'adaptive']  # Learning rate options
}

# Initialize GridSearchCV with MLPClassifier and hyperparameter grid
grid_search = GridSearchCV(
    estimator=MLPClassifier(max_iter=500, early_stopping=True, random_state=42),
    param_grid=hyperparameters,
    cv=5,
    n_jobs=-1
)

# Train the model using GridSearchCV
grid_search.fit(X_train_scaled, y_train)

# Display the best parameters and best score found
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Hyperparameters from GridSearchCV:\n")
print(best_params)
print(f"Best Cross-validation Score: {best_score}\n")

# Evaluate the best model from GridSearchCV on the test set
best_model = grid_search.best_estimator_
test_predictions = best_model.predict(X_test_scaled)

print("Evaluation of the Best Model:\n")
print(f"Accuracy: {accuracy_score(y_test, test_predictions)}\n")
print("Classification Report:")
print(classification_report(y_test, test_predictions))


Best Hyperparameters from GridSearchCV:

{'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
Best Cross-validation Score: 0.9384615384615385

Evaluation of the Best Model:

Accuracy: 0.9649122807017544

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.98      0.95        43
           1       0.99      0.96      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.97      0.96       114
weighted avg       0.97      0.96      0.97       114



## Model Comparison: ANN vs. ANN with Hyperparameter Tuning

### 1. **ANN Model (Without Hyperparameter Tuning)**
- **Accuracy**: 97.37%
- **Precision (Class 0)**: 0.98 | **Recall (Class 0)**: 0.95 | **F1-score (Class 0)**: 0.96
- **Precision (Class 1)**: 0.97 | **Recall (Class 1)**: 0.99 | **F1-score (Class 1)**: 0.98
- This model provides high accuracy and excellent performance in both precision and recall, especially for Class 1.

### 2. **ANN Model (With Hyperparameter Tuning)**
- **Accuracy**: 96.49%
- **Precision (Class 0)**: 0.93 | **Recall (Class 0)**: 0.98 | **F1-score (Class 0)**: 0.95
- **Precision (Class 1)**: 0.99 | **Recall (Class 1)**: 0.96 | **F1-score (Class 1)**: 0.97
- The tuned model shows a slightly lower accuracy, but still achieves strong performance in both classes. Hyperparameter tuning improved the recall for Class 0, but at the cost of slight accuracy reduction.

### **Conclusion:**
- The **ANN Model (Without Hyperparameter Tuning)** performs better overall with a higher accuracy (97.37% vs. 96.49%).
- Although the tuned model shows improvements in some metrics, the untuned model is the more effective choice for this dataset.


## **Saving the Model**

In [15]:
import joblib

# Define the model and scaler to be saved
model_to_save = ann  # Best model after training
scaler_to_save = scaler  # Scaler used for preprocessing

# Saving the model to a file
model_filename = 'ann_model.joblib'
scaler_filename = 'scaler.joblib'

# Use joblib to save both the model and scaler
with open(model_filename, 'wb') as model_file:
    joblib.dump(model_to_save, model_file)

with open(scaler_filename, 'wb') as scaler_file:
    joblib.dump(scaler_to_save, scaler_file)

print(f"Model and scaler have been saved as '{model_filename}' and '{scaler_filename}' respectively.")


Model and scaler have been saved as 'ann_model.joblib' and 'scaler.joblib' respectively.


## **Github Link :** https://github.com/Rutvik-06/Assignment-1-

## **Cloud Deployment Link :**