**Import Libraries and Load Data**

To start, I imported essential libraries for data handling, visualization, preprocessing, modeling, and evaluation. Each library plays a specific role in managing different parts of the project. For example:

Pandas and NumPy are used for data manipulation and basic operations.
Seaborn and Matplotlib enable data visualization, helping me understand distributions, correlations, and patterns.
Scikit-learn provides tools for preprocessing, feature selection, model building, and evaluation.
After importing the libraries, I loaded the dataset and checked for missing values and data types in each column. This step allowed me to gain a preliminary understanding of the data structure and to identify any immediate issues that might need addressing, such as missing values or categorical features.



In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer  # Enable the experimental feature
from sklearn.impute import IterativeImputer  # Import IterativeImputer
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/Breast_Cancer_dataset.csv')

# Display initial data information
print("Dataset Info:")
print(data.info())
print("Missing values per column:\n", data.isnull().sum())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Age                     3823 non-null   float64
 1   Race                    3622 non-null   object 
 2   Marital Status          3703 non-null   object 
 3   T Stage                 4024 non-null   object 
 4   N Stage                 4024 non-null   object 
 5   6th Stage               4024 non-null   object 
 6   differentiate           4024 non-null   object 
 7   Grade                   4024 non-null   object 
 8   A Stage                 4024 non-null   object 
 9   Tumor Size              3622 non-null   float64
 10  Estrogen Status         3823 non-null   object 
 11  Progesterone Status     4024 non-null   object 
 12  Regional Node Examined  3421 non-null   float64
 13  Reginol Node Positive   4024 non-null   int64  
 14  Survival Months         40

**Data Preprocessing (Handle Missing Values and Encode Categorical Variables)**


In this chunk, I focused on preparing the dataset for model training by addressing two primary issues: categorical variables and missing values.

Encoding Categorical Variables: Since many machine learning algorithms require numerical input, I converted categorical columns to numeric codes using Label Encoding. This process maps each unique category to an integer, allowing models to interpret these categories without adding undue complexity.

Handling Missing Values with MICE: To handle missing values effectively, I used Multiple Imputation by Chained Equations (MICE). MICE iteratively fills in missing values by modeling each feature with missing values as a function of other features. This advanced imputation approach preserves the data’s integrity and avoids the bias that might result from simpler methods.

After these steps, I confirmed that the dataset had no missing values left. This preprocessing step is crucial because models often perform poorly on incomplete data.

In [3]:
# Encode categorical variables
for col in data.select_dtypes(include='object').columns:
    data[col] = LabelEncoder().fit_transform(data[col].astype(str))

# Handle missing values using MICE (Iterative Imputer)
imputer = IterativeImputer(max_iter=10, random_state=0)
data_imputed = imputer.fit_transform(data)
data = pd.DataFrame(data_imputed, columns=data.columns)

# Check if missing values are handled
print("Missing values after imputation:\n", data.isnull().sum())


Missing values after imputation:
 Age                       0
Race                      0
Marital Status            0
T Stage                   0
N Stage                   0
6th Stage                 0
differentiate             0
Grade                     0
A Stage                   0
Tumor Size                0
Estrogen Status           0
Progesterone Status       0
Regional Node Examined    0
Reginol Node Positive     0
Survival Months           0
Status                    0
dtype: int64


**Outlier Detection and Removal**


Next, I addressed the issue of outliers, which can distort model training and lead to inaccurate predictions. I used Z-score analysis to detect extreme values. A Z-score measures how far each value is from the mean, in terms of standard deviations. I considered values with a Z-score greater than 3 to be outliers, indicating that they are quite far from typical values in the dataset.

After identifying outliers, I removed them to ensure the dataset was representative of general cases, improving model robustness and reducing the risk of overfitting to noise.



In [4]:
# Detect and remove outliers using Z-score
z_scores = np.abs((data - data.mean()) / data.std())
data = data[(z_scores < 3).all(axis=1)]  # Removing outliers beyond 3 standard deviations

# Separate features and target variable
X = data.drop(columns=['Status'])  # Adjust 'Status' if your target column has a different name
y = data['Status']


**Standardization and Dimensionality Reduction (PCA)**


Standardization: Before applying machine learning models, I standardized the dataset so that each feature had a mean of 0 and a standard deviation of 1. Standardization is essential for algorithms that are sensitive to feature scales, such as KNN and Neural Networks. It ensures that all features contribute equally, avoiding bias toward features with larger scales.

Principal Component Analysis (PCA): Since high-dimensional data can lead to redundancy and slow computation, I applied PCA to reduce the dataset’s dimensionality while preserving 95% of the variance. PCA transforms the data into new features (principal components) that capture the most important patterns. This step streamlines the dataset, minimizes noise, and can improve model performance by focusing only on the most informative aspects of the data.



In [5]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce dimensionality while retaining 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Number of features after PCA: {X_pca.shape[1]}")


Number of features after PCA: 11


**Recursive Feature Elimination (RFE)**


In this chunk, I used Recursive Feature Elimination (RFE) to select the most important features. RFE is a powerful feature selection technique that iteratively removes the least significant features and retrains the model with the remaining ones. For this project, I used a Random Forest model as the estimator within RFE.

The Random Forest model helped assess feature importance during each iteration. By specifying that I wanted to select the top 5 features, RFE ranked the features and retained only those that contributed most to model performance. This method is beneficial for reducing dimensionality in a targeted way, ensuring the model focuses on the most predictive features.



In [6]:
# Initialize Random Forest model for RFE
model = RandomForestClassifier(random_state=42)

# Apply Recursive Feature Elimination to select top features
rfe_selector = RFE(estimator=model, n_features_to_select=5, step=1)
rfe_selector.fit(X_scaled, y)

# Get selected features based on RFE
rfe_features = X.columns[rfe_selector.get_support()]
X_selected = X[rfe_features]
print("Selected features with RFE:", rfe_features)


Selected features with RFE: Index(['Age', 'Tumor Size', 'Regional Node Examined', 'Reginol Node Positive',
       'Survival Months'],
      dtype='object')


**Splitting Data into Training and Testing Sets**

With the most important features selected, I split the data into training and testing sets. This separation is essential for assessing model performance objectively. The training set (80% of the data) is used to fit the model, while the testing set (20%) is kept separate to evaluate how well the model generalizes to unseen data.

By using this approach, I could assess how the model might perform on new cases, ensuring that my results reflect generalizability rather than overfitting to the training data.



In [7]:
# Split the data into training and testing sets using selected features
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Reset indices to ensure alignment
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)


**Model Implementation**


Here, I implemented and evaluated six different machine learning models, each with unique strengths and limitations:

*K-Nearest Neighbors (KNN):* I implemented KNN from scratch to gain a deeper understanding of the algorithm. KNN works by finding the nearest k neighbors of a data point and predicting the majority class. It is simple, interpretable, and effective for low-dimensional data. However, it’s sensitive to feature scaling and can be slow with large datasets.





In [8]:
# Import necessary libraries
import numpy as np
from sklearn.metrics import accuracy_score

# Define Euclidean distance function
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# Define custom KNN class
class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        # Ensure that the data is in Numpy array format
        self.X_train = np.array(X)
        self.y_train = np.array(y)

    def predict(self, X):
        # Convert the input data to a Numpy array
        X = np.array(X)
        predictions = [self._predict(x) for x in X]
        return np.array(predictions)

    def _predict(self, x):
        # Compute distances from x to all training samples
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # Get indices of the k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        # Extract the labels of the k nearest neighbors
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        # Return the most common label among the neighbors
        return np.argmax(np.bincount(k_nearest_labels))

# Reset index after preprocessing steps (if applicable)
data.reset_index(drop=True, inplace=True)

# Redefine X and y after any preprocessing that may alter the indices
X = data.drop(columns=['Status'])  # Replace 'Status' with actual target column name if different
y = data['Status']

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert training and test sets to numpy arrays
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values

# Instantiate, fit, and predict with the custom KNN model
knn_model = KNN(k=5)
knn_model.fit(X_train_np, y_train_np)
knn_predictions = knn_model.predict(X_test_np)

# Evaluate the model
print("KNN Accuracy:", accuracy_score(y_test, knn_predictions))


KNN Accuracy: 0.8887323943661972


*Naïve Bayes:* This probabilistic model is based on Bayes’ theorem and assumes feature independence. Naïve Bayes is fast and handles small datasets well, but it may underperform if features are highly correlated.

*Decision Tree:* The Decision Tree model splits data based on feature conditions, forming branches that lead to class predictions. It’s highly interpretable and can handle non-linear relationships, though it’s prone to overfitting without pruning.

*Random Forest:* Random Forest is an ensemble method that builds multiple decision trees and combines their predictions. It reduces overfitting and improves accuracy, particularly for large datasets. However, it’s less interpretable compared to a single decision tree.

*Gradient Boosting:* Gradient Boosting iteratively improves predictions by learning from errors in previous models. This makes it highly accurate and suitable for complex patterns, although it requires careful tuning to avoid overfitting.

*Neural Network:* Lastly, I implemented a Neural Network with two hidden layers. Neural networks excel at capturing complex patterns in large datasets. However, they’re computationally intensive and require substantial tuning to perform well.

Each model was evaluated on the testing set, allowing me to compare accuracy across different approaches.

In [9]:


# 2. Naïve Bayes
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)
print("Naïve Bayes Accuracy:", accuracy_score(y_test, nb_predictions))

# 3. Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_predictions))

# 4. Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))

# 5. Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_predictions))

# 6. Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

nn_model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])
nn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
nn_model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)
nn_loss, nn_accuracy = nn_model.evaluate(X_test, y_test)
print("Neural Network Accuracy:", nn_accuracy)


Naïve Bayes Accuracy: 0.8098591549295775
Decision Tree Accuracy: 0.8380281690140845
Random Forest Accuracy: 0.9014084507042254
Gradient Boosting Accuracy: 0.9028169014084507
Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.8509 - loss: 0.4827
Epoch 2/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8893 - loss: 0.3100
Epoch 3/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9071 - loss: 0.2778
Epoch 4/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8972 - loss: 0.2888
Epoch 5/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8950 - loss: 0.3089
Epoch 6/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8994 - loss: 0.2760
Epoch 7/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9005 - loss: 0.2900
Epoch 8/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9061 - loss: 0.2719
Epoch 9/10
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

To further optimize model performance, I conducted **hyperparameter tuning** on two of the models: Random Forest and Gradient Boosting.



Grid Search for Random Forest: I adjusted parameters like n_estimators (number of trees) and max_depth (depth of each tree) using a grid search. Grid search systematically evaluates each combination to identify the configuration that maximizes cross-validation accuracy.

Grid Search for Gradient Boosting: For Gradient Boosting, I tuned n_estimators and learning_rate. The learning rate controls how much each tree contributes to the final model, while n_estimators determines the number of trees. By fine-tuning these parameters, I was able to improve model accuracy and generalization.

These tuning efforts allowed me to identify the best-performing configurations for each model, enhancing their predictive power.



In [10]:
# Hyperparameter tuning for Random Forest and Gradient Boosting
param_grid_rf = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
grid_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=5)
grid_rf.fit(X_train, y_train)
print("Best Random Forest Params:", grid_rf.best_params_)
print("Random Forest best accuracy:", grid_rf.best_score_)

param_grid_gb = {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1, 0.5]}
grid_gb = GridSearchCV(GradientBoostingClassifier(), param_grid_gb, cv=5)
grid_gb.fit(X_train, y_train)
print("Best Gradient Boosting Params:", grid_gb.best_params_)
print("Gradient Boosting best accuracy:", grid_gb.best_score_)


Best Random Forest Params: {'max_depth': None, 'n_estimators': 50}
Random Forest best accuracy: 0.9111371935315598
Best Gradient Boosting Params: {'learning_rate': 0.1, 'n_estimators': 50}
Gradient Boosting best accuracy: 0.9129002409518842


**Summary of Results**


Finally, I summarized the accuracy of each model in a results table, providing a straightforward comparison of performance across models. This table highlights which models performed best on the test set, making it easy to identify the most reliable model for predicting breast cancer survivability.

Additionally, for models like Random Forest that provide feature importance scores, I highlighted the most influential features. This information is valuable for understanding which patient characteristics are most predictive of survivability, providing insights that could be valuable in a clinical setting.

In conclusion, this project demonstrates a systematic approach to predictive modeling, covering data preprocessing, feature selection, model implementation, and tuning. Each step was carefully designed to improve model performance and interpretability, resulting in a reliable and accurate tool for predicting breast cancer survivability.

In [12]:
# Evaluate the best models from GridSearchCV on the test set
best_rf_model = grid_rf.best_estimator_
best_rf_predictions = best_rf_model.predict(X_test)
best_rf_accuracy = accuracy_score(y_test, best_rf_predictions)

best_gb_model = grid_gb.best_estimator_
best_gb_predictions = best_gb_model.predict(X_test)
best_gb_accuracy = accuracy_score(y_test, best_gb_predictions)

# Display results for all models, including the best-tuned RF and GB models
results = pd.DataFrame({
    "Model": [
        "KNN",
        "Naïve Bayes",
        "Decision Tree",
        "Random Forest (Default)",
        "Gradient Boosting (Default)",
        "Neural Network",
        "Random Forest (Tuned)",
        "Gradient Boosting (Tuned)"
    ],
    "Accuracy": [
        accuracy_score(y_test, knn_predictions),
        accuracy_score(y_test, nb_predictions),
        accuracy_score(y_test, dt_predictions),
        accuracy_score(y_test, rf_predictions),          # Default RF accuracy
        accuracy_score(y_test, gb_predictions),          # Default GB accuracy
        nn_accuracy,
        best_rf_accuracy,                                # Tuned RF accuracy
        best_gb_accuracy                                 # Tuned GB accuracy
    ]
})

print("\nModel Performance Summary:\n", results)



Model Performance Summary:
                          Model  Accuracy
0                          KNN  0.888732
1                  Naïve Bayes  0.809859
2                Decision Tree  0.838028
3      Random Forest (Default)  0.901408
4  Gradient Boosting (Default)  0.902817
5               Neural Network  0.884507
6        Random Forest (Tuned)  0.907042
7    Gradient Boosting (Tuned)  0.898592
