1. Loading and Preprocessing


In [22]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Preprocessing Steps:
Handling Missing Values: The breast cancer dataset from sklearn is clean and does not contain any missing values. If there were missing values, we would use techniques like imputation (mean, median, or mode) or drop rows/columns with missing values.

Feature Scaling: The features in this dataset have different scales. For example, some features might range from 0 to 1, while others might range from 0 to 1000. Scaling the features is important for algorithms like SVM, k-NN, and Logistic Regression, which are sensitive to the scale of the input data. We use StandardScaler to standardize the features to have a mean of 0 and a standard deviation of 1.

In [23]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Justification: Feature scaling ensures that all features contribute equally to the result, preventing features with larger scales from dominating the model. This is particularly important for distance-based algorithms like k-NN and SVM.

2. Classification Algorithm Implementation:

    1.Logistic Regression
Description: Logistic Regression is a linear model used for binary classification. It estimates the probability of a sample belonging to a particular class using the logistic function.

Suitability: It is simple and interpretable, making it a good baseline model for binary classification tasks like this.

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict
y_pred = log_reg.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Logistic Regression:")
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

Logistic Regression:
Accuracy: 0.9825, Precision: 0.9907, Recall: 0.9815, F1 Score: 0.9860


2. Decision Tree Classifier
Description: A Decision Tree splits the data into subsets based on feature values, creating a tree-like structure to make predictions.

Suitability: It is easy to interpret and can handle non-linear relationships in the data.

In [25]:
from sklearn.tree import DecisionTreeClassifier

# Train the model
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict
y_pred = dt.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Decision Tree Classifier:")
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

Decision Tree Classifier:
Accuracy: 0.9415, Precision: 0.9712, Recall: 0.9352, F1 Score: 0.9528


3. Random Forest Classifier
Description: Random Forest is an ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.

Suitability: It is robust to overfitting and performs well on high-dimensional datasets.

In [26]:
from sklearn.ensemble import RandomForestClassifier

# Train the model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Random Forest Classifier:")
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

Random Forest Classifier:
Accuracy: 0.9708, Precision: 0.9640, Recall: 0.9907, F1 Score: 0.9772


4. Support Vector Machine (SVM)
Description: SVM finds the optimal hyperplane that separates the classes with the maximum margin.

Suitability: It is effective in high-dimensional spaces and works well for binary classification tasks.

In [27]:
from sklearn.svm import SVC

# Train the model
svm = SVC(random_state=42)
svm.fit(X_train, y_train)

# Predict
y_pred = svm.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Support Vector Machine:")
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

Support Vector Machine:
Accuracy: 0.9766, Precision: 0.9815, Recall: 0.9815, F1 Score: 0.9815


5. k-Nearest Neighbors (k-NN)
Description: k-NN classifies a sample based on the majority class among its k-nearest neighbors in the feature space.

Suitability: It is simple and works well for small datasets with clear separation between classes.

In [28]:
from sklearn.neighbors import KNeighborsClassifier

# Train the model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("k-Nearest Neighbors:")
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

k-Nearest Neighbors:
Accuracy: 0.9591, Precision: 0.9633, Recall: 0.9722, F1 Score: 0.9677


3. Model Comparison:

Accuracy:
Logistic Regression has the highest accuracy (0.9825), followed by SVM (0.9766) and Random Forest (0.9708).

Decision Tree (0.9415) and k-NN (0.9591) have the lowest accuracy.

Precision:
Logistic Regression has the highest precision (0.9907), followed by SVM (0.9815) and Decision Tree (0.9712).

Random Forest (0.9640) and k-NN (0.9633) have slightly lower precision.

Recall:
Random Forest has the highest recall (0.9907), followed by k-NN (0.9722) and Logistic Regression (0.9815).

Decision Tree (0.9352) has the lowest recall.

F1 Score:
Logistic Regression has the highest F1 score (0.9860), followed by Random Forest (0.9772) and SVM (0.9815).

Decision Tree (0.9528) and k-NN (0.9677) have the lowest F1 scores.


Conclusion:
Logistic Regression is the best-performing algorithm for this dataset, with the highest accuracy, precision, and F1 score.

SVM and Random Forest are strong alternatives, with SVM being slightly better in precision and F1 score, and Random Forest excelling in recall.

k-NN performs moderately but is not as strong as the top three algorithms.

Decision Tree is the worst-performing algorithm, likely due to overfitting or its inability to generalize well on this dataset.

