# Task Description

## Implementation

The task involves implementing and evaluating three classifiers for diagnosing breast cancer using a dataset of patients tested via fine needle aspiration (FNA). The dataset contains statistics of 10 different features of multiple cell samples, along with a diagnosis (malignant or benign).

1. **Rule-based Classifier**: A rule-based classifier where abnormal cell size, shape, texture, or homogeneity indicate malignancy.

2. **Random Forest Classifier**: Applied to the supplied dataset features using the sklearn framework.

3. **Custom Classifier**: Designed to balance interpretability and classification performance.

### Rule-based Classifier
For the rule-based classifier, appropriate variables need to be defined based on medical insights and the available data. The rules are interpreted from the medical insights provided.

### Random Forest Classifier
The sklearn framework is used to implement a random forest classifier on the dataset features.

### Custom Classifier
A custom classifier is designed to balance interpretability and classification performance, building on existing models but focused on the mentioned trade-off.

## Evaluation
The classification performance of the three classifiers will be compared and the interpretability of each will be discussed. Notable interactions between features will also be explored.

---



In [140]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [144]:
class BreastCancerClassifier:
    def __init__(self):
        # Initialize classifiers
        self.rule_based_classifier = None
        self.random_forest_classifier = RandomForestClassifier()
        self.custom_classifier = None
        
    def fit(self, df, X_train, y_train):
        self.fit_rule_based_classifier(df)
        self.random_forest_classifier.fit(X_train, y_train)
        pass

    def predict(self, X_test):
        y_pred = self.random_forest_classifier.predict(X_test)
        return y_pred
   
    def score(self, df , X_test, y_test):
        self.fit_rule_based_scorer(df)
        y_pred = self.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("Accuracy Random forrest:", accuracy)
        pass

    def fit_rule_based_classifier(self, df):
        features = df[df.malignant == 0].drop(columns=["id", "malignant"])

       # Find the maximum value for each feature
        max_values = features.max()
        
        # Store the maximum value for each feature in a dictionary
        thresholds = max_values.to_dict()
        

        # Define a function to classify based on thresholds
        def classify(row):
            for feature, threshold in thresholds.items():
                if row[feature] > threshold:
                    return 1  # malignant
            return 0  # benign
        
        df["predicted_diagnosis"] = df.apply(classify, axis=1)

    def fit_rule_based_scorer(self, X_test):
        correct_predictions = (X_test['malignant'] == X_test['predicted_diagnosis']).sum()
        total_predictions = len(X_test)
        accuracy_percentage = (correct_predictions / total_predictions) * 100

        print(f"Accuracy Rule based: {accuracy_percentage:.2f}%")      
        

In [145]:
df = pd.read_pickle('dataset/wdbc.pkl')
X = df.drop(columns=["id", "malignant"])
y = df.malignant
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classifier = BreastCancerClassifier()
classifier.fit(df,X_train,y_train)
classifier.score(df, X_test, y_test)






Accuracy rule based: 93.32%
Accuracy Random forrest: 0.9649122807017544


In [143]:
""""
fig, axs = plt.subplots(10, 3, figsize=(12, 12))
axs = axs.flatten()

# Limit the number of features to the length of axs array
num_features = min(len(features.columns), len(axs))

for i, feature in enumerate(features.columns[:num_features]):
    axs[i].hist(benign[feature], bins=20, alpha=0.5, color='blue', label='Benign')
    axs[i].hist(malignant[feature], bins=20, alpha=0.5, color='red', label='Malignant')
    axs[i].set_title(feature)
    axs[i].legend()

plt.tight_layout()
plt.show()
"""

'"\nfig, axs = plt.subplots(10, 3, figsize=(12, 12))\naxs = axs.flatten()\n\n# Limit the number of features to the length of axs array\nnum_features = min(len(features.columns), len(axs))\n\nfor i, feature in enumerate(features.columns[:num_features]):\n    axs[i].hist(benign[feature], bins=20, alpha=0.5, color=\'blue\', label=\'Benign\')\n    axs[i].hist(malignant[feature], bins=20, alpha=0.5, color=\'red\', label=\'Malignant\')\n    axs[i].set_title(feature)\n    axs[i].legend()\n\nplt.tight_layout()\nplt.show()\n'