#### Project Description

This assignmebt involves informed decision-making and extraction from breast cancer data.

#### Loading data & Exploratory analysis

For the first step breast cancer data (in a CSV file) are loaded and the first few rows to understand what the data looks like are shown. This helps to identify the columns, data types, and initial observations.

In [None]:
import pandas as pd
import yaml

# Import the data
configPath = 'config.yaml'

# Read the yaml data from the file
with open(configPath, 'r') as file:
    configData = yaml.safe_load(file)

df = pd.read_csv(configData["breast_cancer_path"])

# Displaying the first few rows
print(df.head())

In [None]:
# Display summary statistics
print(df.describe())

# Visualize the distribution of features
import matplotlib.pyplot as plt

df.drop(['id'], axis=1).hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()

# Check class balance
class_counts = df['diagnosis'].value_counts()
print(class_counts)

as reults shows the diagnosis column is a categorical field involves 357 observed 'B' value and 212 observed 'M' value. And due to histograms' right tail and std info which shows spread of numerical features (from around 0.02 to near 350) the data are not normal. Furthermore, from the displaying data the id field  doesn't seem is valuable and meaningful field for this assignment.

#### Preprocess data

in the following block to prepare the data for machine learning algorithms, the categorical target variable (diagnosis) is encoded to make it more suitable for modeling. Also the dataset into features (X) and the target variable (y) for further processing are splitted.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode the diagnosis column (Malignant = 1, Benign = 0)
label_encoder = LabelEncoder()
df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

# Split the data into features (X) and target variable (y)
X = df.drop(['id', 'diagnosis'], axis=1)
y = df['diagnosis']

As the above results show data are not normal, to improve the perfoemance of model features with high skewness (above 0.7) are identified and a power transformation is applied to those features. Finally, to ensure that features have similar scales, they scaled with StandardScaler to prevent of damage analysing because of big differences in the variance.

In [None]:
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler

# Calculating skewness of the features
skewness = X.skew()

# Selecting features with skewness above a threshold (0.7)
skewed_features = skewness[abs(skewness) > 0.7].index

# Applying power transformation 
power_transformer = PowerTransformer()
X_skewed = X[skewed_features].copy()
X_skewed_transformed = power_transformer.fit_transform(X_skewed)

# Replacing the original skewed features with the transformed features
X[skewed_features] = X_skewed_transformed

# Normalizing the features using standard scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#### Modeling

This step is for selecting machine learning algorithms for classification. The following code experiments with a Decision Tree Classifier and a Gaussian Naive Bayes Classifier over breast cancer data. By using cross-validation with cross_val_score, the performance of each classifier using accuracy scores is estimated.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

# Initialize the classifiers with different parameters
classifiers = [
    DecisionTreeClassifier(max_depth=1),
    GaussianNB()
]

# Train and evaluate the classifiers
for classifier in classifiers:
    scores = cross_val_score(classifier, X_scaled, y, cv=5)
    print(f'{classifier.__class__.__name__}: {scores.mean()}')

The accuracy score represents the ratio of correctly predicted instances to the total number of instances for diagnosis target. The DecisionTreeClassifier results shows about 89.81% accuracy of the instances in the breast cancer dataset during cross-validation, meanwhile the the Gaussian Naive Bayes algorithm achieved an average accuracy score of around 0.9403. 

The higher accuracy of Gaussian Naive Bayes suggests that it might be better to the patterns present in the breast cancer dataset.

#### Evaluating

Accuracy alone might not provide a complete picture of model performance, especially in imbalanced datasets. To gain deeper insights, additional metrics (precision, recall, and F1-score using cross_val_predict) are used. These metrics give a better understanding of how well the models perform in terms of correctly classifying malignant 'M' and benign 'B' cases in the specific breast cancer data.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_predict

# Perform cross-validation predictions
for classifier in classifiers:
    y_pred = cross_val_predict(classifier, X_scaled, y, cv=5)
    print(f'{classifier.__class__.__name__} metrics:')
    print(f'Accuracy: {accuracy_score(y, y_pred)}')
    print(f'Precision: {precision_score(y, y_pred)}')
    print(f'Recall: {recall_score(y, y_pred)}')
    print(f'F1-Score: {f1_score(y, y_pred)}')


Comparing the metrics between the two models:

The Gaussian naive bayes model generally performs better across all metrics, with higher accuracy, (precision, recall, and F1-score) compared to the decision tree model.

The decision tree model has a slightly higher precision (indicating fewer false positives) but lower recall (indicating fewer true positives) compared to the gaussian model. 

The F1-scores for both models are quite close, shows that the both models have similar overall performance in terms of the balance between precision and recall.

In medical applications like breast cancer diagnosis, a higher recall might be more important than precision, as missing a true positive (a malignant 'M' case) can have serious consequences.

#### Explanation

Finally, a Decision Tree Classifier (with a maximum depth of 3) is fitted. Visualizing the decision tree helps to understand how the model makes decisions based on different features over diagnosis.

In [None]:
from sklearn.tree import plot_tree

# Create and fit a decision tree with max_depth=3
decision_tree = DecisionTreeClassifier(max_depth=3)
decision_tree.fit(X_scaled, y)

# Visualize the decision tree
plt.figure(figsize=(12, 8))
plot_tree(decision_tree, feature_names=X.columns, class_names=label_encoder.classes_, filled=True)
plt.show()


The decision tree visualization shows the structure of the decision tree model that has been trained on breast cancer dataset. Each node in the tree represents a decision based on a specific feature, and the branches leading to different nodes represent the possible outcomes of that decision.  

Nodes near the root play a significant role in making decisions.

in decision tree each node evaluates a specific feature and compares it against a threshold. If the feature value satisfies the condition, the algorithm follows the left branch. Otherwise, it follows the right branch. The process continues until reaching a leaf node, where the final prediction (Malignant 'M' or Benign 'B') is made.

Furthermore, the visualization uses colors to represent different classes (malignant 'M' class and benign 'B' class), to make it easier to see how instances are classified at different points in the tree.