## Setup and Library Imports

In this section, we import necessary Python libraries that will be used throughout the notebook for data manipulation (`pandas`), data visualization (`matplotlib.pyplot`), and machine learning tasks (`sklearn`). This setup is crucial for supporting the subsequent data analysis and model application.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

file_path = './data.csv'
data = pd.read_csv(file_path)

label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])

X = data[['radius_mean', 'texture_mean']]
y = data['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

def predict_tumor_probability(radius_mean, texture_mean):
    values = pd.DataFrame([[radius_mean, texture_mean]], columns=['radius_mean', 'texture_mean'])   
    probabilities = model.predict_proba(values)   
    
    return f'Probability that the tumor is BENIGN: {probabilities[0][0]*100:.3f}%\nProbability that the tumor is MALIGNANT: {probabilities[0][1]*100:.3f}%'


## Data Loading and Initial Exploration

Here, we load the dataset using `pandas` and perform initial data exploration and visualization. This step is essential for understanding the data's structure, identifying patterns, and spotting any potential anomalies. Visualizations help in making informed decisions about the data preprocessing and analysis techniques to be applied.


In [None]:
data = pd.read_csv(file_path)
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)  
plt.hist(data[data['diagnosis'] == 'M']['radius_mean'], bins=20, alpha=0.5, label='Malignant')
plt.hist(data[data['diagnosis'] == 'B']['radius_mean'], bins=20, alpha=0.5, label='Benign')
plt.legend()
plt.title('Distribution of Mean Radius by Type of Diagnosis')
plt.xlabel('Mean Radius')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)  
plt.hist(data[data['diagnosis'] == 'M']['texture_mean'], bins=20, alpha=0.5, label='Malignant')
plt.hist(data[data['diagnosis'] == 'B']['texture_mean'], bins=20, alpha=0.5, label='Benign')
plt.legend()
plt.title('Distribution of Mean Texture by Type of Diagnosis')
plt.xlabel('Mean Texture')

plt.tight_layout()

plt.show()


## Model Application and Prediction

After exploring and understanding the dataset, we apply a machine learning model to predict specific outcomes. This section includes the application of a predictive model, `predict_tumor_probability`, to estimate the probability of a tumor being benign or malignant based on certain features.

In [None]:
print(predict_tumor_probability(2.0, 8.0))
print('-------------------------------------')

# Displaying the model's overall accuracy in percentage
print(f"The model's accuracy is: {accuracy*100:.2f}%")
print('-------------------------------------')

# Presenting the model's confusion matrix, which shows the model's performance in terms of true positives, false positives, true negatives, and false negatives.
print("Model's Confusion Matrix:")
print(conf_matrix)  
print("This matrix helps in understanding the model's performance in classifying cancer correctly.")
print('-------------------------------------')

# Showing the model's AUC ROC score, which measures the ability of the model to distinguish between classes.
print(f"Model's AUC ROC: {roc_auc:.2f}")
print("A higher AUC ROC score indicates better model performance in distinguishing between patients with and without cancer.")


## Conclusion

This notebook has taken you through the process of data exploration, visualization, and application of a machine learning model to predict specific outcomes. Through our analysis, we have uncovered insights into the dataset's characteristics and utilized a predictive model to estimate the likelihood of certain outcomes based on the data.
