<a href="https://colab.research.google.com/github/AnahitSh/proj-ds/blob/main/Breast_Cancer_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Breast Cancer Detection

According to the American Cancer Society (n.d.), breast cancer is the second-most common type of cancer diagnosed in American women, behind only skin cancers. The average risk of an American woman developing breast cancer sometime in her life is about 13%. This means there is a 1 in 8 chance she will eventually develop breast cancer.

Mammograms are used to detect breast cancer—hopefully at an early stage. However, many masses that appear on a mammogram are not actually cancerous. Developing a machine learning model to predict whether a tumor is benign or cancerous would be helpful for physicians as they guide and treat patients.

In this module, we wll use decision tree–based methods to classify the tumors as benign or malignant. You'll learn if this model does a better or worse job classifying the tumors compared to previous models you've tried.

In [None]:
# Let's upload the cancer.csv data set

from google.colab import files
cancer = files.upload()

In [None]:
# Importing necessary packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import graphviz

In [None]:
# let's create a Pandas Dataframe from the CSV file, name it cancer, and print first five observations

cancer = pd.read_csv('/content/cancer.csv')
cancer.head(5)


In [None]:
# Converting the Variable Diagnosis into a numeric data type

cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0

Spliting data into the Target Variable and Feature of Interest.

The goal is to predict whether a tumor is benign or malignant (cancer_present) using the mean tumor perimeter measure (perimeter_mean).

In [None]:
X = cancer.drop(["id", "diagnosis", "cancer_present"], axis = 1)
y = cancer['cancer_present']

In [None]:
# Let's split data into a training data set and a test data set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

We will build a Pipline that will impute standardize the data and fit a Decision Tree Classifier that split based on entropy.

In [None]:
tree1 = Pipeline([
    ('impute',SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler',StandardScaler()),
    ('tree',DecisionTreeClassifier(criterion='entropy',random_state=42))])

tree1.fit(X_train, y_train)

In [None]:
# evaluating the Pipline using 10-fold Cross-Validation

scores = cross_val_score(tree1, X_train, y_train, cv=10)

print("The mean of the cross val is",scores.mean())
print("The standard deviation od the cross val is",scores.std())

 The data suggests that the model is performing fairly well in predicting cancer malignancy in male patients, with a mean accuracy of around 89.4%, and the results are relatively consistent, as indicated by a low standard deviation.

Visualize the Decision Tree.

In [None]:
!ls

In [None]:
clf = tree1.named_steps['tree']

In [None]:
plt.figure(figsize = (20,12))
from sklearn.tree import plot_tree
plot_tree(clf, feature_names=X_train.columns, class_names=X_test.columns, filled=True)

Decision tree model represented as a tree diagram, where each node corresponds to a decision rule based on a feature, and each leaf node represents the predicted class. The colors of the nodes will indicate the majority class in that node.