<br />

<h1 align="center">Predict Breast Cancer</h1>
<h5 align="center">by</h5>
<h3 align="center">Monde Anna</h3>

<br />
<br />


<p>We will predict the class of breast cancer from the features of images taken from breast samples. The classes to be predicted are:
    <ul>
        <li><b><i>Malignant:</i></b> The form we wish to avoid; or</li>
        <li><b><i>Benign:</i></b> Not as dangerous as the former</li>
    </ul>
</p>



<p>Here are <b><i>Sample IDs</i></b> and <b><i>nine</i></b> biological attributes of the cancer cell nuclei that have been calculated from the images:</p>

<br />
<br />


<table width="95%">
    <tr align="center">
        <th>Attribute</th>
        <th>Domain</th>
        <th>Attribute</th>
        <th>Domain</th>
    </tr>
    <tr align="center">
        <td>Sample Code Number</td>
        <td>ID Number</td>
        <td>Clumb Thickness</td>
        <td>1 - 10</td>
    </tr>
    <tr align="center">
        <td>Uniformity of Cell Size</td>
        <td>1 - 10</td>
        <td>Uniformity of Cell Shape</td>
        <td>1 - 10</td>
    </tr>
    <tr align="center">
        <td>Marginal Adhesion</td>
        <td>1 - 10</td>
        <td>Single Epithelial Cell Size</td>
        <td>1 - 10</td>
    </tr>
    <tr align="center">
        <td>Bare Nuclei</td>
        <td>1 - 10</td>
        <td>Bland Chromatin</td>
        <td>1 - 10</td>
    </tr>
    <tr align="center">
        <td>Normal Nucleoli</td>
        <td>1 - 10</td>
        <td>Mitoses</td>
        <td>1 - 10</td>
    </tr>
</table>
</font>

<br />
<br />


<p>The target being used for prediction is named class in the original <b><i>Class</i></b> schema, whereas we will refer to it as target, whereby:
    <ul>
        <li><b><i>Benign</i></b> is signified by 2</li>
        <li><b><i>Malignant</i></b> is signified by 4</li>
    </ul>
</p>

<br />
<br />


<h2 align="center">Source</h2>

<br />

<ul>
    <li><a href="http://syllabus.africacode.net/projects/data-science-specific/logistic-regression/breast-cancer/">Brief</a></li>
    <br />
    <li><a href="http://syllabus.africacode.net/projects/data-science-specific/logistic-regression/breast-cancer/cancer.data">Data</a></li>
</ul>

<br />
<br />


<h2 align="center">Imports</h2>

<br />
<br />


In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import pandas as pd
import numpy as np


<br />

<h2 align="center">Global Settings</h2>

<br />
<br />


In [2]:
BOLD = "bold"
SMALL = 8
MEDIUM = 16
LARGE = 32

sns.set(rc={
    "axes.labelpad": MEDIUM,
    "axes.labelsize": MEDIUM,
    "axes.labelweight": BOLD,
    "axes.titlepad": MEDIUM,
    "axes.titlesize": LARGE,
    "axes.titleweight": BOLD,
    "figure.figsize": (MEDIUM, SMALL),
    "figure.titlesize": LARGE,
    "figure.titleweight": BOLD,
})


<br />

<h2 align="center">Data Prep</h2>

<br />
<br />


In [4]:
data = pd.read_csv("../data/cancer.data")
pd.concat([data.head(3), data.tail(3)], axis="rows")


Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
695,888820,5,10,10,3,7,3,8,10,2,4
696,897471,4,8,6,4,3,4,10,6,1,4
697,897471,4,8,8,5,4,5,10,4,1,4


<br />

<h4 align="center">Initial Impressions</h4>

<br />

<ul>
    <li>The features look to reflect the proposed schema, as such we will assume the two to match</li>
    <br />
    <li>Feature naming needs to be attended to</li>
    <br />
    <li>The first feature falls outside the 1 - 10 range as well as not meeting the binary values of the target, as such we will consider this the <b><i>ID</i></b> feature; this feature is thus aligible to be dropped</li>
    <br />
    <li>The remaining features need to be checked for values that go against the proposed schema</li>
    <br />
    <li>Missing and null values also need be found</li>
    <br />
    <li>Bar <b><i>ID</i></b> (feature name 1000025) and <b><i>Target</i></b> (column 2.1), the data is ordinal</li>
    <br />
    <li><b><i>Target</i></b> (column 2.1) is binary</li>
</ul>

<br />
<br />


<h3 align="center">Feature Renaming</h3>

<br />
<br />


In [5]:
new_feature_names = [
    "sample_id",
    "clumb_thickness",
    "cell_size_uniformity",
    "cell_shape_uniformity",
    "marginal_adhesion",
    "single_epithelial_cell_size",
    "bare_nuclei",
    "bland_chromatin",
    "normal_nucleoli",
    "mitoses",
    "target",
]

data.columns = new_feature_names
pd.concat([data.head(3), data.tail(3)], axis="rows").iloc[:, :8]


Unnamed: 0,sample_id,clumb_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin
0,1002945,5,4,4,5,7,10,3
1,1015425,3,1,1,1,2,2,3
2,1016277,6,8,8,1,3,4,3
695,888820,5,10,10,3,7,3,8
696,897471,4,8,6,4,3,4,10
697,897471,4,8,8,5,4,5,10


<br />

<h3 align="center">Removal</h3>
<h5 align="center">of</h5>
<h3 align="center">Identifier Features</h3>
<h5 align="center">and</h5>
<h3 align="center">Duplicates</h3>

<br />
<br />

<p>As luck would have it, only the <b><i>Sample ID</i></b> feature could possibly be used to indentify the data's source. Prior to removing this feature, it would be benefitial to use it as a way of identifying any duplicates.</p>

<br />
<br />


In [6]:
assert data["sample_id"].duplicated().any(), "There are no duplicates in Sample ID"

data.drop_duplicates(inplace=True)
data.drop(columns=["sample_id"], inplace=True)


<br />

<h3 align="center">Descriptive Statistics</h3>

<br />
<br />


<h4 align="center">Data Types</h4>
<h5 align="center">and</h5>
<h4 align="center">Null Counts</h4>

<br />
<br />


In [7]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 690 entries, 0 to 697
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   clumb_thickness              690 non-null    int64 
 1   cell_size_uniformity         690 non-null    int64 
 2   cell_shape_uniformity        690 non-null    int64 
 3   marginal_adhesion            690 non-null    int64 
 4   single_epithelial_cell_size  690 non-null    int64 
 5   bare_nuclei                  690 non-null    object
 6   bland_chromatin              690 non-null    int64 
 7   normal_nucleoli              690 non-null    int64 
 8   mitoses                      690 non-null    int64 
 9   target                       690 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 59.3+ KB


<br />

<h4 align="center">Observations</h4>

<br />

<ul>
    <li>There are no null values; recall that numpy's <b>NaN</b> value is a float type</li>
    <br />
    <li>There look to be no explicitly missing values</li>
    <br />
    <li>Other than <b><i>Bare Nuclei</i></b>, the feature set is made up of 64-bit integers</li>
    <br />
    <li>We will have to transform the data type of <b><i>Bare Nuclei</i></b> to 64-bit integers as well so as to match the schema's proposal</li>
    <br />
    <li><b><i>Bare Nuclei's</i></b> data type means non-digit strings are best treated as null values; should there be non-digit values, then the feature will have to become float type</li>
</ul>

<br />
<br />


<h4 align="center">Feature Type Conversion</h4>
<h4 align="center">Bare Nuclei</h4>

<br />
<br />


In [8]:
data["bare_nuclei"] = data["bare_nuclei"].apply(
    lambda x: np.int64(x) if x.isdigit() else np.nan
)


<br />

<h4 align="center">Null Value Perusal</h4>

<br />
<br />


In [9]:
null_value_count = data.isna().sum()

null_value_count_total = null_value_count[null_value_count.values > 0]
null_value_count_total.index = [
    name + "_count"
    for name in null_value_count_total.index
]

null_value_count_proportion = null_value_count_total / data.shape[0]
null_value_count_proportion.index = [
    name.rstrip("_count") + "_proportion"
    for name in null_value_count_proportion.index
]

pd.DataFrame(
    data=pd.concat([null_value_count_total, null_value_count_proportion], axis="rows"),
    columns=["values"],
)


Unnamed: 0,values
bare_nuclei_count,16.0
bare_nuclei_proportion,0.023188


<br />

<h4 align="center">Observations</h4>

<br />

<ul>
    <li>There's a fairly minimal number of null value, essentially making up <b><i>2.32%</i></b>, rounded, of the data set</li>
    <br />
    <li>The suspicion is that this will have little impact and as such will be dropped</li>
</ul>

<br />
<br />


<h4 align="center">Drop Rows with Null Values</h4>

<br />
<br />


In [10]:
data.dropna(axis="rows", inplace=True)
