<h1 align = center>Programming for Data Analytics Project 2</h1>
<h2 align = center>Stephen Caulfield</h2>
<h2 align = center>G00398240</h2>

<h2>Overview of Project</h2>
    <p>This project will investigate the Wisconsin Breast Cancer dataset. These are the requirements for the project</p>
<ul>
    <li>Undertake an analysis/review of the dataset and present an overview and background</li>
    <li>Provide a literature review on classifiers which have been applied to the dataset and compare their performance</li>
    <li>Present a statistical analysis of the dataset</li>
    <li>Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detailing rationale for the parameter selections you made while training the classifiers.</li>
    <li>Compare, contrast and critique your results with reference to the literature</li>
    <li>Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints</li>
</ul>

<h2>What is Breast Cancer?</h2>
<img src = https://www.cdc.gov/cancer/breast/basic_info/images/female-breast-diagram-750px.jpg?_=41558 align = right style="width:170x;height:170px;">
<p>Breast is one of many forms of Cancer. It happens when cells in the breast grow out of control. There are different types of breast cancer depending on which cells in the breast turn into cancer. The breast consists of three parts, being the lobules, ducts and connective tissue.</p>
<p>The breast cancer usually begins in the ducts or lobules and it can spread outside the breast of which it has then metastasized.</p>
<p>The most common types of breast cancer are:
    <ul>
        <li>Invasive ductal carcinoma where the cancer cell begins in the ducts and grow outside the ducts into other parts of breast tissue, and possibly into to other body parts</li>
        <li>invasive Lobular carcinoma where cancer begins in the lobules and spread the lobules to breast tissue nearby. This can also spread to other body parts</li>

<h2>Overview of Wisconsin Breast Cancer Dataset</h2>

<p>Dataset Title: Wisconsin Breast Cancer Database (January 8, 1991)</p>
<p>Source: Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA</p>

<h3>Attribute Information</h3>
<p>Below is a truncated description of each attribute in the data set, the data set its contains specific statistics that can be easily analysed with my code.</p>
<ol>
    <li>ID number</li>
    <li>Diagnosis (M = malignant, B = benign) </li>
    <li>Radius (mean of distances from center to points on the perimeter) </li>
    <li>Texture (standard deviation of gray-scale values) </li>
    <li>Perimeter </li>
    <li>Area</li>
    <li>Smoothness (local variation in radius lengths) </li>
    <li>Compactness (perimeter^2 / area - 1.0) </li>
    <li>Concavity (severity of concave portions of the contour) 0</li>
    <li>Symmetry </li>
    <li>Fractal dimension ("coastline approximation" - 1)</li>
</ol>

<p>There are a few instances of missing attributes in the data set, they will not be included in the analysis of the data set.</p>

<h2>Packages</h2>

In [None]:
import csv

import numpy as np

import pandas as pd

from pandas.plotting import scatter_matrix

import seaborn as sns

import random

from scipy.stats import norm

import statistics

import matplotlib.pyplot as plt

from scipy.stats import spearmanr
from scipy.cluster import hierarchy
from scipy.spatial.distance import squareform

<h2>Data set imported</h2>
<p>I am using the python package Pandas here to turn the data set into a dataframe that can be presented through python in a presentable fashion. It also provides a suitable platform for the data to be manipulated wherever I see fit.

In [None]:
file = "data.csv"

main_df = pd.read_csv(file, delimiter=",")

main_df.head()

This verifies how many missing values there are

In [None]:
#Shows missing values in dataframe.
main_df.isna().sum()

Verifies data type of each column.

In [None]:
#Shows data type of each column
print(main_df.dtypes)

Shows the size of the data frame.

In [None]:
print (main_df.shape)

ID is a column I have deem to be a redundant column for this analysis as there is no statistical knowledge to be earned from its analysis. So it will be removed.

In [None]:
main_df = main_df.drop(["id", "Unnamed: 32"], axis = 1)

main_df.head()

<h2>General statistics</h2>

In [None]:
main_df.describe()

In [None]:
diag = main_df['diagnosis'].value_counts()
benign = f'Benign = {diag[0]}'
malig = f'Malignant = {diag[1]}'
diag_title = [benign, malig]
plt.pie(diag, labels=diag_title,startangle=90)
plt.title('Benign vs Malignant Diagnoses')
plt.show();

- Benign Tumors do not grow/spread and are not considered cancerous.
- Malignant Tumors do grow and spread in the body. These are considered cancerous.

In [None]:
compare = sns.FacetGrid(main_df, col = "radius_mean", row= "diagnosis")
compare = compare.map(plt.hist, "Mean Value")
print(compare)

<h1>References</h1>
<ol>
    <li><a href = https://www.cdc.gov/cancer/breast/basic_info/what-is-breast-cancer.htm> CDC: Center for Disease Control and Prevention</a></li>
    <li><a href = https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)> UCI Machine Learning Repository: Breast Cancer Wisconsin (Original) Data Set</a></li>
    <li><a href = https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data>Kaggle:Wisconsin Diagnostic Breast Cancer Dataset</a></li>
    <li><a href = https://stackoverflow.com/>StackOverflow</li>
    <li><a href = https://www.w3schools.com/>W3Schools</li>
    <li></li>
    <li></li>
    
</ol>
<h3>Python Packages</h3>
<ol>
    <li><a href = https://numpy.org/>Numpy</a></li>
    <li><a href = https://pandas.pydata.org/>Pandas</a></li>
    <li><a href = https://seaborn.pydata.org/>Seaborn</a></a></a></a></a></a></li>
    <li><a href = https://matplotlib.org/>MatPlotLib</a></li>
    <li><a href = https://docs.python.org/3/library/random.html>Random</a></li>
    <li><a href = https://docs.python.org/3/library/math.html>Math</a></li>
    <li><a href = https://scikit-learn.org/stable/>Sklearn</a></li>
    <li><a href = https://scipy.org/>Scipy</a></li>
</ol>