# Programming for Data Analytics

<br/>

---

# Project II

<br/>

Author: Jamie Tohall<br/>
Student Number: G00411380<br/>
Lecturer: Brian McGinley<br/>

<br/>

---

### Problem Statement
<br/>

This project will investigate the Wisconsin Breast Cancer dataset. The following list presents the requirements of the project:
<br/>

* Undertake an analysis/review of the dataset and present an overview and background.
* Provide a literature review on classifiers which have been applied to the dataset and compare their performance
* Present a statistical analysis of the dataset
* Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers.
* Compare, contrast and critique your results with reference to the literature
* Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints
* Document your work in a Jupyter notebook. 
* As a suggestion, you could use Pandas, Seaborn, SKLearn, etc. to perform your analysis. 
* Please use GitHub to demonstrate research, progress and consistency.

<br/>

---

## Introduction

### Review of data set

### Overview

### Background

### Importing relevant modules

In [2]:
import pandas as pd
import seaborn as sns
import sklearn as sk
import matplotlib.pyplot as plt

---

### Reading in the dataset

In [3]:
# Opening the dataset in read and labelling as db

db = pd.read_csv("wisc_bc_data.csv")

### Preprocessing of the Dataset

In [4]:
# Shape will show the number of rows and columns the dataset contains

db.shape

(569, 32)

In [5]:
# Columns will give an index of all 32 columns

db.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'points_mean', 'symmetry_mean', 'dimension_mean', 'radius_se',
       'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'points_se', 'symmetry_se',
       'dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
       'area_worst', 'smoothness_worst', 'compactness_worst',
       'concavity_worst', 'points_worst', 'symmetry_worst', 'dimension_worst'],
      dtype='object')

---

In [7]:
# Head will print out the top 5 row as default, however I specified 10 rows

db.head(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,87139402,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,8910251,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
5,906539,B,11.57,19.04,74.2,409.7,0.08546,0.07722,0.05485,0.01428,...,13.07,26.98,86.43,520.5,0.1249,0.1937,0.256,0.06664,0.3035,0.08284
6,925291,B,11.51,23.93,74.52,403.5,0.09261,0.1021,0.1112,0.04105,...,12.48,37.16,82.28,474.2,0.1298,0.2517,0.363,0.09653,0.2112,0.08732
7,87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,...,19.2,41.85,128.5,1153.0,0.2226,0.5209,0.4646,0.2013,0.4432,0.1086
8,862989,B,10.49,19.29,67.41,336.1,0.09989,0.08578,0.02995,0.01201,...,11.54,23.31,74.22,402.8,0.1219,0.1486,0.07987,0.03203,0.2826,0.07552
9,89827,B,11.06,14.96,71.49,373.9,0.1033,0.09097,0.05397,0.03341,...,11.92,19.9,79.76,440.0,0.1418,0.221,0.2299,0.1075,0.3301,0.0908


In [8]:
# Describe will output some basic statistical details of the dataset

db.describe

<bound method NDFrame.describe of             id diagnosis  radius_mean  texture_mean  perimeter_mean  \
0     87139402         B        12.32         12.39           78.85   
1      8910251         B        10.60         18.95           69.28   
2       905520         B        11.04         16.83           70.92   
3       868871         B        11.28         13.39           73.00   
4      9012568         B        15.19         13.21           97.65   
..         ...       ...          ...           ...             ...   
564  911320502         B        13.17         18.22           84.28   
565     898677         B        10.26         14.71           66.20   
566     873885         M        15.28         22.41           98.92   
567     911201         B        14.53         13.98           93.86   
568    9012795         M        21.37         15.10          141.30   

     area_mean  smoothness_mean  compactness_mean  concavity_mean  \
0        464.1          0.10280           0.

### Literature Review on Classifiers

---

### Statistical Analysis

---

### Training a Set of Classifiers

---

### Review of Results

---

### Investigation of Dataset Extension

---

## References

[1] https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29<br/>
[2]https://www.researchgate.net/publication/311950799_Analysis_of_the_Wisconsin_Breast_Cancer_Dataset_and_Machine_Learning_for_Breast_Cancer_Detection<br/>
[3] 