![header.jpg](attachment:header.jpg)

# Programming for Data Analysis

<b> Student: Lais Coletta Pereira </b>
***

## About the project

In this project for the Programming for Data Analysis course we were asked to investigate the Wisconsin Breast Cancer dataset and provide: 

<b>1)</b> Undertake an analysis/review of the dataset and present an overview and background. 

<b>2)</b> Provide a literature review on classifiers which have been applied to the dataset and
compare their performance

<b>3)</b> Present a statistical analysis of the dataset

<b>4)</b> Using a range of machine learning algorithms, train a set of classifiers on the dataset (using
SKLearn etc.) and present classification performance results. Detail your rationale for the
parameter selections you made while training the classifiers.

<b>5)</b> Compare, contrast and critique your results with reference to the literature

<b>6)</b> Discuss and investigate how the dataset could be extended – using data synthesis of new
tumour datapoints

<b>7)</b> Document your work in a Jupyter notebook

I have downloaded the Breast Cancer Wisconsin (Diagnostic) Data Set from the [Kaggle website](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/code?resource=download). 


## 1) Dataset overview and analysis
***

The Breast Cancer Wisconsin (Diagnostic) Data Set is a collection of diagnostic information on breast cancer patients, collected by Dr. William H. Wolberg at the University of Wisconsin Hospitals. The data set includes information on a variety of features, such as the size and shape of the tumor, the texture of the cells, and the presence of particular proteins. The data set has been widely used for research and education purposes, and has been studied by many researchers in the field of machine learning and data mining.

According to this explanation by [Shashmi Karanam](https://medium.com/@shashmikaranam/exploratory-data-analysis-breast-cancer-wisconsin-diagnostic-dataset-6a3be9525cd): "Diagnosis of breast cancer is traditionally done by a full biopsy which is an invasive surgical method. A less invasive method called Fine Needle Biopsy (FNB), allows for examination of a small amount of tissue from the tumor".

This dataset was obtained by analyzing the cell nuclei characteristics of <b>569 images </b>obtained by Fine Needle Aspiration of the breast mass. Each of the images are classified (diagnosed) as being <b>“Benign”</b> or <b>“Malignant”</b>. The first 30 attributes are numerical and represent different characteristics of the cells in the tissue samples, while the last two attributes are categorical and represent the diagnosis (malignant or benign) and the patient's name.

The Breast Cancer Wisconsin is often used to train machine learning models to classify whether a given breast cancer tissue sample is malignant or benign. It has also been used to study the relationship between different characteristics of the tissue samples and the diagnosis, as well as to identify patterns in the data that may be relevant to diagnosis and treatment.

Overall, this data set is a valuable resource for researchers and practitioners working on breast cancer diagnosis and treatment, and has contributed significantly to our understanding of breast cancer and the factors that influence its development and progression.

### Import Libraries and read the data
***

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

According to the website [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)) the variables definitions are: 

<b>Attribute Information:</b>

1) ID number

2) Diagnosis (M = malignant, B = benign)
3-32)

<b>Ten real-valued features are computed for each cell nucleus:</b>

* radius (mean of distances from center to points on the perimeter)

* texture (standard deviation of gray-scale values)

* perimeter

* area

* smoothness (local variation in radius lengths)

* compactness (perimeter^2 / area - 1.0)

* concavity (severity of concave portions of the contour)

* concave points (number of concave portions of the contour)

* symmetry

* fractal dimension ("coastline approximation" - 1)

In [3]:
#visualize data headings
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# describe the data
data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


In [1]:
data.info()

NameError: name 'data' is not defined

References: 
Information about the dataset: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
Data set variables information: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/code?resource=download
Analysis about breast cancer types and dataset: https://medium.com/@shashmikaranam/exploratory-data-analysis-breast-cancer-wisconsin-diagnostic-dataset-6a3be9525cd