# Breast Cancer Diagnostic 

The idea of this mini-project is to apply data processing, apply some algorithm models and compare the metrics of each one, and then choose which one has the best performance

The dataset can be found in [Breast Cancer Wisconsin (Diagnostic)](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

## Steps:
  - Prepare data for use
    - Read
    - Clean
    - Separate
  - Plot information to gain insights
  - Train a model
  - Checking the results

## Importing libs

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import max_error, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error, mean_squared_error, median_absolute_error, r2_score

In [2]:
warnings.filterwarnings('ignore')

## Loading and viewing data

In [3]:
# Carregando os dados
df = pd.read_csv('./data/wdbc.data', header=None, delimiter = ',', encoding = 'utf-8')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       569 non-null    int64  
 1   1       569 non-null    object 
 2   2       569 non-null    float64
 3   3       569 non-null    float64
 4   4       569 non-null    float64
 5   5       569 non-null    float64
 6   6       569 non-null    float64
 7   7       569 non-null    float64
 8   8       569 non-null    float64
 9   9       569 non-null    float64
 10  10      569 non-null    float64
 11  11      569 non-null    float64
 12  12      569 non-null    float64
 13  13      569 non-null    float64
 14  14      569 non-null    float64
 15  15      569 non-null    float64
 16  16      569 non-null    float64
 17  17      569 non-null    float64
 18  18      569 non-null    float64
 19  19      569 non-null    float64
 20  20      569 non-null    float64
 21  21      569 non-null    float64
 22  22

We can see there is no empty values.

In [5]:
# stats of numerical data
round (df.describe(exclude = 'object'), 2)

Unnamed: 0,0,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.13,19.29,91.97,654.89,0.1,0.1,0.09,0.05,0.18,...,16.27,25.68,107.26,880.58,0.13,0.25,0.27,0.11,0.29,0.08
std,125020600.0,3.52,4.3,24.3,351.91,0.01,0.05,0.08,0.04,0.03,...,4.83,6.15,33.6,569.36,0.02,0.16,0.21,0.07,0.06,0.02
min,8670.0,6.98,9.71,43.79,143.5,0.05,0.02,0.0,0.0,0.11,...,7.93,12.02,50.41,185.2,0.07,0.03,0.0,0.0,0.16,0.06
25%,869218.0,11.7,16.17,75.17,420.3,0.09,0.06,0.03,0.02,0.16,...,13.01,21.08,84.11,515.3,0.12,0.15,0.11,0.06,0.25,0.07
50%,906024.0,13.37,18.84,86.24,551.1,0.1,0.09,0.06,0.03,0.18,...,14.97,25.41,97.66,686.5,0.13,0.21,0.23,0.1,0.28,0.08
75%,8813129.0,15.78,21.8,104.1,782.7,0.11,0.13,0.13,0.07,0.2,...,18.79,29.72,125.4,1084.0,0.15,0.34,0.38,0.16,0.32,0.09
max,911320500.0,28.11,39.28,188.5,2501.0,0.16,0.35,0.43,0.2,0.3,...,36.04,49.54,251.2,4254.0,0.22,1.06,1.25,0.29,0.66,0.21


In [6]:
# stats of categorical data
round (df.describe(exclude = ['float', 'int64']),2)

Unnamed: 0,1
count,569
unique,2
top,B
freq,357


There's 357 benign (B) cases, and 212 malignant (M).

## Data Preprocessing

In [7]:
# Encode categorical columns
df = pd.get_dummies(df, columns=[1])
df.head()

Unnamed: 0,0,2,3,4,5,6,7,8,9,10,...,24,25,26,27,28,29,30,31,1_B,1_M
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,False,True
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,False,True
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,False,True
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,False,True
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,False,True


## Exploring the data