# Breast Cancer

Borstkanker is de meest voorkomende kanker bij vrouwen. Sterker nog, van 25% van alle gevallen is borstkanker. Het begint met de mutatie van cellen. Deze kunnen gezien worden met behulp van X-ray.

De dataset is een binair probleem. We willen voorspellen of iemand borstkanker heeft of niet. Patienten met type M (Malingant) cellen hebben borstkanker en de genen met type B (Benign) weer niet. We willen dus weten of degene type M of type B cellen heeft.

### Importeer libraries

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('breast-cancer.csv')

data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [15]:

def detect_outlier(data_1):
    outliers=[]
    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)
    
    
    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

In [16]:
for col in data.select_dtypes(include=np.number):
    column = data[col]
    print(f"{col}:")
    print(detect_outlier(column))
    print('\n')

id:
[871001501, 871001502, 881046502, 881094802, 901034301, 901034302, 911157302, 911296201, 911296202, 911320501, 911320502]


radius_mean:
[25.22, 27.22, 28.11, 25.73, 27.42]


texture_mean:
[32.47, 33.81, 39.28, 33.56]


perimeter_mean:
[171.5, 166.2, 182.1, 188.5, 174.2, 186.9, 165.5]


area_mean:
[1878.0, 1761.0, 2250.0, 2499.0, 1747.0, 2010.0, 2501.0, 1841.0]


smoothness_mean:
[0.1425, 0.1398, 0.1447, 0.1634, 0.05263]


compactness_mean:
[0.2776, 0.2839, 0.3454, 0.2665, 0.2768, 0.2867, 0.2832, 0.3114, 0.277]


concavity_mean:
[0.3754, 0.3339, 0.4264, 0.4268, 0.4108, 0.3523, 0.3368, 0.3635, 0.3514]


concave points_mean:
[0.1845, 0.1823, 0.2012, 0.1878, 0.1913, 0.1689]


symmetry_mean:
[0.304, 0.2743, 0.2906, 0.2655, 0.2678]


fractal_dimension_mean:
[0.09744, 0.0898, 0.09296, 0.08743, 0.0845, 0.09502, 0.09575]


radius_se:
[1.509, 1.296, 2.873, 1.292, 1.37, 2.547, 1.291]


texture_se:
[3.568, 2.91, 3.12, 4.885, 2.878, 3.647, 2.927, 2.904, 3.896]


perimeter_se:
[11.07, 10.05, 9.

In [4]:
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, ttest_ind, f_oneway, shapiro
import statsmodels.formula.api as sfa
import statsmodels.api as sa
import statsmodels.stats.multicomp as mc

In [5]:
# Shapiro-Wilk test om te zien of een kolom (data) normaal is verdeeld.
alpha = 0.05

for c in data.columns:
    try:
        stat, p = shapiro(data[c])

        if p > alpha:
            print(f'Wel normaal verdeeld (Behoud H0) | Kolom: {c}\nstat: {stat}\np waarde: {p}\n')
        else:
            print(f'Niet normaal verdeeld (Verwerp H0) | Kolom: {c}\nstat: {stat}\np waarde: {p}\n')
    except:
        pass

Niet normaal verdeeld (Verwerp H0) | Kolom: id
stat: 0.22388029098510742
p waarde: 3.0688436368713494e-43

Niet normaal verdeeld (Verwerp H0) | Kolom: radius_mean
stat: 0.941069483757019
p waarde: 3.106064735383836e-14

Niet normaal verdeeld (Verwerp H0) | Kolom: texture_mean
stat: 0.9767200946807861
p waarde: 7.281492031552261e-08

Niet normaal verdeeld (Verwerp H0) | Kolom: perimeter_mean
stat: 0.9361826181411743
p waarde: 7.01163031715385e-15

Niet normaal verdeeld (Verwerp H0) | Kolom: area_mean
stat: 0.858401358127594
p waarde: 3.1962737991608384e-22

Niet normaal verdeeld (Verwerp H0) | Kolom: smoothness_mean
stat: 0.987487256526947
p waarde: 8.59934589243494e-05

Niet normaal verdeeld (Verwerp H0) | Kolom: compactness_mean
stat: 0.9169782996177673
p waarde: 3.9677173918984066e-17

Niet normaal verdeeld (Verwerp H0) | Kolom: concavity_mean
stat: 0.8668307662010193
p waarde: 1.3385464541211153e-21

Niet normaal verdeeld (Verwerp H0) | Kolom: concave points_mean
stat: 0.89164972305