# Cancer detection

## About the dataset
This breast cancer databases was obtained from the University of Wisconsin
Hospitals, Madison from Dr. William H. Wolberg.

## Relevant Information

   #  Features
   1. Sample code number            id number
   2. Clump Thickness               -- (1 - 10)
   3. Uniformity of Cell Size       -- (1 - 10)
   4. Uniformity of Cell Shape      -- (1 - 10)
   5. Marginal Adhesion             -- (1 - 10)
   6. Single Epithelial Cell Size   -- (1 - 10)
   7. Bare Nuclei                   -- (1 - 10)
   8. Bland Chromatin               -- (1 - 10)
   9. Normal Nucleoli               -- (1 - 10)
   10. Mitoses                      -- (1 - 10)
   11. Class:                        (2 for benign, 4 for malignant)


* Missing attribute values: 16 feature **"Bare Nuclei"**

   There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?". 

In [62]:
import pandas as pd
import numpy as np

Look out the dataset, something interesting the names of the columns are missing

In [63]:
df = pd.read_csv("breast-cancer-wisconsin.data.txt", header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


the data type of the columns are Int64

In [64]:
df.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

So let's change the columns names for the correct names

In [65]:
columns_names = {0:"Id", 
                 1:"Clump Thickness", 
                 2:"Uniformity of Cell Size", 
                 3:"Uniformity of Cell Shape", 
                 4:"Marginal Adhesion",
                 5:"Single Epithelial Cell Size", 
                 6:"Bare Nuclei", 
                 7:"Bland Chromatin", 
                 8:"Nromal Nucleoli", 
                 9:"Mitoses", 
                 10:"Class"}

In [66]:
df.rename(columns = columns_names, inplace = True)
df.head()

Unnamed: 0,Id,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Nromal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


let's change the "Id" as index of the dataframe

In [67]:
df.set_index("Id")

Unnamed: 0_level_0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Nromal Nucleoli,Mitoses,Class
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...
776715,3,1,1,1,3,2,1,1,1,2
841769,2,1,1,1,2,1,1,1,1,2
888820,5,10,10,3,7,3,8,10,2,4
897471,4,8,6,4,3,4,10,6,1,4


Shape of the dataset

In [68]:
print("The shape of the dataset: {}".format(df.shape))

The shape of the dataset: (699, 11)


The dataset consist of 9 features:
   2. Clump Thickness               -- (1 - 10)
   3. Uniformity of Cell Size       -- (1 - 10)
   4. Uniformity of Cell Shape      -- (1 - 10)
   5. Marginal Adhesion             -- (1 - 10)
   6. Single Epithelial Cell Size   -- (1 - 10)
   7. Bare Nuclei                   -- (1 - 10)
   8. Bland Chromatin               -- (1 - 10)
   9. Normal Nucleoli               -- (1 - 10)
   10. Mitoses                      -- (1 - 10)

The range for all the feature are between **(1 -- 10)** and the data type should be **int**
so, let's see if the dataset have the correct type of value

In [69]:
df.dtypes

Id                              int64
Clump Thickness                 int64
Uniformity of Cell Size         int64
Uniformity of Cell Shape        int64
Marginal Adhesion               int64
Single Epithelial Cell Size     int64
Bare Nuclei                    object
Bland Chromatin                 int64
Nromal Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

The column "Bare Nuclei" have the incorrect format, let's change the data type and remeber this columns have missing data with the value "?", change those values

In [70]:
#Class 2
df[df["Class"] == 2]["Bare Nuclei"].value_counts()

1     387
2      21
?      14
3      14
5      10
4       6
10      3
8       2
7       1
Name: Bare Nuclei, dtype: int64

For the **Class 2** the most common value is "1", let's change the missing values "?" for "1"

In [71]:
common_bare_nuclei_class2 = df[df["Class"] == 2]["Bare Nuclei"].value_counts().idxmax()
print("the most common value for the columns Bare Nuclei in class 2: {}".format(common_bare_nuclei_class2))

the most common value for the columns Bare Nuclei in class 2: 1


In [75]:
df.loc[(df["Class"] == 2) & (df["Bare Nuclei"] == "?"), "Bare Nuclei"] = common_bare_nuclei_class2

For the **class 4** the most common values is "10", let's change it

In [36]:
df[df["Class"] == 4]["Bare Nuclei"].value_counts()

10    129
5      20
8      19
1      15
3      14
4      13
9       9
2       9
7       7
6       4
?       2
Name: Bare Nuclei, dtype: int64

In [77]:
#Class 4
common_bare_nuclei_class4 = df[df["Class"] == 4]["Bare Nuclei"].value_counts().idxmax()
print("The most common value for the column Bare Nuceli for the class 4: {}".format(common_bare_nuclei_class4))

The most common value for the column Bare Nuceli for the class 4: 10


In [80]:
df.loc[(df["Class"] == 4) & (df["Bare Nuclei"] == "?"), "Bare Nuclei"] = common_bare_nuclei_class4

Now theres no missing values

In [81]:
df[df["Bare Nuclei"] == "?"]

Unnamed: 0,Id,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Nromal Nucleoli,Mitoses,Class


In [82]:
df.dtypes

Id                              int64
Clump Thickness                 int64
Uniformity of Cell Size         int64
Uniformity of Cell Shape        int64
Marginal Adhesion               int64
Single Epithelial Cell Size     int64
Bare Nuclei                    object
Bland Chromatin                 int64
Nromal Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

Only we need to change the type of the values, it's an object value, let's change to a int value

In [83]:
df["Bare Nuclei"] = df["Bare Nuclei"].astype(int)

In [84]:
df.dtypes

Id                             int64
Clump Thickness                int64
Uniformity of Cell Size        int64
Uniformity of Cell Shape       int64
Marginal Adhesion              int64
Single Epithelial Cell Size    int64
Bare Nuclei                    int64
Bland Chromatin                int64
Nromal Nucleoli                int64
Mitoses                        int64
Class                          int64
dtype: object

In [85]:
df.isnull().sum()

Id                             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Nromal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [87]:
df.to_csv("Breast-Cancer-Wisconsin-data.csv")